* SWDEV-554626 - return hipErrorInvalidDeviceFunction when we can not load module
Return correct error code when modules are empty
* Match the error codes
* Revert the error code
* Don't require powercap support
APUs don't necessarily support setting a power cap from sysfs.
Ignore failures of the file missing.
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Show edge temperature in default output if hotspot is missing
APUs don't have a hotspot temperature, they have an edge though.
Use that.
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Format all "power" keys as watts
There will be more power keys when APU support is added, so format
them properly.
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Don't show power limit in output if it's invalid
APUs can't set power limit using power_cap1 interface. The limit
will be 0 and thus the UX looks weird in default output.
Only add the `/power_limit` if it's valid.
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Unify sizes of `amdsmi_power_info_t`
Sizes are used inconsistently. This causes tools to not show
N/A when they should. Make them unified.
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
---------
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
These have an external source of truth.
Also drop the non-existent hipblaslt which isn't in rocm-systems.
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* While reusing signals, its possible we can come across a timestamp
that can contain several signals, like when profiling a graph. Reading
timestamps from all signals can make the call severely CPU bound.
Instead cache only that signal so as to avoid the overhead for critical
path.
* Added single process isolation support to execute tests
* Address review comments
* Update README
* Removed requirement of explicit call to clear method
* Added macros for simplified usage
* Updated tests to use process isolation framework
* Adjust summary output format for isolated tests
* Updated rccl_wrap tests
* Used process isolation in AllocTests
* Used process isolation and fixed failing tests
* Modified test output, added signal handling
Updated macros to handle lambdas
* Convert argcheck tests to isolated tests
* Convert proxy tests to isolated tests
* Remove non-supported test
* Fixed file descriptor handling and clearing env vars for tests
* Run pre-commit's whitespace related hooks on projects/rocr-runtime
In order for pre-commit to be useful, everything needs to meet a common
baseline.
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Add missing semicolon which would block compilation on big endian CPUs
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
---------
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Added MPI support to execute unit/functional tests
Update node and process validation
Updated node detection count and modified validation method
Update validation logic to include max procs and nodes
* Address review comments
* Fix warnings
* Added a new NET transport test and clean up
* Added MPI test logging mechanism
* Decoupled GTest framework
* Added Net IB functional tests
* Updated with resource guards
* Added NET IB tests and refactored code
* Update P2pWorkflow test
* Update documentation
* Add MPI_TESTS_ENABLED guard to the file
* Fix Shm and NetIB tests
* Applied refactoring and cleanup
* Replaced BufferGuard with AutoGuard
* Modified test debug logging
* Use macro to reduce NcclTypeTraits code duplication
- Replace repetitive template specializations with a single
DEFINE_NCCL_TYPE_TRAIT macro
- Use stringification operator (#) to auto-generate type name strings
- Add #undef to keep macro from polluting namespace
- Makes adding new type mappings trivial
* Unify buffer initialization with generic pattern function
- Remove initializeBufferWithCustomPattern
- Make initializeBufferWithPattern generic with PatternFunc template param
- Now single function handles all patterns via lambda injection
- Updated all test files to use lambdas for pattern generation
- Pattern logic now visible at call site (self-documenting)
* Unify buffer verification with pluggable pattern function
- Remove verifyBufferWithCustomCheck
- Make verifyBufferData generic with PatternFunc template param
- Single function handles all verification patterns via lambda injection
- Updated all test files to use lambdas
- Better defaults: num_samples=0 means verify all elements
- Pattern logic now visible at call site (self-documenting)
* Docs: Add DeviceBufferHelpers section to MPITestRunner.md
- Document new refactored buffer initialization/verification API
- Explain pluggable pattern functions with lambda examples
- Show type mapping and automatic float/int comparison
- Include migration guide from old API to new unified functions
- Demonstrate best practices with real-world examples
- Reference recent refactoring commits (macro-based type traits)
* Docs: Update documentation and examples
- Update on DeviceBufferHelpers
- Update examples using DeviceBufferHelpers methods, e.g. data verification
* Address review comment.
- Replace manual pattern generation loop with initializeBufferWithPattern call
- Use downloadBuffer to get host copy instead of manual hipMemcpy
* Remove non-existent dependency
* Remove duplicate testcase
* Code cleanup in test files
* Moved common constants to base class
* Fixing the copy back to the original buffer malformed packets
* Addressing Copilot Comments
* Addressing Review comments
* Adjust staging buffer size allocation
Change staging buffer size to match the number of packets.
* Initial work in progress for compute CI workflow
* Update run-ci.py script location, enable test creation
* Add new lines to files
* Add coverage file argument to run-ci.py
* Remove run-ci.py script usage from rocprofiler-compute-continuous-integration.yml workflow
* Add --break-system-packages parameter
* Add --ignore-installed to pip install
* Checkout specific branch until amdclang issue fixed in develop
* Add missing slash to path for cxx compiler
* Remove specific branch from checkout action
* Use run-ci.py in rocprofiler-compute-continuous-integration.yml
* Update install python requirements step
* Fix typo in build-name
* Update run-ci.py to have toggle for code coverage
* Apply ruff formatting
* Ruff again
* Exclude live attach detach and roofline tests in CI
* Add ctest args
* Revert run-ci.py changes
* Try new run-ci-2.py
* Update type of pytest-numprocs argument
* Try casting arg to str
* Fix typo in arg reference
* upgrade pip before running python installs
* Use jammy instead of noble for CI
* Remove python nproc arg from run-ci-2.py
* Switch to MI325 runners for CI
* Fix spacing issue
* Rename run-ci.py to run-code-coverage.py, add new run-ci.py
* Update to ROCm version 7.1.0 to debug sdk issues
* Testing out tarball install again
* Update regex on tarball version
* Update tarball regex on compute
* ruff formatting
* Revert change to systems CI file
* Switch back to rocm-dev install
* ruff formatting again
* Add ld_lib_path for rocm_sysdeps
* Remove excluded tests temporarily
* Add back excluded tests, add timeout for test step
* Address PR feedback
* Add git safe directory lines
* Revert dependencies change to debug new failures
* Exclude roofline again, rework dependencies
* Add in hip-runtime-amd dependency
* Install hip dev package
* Add TEST_FROM_INSTALL cmake arg to compute CI workflow
* Remove test_from_install for now
* Enable roofline tests again
* Fix#2169: rocprofv3 pmc crash on gfx1151
This PR addresses two issues for gfx1151:
- In Pm4Factory::GetGpuId, the first matching entry from the gfxip_map
vector was taken, but "gfx115" came after "gfx11".
- HsaRsrcFactory::GetHsaAgentsCallback would fail when it saw an NPU
agent. Now it ignores it and continues.
* Stop trying to fit too much in one line for default view
The default view is really cramped trying to put a lot of version
information into one line, to the point that some strings are
cropped. Instead of cropping the strings just put each into it's
own line.
For running without a ROCm release installed hide the ROCm version
line.
Sample output:
```
+------------------------------------------------------------------------------+
| AMD-SMI 26.1.0+2a668c34 |
| amdgpu version: Linuxver |
| VBIOS version: 023.010.001.022.000001 |
| Platform: Linux Baremetal |
|-------------------------------------+----------------------------------------|
| BDF GPU-Name | Mem-Uti Temp UEC Power-Usage |
| GPU HIP-ID OAM-ID Partition-Mode | GFX-Uti Fan Mem-Usage |
|=====================================+========================================|
| 0000:c1:00.0 ...adeon 890M Graphics | N/A 59 °C 0 17 W |
| 0 0 N/A N/A | 25 % N/A 479/512 MB |
+-------------------------------------+----------------------------------------+
+------------------------------------------------------------------------------+
| Processes: |
| GPU PID Process Name GTT_MEM VRAM_MEM MEM_USAGE CU % |
|==============================================================================|
| No running processes found |
+------------------------------------------------------------------------------+
```
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Don't show amdgpu version on mainline kernels
amdgpu version doesn't exist on a mainline kernel.
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Truncate amdgpu version string to 80 characters
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Allow longer AMD-SMI version strings
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Adjusted version header format
---------
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Co-authored-by: Mario Limonciello (AMD) <superm1@kernel.org>
Co-authored-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve. -->
- __Reduced Code Duplication__: Version parsing logic moved from individual Dockerfiles to the central build script
- __Improved Edge Case Handling__: Better handling of ROCm versions with and without patch numbers (e.g., `6.2` vs `6.2.0`)
- __Easier Maintenance__: Future version-related changes only need to be made in one place
- __Cleaner Dockerfiles__: Simplified Dockerfiles focus on package installation rather than complex shell logic
- __Updated Platform Support__: Refreshed container matrix to reflect current platform/ROCm version combinations
- __Fix OpenSUSE Docker Generation__: OpenSUSE container generation fails due to a change to the `binutils-gold` package
- __Error Handling__: Fix bug where errors in docker image build were being masked, allowing workflow to pass anyway.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
- Updated `Dockerfile.opensuse` and `Dockerfile.opensuse.ci` docker files to remove `binutils-gold`
- Not needed since we build `binutils` with systems anyways
- Updated `rocprofiler-systems-containers.yml` to remove `pushd/popd` commands and just run the shell scripts
- There was a silent failure observed here, which I verified in this PR before adding the fix for openSUSE
- Refactor ROCm version parsing. Move this logic to the `build-docker.sh` script to reduce duplication.
- Fix bug that caused ROCm 7.0 to fail installation. The trailing `.0` was being trimmed.
- Fixed inconsistencies in `containers.yml` that lead to invalid ROCm-OS_VERSION combinations.
- Formatting fixes
- Removed trailing whitespace
- Fix docker build warnings. Use an `=` rather than ` ` when assigning an environment variable.
This patch enhances compatibility for DXG environments by introducing conditional
checks for DRM operations, particularly around buffer object metadata handling
in IPC scenarios. These changes improve robustness in DXG IPC memory management
without impacting existing functionality in standard Linux environments.
Signed-off-by: Horatio Zhang <Hongkun.Zhang@amd.com>
* Fix powercap default to enum for sensor_ind
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
* [SWDEV-559965] Refactor amdsmi set power cap
Modified power cap set to accept args with
optional power_cap type. Added power_cap helper
validate_and_set_power_cap(). Fixed JSON output
format.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
---------
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Co-authored-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
* Add exception handling for native tool path search
* Fix formatting in roofline benchmark code
* Fix detection of .so files
* include hip code and native tool code in standalone binary
* add fallback path for ROCM_PATH
Changes:
- Fixed `amd-smi` showing:
```console
$ amd-smi
Traceback (most recent call last):
File "/opt/rocm/bin/amd-smi", line 53, in <module>
from amdsmi_init import *
File "/opt/rocm/libexec/amdsmi_cli/amdsmi_init.py", line 38, in <module>
from amdsmi import amdsmi_interface, amdsmi_exception
File "/usr/local/lib/python3.8/dist-packages/amdsmi/__init__.py", line 24, in <module>
from .amdsmi_interface import amdsmi_init
File "/usr/local/lib/python3.8/dist-packages/amdsmi/amdsmi_interface.py", line 5581, in <module>
) -> tuple[int, int]:
TypeError: 'type' object is not subscriptable
```
This was a python3.8 issue, which is now resolved by using
`Tuple[int, int]` typing for Python 3.8 compatibility.
* Enable running tests from installation only
* Use cmake option -DTEST_FROM_INSTALL=ON to enable running tests from installation folder only
* It is not possible to run tests from build folder in this case
* This option prevents changing working directory to source folder
* Fix SourceFileLoader to import rocprof-compute main module correctly
* Install sample executables in the test folder
* fix num_xcds_cli_output test
* Fix tests
* Skip autogen. config. test and add a TODO task for re-design of this
test
* Add flexible import of source code in test_gpu_specs.py
* Update cmake to install tests/workloads folder when INSTALL_TESTS=ON
* Fix sys.argv[0] for tests
* fix live attach detach test