* Install rocm-dev in rocprofiler-compute-tarball.yml workflow
* Update paths for push and PR for rocprofiler-compute-tarball.yml
* Add ROCm dependencies to disttest job
* cmake fix binary link creation and fix format
* Use python3 instead of python3.9 in RHEL 8 and RHEL 9 workflows
* set default python3 to python3.9 in rhel8
* Try alternatives setup for python3 in RHEL8 env
* Add pip install cmake to debug RHEL8 issue
* Remove python3.11 in RHEL8 workflow
* Add back comment regarding RHEL8
---------
Co-authored-by: Vignesh Edithal <Vignesh.Edithal@amd.com>
* Improve Iteration multiplexing
* Improve iteration multiplexing documentation by adding usage note and
listing caveats
* Bugfixes for iteration mulitplexing
* Use merge iteration multiplexing in analysis webui and db mode
* Do not remove Dispatch_ID column in merge iteration multiplexing
since it is needed for analysis of top dispatches based on
duration
* Bugfixes for analysis logic
* Graceful handling of missing counters in case of iteration
multiplexing
* Improved warnings when metrics could not be calculated due to
missing counter data
* Fix the check to prevent showing table when a column is full of
N/A
* Improve detection of empty values when metric evaludation fails
due to missing counter data
* Bugfixes for profile logic
* Fix kernel filtering during roofline benchmark phase
* Update changelog for bugfixes
* Remove unnecessary columns when merging dispatches for iteration multiplexing
* bugfix
* Better analysis warnings
* fix to_std() in parser
* Use median in merge iteration multiplex
* Address review comments
* Fix cmake formatting
* fix None handling of parser util functions
* Enable stochastic counter accuracy test
* fix cmake formatting
* test: add unit tests for common utilities from PR #1249
* incorporate review comments specific to tests formatting
* use filesystem API instead of std::system for safer cleanup
* Add ghc/filesystem submodule v1.5.14 for portable C++17 filesystem support
* fix: add cmake/GhcFilesystem.cmake for CI submodule auto-checkout
* incorporate review comment
* incorporate review comment
* Update unit tests for alt_rsmi impl
- Create distinct test executable for alt_rsmi testing
- Updated alt_rsmi tests to use public methods
- Compiles alt_rsmi.cc with ARSMI_TEST_BUILD
- Enables external linkage of internal variables
- Only for AltRsmiTests.cpp that manipulates internals
- Clean separation for test behavior
* Address review comments
* restore hidden symbol visibility
[ROCm/rccl commit: 74690ea705]
* Update unit tests for alt_rsmi impl
- Create distinct test executable for alt_rsmi testing
- Updated alt_rsmi tests to use public methods
- Compiles alt_rsmi.cc with ARSMI_TEST_BUILD
- Enables external linkage of internal variables
- Only for AltRsmiTests.cpp that manipulates internals
- Clean separation for test behavior
* Address review comments
* restore hidden symbol visibility
* Add HasExpertSchedMode device prop
* Add unit tests for HasExpertSchedMode
* Add gfx12 check for HasExpertSchedMode prop
* Update gfx major version check and test for ExpertSchedMode
* Minor fix and ROCr version bump
* Update projects/rocr-runtime/runtime/hsa-runtime/inc/hsa_ext_amd.h
* Update projects/rocr-runtime/runtime/hsa-runtime/inc/hsa_ext_amd.h
* Apply suggestion from @dayatsin-amd
* Apply suggestion from @dayatsin-amd
---------
Co-authored-by: Stefan Sokolovic <stefan.sokolovic2@amd.com>
Co-authored-by: David Yat Sin <77975354+dayatsin-amd@users.noreply.github.com>
* Optimize RDC counter sampling with greedy packing algorithm
This change significantly reduces the number of rocprofiler-sdk sample calls
by implementing a greedy packing algorithm that groups multiple counters into
the minimal number of hardware profiles.
Key improvements:
- Implement greedy packing algorithm to combine counters into minimal profiles
- Add ProfileSet structure to manage packed counter configurations
- Cache packed profile sets for reuse across queries
- Group telemetry field requests by GPU for bulk processing
- Reduce sample calls by ~35% (from 100 to 65 for typical workloads)
Performance impact:
- 13 counters now packed into 3 profiles (77% compression)
- Reduces overhead from profile creation and context switching
- More efficient utilization of hardware counter resources
Implementation details:
- Added create_profiles_for_counters() using greedy algorithm
- Added sample_counters_with_packing() for bulk sampling
- Modified telemetry layer to use rocp_lookup_bulk()
- Preserves all field transformations and special handling
Testing shows successful packing with expected performance gains.
No functional changes to external APIs or behavior.
Co-Authored-By: Ben Welton <bwelton@amd.com>
* Address PR review feedback
This commit addresses all review comments from the initial PR:
1. Fix division by zero risk in debug logging
- Added check for empty counters vector before calculating compression ratio
- Avoids potential division by zero when logging profile creation stats
2. Improve thread safety for statistics tracking
- Changed static uint64_t to std::atomic<uint64_t> for thread-safe counters
- Prevents race conditions in multi-threaded sampling scenarios
3. Remove unused variable
- Removed unused profile_index variable that was incremented but never used
- Cleaned up dead code
4. Clean up code formatting
- Removed extra blank lines for consistency
- Applied formatting fixes across modified files
5. Refactor code duplication between rocp_lookup and rocp_lookup_bulk
- Created apply_field_transformation() helper function
- Eliminates ~70 lines of duplicated switch statement logic
- Centralizes field transformation logic in single location
- Makes future maintenance easier
6. Document non-rocprofiler metrics handling
- Added comments explaining how bulk lookup handles special cases
- Clarifies that non-profiler fields like KFD_ID are handled in transformation
All changes maintain backward compatibility and pass compilation.
Co-Authored-By: Ben Welton <bwelton@amd.com>
---------
Co-authored-by: Ben Welton <bwelton@amd.com>
Co-authored-by: Adam Pryor <61172547+adam360x@users.noreply.github.com>
Debug shows parameter count no impact on call relations, so revert this commit.
This reverts commit 5f710768aa2e68c5b06d0ece19f4268cc66f88d4.
Signed-off-by: Yang Su <Yang.Su2@amd.com>
Reviewed-by: Flora Cui <flora.cui@amd.com>
* hsakmt: Expose CWSR and Control stack sizes
This is better than hardcoding values and hoping that they align with
KFD's definitions
Signed-off-by: Kent Russell <kent.russell@amd.com>
* hsakmt: Use CwsrSize and CtlStackSize if available
If KFD is providing the CwsrSize and CtlStackSize, use the maximum
of those and the old calculations for the ctx_save_restore_size
and ctl_stack_size defined in the queue
Signed-off-by: Kent Russell <kent.russell@amd.com>
* hsakmt: Add warning when ABI<1.20 on GFX1151
CwsrSize and CtlStackSize are reported by KFD ABI 1.20. GFX1151
specifically may have some issues if these regions are misaligned, so
report a strong warning during topology initialization if the system is
GFX1151 but is using KFD ABI < 1.20
Signed-off-by: Kent Russell <kent.russell@amd.com>
---------
Signed-off-by: Kent Russell <kent.russell@amd.com>
* remove node-count and threshold restrictions from p2p-batching
* remove batching threshold usage, fix typo for using batching-enablement flag
---------
Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>
[ROCm/rccl commit: 7c1049d2a4]
* remove node-count and threshold restrictions from p2p-batching
* remove batching threshold usage, fix typo for using batching-enablement flag
---------
Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>
* Enable Lintian Support for ROCM-SMI
* Enable Lintian Support for ROCMINFO
* Updated Lintian Override File Processing
* Update UT Fix for Lintian rocmsmi,rocminfo
* Update UT Fixes, Review Comments
* Update Review Comments - removed extra white spaces, added error check for gzip, date commands
* Update Review Comments - Correcting License Type
* Sync Lintian ChangeLog
* Changelog data sync enhanced
* Update Review Comments, UT fix
* white space cleanup - precommit check
* Run pre-commit's whitespace related hooks on projects/amdsmi
In order for pre-commit to be useful, everything needs to meet a common
baseline.
* Add whitespace back to Changelog for formatting
---------
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Ignore __CLANG_HIP_RUNTIME_WRAPPER_INCLUDED__. This should not be relying
on declarations from the clang builtin headers. There is no issue declaring
the same intrinsics multiple times. This will enable removal of declarations
from the clang builtin headers.