Fixed incorrect error code expectation in FrequenciesRead
test when calling amdsmi_get_gpu_pci_bandwidth() with nullptr
parameter.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
* Faster counter accuracy testing
* Better handle SPI_CSN_* metrics for lesser than MI350 series
* Use metric filtering to collect only relevant counters for comparison
* Ensure all workload folders are deleted after testing is completed
* Dont use clean_existing=False
* Add manual test for all counter accuracy
* Test env. vars. in rocprofiler-sdk backend
* Improve rocprofiler-sdk backend test case to check for env. vars. and
ensure we do not overwrite irrelevant env. vars.
* Remove unnecessary usage of ROCPROF_INDIVIDUAL_XCC_MODE env. var.
* Formatting fixes
* Test fixes
* Remove redundant code in tests
* Remove usage of utils_mod and use utils instead, this prevents
duplicate imports
* Fix for multi process workload profiling
Native counter collection tool updates:
* Do not dump empty counter data for a process
* Use PID instead of UUID for dumped csv files to facilitate correlation
* Handle merging multiple pairs of rocpd (from sdk tool) and csv (from
native tool) files
* Handle merging multiple pairs of csv (from sdk tool) and csv (from
native tool) files
Rocpd output format updates:
* Merge multiple rocpd databases into a single csv
* Reset dispatch id and kernel id for unique dispatches and unique
kernels respectively
* Retain multiple rocpd databases per run for multi process workloads
* Add test case for multiprocess profiling using rocflop workload
* Add rocflop
* Fix native counter csv to rocprofv3 csv conversion
* Use kernel_id instead of dispatch_id to correlate native counter csv
and kernel trace csv
* python formatting using ruff 0.14 instead of 0.13
## Motivation
When profiling multi-process applications where a parent process sends SIGKILL to child processes, the termination can occur before the profiler has a chance to flush collected data. This PR introduces a configurable delay before SIGKILL signals are forwarded, allowing profiling data to be captured before process termination. This is workaround.
## Technical Details
- Added new configuration setting `ROCPROFSYS_KILL_DELAY` (default: 0 seconds) to specify a delay before SIGKILL signals are forwarded to other processes
- Implemented `kill_gotcha` component that intercepts the `kill()` system call
- The gotcha only delays SIGKILL signals sent to external processes (pid > 0 and not self)
- Integrated `kill_gotcha_t` into the `preinit_bundle_t` for early initialization
* fix: resolve crash when profiling TensorFlow GPU application
* incorporate review comments
* updated min_rows from 3 to 2 for threads table validation as internal threads are not profiled and are now correctly bypassed
* Put cached perfetto traces as default one
* Improve cached data and perfetto traces in order to be more aligned with E2E tests
* Addressing PR comments and findings
* Force early instrumentation bundle instantiation
* Sync-up insturumented containers with thread growth data
* Revert ompvv number of host threads to default 8
* Fixed counter track namings for amd-smi
* AIPROFSYST-34 [rocprof-sys] Update documentation describing newly introduced changes to default tracing mechanism
Currently if the input file name already exists, the tool
appends output to existing file. Added overwrite, append,
or no(discard) options to choose from.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Co-authored-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
* Install rocm-dev in rocprofiler-compute-tarball.yml workflow
* Update paths for push and PR for rocprofiler-compute-tarball.yml
* Add ROCm dependencies to disttest job
* cmake fix binary link creation and fix format
* Use python3 instead of python3.9 in RHEL 8 and RHEL 9 workflows
* set default python3 to python3.9 in rhel8
* Try alternatives setup for python3 in RHEL8 env
* Add pip install cmake to debug RHEL8 issue
* Remove python3.11 in RHEL8 workflow
* Add back comment regarding RHEL8
---------
Co-authored-by: Vignesh Edithal <Vignesh.Edithal@amd.com>
* Improve Iteration multiplexing
* Improve iteration multiplexing documentation by adding usage note and
listing caveats
* Bugfixes for iteration mulitplexing
* Use merge iteration multiplexing in analysis webui and db mode
* Do not remove Dispatch_ID column in merge iteration multiplexing
since it is needed for analysis of top dispatches based on
duration
* Bugfixes for analysis logic
* Graceful handling of missing counters in case of iteration
multiplexing
* Improved warnings when metrics could not be calculated due to
missing counter data
* Fix the check to prevent showing table when a column is full of
N/A
* Improve detection of empty values when metric evaludation fails
due to missing counter data
* Bugfixes for profile logic
* Fix kernel filtering during roofline benchmark phase
* Update changelog for bugfixes
* Remove unnecessary columns when merging dispatches for iteration multiplexing
* bugfix
* Better analysis warnings
* fix to_std() in parser
* Use median in merge iteration multiplex
* Address review comments
* Fix cmake formatting
* fix None handling of parser util functions
* Enable stochastic counter accuracy test
* fix cmake formatting
* test: add unit tests for common utilities from PR #1249
* incorporate review comments specific to tests formatting
* use filesystem API instead of std::system for safer cleanup
* Add ghc/filesystem submodule v1.5.14 for portable C++17 filesystem support
* fix: add cmake/GhcFilesystem.cmake for CI submodule auto-checkout
* incorporate review comment
* incorporate review comment
* Add HasExpertSchedMode device prop
* Add unit tests for HasExpertSchedMode
* Add gfx12 check for HasExpertSchedMode prop
* Update gfx major version check and test for ExpertSchedMode
* Minor fix and ROCr version bump
* Update projects/rocr-runtime/runtime/hsa-runtime/inc/hsa_ext_amd.h
* Update projects/rocr-runtime/runtime/hsa-runtime/inc/hsa_ext_amd.h
* Apply suggestion from @dayatsin-amd
* Apply suggestion from @dayatsin-amd
---------
Co-authored-by: Stefan Sokolovic <stefan.sokolovic2@amd.com>
Co-authored-by: David Yat Sin <77975354+dayatsin-amd@users.noreply.github.com>
* Optimize RDC counter sampling with greedy packing algorithm
This change significantly reduces the number of rocprofiler-sdk sample calls
by implementing a greedy packing algorithm that groups multiple counters into
the minimal number of hardware profiles.
Key improvements:
- Implement greedy packing algorithm to combine counters into minimal profiles
- Add ProfileSet structure to manage packed counter configurations
- Cache packed profile sets for reuse across queries
- Group telemetry field requests by GPU for bulk processing
- Reduce sample calls by ~35% (from 100 to 65 for typical workloads)
Performance impact:
- 13 counters now packed into 3 profiles (77% compression)
- Reduces overhead from profile creation and context switching
- More efficient utilization of hardware counter resources
Implementation details:
- Added create_profiles_for_counters() using greedy algorithm
- Added sample_counters_with_packing() for bulk sampling
- Modified telemetry layer to use rocp_lookup_bulk()
- Preserves all field transformations and special handling
Testing shows successful packing with expected performance gains.
No functional changes to external APIs or behavior.
Co-Authored-By: Ben Welton <bwelton@amd.com>
* Address PR review feedback
This commit addresses all review comments from the initial PR:
1. Fix division by zero risk in debug logging
- Added check for empty counters vector before calculating compression ratio
- Avoids potential division by zero when logging profile creation stats
2. Improve thread safety for statistics tracking
- Changed static uint64_t to std::atomic<uint64_t> for thread-safe counters
- Prevents race conditions in multi-threaded sampling scenarios
3. Remove unused variable
- Removed unused profile_index variable that was incremented but never used
- Cleaned up dead code
4. Clean up code formatting
- Removed extra blank lines for consistency
- Applied formatting fixes across modified files
5. Refactor code duplication between rocp_lookup and rocp_lookup_bulk
- Created apply_field_transformation() helper function
- Eliminates ~70 lines of duplicated switch statement logic
- Centralizes field transformation logic in single location
- Makes future maintenance easier
6. Document non-rocprofiler metrics handling
- Added comments explaining how bulk lookup handles special cases
- Clarifies that non-profiler fields like KFD_ID are handled in transformation
All changes maintain backward compatibility and pass compilation.
Co-Authored-By: Ben Welton <bwelton@amd.com>
---------
Co-authored-by: Ben Welton <bwelton@amd.com>
Co-authored-by: Adam Pryor <61172547+adam360x@users.noreply.github.com>
* hsakmt: Expose CWSR and Control stack sizes
This is better than hardcoding values and hoping that they align with
KFD's definitions
Signed-off-by: Kent Russell <kent.russell@amd.com>
* hsakmt: Use CwsrSize and CtlStackSize if available
If KFD is providing the CwsrSize and CtlStackSize, use the maximum
of those and the old calculations for the ctx_save_restore_size
and ctl_stack_size defined in the queue
Signed-off-by: Kent Russell <kent.russell@amd.com>
* hsakmt: Add warning when ABI<1.20 on GFX1151
CwsrSize and CtlStackSize are reported by KFD ABI 1.20. GFX1151
specifically may have some issues if these regions are misaligned, so
report a strong warning during topology initialization if the system is
GFX1151 but is using KFD ABI < 1.20
Signed-off-by: Kent Russell <kent.russell@amd.com>
---------
Signed-off-by: Kent Russell <kent.russell@amd.com>
* Enable Lintian Support for ROCM-SMI
* Enable Lintian Support for ROCMINFO
* Updated Lintian Override File Processing
* Update UT Fix for Lintian rocmsmi,rocminfo
* Update UT Fixes, Review Comments
* Update Review Comments - removed extra white spaces, added error check for gzip, date commands
* Update Review Comments - Correcting License Type
* Sync Lintian ChangeLog
* Changelog data sync enhanced
* Update Review Comments, UT fix
* white space cleanup - precommit check
* Run pre-commit's whitespace related hooks on projects/amdsmi
In order for pre-commit to be useful, everything needs to meet a common
baseline.
* Add whitespace back to Changelog for formatting
---------
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Ignore __CLANG_HIP_RUNTIME_WRAPPER_INCLUDED__. This should not be relying
on declarations from the clang builtin headers. There is no issue declaring
the same intrinsics multiple times. This will enable removal of declarations
from the clang builtin headers.
* added graceful errors/exit in profile/analyze roofline.csv
* edit if statement truth
* restore if statement truth (roofline_csv needs at least 2 rows)
* addressed comments and skipped showing roof metrics when data invalid
* fix workload merge
* changed warning to error
* removed redundant variable definition
* added roofline csv validate check in TUI
* add test cases to test validation function
* ruff format
* simplified TUI roofline handling
* Update README documentation links for clarity and consistency across projects
- Changed links in the README files for `clr`, `hipother`, and `hip-tests` to use relative paths instead of absolute URLs, improving navigation within the repository.
* Update CONTRIBUTING documentation to use relative links for improved navigation
- Changed absolute URLs to relative paths in the CONTRIBUTING.md files for the hip and hipother projects, enhancing consistency and ease of access within the repository.