* Fix merging logic for multi process
* Fix dispatch id reset logic in case of rocpd format
* Fix kernel id reset logic in case of csv format
* Revert correlation logic change in csv format
* Do inner join instead of left join
* Added tool for dumping counter and metric values
* Skip Linting
* Added support for iteration multiplexing
* Remove subparser and supress compute options
* Specify output dir
* Add kernel info
* csv name change
* Added comments
* Support dispatch id-less dataframes
* Formatting fix
* Add default for path
* Print help with no args
* Support only single workload
Fixed incorrect error code expectation in FrequenciesRead
test when calling amdsmi_get_gpu_pci_bandwidth() with nullptr
parameter.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
* Faster counter accuracy testing
* Better handle SPI_CSN_* metrics for lesser than MI350 series
* Use metric filtering to collect only relevant counters for comparison
* Ensure all workload folders are deleted after testing is completed
* Dont use clean_existing=False
* Add manual test for all counter accuracy
* Test env. vars. in rocprofiler-sdk backend
* Improve rocprofiler-sdk backend test case to check for env. vars. and
ensure we do not overwrite irrelevant env. vars.
* Remove unnecessary usage of ROCPROF_INDIVIDUAL_XCC_MODE env. var.
* Formatting fixes
* Test fixes
* Remove redundant code in tests
* Remove usage of utils_mod and use utils instead, this prevents
duplicate imports
* Fix for multi process workload profiling
Native counter collection tool updates:
* Do not dump empty counter data for a process
* Use PID instead of UUID for dumped csv files to facilitate correlation
* Handle merging multiple pairs of rocpd (from sdk tool) and csv (from
native tool) files
* Handle merging multiple pairs of csv (from sdk tool) and csv (from
native tool) files
Rocpd output format updates:
* Merge multiple rocpd databases into a single csv
* Reset dispatch id and kernel id for unique dispatches and unique
kernels respectively
* Retain multiple rocpd databases per run for multi process workloads
* Add test case for multiprocess profiling using rocflop workload
* Add rocflop
* Fix native counter csv to rocprofv3 csv conversion
* Use kernel_id instead of dispatch_id to correlate native counter csv
and kernel trace csv
* python formatting using ruff 0.14 instead of 0.13
## Motivation
When profiling multi-process applications where a parent process sends SIGKILL to child processes, the termination can occur before the profiler has a chance to flush collected data. This PR introduces a configurable delay before SIGKILL signals are forwarded, allowing profiling data to be captured before process termination. This is workaround.
## Technical Details
- Added new configuration setting `ROCPROFSYS_KILL_DELAY` (default: 0 seconds) to specify a delay before SIGKILL signals are forwarded to other processes
- Implemented `kill_gotcha` component that intercepts the `kill()` system call
- The gotcha only delays SIGKILL signals sent to external processes (pid > 0 and not self)
- Integrated `kill_gotcha_t` into the `preinit_bundle_t` for early initialization
* fix: resolve crash when profiling TensorFlow GPU application
* incorporate review comments
* updated min_rows from 3 to 2 for threads table validation as internal threads are not profiled and are now correctly bypassed
## Motivation
Missing CODEOWNERS for ROCProfiler-SDK
<!-- Explain the purpose of this PR and the goals it aims to achieve. -->
## Technical Details
Add CODEOWNERS for rocprofiler-sdk project
<!-- Explain the changes along with any relevant GitHub links. -->
## JIRA ID
<!-- If applicable, mention the JIRA ID resolved by this PR (Example: Resolves SWDEV-12345). -->
<!-- Do not post any JIRA links here. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Add Full Build Capability to theROCK for HIP
### Summary
This PR adds full build support to **theROCK** for HIP-related changes, ensuring that all components are built.
### Changes
- Enabled full build coverage for the following projects:
- `projects/clr`
- `projects/hip`
- `projects/hip-tests`
- `projects/rocr-runtime`
- Updated build configuration to include all targets for the above projects.
- Ensured rocm-libraries is pulled to build optional components.
### Motivation
These changes are required to support HIP development and testing within theROCK by ensuring all components are built together. This improves reliability, integration testing.
* Put cached perfetto traces as default one
* Improve cached data and perfetto traces in order to be more aligned with E2E tests
* Addressing PR comments and findings
* Force early instrumentation bundle instantiation
* Sync-up insturumented containers with thread growth data
* Revert ompvv number of host threads to default 8
* Fixed counter track namings for amd-smi
* AIPROFSYST-34 [rocprof-sys] Update documentation describing newly introduced changes to default tracing mechanism
Currently if the input file name already exists, the tool
appends output to existing file. Added overwrite, append,
or no(discard) options to choose from.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Co-authored-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
* Install rocm-dev in rocprofiler-compute-tarball.yml workflow
* Update paths for push and PR for rocprofiler-compute-tarball.yml
* Add ROCm dependencies to disttest job
* cmake fix binary link creation and fix format
* Use python3 instead of python3.9 in RHEL 8 and RHEL 9 workflows
* set default python3 to python3.9 in rhel8
* Try alternatives setup for python3 in RHEL8 env
* Add pip install cmake to debug RHEL8 issue
* Remove python3.11 in RHEL8 workflow
* Add back comment regarding RHEL8
---------
Co-authored-by: Vignesh Edithal <Vignesh.Edithal@amd.com>
* Improve Iteration multiplexing
* Improve iteration multiplexing documentation by adding usage note and
listing caveats
* Bugfixes for iteration mulitplexing
* Use merge iteration multiplexing in analysis webui and db mode
* Do not remove Dispatch_ID column in merge iteration multiplexing
since it is needed for analysis of top dispatches based on
duration
* Bugfixes for analysis logic
* Graceful handling of missing counters in case of iteration
multiplexing
* Improved warnings when metrics could not be calculated due to
missing counter data
* Fix the check to prevent showing table when a column is full of
N/A
* Improve detection of empty values when metric evaludation fails
due to missing counter data
* Bugfixes for profile logic
* Fix kernel filtering during roofline benchmark phase
* Update changelog for bugfixes
* Remove unnecessary columns when merging dispatches for iteration multiplexing
* bugfix
* Better analysis warnings
* fix to_std() in parser
* Use median in merge iteration multiplex
* Address review comments
* Fix cmake formatting
* fix None handling of parser util functions
* Enable stochastic counter accuracy test
* fix cmake formatting
* test: add unit tests for common utilities from PR #1249
* incorporate review comments specific to tests formatting
* use filesystem API instead of std::system for safer cleanup
* Add ghc/filesystem submodule v1.5.14 for portable C++17 filesystem support
* fix: add cmake/GhcFilesystem.cmake for CI submodule auto-checkout
* incorporate review comment
* incorporate review comment
* Add HasExpertSchedMode device prop
* Add unit tests for HasExpertSchedMode
* Add gfx12 check for HasExpertSchedMode prop
* Update gfx major version check and test for ExpertSchedMode
* Minor fix and ROCr version bump
* Update projects/rocr-runtime/runtime/hsa-runtime/inc/hsa_ext_amd.h
* Update projects/rocr-runtime/runtime/hsa-runtime/inc/hsa_ext_amd.h
* Apply suggestion from @dayatsin-amd
* Apply suggestion from @dayatsin-amd
---------
Co-authored-by: Stefan Sokolovic <stefan.sokolovic2@amd.com>
Co-authored-by: David Yat Sin <77975354+dayatsin-amd@users.noreply.github.com>
* Optimize RDC counter sampling with greedy packing algorithm
This change significantly reduces the number of rocprofiler-sdk sample calls
by implementing a greedy packing algorithm that groups multiple counters into
the minimal number of hardware profiles.
Key improvements:
- Implement greedy packing algorithm to combine counters into minimal profiles
- Add ProfileSet structure to manage packed counter configurations
- Cache packed profile sets for reuse across queries
- Group telemetry field requests by GPU for bulk processing
- Reduce sample calls by ~35% (from 100 to 65 for typical workloads)
Performance impact:
- 13 counters now packed into 3 profiles (77% compression)
- Reduces overhead from profile creation and context switching
- More efficient utilization of hardware counter resources
Implementation details:
- Added create_profiles_for_counters() using greedy algorithm
- Added sample_counters_with_packing() for bulk sampling
- Modified telemetry layer to use rocp_lookup_bulk()
- Preserves all field transformations and special handling
Testing shows successful packing with expected performance gains.
No functional changes to external APIs or behavior.
Co-Authored-By: Ben Welton <bwelton@amd.com>
* Address PR review feedback
This commit addresses all review comments from the initial PR:
1. Fix division by zero risk in debug logging
- Added check for empty counters vector before calculating compression ratio
- Avoids potential division by zero when logging profile creation stats
2. Improve thread safety for statistics tracking
- Changed static uint64_t to std::atomic<uint64_t> for thread-safe counters
- Prevents race conditions in multi-threaded sampling scenarios
3. Remove unused variable
- Removed unused profile_index variable that was incremented but never used
- Cleaned up dead code
4. Clean up code formatting
- Removed extra blank lines for consistency
- Applied formatting fixes across modified files
5. Refactor code duplication between rocp_lookup and rocp_lookup_bulk
- Created apply_field_transformation() helper function
- Eliminates ~70 lines of duplicated switch statement logic
- Centralizes field transformation logic in single location
- Makes future maintenance easier
6. Document non-rocprofiler metrics handling
- Added comments explaining how bulk lookup handles special cases
- Clarifies that non-profiler fields like KFD_ID are handled in transformation
All changes maintain backward compatibility and pass compilation.
Co-Authored-By: Ben Welton <bwelton@amd.com>
---------
Co-authored-by: Ben Welton <bwelton@amd.com>
Co-authored-by: Adam Pryor <61172547+adam360x@users.noreply.github.com>
* hsakmt: Expose CWSR and Control stack sizes
This is better than hardcoding values and hoping that they align with
KFD's definitions
Signed-off-by: Kent Russell <kent.russell@amd.com>
* hsakmt: Use CwsrSize and CtlStackSize if available
If KFD is providing the CwsrSize and CtlStackSize, use the maximum
of those and the old calculations for the ctx_save_restore_size
and ctl_stack_size defined in the queue
Signed-off-by: Kent Russell <kent.russell@amd.com>
* hsakmt: Add warning when ABI<1.20 on GFX1151
CwsrSize and CtlStackSize are reported by KFD ABI 1.20. GFX1151
specifically may have some issues if these regions are misaligned, so
report a strong warning during topology initialization if the system is
GFX1151 but is using KFD ABI < 1.20
Signed-off-by: Kent Russell <kent.russell@amd.com>
---------
Signed-off-by: Kent Russell <kent.russell@amd.com>
* Enable Lintian Support for ROCM-SMI
* Enable Lintian Support for ROCMINFO
* Updated Lintian Override File Processing
* Update UT Fix for Lintian rocmsmi,rocminfo
* Update UT Fixes, Review Comments
* Update Review Comments - removed extra white spaces, added error check for gzip, date commands
* Update Review Comments - Correcting License Type
* Sync Lintian ChangeLog
* Changelog data sync enhanced
* Update Review Comments, UT fix
* white space cleanup - precommit check
* Run pre-commit's whitespace related hooks on projects/amdsmi
In order for pre-commit to be useful, everything needs to meet a common
baseline.
* Add whitespace back to Changelog for formatting
---------
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>