* Fix dimension mismatch for multi-GPU systems with identical architectures
This change addresses an issue where counter dimensions were incorrectly
shared across all GPU agents with the same architecture name, even when
those agents had different hardware configurations (e.g., different CU counts).
Changes:
- Updated getBlockDimensions() to accept agent ID instead of architecture name
- Made dimension cache agent-specific instead of architecture-specific
- Updated set_dimensions() in AST evaluation to use specific agent ID
- Modified all API functions to handle agent-specific dimension lookups
- Updated tests to work with agent-specific dimensions
This fix ensures that dimensions accurately reflect the actual hardware
configuration of each individual GPU agent, preventing dimension mismatches
in multi-GPU systems where GPUs share the same architecture but have
different physical configurations.
Counter ID Representation Changes:
- Modified counter_id encoding to include agent information in bits 37-32
- Agent logical_node_id is encoded as (value + 1) to ensure agent 0 is detectable
- Counter records internally store only 16-bit base metric IDs (bits 15-0)
- Tool reconstructs agent-encoded counter IDs from base metric ID & agent info
- Instance record counter_id field uses bitwise AND mask to extract base metric ID
(counter_id.handle & 0xFFFF) to fit in 16-bit storage
- Output generators (CSV, JSON, Perfetto) use agent-encoded IDs for consistency
- Updated counter_config.cpp and metrics.cpp to extract base metric ID when needed
- All counter lookups now properly handle agent-encoded vs base metric IDs
This ensures counter IDs are consistent between metadata and output records while
maintaining compact storage in instance records.
Workaround for rare issue on gfx90x asics when SDMA_OP_POLLREGMEM
returns before polled memory has value of 0.
Removing previous SW workaround to double-poll as it was not reliable.
* rocr: hsa_amd_pointer_info return err on shutdown
Decrement ref count before starting to unload to make sure API
calls during shutdown return error.
Delete blit objects during agent destructor.
* Add support for HSA_AMD_SYSTEM_SHUTDOWN_EVENT
Add support for new event to indicate shut down within the
hsa_amd_register_system_event_handler API.
* SWDEV-560065 - Revert "SWDEV-555484 - Invalidate capturing stream only for null/legacy stream. (#1032)"
This reverts commit 99613f1009.
* SWDEV-560065 - Revert "SWDEV-542700 - Return an error if stream capture is attempted on the null stream while a stream capture is active. (#450)"
This reverts commit 0647cf1d28.
* Change how cache manager handles child process trace cache
* Sampling and backtrace metrics to cache
* Apply cmake formatting
* Fix parsing of metadata json
* Code clean up
* Fix build nlohmann json from source
* Fix storage parsed finished callback
* Revert sampling for child process
* Change cache file name generating
* Fix thread start stop
* Fix process start end timestamp
* Applied suggestions from code review
* Try with late start of flushing task thread
* Change dockerfiles for ci
* Revert changes on github workflows
* Remove json_fwd.hpp include
* fix dump
* Build nlohmann/json by default
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
* Update location of build artifacts for nlohmann/json
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
* Revert use_output_suffix
* Remove unused logs
* Fix cache store inside counter due to structure change
* Remove decode tests from debian ci
* Fix issue where all databases have the same UUID (#1499)
Co-authored-by: Aleksandar Djordjevic <adjordje@amd.com>
* Removing the cpack and install steps to save space
* Revert "Remove decode tests from debian ci"
This reverts commit ddabf6dd142dcf438e6b8997b8abe86f2c868468.
* Revert "Removing the cpack and install steps to save space"
This reverts commit 973da3a1ba99d99d529af5269d30e177092f9bfa.
* Add prepare-runner job as dependency to clean up the space
* Fix formatting
* Free up even more space
* Remove verbose for workflows
* remove hw_counters from ext_data
* move space clean up inside container
* try to remove external folder to free up space
* Check space
* Refactor Cleanup to it's own step
---------
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: Aleksandar Djordjevic <aleksandar.djordjevic@amd.com>
Co-authored-by: Aleksandar Djordjevic <adjordje@amd.com>
* SWDEV-555347 - Remove lock contention in async events loop
* SWDEV-555347 - Introduce Pool of AsyncEventItems
* create generic mempool for AsyncEventItem
* Use BaseShared allocate and free for async event pool
---------
Co-authored-by: Rahul Manocha <rmanocha@amd.com>
* Check emulator mode at runtime
* Reduce emu mode function call to one time and use result
* Move function to main.cc
* Address feedback
* EmuMode check improvement; convert to AoS
* replace g_isEmuMode with func call
* Add mode check func for every sample
1. Create a set of mini numa interface.
In Linux, the interface is based on system call rather than libnuma.
In Windows, the interface can also work, but the policy class is dummy.
Different from Linux, Windows doesn't provide numactl tool or numa lib to setup numa policy, thus
the default policy is followed in Windows, that is, using the closest host numa node to allocate
pinned host memory in hipHostMalloc().
To get the closest host numa node of a GPU device, you need query the new attribute
hipDeviceAttributeHostNumaId. Then you can create a thread with CPU affinity on the numa node.
For example, reference the test in hip-tests/catch/perftests/memory/hipPerfHostNumaAllocWin.cc.
2. Remove pfnSetThreadGroupAffinity and pfnGetNumaNodeProcessorMaskEx as the functions have been exposed since Win7 and Win server 2008.
3. Other minor fixes.
Changes:
- Fix `rocm-smi --setsclk [0 .. n]` for multiple devices to continue on fail when
in a partitioned configuration (ex. in DPX/QPX/CPX/etc).
- Partitioned configurations or devices which do not support changing
sclk/mclk/pcie clks will now continue on failure. Will report a "not
supported" or other (rocm-smi) error codes for these devices.
- Updates impact other clock settings such as `--setmclk` and
`--setpcie`.
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
* initial commit
* add csv support extraction for non kernel selection mode
* add --kernel-trace for rocprofiler-sdk mode
* make non kernel selective mode runnable
* make kernel selection work with -k
* remove upper case of arg hint
* update documentation
* display same kernel name at only one place and merge instruction id with same obj id as well as offset
* remove kernel name's display for single kernel selection
* change log added
---------
Co-authored-by: Fei Zheng <44449748+feizheng10@users.noreply.github.com>
* SWDEV-554608 - Add hipHostRegisterIoMemory for hipHostRegister
* SWDEV-554608 - Add hipHostRegisterIoMemory for hipHostRegister
* SWDEV-554174 Added hipHostRegisterIoMemory flag in test cases
* SWDEV-554174 : Did formatting corrections
* SWDEV-554608 - set HSA_AMD_MEMORY_POOL_UNCACHED_FLAG if IoMemory is set
* SWDEV-554608 - set HSA_AMD_MEMORY_POOL_UNCACHED_FLAG if IoMemory is set
* SWDEV-554608 - Add hipHostRegisterIoMemory for hipHostRegister
---------
Co-authored-by: Anavena Venkatesh <Anavena.Venkatesh@amd.com>
Co-authored-by: Rambabu Swargam <rambabu.swargam@amd.com>
- Remove unimplemented older API functions
- Remove mentions of reattach API
- Remove details on implementing a process attachment library
- This will return later as a theory of operation
* Check if test exists before adding validation
* Adjust validation parameters for rocpd_string
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
---------
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
* Add rdhc script in to rocm-core package
* Create the rdhc symlink within the package itself.
* rdhc tool support is not enabled for windows.
* [RDHC] Check if the required pip pkgs are present and warn .
rdhc checks the required pip packages are present or not.
if not warns the user and exits gracefully.
Signed-off-by: Saravanan Solaiyappan <saravanan.solaiyappan@amd.com>