Commit Graph

64719 Commits

Author SHA1 Message Date
Geo Min 8e98b80deb [TheRock CI] Fixing patches for rocm-systems (#1460)
* Fixing patches for rocm-systems

* Adding all

* Adding remaining projects

* Submodule bump

* adding compiler

* adding test commit hash

* Adding artifact group

* adding update for artifact group

* Adding new commit hash
2025-10-28 19:47:17 -07:00
Ajay GunaShekar 22213c0ec3 SWDEV-559569 - enable fixed tests (#1363) 2025-10-28 12:17:15 -07:00
David Galiffi 3d7a5eec0e Setup rocprofsys_root environment variable (#1561)
* Setup `rocprofsys_root` environment variable

* Update `CHANGELOGS`

* Fixed formatting

* Add rocpd output and validation to python tests

* Refactoring environment setup
2025-10-28 13:06:07 -04:00
Venkateshwar Reddy Kandula c5bd693478 [rocprofiler-sdk] Disable HIP/CLR build in rocprofiler-sdk CI jobs (#1574)
* disable HIP/CLR build

* misc. fix
2025-10-28 11:42:11 -05:00
Gopesh Bhardwaj 2be2945228 Version bump and CHANGELOG update for 7.1 (#1563) 2025-10-28 11:53:32 -04:00
Swati Rawat f0f008d494 Update using-rocprofv3-process-attachment.rst (#1534) 2025-10-28 11:52:23 -04:00
ywang103-amd 99183ffd92 fix failure of pc sampling and unit tests (#1526) 2025-10-28 11:30:32 -04:00
systems-assistant[bot] 00b2bd3e8c SWDEV-515530 - Re-enable passing test (#598) 2025-10-28 11:23:30 +01:00
Ajay GunaShekar f8e3858659 remove usage of HIP_RETURN in internal function (#1359) 2025-10-27 15:37:46 -07:00
Rahul Manocha f5d901f016 SWDEV-546311 - implement hipKernelGetLibrary & hipLibraryEnumerateKer… (#1143)
* SWDEV-546311 - implement hipKernelGetLibrary & hipLibraryEnumerateKernels API

* Fix for LibraryEnumerateKernel and KernelGetName

* Update Enumerate Kernels to handle 0 numKernels

* Minor fixes to function names

* fix error checking in internal function

* Update changelog for new apis

---------

Co-authored-by: Rahul Manocha <rmanocha@amd.com>
2025-10-27 14:13:17 -07:00
Shadi Dashmiz 3e59eebf17 SWDEV-558510:Correct max mem per multiprocessor value (#1207)
Signed-off-by: sdashmiz <shadi.dashmiz@amd.com>
2025-10-27 15:45:06 -04:00
David Yat Sin 6497fa0339 rocr: Fix wrong args in memory copy functions (#1520)
Fix incorrect arguments passed into system_region->Lock
2025-10-27 14:12:06 -05:00
Gopesh Bhardwaj 1585fe59cd [Documentation] Repo location and limitation update (#1537) 2025-10-27 12:26:05 -04:00
MachineTom eb69a455ed SWDEV-558844 - Cleanup Os header (#1530)
Remove codes that aren't used in Os header.
2025-10-27 11:52:31 -04:00
systems-assistant[bot] c1926d547e SWDEV-515530 - Re-enable passing tests on NV (#605) 2025-10-27 16:32:37 +01:00
Benjamin Welton d496bcef18 Fix dimension mismatch for multi-GPU systems with identical architect… (#1440)
* Fix dimension mismatch for multi-GPU systems with identical architectures

This change addresses an issue where counter dimensions were incorrectly
shared across all GPU agents with the same architecture name, even when
those agents had different hardware configurations (e.g., different CU counts).

Changes:
- Updated getBlockDimensions() to accept agent ID instead of architecture name
- Made dimension cache agent-specific instead of architecture-specific
- Updated set_dimensions() in AST evaluation to use specific agent ID
- Modified all API functions to handle agent-specific dimension lookups
- Updated tests to work with agent-specific dimensions

This fix ensures that dimensions accurately reflect the actual hardware
configuration of each individual GPU agent, preventing dimension mismatches
in multi-GPU systems where GPUs share the same architecture but have
different physical configurations.

Counter ID Representation Changes:
- Modified counter_id encoding to include agent information in bits 37-32
- Agent logical_node_id is encoded as (value + 1) to ensure agent 0 is detectable
- Counter records internally store only 16-bit base metric IDs (bits 15-0)
- Tool reconstructs agent-encoded counter IDs from base metric ID & agent info
- Instance record counter_id field uses bitwise AND mask to extract base metric ID
  (counter_id.handle & 0xFFFF) to fit in 16-bit storage
- Output generators (CSV, JSON, Perfetto) use agent-encoded IDs for consistency
- Updated counter_config.cpp and metrics.cpp to extract base metric ID when needed
- All counter lookups now properly handle agent-encoded vs base metric IDs

This ensures counter IDs are consistent between metadata and output records while
maintaining compact storage in instance records.
2025-10-27 07:58:20 -07:00
systems-assistant[bot] e22856b3ac SWDEV-515562 - Fix and enable hipDeviceReset tests (#594) 2025-10-27 15:07:44 +01:00
systems-assistant[bot] 8cc65f49c4 SWDEV-491296 - Add stream capture testcases to Virtual Memory APIs (#589) 2025-10-27 15:06:51 +01:00
marantic-amd 08d259c24c Fix the issue when sampling JAX with rocpd (#1552) 2025-10-27 09:59:51 -04:00
David Yat Sin f7b180ee7d rocr: SW workaround for gfx90x SDMA poll (#1469)
Workaround for rare issue on gfx90x asics when SDMA_OP_POLLREGMEM
returns before polled memory has value of 0.
Removing previous SW workaround to double-poll as it was not reliable.
2025-10-27 09:33:20 -04:00
David Yat Sin db01d95ebc Users/dayatsin/swdev 519413 hsa amd pointer info return err shutdown (#1509)
* rocr: hsa_amd_pointer_info return err on shutdown

Decrement ref count before starting to unload to make sure API
calls during shutdown return error.

Delete blit objects during agent destructor.

* Add support for HSA_AMD_SYSTEM_SHUTDOWN_EVENT

Add support for new event to indicate shut down within the
hsa_amd_register_system_event_handler API.
2025-10-27 09:32:52 -04:00
systems-assistant[bot] 45d6598724 SWDEV-517867 - Enable Unit_hipStreamCreateWithPriority_MulthreadDefaultflag (#599) 2025-10-27 11:36:40 +01:00
systems-assistant[bot] abaf29d0b6 SWDEV-537855 - Add hipEventDestroy (#554)
Co-authored-by: Vladana Stojiljkovic <Vladana.Stojiljkovic@amd.com>
2025-10-26 21:20:21 +01:00
SaleelK f301053740 clr: Improve logging (#1457) 2025-10-25 15:55:27 -07:00
David Galiffi e22a8e865e Update Timemory submodule (#1539)
- Fixes clang build failure

Signed-off-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: Aleksandar Janicijevic <Aleksandar.Janicijevic@amd.com>
2025-10-25 14:56:43 -04:00
David Galiffi 28c2728b6b Update Dyninst module (#1540)
- Fix nullptr check

------
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: Aleksandar Janicijevic <Aleksandar.Janicijevic@amd.com>
2025-10-25 14:56:29 -04:00
MachineTom 6a49171fa5 SWDEV-562431 - Fix Unit_hipBindTexture_Negative failure (#1523) 2025-10-24 16:25:22 -04:00
Rakesh Roy e9dac39102 SWDEV-560065 - Revert changes to align error code with Cuda when stream capture is tried on Legacy stream (#1337)
* SWDEV-560065 - Revert "SWDEV-555484 - Invalidate capturing stream only for null/legacy stream. (#1032)"

This reverts commit 99613f1009.

* SWDEV-560065 - Revert "SWDEV-542700 - Return an error if stream capture is attempted on the null stream while a stream capture is active. (#450)"

This reverts commit 0647cf1d28.
2025-10-24 21:33:25 +05:30
Milan Radosavljevic 8806be162c Change how cache manager handles child process trace cache for rocpd (#1033)
* Change how cache manager handles child process trace cache

* Sampling and backtrace metrics to cache

* Apply cmake formatting

* Fix parsing of metadata json

* Code clean up

* Fix build nlohmann json from source

* Fix storage parsed finished callback

* Revert sampling for child process

* Change cache file name generating

* Fix thread start stop

* Fix process start end timestamp

* Applied suggestions from code review

* Try with late start of flushing task thread

* Change dockerfiles for ci

* Revert changes on github workflows

* Remove json_fwd.hpp include

* fix dump

* Build nlohmann/json by default

Signed-off-by: David Galiffi <David.Galiffi@amd.com>

* Update location of build artifacts for nlohmann/json

Signed-off-by: David Galiffi <David.Galiffi@amd.com>

* Revert use_output_suffix

* Remove unused logs

* Fix cache store inside counter due to structure change

* Remove decode tests from debian ci

* Fix issue where all databases have the same UUID (#1499)

Co-authored-by: Aleksandar Djordjevic <adjordje@amd.com>

* Removing the cpack and install steps to save space

* Revert "Remove decode tests from debian ci"

This reverts commit ddabf6dd142dcf438e6b8997b8abe86f2c868468.

* Revert "Removing the cpack and install steps to save space"

This reverts commit 973da3a1ba99d99d529af5269d30e177092f9bfa.

* Add prepare-runner job as dependency to clean up the space

* Fix formatting

* Free up even more space

* Remove verbose for workflows

* remove hw_counters from ext_data

* move space clean up inside container

* try to remove external folder to free up space

* Check space

* Refactor Cleanup to it's own step

---------

Signed-off-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: Aleksandar Djordjevic <aleksandar.djordjevic@amd.com>
Co-authored-by: Aleksandar Djordjevic <adjordje@amd.com>
2025-10-24 11:47:15 -04:00
Rahul Manocha 4f075902fc SWDEV-555347 - Remove lock contention in async events loop (#878)
* SWDEV-555347 - Remove lock contention in async events loop

* SWDEV-555347 - Introduce Pool of AsyncEventItems

* create generic mempool for AsyncEventItem

* Use BaseShared allocate and free for async event pool

---------

Co-authored-by: Rahul Manocha <rmanocha@amd.com>
2025-10-24 08:43:00 -07:00
marandje 7e20e8ec13 SWDEV-548500 - Resolve memory leaks in memory tests (#1093) 2025-10-24 16:27:48 +02:00
pghoshamd 95f721f8a5 Check emulator mode at runtime (#1432)
* Check emulator mode at runtime

* Reduce emu mode function call to one time and use result

* Move function to main.cc

* Address feedback

* EmuMode check improvement; convert to AoS

* replace g_isEmuMode with func call

* Add mode check func for every sample
2025-10-24 10:11:19 -04:00
systems-assistant[bot] 339877853d SWDEV-487395 - Add capture testcases to memcpy APIs (#587) 2025-10-24 12:43:45 +02:00
systems-assistant[bot] 196086042d SWDEV-523137 - Enable and fix failing tests on NV (#602) 2025-10-24 12:41:54 +02:00
Jatin Chaudhary 48313b8655 SWDEV-1 add missing hiperror entries (#1450) 2025-10-24 09:29:27 +01:00
abchoudh-amd a7bbe0c5d2 Use amd-smi Python API instead of CLI (#1334)
* Use amd-smi Python API instead of CLI

Formatting fix

python path

* Update CHANGELOG

* Create amdsmi interface

* Added amdsmi tests

* Removed run

* Prioritize rocm's amdsmi python API

* address review comments

* update changelog

* fix ruff formatting

---------

Co-authored-by: Vignesh Edithal <Vignesh.Edithal@amd.com>
2025-10-24 11:11:33 +05:30
SaleelK 839fb95717 clr: Do not increase signal pool (#1354)
* Do not increase signal pool when profiling, instead allow saving off
  timestamps. This is slow but a tradeoff to memory footprint of the
signals
2025-10-23 22:05:00 -07:00
MachineTom 5f76cb916d SWDEV-555888 - Refactor Numa code (#1191)
1. Create a set of mini numa interface.
In Linux, the interface is based on system call rather than libnuma.
In Windows, the interface can also work, but the policy class is dummy.
Different from Linux, Windows doesn't provide numactl tool or numa lib to setup numa policy, thus
the default policy is followed in Windows, that is, using the closest host numa node to allocate
pinned host memory in hipHostMalloc().
To get the closest host numa node of a GPU device, you need query the new attribute
hipDeviceAttributeHostNumaId. Then you can create a thread with CPU affinity on the numa node.
For example, reference the test in hip-tests/catch/perftests/memory/hipPerfHostNumaAllocWin.cc.

2. Remove pfnSetThreadGroupAffinity and pfnGetNumaNodeProcessorMaskEx as the functions have been exposed since Win7 and Win server 2008.

3. Other minor fixes.
2025-10-23 21:56:15 -04:00
Ioannis Assiouras 602ea0be1e SWDEV-558078 - Fix use-after-free in graph tests due to AsyncEventHandler (#1502) 2025-10-23 22:49:24 +01:00
Julia Jiang 4942f3cae5 SWDEV-555548 - Fix Unit_hipMemPoolMaxAlloc failure on Windows (#1486) 2025-10-23 17:09:46 -04:00
amd-hsivasun 43687b24f8 [Github Actions] Added monorepo_source_of_truth flag (#1525) 2025-10-23 16:37:12 -04:00
nunnikri 45528ea3fc SWDEV-559329 : Added missing hash value needed for module file (#1431) 2025-10-23 12:05:41 -07:00
Pengda Xie a4bbd73dc6 SWDEV-556684 - Remove HSAIL support (#1183) 2025-10-23 11:21:49 -07:00
Kian Cossettini db949445c3 [rocprofiler-systems] Overhaul OpenMP-VV Test compilation (#1389)
* Reworked Compilation

* Formatting

* Change compile log name

* Optimize Code

* Remove gfx940 and gfx941
2025-10-23 13:58:11 -04:00
Venkateshwar Reddy Kandula 8c89ed8ab1 [rocprofiler-sdk][CI] Use rock infra for rocprofiler-sdk build docs jobs (#1518)
* Initial changes to move build docs job to rock infra

* misc. fix

* clean up code.
2025-10-23 11:17:13 -05:00
Venkateshwar Reddy Kandula 40f9f15ece use rhel 8.10 amdgpu kernel driver for rhel 8.8 (#1490) 2025-10-23 09:00:10 -05:00
Charis Poag Jones 933fdc3c7e [SWDEV-558141] Fix rocm-smi --setsclk [0...n] & other clocks in partitioned configurations (#1493)
Changes:
  - Fix `rocm-smi --setsclk [0 .. n]` for multiple devices to continue on fail when
    in a partitioned configuration (ex. in DPX/QPX/CPX/etc).
  - Partitioned configurations or devices which do not support changing
    sclk/mclk/pcie clks will now continue on failure. Will report a "not
    supported" or other (rocm-smi) error codes for these devices.
  - Updates impact other clock settings such as `--setmclk` and
    `--setpcie`.

Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-10-23 08:56:41 -05:00
vedithal-amd 2a37cbf2ca Bump VERSION and add CHANGELOG for ROCm 7.1.1 release (#1447) 2025-10-23 09:34:18 -04:00
ywang103-amd ee805d1014 remove option of json as rocprofv3's intermediate file to avoid test failures of outdated code (#1474) 2025-10-23 09:33:54 -04:00
Gopesh Bhardwaj 30bcf123a8 build fix for linker error (#1376) 2025-10-23 17:35:51 +05:30