175 کامیت‌ها

مولف SHA1 پیام تاریخ
Adam Pryor 9425a2f687 [SWDEV-569427] Fix segfault calling bad page info (#2547) 2026-01-13 09:44:49 -06:00
Adam Pryor 5bf6e366dd [SWDEV-548460] Add RDC Policy Reset Message (#2180)
* [SWDEV-548460] Add RDC Policy Reset Message

* [rdc] Bump version to 1.3.0

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

* chore: [rdc] Format CMakeLists.txt

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

---------

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Co-authored-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-12-29 08:31:13 -08:00
Adam Pryor bd6c6852fc [SWDEV-566924] Update KFD_ID metric to use amd-smi instead of rocprof (#2355) 2025-12-18 08:39:19 -06:00
Benjamin Welton e3c051d9b8 [RDC] Optimize RDC counter sampling with greedy packing algorithm (#1590)
* Optimize RDC counter sampling with greedy packing algorithm

This change significantly reduces the number of rocprofiler-sdk sample calls
by implementing a greedy packing algorithm that groups multiple counters into
the minimal number of hardware profiles.

Key improvements:
- Implement greedy packing algorithm to combine counters into minimal profiles
- Add ProfileSet structure to manage packed counter configurations
- Cache packed profile sets for reuse across queries
- Group telemetry field requests by GPU for bulk processing
- Reduce sample calls by ~35% (from 100 to 65 for typical workloads)

Performance impact:
- 13 counters now packed into 3 profiles (77% compression)
- Reduces overhead from profile creation and context switching
- More efficient utilization of hardware counter resources

Implementation details:
- Added create_profiles_for_counters() using greedy algorithm
- Added sample_counters_with_packing() for bulk sampling
- Modified telemetry layer to use rocp_lookup_bulk()
- Preserves all field transformations and special handling

Testing shows successful packing with expected performance gains.
No functional changes to external APIs or behavior.

Co-Authored-By: Ben Welton <bwelton@amd.com>

* Address PR review feedback

This commit addresses all review comments from the initial PR:

1. Fix division by zero risk in debug logging
   - Added check for empty counters vector before calculating compression ratio
   - Avoids potential division by zero when logging profile creation stats

2. Improve thread safety for statistics tracking
   - Changed static uint64_t to std::atomic<uint64_t> for thread-safe counters
   - Prevents race conditions in multi-threaded sampling scenarios

3. Remove unused variable
   - Removed unused profile_index variable that was incremented but never used
   - Cleaned up dead code

4. Clean up code formatting
   - Removed extra blank lines for consistency
   - Applied formatting fixes across modified files

5. Refactor code duplication between rocp_lookup and rocp_lookup_bulk
   - Created apply_field_transformation() helper function
   - Eliminates ~70 lines of duplicated switch statement logic
   - Centralizes field transformation logic in single location
   - Makes future maintenance easier

6. Document non-rocprofiler metrics handling
   - Added comments explaining how bulk lookup handles special cases
   - Clarifies that non-profiler fields like KFD_ID are handled in transformation

All changes maintain backward compatibility and pass compilation.

Co-Authored-By: Ben Welton <bwelton@amd.com>

---------

Co-authored-by: Ben Welton <bwelton@amd.com>
Co-authored-by: Adam Pryor <61172547+adam360x@users.noreply.github.com>
2025-12-17 07:56:33 -06:00
Yazen AL Musaffar 277072f241 Fix for unexpected behavior by ECC_UNCORRECT field (#1088) 2025-12-08 12:07:00 -06:00
Yazen AL Musaffar 16b9160034 [RDC] [SWDEV-551280] RDC to include Error Counters (#1087)
* rdc error counter

* RDC error counters

* fix

* Updates

* updated field names

Signed-off-by: yalmusaf_amdeng <yalmusaf@amd.com>

---------

Signed-off-by: yalmusaf_amdeng <yalmusaf@amd.com>
Co-authored-by: yalmusaf_amdeng <yalmusaf@amd.com>
2025-12-03 15:22:18 -06:00
Dmitrii a2cff3c84d [RDC] Fix GPU_COUNT metric to only count GPUs (#1453)
* [RDC] Fix GPU_COUNT metric to only count GPUs
* [RDC] Clean up float->double casts

---------

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-10-30 12:50:47 -05:00
Dmitrii e0ec72ccdd [rdc] Bump rocprofiler-sdk requirement to 1.1.0 (#1610)
Fixes RDC builds broken by #1563
2025-10-30 10:06:45 -04:00
Dmitrii 8abe24d3b0 rdc: Add CPU support and CPU metrics infrastructure (#770) 2025-09-12 16:14:38 -05:00
Dmitrii a2d3f4a0e0 rdc: Profiler - improve metrics path detection (#333)
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-08-13 12:33:17 -05:00
Galantsev, Dmitrii 2d41f97290 Bump version to 1.2.0
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 543543ff1b]
2025-08-05 20:06:12 -05:00
Galantsev, Dmitrii 45e62ada3d Profiler - Add metrics location
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 0a19a7ffc1]
2025-07-30 16:59:44 -05:00
Galantsev, Dmitrii 758adbc1a3 Profiler - Update counter definitions to match changed api
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 8f3a232613]
2025-07-23 23:27:04 -05:00
Galantsev, Dmitrii 213ccc7e72 RVS - Fix iet_stress by disabling logging
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 3f2f92a37a]
2025-07-22 16:02:14 -05:00
Galantsev, Dmitrii 8fc1d27ecd Profiler - Remove UUID metric
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 059451d48f]
2025-07-22 14:55:28 -05:00
Pryor, Adam bf01498af7 [SWDEV-541958] Fix config (#217)
* [SWDEV-541958] Fix config

Change-Id: I6703821747ade5adb993ab7f386f3658db8a3357

* fixes

Change-Id: I0a1c7d96452d9b2ccb6401b77d73398a67518e91

[ROCm/rdc commit: 6a356e7bb1]
2025-07-21 15:05:49 -05:00
Pryor, Adam 07346922f5 Adam/bill cleanup (#209)
Co-authored-by: Bill(Shuzhou) Liu <shuzhou.liu@amd.com>


[ROCm/rdc commit: ca9d8c4bae]
2025-07-07 15:41:22 -05:00
Galantsev, Dmitrii 1d55c1d820 CMAKE - Format with gersemi
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 40545dcb49]
2025-06-27 17:25:51 -05:00
Galantsev, Dmitrii 89a495e493 Profiler - Remove rocprofiler-v1 remnants
Also force unset HSA_TOOLS_LIB so it doesn't break rocprofiler-sdk

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: e73eaf8115]
2025-06-27 13:15:52 -05:00
Galantsev, Dmitrii bb0c4b7653 Python - Add entitycodec
Change-Id: I9dc7f5786e2c5ee5f9756cad7cb12387d05982ae
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: cae49cf4f7]
2025-06-24 17:01:43 -05:00
Galantsev, Dmitrii 5151fe9649 CMAKE - CONFIGURE -> CONFIG
Change-Id: I716f713363469091e944bdda5ecd6886a3a43aa1
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 502fcef7b3]
2025-06-24 17:01:43 -05:00
Galantsev, Dmitrii ad14980e9a Profiler - Add partition support
NOTE: GPU ordering used is not the same as in HSA/HIP.

GPUs are ordered via amdsmi and then GPU_ID fields are compared to map
GPU partitions to each other.

Change-Id: If379214f5281d7d5ee98515b3e5ba7affc2e2197
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 85b619b2f0]
2025-06-03 19:34:00 -05:00
Galantsev, Dmitrii a14c15ea28 Profiler - Update to 1.0
Change-Id: Iee6d5e7a87a5eb8eed61adccf6729e4d6a144bf8
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 2adc8f82c6]
2025-06-03 19:34:00 -05:00
Pryor, Adam 331f648ba0 RDC Event Process Start/Stop Fix (#193)
Change-Id: Ib68f9909f2a6e0a1e5764298f1012a2bcf7ce1fc

Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/rdc commit: 76e9846bb1]
2025-06-03 18:07:37 -05:00
Pryor, Adam 151b0301f1 [SWDEV-535739] Align RDC with amdsmi 26.0 (#191)
* Align RDC with amdsmi 26.0.0
* Remove RDCI_IOLINK_TYPE_NUMIOLINKTYPES

---------

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: Ib7f2a22bd9544e0bf74afb1ed8d8f8b79b129b1a

[ROCm/rdc commit: cc7ccf507a]
2025-06-02 18:27:19 -05:00
Pryor, Adam ec661d5d17 [SWDEV-243250] RDC Process Start/Stop integration (#189)
Change-Id: I3d2be33b5d23cd259b3d06fb572f81d19e6c3798

Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/rdc commit: 0e9c3b2c4f]
2025-06-02 14:42:21 -05:00
Galantsev, Dmitrii 0d352c515e Profiler - Align SMI and Profiler indices
Change-Id: If2bb850ffd1c1b8b16a8f5963a0f6971f82d4863
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: eff955fdf7]
2025-05-21 19:11:17 -05:00
adapryor 0702a6a5a2 Profiler - Fix SIMD Utilization
Change-Id: I6775cce9901a714d20e80c8c17e7a563edeb48a4


[ROCm/rdc commit: 33924ea79e]
2025-05-07 00:56:52 -05:00
Galantsev, Dmitrii 1e8bc4dc96 CMAKE - Format with cmake-format
Change-Id: I08e71fc5060b1f6e0168225cc5fe66886c2044bd
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: fa8b89f4ae]
2025-05-06 17:28:14 -05:00
Galantsev, Dmitrii b6488d150d Profiler - Add SIMD_UTILIZATION (#171)
Change-Id: I19d5acd80dbed8c4fc4e1c85eec71ca89398d299

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

[ROCm/rdc commit: 02c0786a2c]
2025-05-06 13:20:03 -07:00
Pryor, Adam 2cb7903b06 [SWDEV-523349/SWDEV-527257] Fix Rdci Config (#161)
Change-Id: Iae21ea8061205f186086a3ed59c6259ddeb1dbe7

Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/rdc commit: 2db6ddea69]
2025-04-28 11:57:51 -05:00
Galantsev, Dmitrii 375ab5eace Add RDC_FI_GPU_BUSY_PERCENT
AMDSMI needs to merge first and bump the version to at least 24.4.2

Change-Id: I30149bb78c79ebc3de0dabdc8e63fcef12b2f406
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: a5cb334f8b]
2025-04-15 17:00:56 -05:00
Galantsev, Dmitrii e15c5a15fa CMAKE - Bump version to 1.1.0
Change-Id: I0fbc0f6d842c034ad858f30fa6418afd01e11a4f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: ac50573e67]
2025-04-11 17:27:27 -05:00
Galantsev, Dmitrii 0a05e0db08 Profiler - Remove buffer to fix memory leaks
Change-Id: Ia3717ccfc147221557f5469965c2abb76b3f451c
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: dfae9cd37f]
2025-04-11 17:27:27 -05:00
Pryor, Adam 9d25978a3f [SWDEV-515192] Fix rdc topo (#146)
Change-Id: I64a8077a56e2eaf99735fafb1010d869a1fdb0c3

Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/rdc commit: 58811fecbb]
2025-04-10 17:46:08 -05:00
Galantsev, Dmitrii d87fe5bada Profiler - Fix eval fields
The 'value' pointer was being written to a lot and then used for reading
within the same function. This likely caused issues all over RDC when
reading the metrics.

This commit changes it so *value is written to only once.

Change-Id: I83c158c1e46c6ce46ff87d8a2e769f26ffa8c0da
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 91be467cad]
2025-04-09 20:06:21 -05:00
Galantsev, Dmitrii 5276903800 Revert "Implement CPU discovery support"
This reverts commit f967f8a17d15e148464393fcd145af01dc0e1525.


[ROCm/rdc commit: 24024f0e4f]
2025-04-07 20:45:19 -05:00
Galantsev, Dmitrii 8afcedfc96 Revert "Fix breaking changes introduced with CPU support"
This reverts commit e9ac9e4626e3e45ebdfafb39e251d073091429f1.


[ROCm/rdc commit: c96f5db52c]
2025-04-07 20:45:19 -05:00
Galantsev, Dmitrii 3e8f56c430 Fix breaking changes introduced with CPU support
Changes introduced in f0f44d977f
broke RDC if it was compiled without ESMI support, or if esmi driver is
not loaded when RDC is being used.

Change-Id: Id54e1e9002d2e3cf09240081149eed84178700af
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 0aeceefcb3]
2025-04-07 14:41:46 -05:00
Yuan, Perry f0f44d977f Implement CPU discovery support (#77)
* Implement CPU discovery support

SWDEV-482949:

enable the CPU model name info support to the RDC, rdci command
can detect GPU and CPU modules at the same time.
It will query the CPU info through the amdsmi interface like below:

1 GPUs found.
-----------------------------------------------------------------
GPU Index        Device Information
0               AMD Radeon PRO W7800
=================================================================
1 CPUs found.
-----------------------------------------------------------------
CPU Index        Device Information
0               AMD Ryzen Threadripper PRO 7995WX 96-Cores
-----------------------------------------------------------------

Change-Id: Ibc6533c9a61000cd86c45b1bae14c3eb6788c119
Signed-off-by: Perry Yuan <perry.yuan@amd.com>

* CMAKE - Add required version for amdsmi

Change-Id: I341a89351d196ec66cce215a5d1d3953302fcc66
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

---------

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Co-authored-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

[ROCm/rdc commit: 3bdca8b8b6]
2025-03-31 10:58:36 +08:00
Galantsev, Dmitrii 874a7b438f CMAKE - Fix build types
Addresses issue https://github.com/ROCm/rdc/issues/43

Change-Id: I456184358524a6feef4bf83eecb655678c3bc42d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 80ee980cdb]
2025-03-30 18:54:54 -05:00
Galantsev, Dmitrii e80760c890 RVS - Add long-running tests
Change-Id: Iddeb7f2d4fdcd69d7ac1ae94b2fa128ee3011b1a
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: bdb2367010]
2025-03-27 23:42:56 -05:00
Galantsev, Dmitrii 3273e2993b Profiler - Remove bootstrap link
Change-Id: Ieea57515d77c2d521d95568c3bc2660cc829d829
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 58350a8bb8]
2025-03-27 23:29:30 -05:00
Galantsev, Dmitrii 7ce869c8d6 CMAKE - Add BUILD_INTERFACE include dirs for rdc_bootstrap
Change-Id: I93df878b21e245277c7a8d9589102a15c2517f4f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 059d015ea4]
2025-03-27 23:29:30 -05:00
Galantsev, Dmitrii bfee4ae9ee Profiler - Add CPC and CPF metrics
Change-Id: I27fd725e9e1868c9afe7624d6e4aafad2a42d47e
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 51de344be7]
2025-03-27 19:01:23 -05:00
Pryor, Adam fe868f6763 [SWDEV-498711] RDC Partition Implementation (#119)
* [SWDEV-498711] RDC Partition Implementation

Change-Id: Ibfc3709793770537e4c9d36458f34c6b4f461724
Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/rdc commit: 47692d3ed5]
2025-03-27 14:10:11 -05:00
Galantsev, Dmitrii 791fa376e9 Fix amdsmi_get_power_info API
This change creates a workaround for a broken C api in amdsmi.

amdsmi_get_power_info API is broken in rocm 6.4.0 (amdsmi 25.2) and is fixed
in rocm 6.4.1 (amdsmi 25.3).

Breaking AMDSMI change:
https://github.com/ROCm/amdsmi/commit/dc4a16da6fb45d581a6e23c78d340172989418a0

Change-Id: Ib45a2702aa722c7735f3ccd1081d8f62e4d34216
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 929041b556]
2025-03-19 23:12:45 -05:00
Galantsev, Dmitrii 68c02bda78 RVS - Use config files and make GPU aware
Change-Id: I7a5c80ed4e6122d102e494d1ae38b4b7d40c42cd
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: f5a4402ce5]
2025-03-11 15:39:16 -05:00
Galantsev, Dmitrii 122ab5c053 RVS - Disable IET test
Change-Id: I015d68735316d2dc6af18d16f972d9f379b76bcf
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 247c8c7d5e]
2025-03-11 09:51:08 -05:00
Galantsev, Dmitrii 705d42f0f5 CMAKE - Set fallback version to 0.3.0
Change-Id: I2322bdb7d3a8e4f83346ca4f5d24351ad2a4eccc
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: d5f8ff0ab0]
2025-03-04 08:43:32 -06:00