69 コミット

作成者 SHA1 メッセージ 日付
Adam Pryor bd6c6852fc [SWDEV-566924] Update KFD_ID metric to use amd-smi instead of rocprof (#2355) 2025-12-18 08:39:19 -06:00
Benjamin Welton e3c051d9b8 [RDC] Optimize RDC counter sampling with greedy packing algorithm (#1590)
* Optimize RDC counter sampling with greedy packing algorithm

This change significantly reduces the number of rocprofiler-sdk sample calls
by implementing a greedy packing algorithm that groups multiple counters into
the minimal number of hardware profiles.

Key improvements:
- Implement greedy packing algorithm to combine counters into minimal profiles
- Add ProfileSet structure to manage packed counter configurations
- Cache packed profile sets for reuse across queries
- Group telemetry field requests by GPU for bulk processing
- Reduce sample calls by ~35% (from 100 to 65 for typical workloads)

Performance impact:
- 13 counters now packed into 3 profiles (77% compression)
- Reduces overhead from profile creation and context switching
- More efficient utilization of hardware counter resources

Implementation details:
- Added create_profiles_for_counters() using greedy algorithm
- Added sample_counters_with_packing() for bulk sampling
- Modified telemetry layer to use rocp_lookup_bulk()
- Preserves all field transformations and special handling

Testing shows successful packing with expected performance gains.
No functional changes to external APIs or behavior.

Co-Authored-By: Ben Welton <bwelton@amd.com>

* Address PR review feedback

This commit addresses all review comments from the initial PR:

1. Fix division by zero risk in debug logging
   - Added check for empty counters vector before calculating compression ratio
   - Avoids potential division by zero when logging profile creation stats

2. Improve thread safety for statistics tracking
   - Changed static uint64_t to std::atomic<uint64_t> for thread-safe counters
   - Prevents race conditions in multi-threaded sampling scenarios

3. Remove unused variable
   - Removed unused profile_index variable that was incremented but never used
   - Cleaned up dead code

4. Clean up code formatting
   - Removed extra blank lines for consistency
   - Applied formatting fixes across modified files

5. Refactor code duplication between rocp_lookup and rocp_lookup_bulk
   - Created apply_field_transformation() helper function
   - Eliminates ~70 lines of duplicated switch statement logic
   - Centralizes field transformation logic in single location
   - Makes future maintenance easier

6. Document non-rocprofiler metrics handling
   - Added comments explaining how bulk lookup handles special cases
   - Clarifies that non-profiler fields like KFD_ID are handled in transformation

All changes maintain backward compatibility and pass compilation.

Co-Authored-By: Ben Welton <bwelton@amd.com>

---------

Co-authored-by: Ben Welton <bwelton@amd.com>
Co-authored-by: Adam Pryor <61172547+adam360x@users.noreply.github.com>
2025-12-17 07:56:33 -06:00
Dmitrii a2cff3c84d [RDC] Fix GPU_COUNT metric to only count GPUs (#1453)
* [RDC] Fix GPU_COUNT metric to only count GPUs
* [RDC] Clean up float->double casts

---------

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-10-30 12:50:47 -05:00
Dmitrii e0ec72ccdd [rdc] Bump rocprofiler-sdk requirement to 1.1.0 (#1610)
Fixes RDC builds broken by #1563
2025-10-30 10:06:45 -04:00
Dmitrii 8abe24d3b0 rdc: Add CPU support and CPU metrics infrastructure (#770) 2025-09-12 16:14:38 -05:00
Dmitrii a2d3f4a0e0 rdc: Profiler - improve metrics path detection (#333)
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-08-13 12:33:17 -05:00
Galantsev, Dmitrii 45e62ada3d Profiler - Add metrics location
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 0a19a7ffc1]
2025-07-30 16:59:44 -05:00
Galantsev, Dmitrii 758adbc1a3 Profiler - Update counter definitions to match changed api
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 8f3a232613]
2025-07-23 23:27:04 -05:00
Galantsev, Dmitrii 213ccc7e72 RVS - Fix iet_stress by disabling logging
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 3f2f92a37a]
2025-07-22 16:02:14 -05:00
Galantsev, Dmitrii 8fc1d27ecd Profiler - Remove UUID metric
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 059451d48f]
2025-07-22 14:55:28 -05:00
Pryor, Adam 07346922f5 Adam/bill cleanup (#209)
Co-authored-by: Bill(Shuzhou) Liu <shuzhou.liu@amd.com>


[ROCm/rdc commit: ca9d8c4bae]
2025-07-07 15:41:22 -05:00
Galantsev, Dmitrii 1d55c1d820 CMAKE - Format with gersemi
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 40545dcb49]
2025-06-27 17:25:51 -05:00
Galantsev, Dmitrii bb0c4b7653 Python - Add entitycodec
Change-Id: I9dc7f5786e2c5ee5f9756cad7cb12387d05982ae
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: cae49cf4f7]
2025-06-24 17:01:43 -05:00
Galantsev, Dmitrii 5151fe9649 CMAKE - CONFIGURE -> CONFIG
Change-Id: I716f713363469091e944bdda5ecd6886a3a43aa1
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 502fcef7b3]
2025-06-24 17:01:43 -05:00
Galantsev, Dmitrii ad14980e9a Profiler - Add partition support
NOTE: GPU ordering used is not the same as in HSA/HIP.

GPUs are ordered via amdsmi and then GPU_ID fields are compared to map
GPU partitions to each other.

Change-Id: If379214f5281d7d5ee98515b3e5ba7affc2e2197
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 85b619b2f0]
2025-06-03 19:34:00 -05:00
Galantsev, Dmitrii a14c15ea28 Profiler - Update to 1.0
Change-Id: Iee6d5e7a87a5eb8eed61adccf6729e4d6a144bf8
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 2adc8f82c6]
2025-06-03 19:34:00 -05:00
Galantsev, Dmitrii 0d352c515e Profiler - Align SMI and Profiler indices
Change-Id: If2bb850ffd1c1b8b16a8f5963a0f6971f82d4863
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: eff955fdf7]
2025-05-21 19:11:17 -05:00
adapryor 0702a6a5a2 Profiler - Fix SIMD Utilization
Change-Id: I6775cce9901a714d20e80c8c17e7a563edeb48a4


[ROCm/rdc commit: 33924ea79e]
2025-05-07 00:56:52 -05:00
Galantsev, Dmitrii 1e8bc4dc96 CMAKE - Format with cmake-format
Change-Id: I08e71fc5060b1f6e0168225cc5fe66886c2044bd
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: fa8b89f4ae]
2025-05-06 17:28:14 -05:00
Galantsev, Dmitrii b6488d150d Profiler - Add SIMD_UTILIZATION (#171)
Change-Id: I19d5acd80dbed8c4fc4e1c85eec71ca89398d299

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

[ROCm/rdc commit: 02c0786a2c]
2025-05-06 13:20:03 -07:00
Galantsev, Dmitrii 0a05e0db08 Profiler - Remove buffer to fix memory leaks
Change-Id: Ia3717ccfc147221557f5469965c2abb76b3f451c
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: dfae9cd37f]
2025-04-11 17:27:27 -05:00
Galantsev, Dmitrii d87fe5bada Profiler - Fix eval fields
The 'value' pointer was being written to a lot and then used for reading
within the same function. This likely caused issues all over RDC when
reading the metrics.

This commit changes it so *value is written to only once.

Change-Id: I83c158c1e46c6ce46ff87d8a2e769f26ffa8c0da
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 91be467cad]
2025-04-09 20:06:21 -05:00
Galantsev, Dmitrii e80760c890 RVS - Add long-running tests
Change-Id: Iddeb7f2d4fdcd69d7ac1ae94b2fa128ee3011b1a
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: bdb2367010]
2025-03-27 23:42:56 -05:00
Galantsev, Dmitrii 3273e2993b Profiler - Remove bootstrap link
Change-Id: Ieea57515d77c2d521d95568c3bc2660cc829d829
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 58350a8bb8]
2025-03-27 23:29:30 -05:00
Galantsev, Dmitrii bfee4ae9ee Profiler - Add CPC and CPF metrics
Change-Id: I27fd725e9e1868c9afe7624d6e4aafad2a42d47e
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 51de344be7]
2025-03-27 19:01:23 -05:00
Galantsev, Dmitrii 68c02bda78 RVS - Use config files and make GPU aware
Change-Id: I7a5c80ed4e6122d102e494d1ae38b4b7d40c42cd
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: f5a4402ce5]
2025-03-11 15:39:16 -05:00
Galantsev, Dmitrii 122ab5c053 RVS - Disable IET test
Change-Id: I015d68735316d2dc6af18d16f972d9f379b76bcf
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 247c8c7d5e]
2025-03-11 09:51:08 -05:00
Pryor, Adam 0186fc2481 SWDEV-508477 Eval Flops Percent (#85)
SWDEV-508477 - Profiler add FP*_PERCENT

Change-Id: Idb6250fe6b7ba3df6fe7d30861e0fbbda7e9bdce

Signed-off-by: adapryor <Adam.pryor@amd.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

[ROCm/rdc commit: 6f358ddc9e]
2025-01-24 10:07:32 -06:00
Galantsev, Dmitrii 3218c2af5c CMAKE - Rename SMI_*_DIR into AMD_SMI_*_DIR
Change-Id: I3b8b852e6b68f1448c8ed5d5e6ea4579c470ff53
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: e033fd4c55]
2025-01-23 20:56:00 -06:00
adapryor 8286a92fc1 Implementation for RDC_FI_PROF_OCCUPANCY_PER_ACTIVE_CU SWDEV-50895
Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I8da7d9846edabe5629c75f50cd2bb4b23e019a17
Signed-off-by: adapryor <Adam.pryor@amd.com>


[ROCm/rdc commit: 290b90dc89]
2025-01-21 21:49:19 -06:00
Pryor, Adam 9f1f502d93 SWDEV-510089 Fix rocprof segfaulting on ctrl+c (#94)
Change-Id: Iaa0f3856bb8fed174cbc935b85739414ecd44758

Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/rdc commit: 0ae4404a09]
2025-01-21 10:30:31 -06:00
Galantsev, Dmitrii b78295c8f8 RVS - Add IET and PEBB tests
Change-Id: Ia032901d74c882e5cbfa5a3164199cd4d571341f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 5861ec7663]
2025-01-08 18:23:13 -06:00
Galantsev, Dmitrii 9d32387925 RVS - Add memory bandwidth test
Change-Id: I4c8990170861f6a0f3853615db68634fdaa7a622
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: b058cbecf1]
2025-01-08 18:23:13 -06:00
Greg Scaffidi 725599b51c Add RDC_FI_PROF_SM_ACTIVE metric.
Signed-off-by: Greg Scaffidi <salvatore.scaffidi@amd.com>
Change-Id: I63aaf5eb05d74ba696ace2b088e17c2cfb1bd74b
Signed-off-by: adapryor <Adam.pryor@amd.com>


[ROCm/rdc commit: f4de4b0529]
2024-12-21 15:21:46 -06:00
Galantsev, Dmitrii 755ae0ee5d Profiler - Migrate from rocprofv1 to rocprofv3
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

Fixed RDC for Rocprofv3

Updates

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: Ic9162bacf1322b265e6bbcdd9fbb9b1fdef414fd

last updates

Change-Id: I12e168501327c5e4cff8a9273b0512fb0e098fe7

comment

Change-Id: I61da61e66dcc017ec46f98ff4c90fb064c9679e8


[ROCm/rdc commit: 7c91a07a43]
2024-12-20 15:39:02 -06:00
Galantsev, Dmitrii d9b13912c6 Profiler - Remove averaging
Averaging happens very slowly and only confuses people...

Change-Id: I60754d3b896b6ffeb6104bb1c2fcc54e9869b331
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 2c61dfe2ce]
2024-12-11 11:58:50 -06:00
Galantsev, Dmitrii fc83179a9d Profiler - Fix fp64 metric
Change-Id: Iab27e21740c2c51143a9e88d085b80716bf193e2
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 2605eda5f3]
2024-12-11 11:27:41 -06:00
Chen Gong a8086b484d rocprofiler: add valu utilization
SWDEV-475242

For the description of "FP32 Engine Activity" and "FP64 Engine Activity" in dcgm,
It seems that we do not have an equivalent to these pipe-utilizations on our hardware.

In rocprofiler, I think VALU Utilization is the closest to what we want.

Change-Id: Ibce8835ef4757084cdfd73258de6fc1606ca0158
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 251fcbe49d]
2024-11-21 15:24:01 +08:00
Galantsev, Dmitrii 8e657c165c RVS - Fix cookie_t -> rdc_diag_callback_t types issue
Issue introduced in ae9030ab1a

Change-Id: I2b6a8024d45fc44d92cf2770be9887dfc0fb3ede
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: e1b57c43f3]
2024-11-12 10:36:52 -06:00
Galantsev, Dmitrii ae9030ab1a RVS - Report test progress in realtime
Change-Id: Id9fea71f242f372f408ecd777c030465b7ef9989
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 37ddd5bf50]
2024-11-07 11:21:22 -06:00
Galantsev, Dmitrii 73c79fcd83 Finish basic logging impl
Change-Id: Ia3d6ac80f4832f1bfb63573c543659abd5f84341
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 9c77312c51]
2024-11-07 11:21:22 -06:00
Galantsev, Dmitrii b0035605ee CMAKE - Find modules at build time
Change-Id: I9370ef1433579aff1a37f3636050f525638d8658
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: cdf1588974]
2024-11-07 11:21:22 -06:00
Galantsev, Dmitrii 39687e8d96 CMAKE - Fix RVS include
Change-Id: I65095cc3d04fc2a5daeee5c809f635cb1662822f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

Revert "Disable RVS as the error scares people"

This reverts commit f3450f61bf.

Change-Id: I5086c25772444aa3bfc4c10abc1ea58d3f3f1f27
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: dd50027748]
2024-11-07 11:18:41 -06:00
Galantsev, Dmitrii 793b2de0cb Profiler - Modify metrics
Remove occupancy metrics and replace with OccupancyPercent

Add OCCUPANCY_PERCENT which uses OccupancyPercent
Add GR_ENGINE_ACTIVE which uses GPU_UTIL/100
Add TENSOR_ACTIVE_PERCENT which uses MfmaUtil
Modify FLOPS_64 to use FP64_ACTIVE

Change-Id: I5f30d77a0c80f5ac78abd1a9e57f8a0a3c6cc00b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 28acbf0436]
2024-10-15 19:00:30 -05:00
Galantsev, Dmitrii 999cae5e2c SWDEV-466829 - Disable ROCP when in GTest
Change-Id: I3b218fe256717c1dc9187d5f17476dfc990656c2
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: c40a6308c5]
2024-09-26 17:00:05 -05:00
Bill(Shuzhou) Liu 6372df9447 Update the hsaco for diagonstic on MI300X
Add hsaco for gfx940, gfx941 and gfx942

Change-Id: Ibd55fcc2d036d1190357e1e86d4e170568426d94


[ROCm/rdc commit: 9800528c19]
2024-09-17 14:15:35 -05:00
Galantsev, Dmitrii b50c64b868 Use correct rocprofiler metrics
Change-Id: I26603de7425abb6588f770ed68c22e14d6d20d56
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: d4bb33d100]
2024-06-11 11:15:18 -05:00
Galantsev, Dmitrii 73948f95e2 Rewrite rocprofiler plugin
Change-Id: Ic7dd967cc60cacd2b16a465180505ea2a342fccf
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 3514225b83]
2024-06-11 03:11:15 -05:00
Galantsev, Dmitrii 29b86095ed Fix rocprofiler plugin
- Replace non-working fields with working ones
    - remove CU_OCCUPANCY completely as it isn't well supported
- Fix rocprofiler initialization with shared_ptr and rdc_module_init
- Replace env var ROCPROFILER_METRICS_PATH with ROCP_METRICS
    - ROCPROFILER_METRICS_PATH is only relevant for rocprofv2
    - ROCP_METRICS is only relevant for rocprofv1 (which we are using)

Change-Id: I21e6fa3f0e1694c38f44ca0e5659d672559f7380
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 20ca2ce574]
2024-06-06 01:51:39 -05:00
Galantsev, Dmitrii c2a75bbe4c Finalize the rocprofiler fields
Change-Id: I4ed1c4309f21bdcc7281d911663036caf5947182
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 07c414af5e]
2024-06-04 19:49:06 -05:00