Граф коммитов

80 Коммитов

Автор SHA1 Сообщение Дата
Galantsev, Dmitrii 02c0786a2c Profiler - Add SIMD_UTILIZATION (#171)
Change-Id: I19d5acd80dbed8c4fc4e1c85eec71ca89398d299

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-05-06 13:20:03 -07:00
Pryor, Adam 2db6ddea69 [SWDEV-523349/SWDEV-527257] Fix Rdci Config (#161)
Change-Id: Iae21ea8061205f186086a3ed59c6259ddeb1dbe7

Signed-off-by: adapryor <Adam.pryor@amd.com>
2025-04-28 11:57:51 -05:00
Galantsev, Dmitrii a5cb334f8b Add RDC_FI_GPU_BUSY_PERCENT
AMDSMI needs to merge first and bump the version to at least 24.4.2

Change-Id: I30149bb78c79ebc3de0dabdc8e63fcef12b2f406
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-04-15 17:00:56 -05:00
Galantsev, Dmitrii 24024f0e4f Revert "Implement CPU discovery support"
This reverts commit f967f8a17d15e148464393fcd145af01dc0e1525.
2025-04-07 20:45:19 -05:00
Yuan, Perry 3bdca8b8b6 Implement CPU discovery support (#77)
* Implement CPU discovery support

SWDEV-482949:

enable the CPU model name info support to the RDC, rdci command
can detect GPU and CPU modules at the same time.
It will query the CPU info through the amdsmi interface like below:

1 GPUs found.
-----------------------------------------------------------------
GPU Index        Device Information
0               AMD Radeon PRO W7800
=================================================================
1 CPUs found.
-----------------------------------------------------------------
CPU Index        Device Information
0               AMD Ryzen Threadripper PRO 7995WX 96-Cores
-----------------------------------------------------------------

Change-Id: Ibc6533c9a61000cd86c45b1bae14c3eb6788c119
Signed-off-by: Perry Yuan <perry.yuan@amd.com>

* CMAKE - Add required version for amdsmi

Change-Id: I341a89351d196ec66cce215a5d1d3953302fcc66
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

---------

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Co-authored-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-03-31 10:58:36 +08:00
Galantsev, Dmitrii bdb2367010 RVS - Add long-running tests
Change-Id: Iddeb7f2d4fdcd69d7ac1ae94b2fa128ee3011b1a
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-03-27 23:42:56 -05:00
Galantsev, Dmitrii 51de344be7 Profiler - Add CPC and CPF metrics
Change-Id: I27fd725e9e1868c9afe7624d6e4aafad2a42d47e
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-03-27 19:01:23 -05:00
Pryor, Adam 47692d3ed5 [SWDEV-498711] RDC Partition Implementation (#119)
* [SWDEV-498711] RDC Partition Implementation

Change-Id: Ibfc3709793770537e4c9d36458f34c6b4f461724
Signed-off-by: adapryor <Adam.pryor@amd.com>
2025-03-27 14:10:11 -05:00
Galantsev, Dmitrii f5a4402ce5 RVS - Use config files and make GPU aware
Change-Id: I7a5c80ed4e6122d102e494d1ae38b4b7d40c42cd
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-03-11 15:39:16 -05:00
Pryor, Adam 6f358ddc9e SWDEV-508477 Eval Flops Percent (#85)
SWDEV-508477 - Profiler add FP*_PERCENT

Change-Id: Idb6250fe6b7ba3df6fe7d30861e0fbbda7e9bdce

Signed-off-by: adapryor <Adam.pryor@amd.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-01-24 10:07:32 -06:00
adapryor 290b90dc89 Implementation for RDC_FI_PROF_OCCUPANCY_PER_ACTIVE_CU SWDEV-50895
Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I8da7d9846edabe5629c75f50cd2bb4b23e019a17
Signed-off-by: adapryor <Adam.pryor@amd.com>
2025-01-21 21:49:19 -06:00
stali b427c07ffe fixed rdc link state print issue 2025-01-22 09:05:49 +08:00
limeng12 016a1d9d39 [SWDEV-230863] Improve the functionality of RdcSmiHealth module.
Memory check:get the threshold of retired page number
EEPROM check:read and verify the checksum
Power/Thermal check: power/thermal throttle status counter

Signed-off-by: Meng Li <li.meng@amd.com>
Change-Id: Id2c751416eb5bf007e6e1da8dc05966a6ba1324e
2025-01-14 08:14:36 +08:00
Galantsev, Dmitrii 83f36f1673 Include assert.h during C compilation (#4)
Fix for https://github.com/ROCm/ROCm/issues/3997. When compiling a C program that includes rdc/rdc.h, multiple assertion errors are thrown without this header included.

Change-Id: Ie5b5c1a1a17c8207cf9b1be23b31193e260d5c1a

Co-authored-by: harkgill-amd <harkgill@amd.com>
2025-01-10 11:29:15 -05:00
Galantsev, Dmitrii 5861ec7663 RVS - Add IET and PEBB tests
Change-Id: Ia032901d74c882e5cbfa5a3164199cd4d571341f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-01-08 18:23:13 -06:00
Galantsev, Dmitrii b058cbecf1 RVS - Add memory bandwidth test
Change-Id: I4c8990170861f6a0f3853615db68634fdaa7a622
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-01-08 18:23:13 -06:00
Li, Star bd7d7c99c1 Fix unit issue in policy feature (#78)
1. For temperature the unit in milli Celsius
2. For power the unit in microwatts.
3. Fix second register call to rdcd doesn't functional because start flag

Co-authored-by: Chao Fei <chao.fei@amd.com>
2025-01-06 09:21:08 +08:00
Pryor, Adam 60b7359161 Implementation for adding pcie_total (#40)
* Implementation for adding pcie_total

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I4b0cfd7095e9d984e939283ee7169d01f55a1847
Signed-off-by: adapryor <Adam.pryor@amd.com>

* Updates

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I021f29083de651cab9fbe7db98acbe20f65948d4

* Updates

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I42f3207b745fa787dabe30a85c8e063159d1337d

---------

Signed-off-by: adapryor <Adam.pryor@amd.com>
2024-12-26 18:36:41 -06:00
Ma, Li 772481f952 SWDEV-475244 - Memory Usage and Bandwidth: max mem and current mem (#48) 2024-12-23 10:22:53 +08:00
stali 29b6699b62 Enable RDC link Status feature
1.add link status APIs
   2.Add link status example for link status API usage
2024-12-23 09:30:21 +08:00
Greg Scaffidi f4de4b0529 Add RDC_FI_PROF_SM_ACTIVE metric.
Signed-off-by: Greg Scaffidi <salvatore.scaffidi@amd.com>
Change-Id: I63aaf5eb05d74ba696ace2b088e17c2cfb1bd74b
Signed-off-by: adapryor <Adam.pryor@amd.com>
2024-12-21 15:21:46 -06:00
Adam Pryor df170c8801 Implementation for SWDEV-479728:[RDC] - Clock Speed/Power Cap Control
Change-Id: I767a71325527aa3c691e9607953ceafebacfb4d5
Signed-off-by: adapryor <Adam.pryor@amd.com>
2024-12-20 16:03:33 -06:00
stali 8bcb5f7068 Enable RDC topology feature
1.Add topology APIs
2.Add topology example for topology API usage

Change-Id: Ib79c06d0bac85119672f194ba685ebf25029979c
2024-12-16 10:02:41 +08:00
Li Ma 30f9b2ac2f SWDEV-475244 - Memory Usage and Bandwidth: max mem and current mem
Implemented max memory bandwith and current memory bandwidth. Added two
new field ids: RDC_FI_GPU_MEMORY_MAX_BANDWIDTH, RDC_FI_GPU_MEMORY_CUR_BANDWIDTH

Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I453e49937a84777146575f4f5bdd69fd4fe53bfc
2024-12-16 09:43:20 +08:00
Chen Gong 251fcbe49d rocprofiler: add valu utilization
SWDEV-475242

For the description of "FP32 Engine Activity" and "FP64 Engine Activity" in dcgm,
It seems that we do not have an equivalent to these pipe-utilizations on our hardware.

In rocprofiler, I think VALU Utilization is the closest to what we want.

Change-Id: Ibce8835ef4757084cdfd73258de6fc1606ca0158
Signed-off-by: Chen Gong <curry.gong@amd.com>
2024-11-21 15:24:01 +08:00
limeng12 853d3b0cc5 Backgroud health check
Add the RdcSmiHealth module, which will call rocm_smi_lib.
It will support following health:
 - XGMI error detected
 - PCIE replay count detected
 - Memory check
 - InfoROM check
 - Power/Thermal check
The grpc client and server side health function is added.
The health module is added to the rdci.

At present, XGMI/PCIE and a part of Memory have been implemented.
Others will be added as soon as possible.

Change-Id: I1bd99290bdc7dea733f21a41a8c4bcefb2138112
2024-11-19 14:00:49 +08:00
Bill(Shuzhou) Liu 5e3ebecf80 Correct RDC_FI_PCIE_BANDWIDTH unit
The unit should be mbps instead of GB/second
2024-11-13 09:45:46 -05:00
stali d8fec06bab Enable RDCI policy subsystem
- Enable set and get for policy settings
- Enable register and clear policy events

Change-Id: If4eaaf9b80e668fb21691757210e0aa1532cecae
Signed-off-by: stali <Star.Li@amd.com>
2024-11-12 20:40:08 -06:00
Galantsev, Dmitrii 9c77312c51 Finish basic logging impl
Change-Id: Ia3d6ac80f4832f1bfb63573c543659abd5f84341
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-11-07 11:21:22 -06:00
Chao Fei 345ac64a43 Enable RDC policy feature
1. Add policy APIs
2. Add policy example for policy API usage

Change-Id: I14deb7c809d0b865b7bb083842092fc37868025e
Signed-off-by: Chao Fei <Chao.Fei@amd.com>
2024-10-23 20:37:27 -04:00
Li Ma 4bd31b605a SWDEV-475244 - Memory Usage and Bandwidth: memory activity
Implemented memory activity and added a new fied id
RDC_FI_GPU_MEMORY_ACTIVITY.

Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I11abe356ef6b01ce4917fd19dcc128efbc535f39
2024-10-22 11:11:31 +08:00
Li Ma b17abf93fa SWDEV-475255 - MM Engine Decoding Throughput
Implemented DEC activity for now due to ENC activity is unavailable in
amdsmi.

Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I34bb56e6e0d8d2ab91243f8932f0ac10cb2d1e9f
2024-10-18 10:01:41 +08:00
Galantsev, Dmitrii 28acbf0436 Profiler - Modify metrics
Remove occupancy metrics and replace with OccupancyPercent

Add OCCUPANCY_PERCENT which uses OccupancyPercent
Add GR_ENGINE_ACTIVE which uses GPU_UTIL/100
Add TENSOR_ACTIVE_PERCENT which uses MfmaUtil
Modify FLOPS_64 to use FP64_ACTIVE

Change-Id: I5f30d77a0c80f5ac78abd1a9e57f8a0a3c6cc00b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-10-15 19:00:30 -05:00
adapryor e20bc58b1c Add XGMI read/write sum metrics
Change-Id: I898b779ea7f5336edf0d047fb1e5d3ec40085baa
2024-10-09 17:02:55 -05:00
Galantsev, Dmitrii c40a6308c5 SWDEV-466829 - Disable ROCP when in GTest
Change-Id: I3b218fe256717c1dc9187d5f17476dfc990656c2
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-09-26 17:00:05 -05:00
Galantsev, Dmitrii d4a868cb69 Increase MAX_NUM_DEVICES limit
Change-Id: I0cf21be156649818fd05a66928054710322b23ac
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-09-25 20:58:19 -05:00
Chen Gong 1edd04d84e Implement the code related to the GetMixedComponentVersion()
Change-Id: I98aad97b4cb6498b7f2fc03a2d5ee7c9e949d5f1
Signed-off-by: Chen Gong <curry.gong@amd.com>
2024-09-10 10:06:44 -05:00
Chen Gong 5a3fd9fbc1 Implement rdc_device_get_component_version API related code
Implement an API to obtain the version information of the rdc calling component.
See rdc_component_t for details on available components.
It can be expanded later if necessary.

Change-Id: I03b48f774179c52c57b606704283add74ca39a02
Signed-off-by: Chen Gong <curry.gong@amd.com>
2024-09-10 10:06:44 -05:00
Galantsev, Dmitrii bffe4e22fa Add OAM_ID
Change-Id: I771b2f7f088940838c09ba3521a7955faa64e7ec
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-09-09 21:19:33 -05:00
Galantsev, Dmitrii d85657e5f2 SWDEV-452795 - Disable RAS plugin, fix XGMI
RAS plugin loaded rocm-smi which is in conflict with amd-smi library

Main source of grief was the map 'devInfoTypesStrings' that is defined
in both rocm-smi and amd-smi

We assume that rocm-smi would get lazy-loaded by RAS library and
overwrite symbols defined in amd-smi. devInfoTypesStrings in rocm-smi
contains different number of elements, the enums are also different.
RDC relies on amd-smi's enums.

One such enum is kDevGpuMetrics:
  rocm-smi: kDevGpuMetrics = 68
  amd-smi:  kDevGpuMetrics = 75

Example of overlapping map definitions:

  $ objdump --dynamic-syms /opt/rocm/lib/libamd_smi.so | grep devInfoTypesStrings
  00000000003c4980 g    DO .data.rel.ro0000000000000008  Base        devInfoTypesStrings
  00000000003db830 g    DO .bss0000000000000030  Base        _ZN3amd3smi6Device19devInfoTypesStringsE
  $ objdump --dynamic-syms /opt/rocm/lib/librocm_smi64.so  | grep devInfoTypesStrings
  00000000003dc590 g    DO .bss0000000000000030  Base        _ZN3amd3smi6Device19devInfoTypesStringsE
  00000000003c9c68 g    DO .data.rel.ro0000000000000008  Base        devInfoTypesStrings

Change-Id: Ib2f2db32b6abd7ebe84e7807c25581461eb86bae
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-06-26 03:42:07 -05:00
Galantsev, Dmitrii 20ca2ce574 Fix rocprofiler plugin
- Replace non-working fields with working ones
    - remove CU_OCCUPANCY completely as it isn't well supported
- Fix rocprofiler initialization with shared_ptr and rdc_module_init
- Replace env var ROCPROFILER_METRICS_PATH with ROCP_METRICS
    - ROCPROFILER_METRICS_PATH is only relevant for rocprofv2
    - ROCP_METRICS is only relevant for rocprofv1 (which we are using)

Change-Id: I21e6fa3f0e1694c38f44ca0e5659d672559f7380
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-06-06 01:51:39 -05:00
Galantsev, Dmitrii 07c414af5e Finalize the rocprofiler fields
Change-Id: I4ed1c4309f21bdcc7281d911663036caf5947182
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-06-04 19:49:06 -05:00
Galantsev, Dmitrii e11afbf60f Add GPU indexing and fix check for fields in rocprof
- Fix RUNPATH for tests

Change-Id: I79517592b49d27080a010a2e41e5878adf24a157
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-06-04 12:56:22 -05:00
Galantsev, Dmitrii 53033a5b77 Add memory bandwidth metrics
Change-Id: I310ca8af0536497be619d2bda1e540d1f11c2565
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-05-17 14:55:01 -05:00
Galantsev, Dmitrii c3a4c899d5 Profiler - Add all required metrics
Change-Id: Iea3938df9407789c061c3a6ead9167a69069d6e6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-05-09 23:24:02 -05:00
Galantsev, Dmitrii 234b2d835b Add rocprofiler plugin
Rename ROCR -> Runtime and ROCP -> Profiler

Change-Id: If90953da8fa5d695b681813dad4a3e7ec26a9c7e
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-05-07 04:39:39 -05:00
Galantsev, Dmitrii 5525bf8c86 AMDSMI - Add ring hang event
Change-Id: I84696e3cc1a4eba8de48e464f1a208ed9c6e489d
Depends-On: I2e73ba08ee0004f6f30660b2fa425ea94bafceca
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-05-03 16:45:42 -05:00
Bill(Shuzhou) Liu 61a75d346b Add new XGMI and PCIE bandwidth fields from gpu_metrics
For new ASIC, the RDC_EVNT_XGMI, RDC_FI_PCIE_RX and RDC_FI_PCIE_TX
are not supported. New fileds RDC_FI_XGMI and RDC_FI_PCIE_BANDWIDTH
should be used.

Change-Id: Iff5bbef4c07994090fa7c4e9b319966215525283
2024-05-03 16:18:17 -04:00
Galantsev, Dmitrii 9702d0f2d7 SWDEV-439576 - rocmsmi -> amdsmi
- Migrate to amdsmi library
- NOTE: raslib still uses rocmsmi
- Remove unused rocmsmi service
- Remove unused RDC client code
- Remove RSMI calls from protos/rdc.proto

Change-Id: Ifc34a264c506b0ec5792307ee56b34526268762d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-04-09 20:19:28 -05:00
Galantsev, Dmitrii 81e3a78b1f Remove unsupported rocprofiler metrics
Change-Id: If6cfbcbe018227c591733471ab203fc6675d50af
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-02-09 15:18:54 -06:00