Граф коммитов

85 Коммитов

Автор SHA1 Сообщение Дата
Pryor, Adam 6f358ddc9e SWDEV-508477 Eval Flops Percent (#85)
SWDEV-508477 - Profiler add FP*_PERCENT

Change-Id: Idb6250fe6b7ba3df6fe7d30861e0fbbda7e9bdce

Signed-off-by: adapryor <Adam.pryor@amd.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-01-24 10:07:32 -06:00
adapryor 290b90dc89 Implementation for RDC_FI_PROF_OCCUPANCY_PER_ACTIVE_CU SWDEV-50895
Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I8da7d9846edabe5629c75f50cd2bb4b23e019a17
Signed-off-by: adapryor <Adam.pryor@amd.com>
2025-01-21 21:49:19 -06:00
stali b427c07ffe fixed rdc link state print issue 2025-01-22 09:05:49 +08:00
limeng12 016a1d9d39 [SWDEV-230863] Improve the functionality of RdcSmiHealth module.
Memory check:get the threshold of retired page number
EEPROM check:read and verify the checksum
Power/Thermal check: power/thermal throttle status counter

Signed-off-by: Meng Li <li.meng@amd.com>
Change-Id: Id2c751416eb5bf007e6e1da8dc05966a6ba1324e
2025-01-14 08:14:36 +08:00
Galantsev, Dmitrii 83f36f1673 Include assert.h during C compilation (#4)
Fix for https://github.com/ROCm/ROCm/issues/3997. When compiling a C program that includes rdc/rdc.h, multiple assertion errors are thrown without this header included.

Change-Id: Ie5b5c1a1a17c8207cf9b1be23b31193e260d5c1a

Co-authored-by: harkgill-amd <harkgill@amd.com>
2025-01-10 11:29:15 -05:00
Galantsev, Dmitrii 5861ec7663 RVS - Add IET and PEBB tests
Change-Id: Ia032901d74c882e5cbfa5a3164199cd4d571341f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-01-08 18:23:13 -06:00
Galantsev, Dmitrii b058cbecf1 RVS - Add memory bandwidth test
Change-Id: I4c8990170861f6a0f3853615db68634fdaa7a622
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-01-08 18:23:13 -06:00
Li, Star bd7d7c99c1 Fix unit issue in policy feature (#78)
1. For temperature the unit in milli Celsius
2. For power the unit in microwatts.
3. Fix second register call to rdcd doesn't functional because start flag

Co-authored-by: Chao Fei <chao.fei@amd.com>
2025-01-06 09:21:08 +08:00
Pryor, Adam 60b7359161 Implementation for adding pcie_total (#40)
* Implementation for adding pcie_total

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I4b0cfd7095e9d984e939283ee7169d01f55a1847
Signed-off-by: adapryor <Adam.pryor@amd.com>

* Updates

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I021f29083de651cab9fbe7db98acbe20f65948d4

* Updates

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I42f3207b745fa787dabe30a85c8e063159d1337d

---------

Signed-off-by: adapryor <Adam.pryor@amd.com>
2024-12-26 18:36:41 -06:00
Ma, Li 772481f952 SWDEV-475244 - Memory Usage and Bandwidth: max mem and current mem (#48) 2024-12-23 10:22:53 +08:00
stali 29b6699b62 Enable RDC link Status feature
1.add link status APIs
   2.Add link status example for link status API usage
2024-12-23 09:30:21 +08:00
Greg Scaffidi f4de4b0529 Add RDC_FI_PROF_SM_ACTIVE metric.
Signed-off-by: Greg Scaffidi <salvatore.scaffidi@amd.com>
Change-Id: I63aaf5eb05d74ba696ace2b088e17c2cfb1bd74b
Signed-off-by: adapryor <Adam.pryor@amd.com>
2024-12-21 15:21:46 -06:00
Adam Pryor df170c8801 Implementation for SWDEV-479728:[RDC] - Clock Speed/Power Cap Control
Change-Id: I767a71325527aa3c691e9607953ceafebacfb4d5
Signed-off-by: adapryor <Adam.pryor@amd.com>
2024-12-20 16:03:33 -06:00
Galantsev, Dmitrii 7c91a07a43 Profiler - Migrate from rocprofv1 to rocprofv3
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

Fixed RDC for Rocprofv3

Updates

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: Ic9162bacf1322b265e6bbcdd9fbb9b1fdef414fd

last updates

Change-Id: I12e168501327c5e4cff8a9273b0512fb0e098fe7

comment

Change-Id: I61da61e66dcc017ec46f98ff4c90fb064c9679e8
2024-12-20 15:39:02 -06:00
stali 8bcb5f7068 Enable RDC topology feature
1.Add topology APIs
2.Add topology example for topology API usage

Change-Id: Ib79c06d0bac85119672f194ba685ebf25029979c
2024-12-16 10:02:41 +08:00
Li Ma 30f9b2ac2f SWDEV-475244 - Memory Usage and Bandwidth: max mem and current mem
Implemented max memory bandwith and current memory bandwidth. Added two
new field ids: RDC_FI_GPU_MEMORY_MAX_BANDWIDTH, RDC_FI_GPU_MEMORY_CUR_BANDWIDTH

Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I453e49937a84777146575f4f5bdd69fd4fe53bfc
2024-12-16 09:43:20 +08:00
Galantsev, Dmitrii 2c61dfe2ce Profiler - Remove averaging
Averaging happens very slowly and only confuses people...

Change-Id: I60754d3b896b6ffeb6104bb1c2fcc54e9869b331
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-12-11 11:58:50 -06:00
Chen Gong 251fcbe49d rocprofiler: add valu utilization
SWDEV-475242

For the description of "FP32 Engine Activity" and "FP64 Engine Activity" in dcgm,
It seems that we do not have an equivalent to these pipe-utilizations on our hardware.

In rocprofiler, I think VALU Utilization is the closest to what we want.

Change-Id: Ibce8835ef4757084cdfd73258de6fc1606ca0158
Signed-off-by: Chen Gong <curry.gong@amd.com>
2024-11-21 15:24:01 +08:00
limeng12 853d3b0cc5 Backgroud health check
Add the RdcSmiHealth module, which will call rocm_smi_lib.
It will support following health:
 - XGMI error detected
 - PCIE replay count detected
 - Memory check
 - InfoROM check
 - Power/Thermal check
The grpc client and server side health function is added.
The health module is added to the rdci.

At present, XGMI/PCIE and a part of Memory have been implemented.
Others will be added as soon as possible.

Change-Id: I1bd99290bdc7dea733f21a41a8c4bcefb2138112
2024-11-19 14:00:49 +08:00
Bill(Shuzhou) Liu 5e3ebecf80 Correct RDC_FI_PCIE_BANDWIDTH unit
The unit should be mbps instead of GB/second
2024-11-13 09:45:46 -05:00
stali d8fec06bab Enable RDCI policy subsystem
- Enable set and get for policy settings
- Enable register and clear policy events

Change-Id: If4eaaf9b80e668fb21691757210e0aa1532cecae
Signed-off-by: stali <Star.Li@amd.com>
2024-11-12 20:40:08 -06:00
Galantsev, Dmitrii e1b57c43f3 RVS - Fix cookie_t -> rdc_diag_callback_t types issue
Issue introduced in 37ddd5bf50

Change-Id: I2b6a8024d45fc44d92cf2770be9887dfc0fb3ede
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-11-12 10:36:52 -06:00
Galantsev, Dmitrii 37ddd5bf50 RVS - Report test progress in realtime
Change-Id: Id9fea71f242f372f408ecd777c030465b7ef9989
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-11-07 11:21:22 -06:00
Galantsev, Dmitrii 9c77312c51 Finish basic logging impl
Change-Id: Ia3d6ac80f4832f1bfb63573c543659abd5f84341
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-11-07 11:21:22 -06:00
Chao Fei 345ac64a43 Enable RDC policy feature
1. Add policy APIs
2. Add policy example for policy API usage

Change-Id: I14deb7c809d0b865b7bb083842092fc37868025e
Signed-off-by: Chao Fei <Chao.Fei@amd.com>
2024-10-23 20:37:27 -04:00
Li Ma 4bd31b605a SWDEV-475244 - Memory Usage and Bandwidth: memory activity
Implemented memory activity and added a new fied id
RDC_FI_GPU_MEMORY_ACTIVITY.

Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I11abe356ef6b01ce4917fd19dcc128efbc535f39
2024-10-22 11:11:31 +08:00
Li Ma b17abf93fa SWDEV-475255 - MM Engine Decoding Throughput
Implemented DEC activity for now due to ENC activity is unavailable in
amdsmi.

Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I34bb56e6e0d8d2ab91243f8932f0ac10cb2d1e9f
2024-10-18 10:01:41 +08:00
Galantsev, Dmitrii 28acbf0436 Profiler - Modify metrics
Remove occupancy metrics and replace with OccupancyPercent

Add OCCUPANCY_PERCENT which uses OccupancyPercent
Add GR_ENGINE_ACTIVE which uses GPU_UTIL/100
Add TENSOR_ACTIVE_PERCENT which uses MfmaUtil
Modify FLOPS_64 to use FP64_ACTIVE

Change-Id: I5f30d77a0c80f5ac78abd1a9e57f8a0a3c6cc00b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-10-15 19:00:30 -05:00
adapryor e20bc58b1c Add XGMI read/write sum metrics
Change-Id: I898b779ea7f5336edf0d047fb1e5d3ec40085baa
2024-10-09 17:02:55 -05:00
Galantsev, Dmitrii c40a6308c5 SWDEV-466829 - Disable ROCP when in GTest
Change-Id: I3b218fe256717c1dc9187d5f17476dfc990656c2
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-09-26 17:00:05 -05:00
Galantsev, Dmitrii d4a868cb69 Increase MAX_NUM_DEVICES limit
Change-Id: I0cf21be156649818fd05a66928054710322b23ac
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-09-25 20:58:19 -05:00
Chen Gong 1edd04d84e Implement the code related to the GetMixedComponentVersion()
Change-Id: I98aad97b4cb6498b7f2fc03a2d5ee7c9e949d5f1
Signed-off-by: Chen Gong <curry.gong@amd.com>
2024-09-10 10:06:44 -05:00
Chen Gong 5a3fd9fbc1 Implement rdc_device_get_component_version API related code
Implement an API to obtain the version information of the rdc calling component.
See rdc_component_t for details on available components.
It can be expanded later if necessary.

Change-Id: I03b48f774179c52c57b606704283add74ca39a02
Signed-off-by: Chen Gong <curry.gong@amd.com>
2024-09-10 10:06:44 -05:00
Galantsev, Dmitrii bffe4e22fa Add OAM_ID
Change-Id: I771b2f7f088940838c09ba3521a7955faa64e7ec
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-09-09 21:19:33 -05:00
Galantsev, Dmitrii d85657e5f2 SWDEV-452795 - Disable RAS plugin, fix XGMI
RAS plugin loaded rocm-smi which is in conflict with amd-smi library

Main source of grief was the map 'devInfoTypesStrings' that is defined
in both rocm-smi and amd-smi

We assume that rocm-smi would get lazy-loaded by RAS library and
overwrite symbols defined in amd-smi. devInfoTypesStrings in rocm-smi
contains different number of elements, the enums are also different.
RDC relies on amd-smi's enums.

One such enum is kDevGpuMetrics:
  rocm-smi: kDevGpuMetrics = 68
  amd-smi:  kDevGpuMetrics = 75

Example of overlapping map definitions:

  $ objdump --dynamic-syms /opt/rocm/lib/libamd_smi.so | grep devInfoTypesStrings
  00000000003c4980 g    DO .data.rel.ro0000000000000008  Base        devInfoTypesStrings
  00000000003db830 g    DO .bss0000000000000030  Base        _ZN3amd3smi6Device19devInfoTypesStringsE
  $ objdump --dynamic-syms /opt/rocm/lib/librocm_smi64.so  | grep devInfoTypesStrings
  00000000003dc590 g    DO .bss0000000000000030  Base        _ZN3amd3smi6Device19devInfoTypesStringsE
  00000000003c9c68 g    DO .data.rel.ro0000000000000008  Base        devInfoTypesStrings

Change-Id: Ib2f2db32b6abd7ebe84e7807c25581461eb86bae
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-06-26 03:42:07 -05:00
Galantsev, Dmitrii 3514225b83 Rewrite rocprofiler plugin
Change-Id: Ic7dd967cc60cacd2b16a465180505ea2a342fccf
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-06-11 03:11:15 -05:00
Galantsev, Dmitrii 20ca2ce574 Fix rocprofiler plugin
- Replace non-working fields with working ones
    - remove CU_OCCUPANCY completely as it isn't well supported
- Fix rocprofiler initialization with shared_ptr and rdc_module_init
- Replace env var ROCPROFILER_METRICS_PATH with ROCP_METRICS
    - ROCPROFILER_METRICS_PATH is only relevant for rocprofv2
    - ROCP_METRICS is only relevant for rocprofv1 (which we are using)

Change-Id: I21e6fa3f0e1694c38f44ca0e5659d672559f7380
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-06-06 01:51:39 -05:00
Galantsev, Dmitrii 07c414af5e Finalize the rocprofiler fields
Change-Id: I4ed1c4309f21bdcc7281d911663036caf5947182
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-06-04 19:49:06 -05:00
Galantsev, Dmitrii e11afbf60f Add GPU indexing and fix check for fields in rocprof
- Fix RUNPATH for tests

Change-Id: I79517592b49d27080a010a2e41e5878adf24a157
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-06-04 12:56:22 -05:00
Galantsev, Dmitrii 53033a5b77 Add memory bandwidth metrics
Change-Id: I310ca8af0536497be619d2bda1e540d1f11c2565
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-05-17 14:55:01 -05:00
Galantsev, Dmitrii c3a4c899d5 Profiler - Add all required metrics
Change-Id: Iea3938df9407789c061c3a6ead9167a69069d6e6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-05-09 23:24:02 -05:00
Galantsev, Dmitrii 234b2d835b Add rocprofiler plugin
Rename ROCR -> Runtime and ROCP -> Profiler

Change-Id: If90953da8fa5d695b681813dad4a3e7ec26a9c7e
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-05-07 04:39:39 -05:00
Galantsev, Dmitrii 5525bf8c86 AMDSMI - Add ring hang event
Change-Id: I84696e3cc1a4eba8de48e464f1a208ed9c6e489d
Depends-On: I2e73ba08ee0004f6f30660b2fa425ea94bafceca
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-05-03 16:45:42 -05:00
Bill(Shuzhou) Liu 61a75d346b Add new XGMI and PCIE bandwidth fields from gpu_metrics
For new ASIC, the RDC_EVNT_XGMI, RDC_FI_PCIE_RX and RDC_FI_PCIE_TX
are not supported. New fileds RDC_FI_XGMI and RDC_FI_PCIE_BANDWIDTH
should be used.

Change-Id: Iff5bbef4c07994090fa7c4e9b319966215525283
2024-05-03 16:18:17 -04:00
Galantsev, Dmitrii 9702d0f2d7 SWDEV-439576 - rocmsmi -> amdsmi
- Migrate to amdsmi library
- NOTE: raslib still uses rocmsmi
- Remove unused rocmsmi service
- Remove unused RDC client code
- Remove RSMI calls from protos/rdc.proto

Change-Id: Ifc34a264c506b0ec5792307ee56b34526268762d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-04-09 20:19:28 -05:00
Galantsev, Dmitrii 81e3a78b1f Remove unsupported rocprofiler metrics
Change-Id: If6cfbcbe018227c591733471ab203fc6675d50af
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-02-09 15:18:54 -06:00
Galantsev, Dmitrii f9e80cc37a Use templates for module population
Also add stddef.h workaround for old GCC.
RHEL-8 still uses GCC 8.5 and templates are not well supported.

Change-Id: Ia4dae23892ec63682ea848c46ba81de85cf6d209
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-01-10 00:27:09 -06:00
Galantsev, Dmitrii eaa1862a80 RVS: Finish initial RVS integration
NOTE: RVS Build is disabled by default due to CI build issues.

Change-Id: I1593f0fe22075a9f86f54afa3ac151e109f1f7bd
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-01-10 00:27:04 -06:00
Galantsev, Dmitrii 434e40305d LINT: Add cpplint, clang-format and pre-commit support
Change-Id: I3cbb787ef27d90486b212dfb1a8c77c460acc2ac
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-01-09 11:37:11 -06:00
Galantsev, Dmitrii 95e057c88d Simplify ModuleMgr
Change-Id: I3a57876c73e50771fcedb7ca4c67d55ac406b34d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-01-09 11:37:11 -06:00