Graf commitů

101 Commity

Autor SHA1 Zpráva Datum
zichguan-amd a5227aa61b Make ROCM_DIR default ROCm path for rocprof
Signed-off-by: zichguan-amd <zichuan.guan@amd.com>


[ROCm/rdc commit: c042b4f582]
2024-12-02 11:19:56 -05:00
Chen Gong a8086b484d rocprofiler: add valu utilization
SWDEV-475242

For the description of "FP32 Engine Activity" and "FP64 Engine Activity" in dcgm,
It seems that we do not have an equivalent to these pipe-utilizations on our hardware.

In rocprofiler, I think VALU Utilization is the closest to what we want.

Change-Id: Ibce8835ef4757084cdfd73258de6fc1606ca0158
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 251fcbe49d]
2024-11-21 15:24:01 +08:00
limeng12 71e2727a8f Backgroud health check
Add the RdcSmiHealth module, which will call rocm_smi_lib.
It will support following health:
 - XGMI error detected
 - PCIE replay count detected
 - Memory check
 - InfoROM check
 - Power/Thermal check
The grpc client and server side health function is added.
The health module is added to the rdci.

At present, XGMI/PCIE and a part of Memory have been implemented.
Others will be added as soon as possible.

Change-Id: I1bd99290bdc7dea733f21a41a8c4bcefb2138112


[ROCm/rdc commit: 853d3b0cc5]
2024-11-19 14:00:49 +08:00
Galantsev, Dmitrii 8e657c165c RVS - Fix cookie_t -> rdc_diag_callback_t types issue
Issue introduced in ae9030ab1a

Change-Id: I2b6a8024d45fc44d92cf2770be9887dfc0fb3ede
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: e1b57c43f3]
2024-11-12 10:36:52 -06:00
Galantsev, Dmitrii ae9030ab1a RVS - Report test progress in realtime
Change-Id: Id9fea71f242f372f408ecd777c030465b7ef9989
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 37ddd5bf50]
2024-11-07 11:21:22 -06:00
Galantsev, Dmitrii 73c79fcd83 Finish basic logging impl
Change-Id: Ia3d6ac80f4832f1bfb63573c543659abd5f84341
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 9c77312c51]
2024-11-07 11:21:22 -06:00
Galantsev, Dmitrii b0035605ee CMAKE - Find modules at build time
Change-Id: I9370ef1433579aff1a37f3636050f525638d8658
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: cdf1588974]
2024-11-07 11:21:22 -06:00
Galantsev, Dmitrii 39687e8d96 CMAKE - Fix RVS include
Change-Id: I65095cc3d04fc2a5daeee5c809f635cb1662822f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

Revert "Disable RVS as the error scares people"

This reverts commit f3450f61bf.

Change-Id: I5086c25772444aa3bfc4c10abc1ea58d3f3f1f27
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: dd50027748]
2024-11-07 11:18:41 -06:00
Chao Fei d489245fbe Enable RDC policy feature
1. Add policy APIs
2. Add policy example for policy API usage

Change-Id: I14deb7c809d0b865b7bb083842092fc37868025e
Signed-off-by: Chao Fei <Chao.Fei@amd.com>


[ROCm/rdc commit: 345ac64a43]
2024-10-23 20:37:27 -04:00
Li Ma 7e3c4b9a21 SWDEV-475244 - Memory Usage and Bandwidth: memory activity
Implemented memory activity and added a new fied id
RDC_FI_GPU_MEMORY_ACTIVITY.

Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I11abe356ef6b01ce4917fd19dcc128efbc535f39


[ROCm/rdc commit: 4bd31b605a]
2024-10-22 11:11:31 +08:00
Li Ma 09c718954c SWDEV-475255 - MM Engine Decoding Throughput
Implemented DEC activity for now due to ENC activity is unavailable in
amdsmi.

Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I34bb56e6e0d8d2ab91243f8932f0ac10cb2d1e9f


[ROCm/rdc commit: b17abf93fa]
2024-10-18 10:01:41 +08:00
Galantsev, Dmitrii 793b2de0cb Profiler - Modify metrics
Remove occupancy metrics and replace with OccupancyPercent

Add OCCUPANCY_PERCENT which uses OccupancyPercent
Add GR_ENGINE_ACTIVE which uses GPU_UTIL/100
Add TENSOR_ACTIVE_PERCENT which uses MfmaUtil
Modify FLOPS_64 to use FP64_ACTIVE

Change-Id: I5f30d77a0c80f5ac78abd1a9e57f8a0a3c6cc00b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 28acbf0436]
2024-10-15 19:00:30 -05:00
adapryor c283aebd1c Add XGMI read/write sum metrics
Change-Id: I898b779ea7f5336edf0d047fb1e5d3ec40085baa


[ROCm/rdc commit: e20bc58b1c]
2024-10-09 17:02:55 -05:00
Galantsev, Dmitrii 999cae5e2c SWDEV-466829 - Disable ROCP when in GTest
Change-Id: I3b218fe256717c1dc9187d5f17476dfc990656c2
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: c40a6308c5]
2024-09-26 17:00:05 -05:00
Bill(Shuzhou) Liu 6372df9447 Update the hsaco for diagonstic on MI300X
Add hsaco for gfx940, gfx941 and gfx942

Change-Id: Ibd55fcc2d036d1190357e1e86d4e170568426d94


[ROCm/rdc commit: 9800528c19]
2024-09-17 14:15:35 -05:00
Galantsev, Dmitrii f3450f61bf Disable RVS as the error scares people
Change-Id: I572fdb65dd8882ab4fdc1474cb39fc0e493b1eab
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 660c5afaf4]
2024-09-13 10:42:59 -05:00
Chen Gong cd98bb7f90 Implement the code related to the GetMixedComponentVersion()
Change-Id: I98aad97b4cb6498b7f2fc03a2d5ee7c9e949d5f1
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 1edd04d84e]
2024-09-10 10:06:44 -05:00
Chen Gong 891039280f Implement rdc_device_get_component_version API related code
Implement an API to obtain the version information of the rdc calling component.
See rdc_component_t for details on available components.
It can be expanded later if necessary.

Change-Id: I03b48f774179c52c57b606704283add74ca39a02
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 5a3fd9fbc1]
2024-09-10 10:06:44 -05:00
Galantsev, Dmitrii cc3c3ce9b6 Add OAM_ID
Change-Id: I771b2f7f088940838c09ba3521a7955faa64e7ec
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: bffe4e22fa]
2024-09-09 21:19:33 -05:00
Tom Rix 27f35431ea Fix build with BUILD_STANDALONE=OFF
When the rdc is built with this configure option
-DBUILD_STANDALONE=OFF

This error is caused
CMake Error at rdc_libs/CMakeLists.txt:106 (export):
  export given target "rdc_client" which is not built by this project.

Resolve this by using conditional

Change-Id: I3f6bb2946c609c7db9fc38015b7d9c8ae766f3a0
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 6762a6dd8b]
2024-07-08 12:49:09 -05:00
Galantsev, Dmitrii 9a2806ac95 SWDEV-452795 - Disable RAS plugin, fix XGMI
RAS plugin loaded rocm-smi which is in conflict with amd-smi library

Main source of grief was the map 'devInfoTypesStrings' that is defined
in both rocm-smi and amd-smi

We assume that rocm-smi would get lazy-loaded by RAS library and
overwrite symbols defined in amd-smi. devInfoTypesStrings in rocm-smi
contains different number of elements, the enums are also different.
RDC relies on amd-smi's enums.

One such enum is kDevGpuMetrics:
  rocm-smi: kDevGpuMetrics = 68
  amd-smi:  kDevGpuMetrics = 75

Example of overlapping map definitions:

  $ objdump --dynamic-syms /opt/rocm/lib/libamd_smi.so | grep devInfoTypesStrings
  00000000003c4980 g    DO .data.rel.ro0000000000000008  Base        devInfoTypesStrings
  00000000003db830 g    DO .bss0000000000000030  Base        _ZN3amd3smi6Device19devInfoTypesStringsE
  $ objdump --dynamic-syms /opt/rocm/lib/librocm_smi64.so  | grep devInfoTypesStrings
  00000000003dc590 g    DO .bss0000000000000030  Base        _ZN3amd3smi6Device19devInfoTypesStringsE
  00000000003c9c68 g    DO .data.rel.ro0000000000000008  Base        devInfoTypesStrings

Change-Id: Ib2f2db32b6abd7ebe84e7807c25581461eb86bae
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: d85657e5f2]
2024-06-26 03:42:07 -05:00
Galantsev, Dmitrii b50c64b868 Use correct rocprofiler metrics
Change-Id: I26603de7425abb6588f770ed68c22e14d6d20d56
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: d4bb33d100]
2024-06-11 11:15:18 -05:00
Galantsev, Dmitrii 73948f95e2 Rewrite rocprofiler plugin
Change-Id: Ic7dd967cc60cacd2b16a465180505ea2a342fccf
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 3514225b83]
2024-06-11 03:11:15 -05:00
Galantsev, Dmitrii 29b86095ed Fix rocprofiler plugin
- Replace non-working fields with working ones
    - remove CU_OCCUPANCY completely as it isn't well supported
- Fix rocprofiler initialization with shared_ptr and rdc_module_init
- Replace env var ROCPROFILER_METRICS_PATH with ROCP_METRICS
    - ROCPROFILER_METRICS_PATH is only relevant for rocprofv2
    - ROCP_METRICS is only relevant for rocprofv1 (which we are using)

Change-Id: I21e6fa3f0e1694c38f44ca0e5659d672559f7380
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 20ca2ce574]
2024-06-06 01:51:39 -05:00
Galantsev, Dmitrii c2a75bbe4c Finalize the rocprofiler fields
Change-Id: I4ed1c4309f21bdcc7281d911663036caf5947182
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 07c414af5e]
2024-06-04 19:49:06 -05:00
Galantsev, Dmitrii f73e123900 Add GPU indexing and fix check for fields in rocprof
- Fix RUNPATH for tests

Change-Id: I79517592b49d27080a010a2e41e5878adf24a157
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: e11afbf60f]
2024-06-04 12:56:22 -05:00
Maisam Arif d9adf280cd Updated RDC to use AMD-SMI 24.6.0 structs
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I9ef0f3cb786c1238e53cf21df5c6afafac829175


[ROCm/rdc commit: 7c6bd4dc1c]
2024-05-31 10:37:39 -05:00
Galantsev, Dmitrii a80dfd4f00 Add memory bandwidth metrics
Change-Id: I310ca8af0536497be619d2bda1e540d1f11c2565
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 53033a5b77]
2024-05-17 14:55:01 -05:00
Galantsev, Dmitrii cff2ac8490 Add rocprofiler_example.cc and fix logging
Change-Id: Ib3ed8754f314edc76ea56bfec9a645d720f8926d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: c7fcb1ad25]
2024-05-17 14:55:01 -05:00
Galantsev, Dmitrii 83cf97e280 Profiler - Add all required metrics
Change-Id: Iea3938df9407789c061c3a6ead9167a69069d6e6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: c3a4c899d5]
2024-05-09 23:24:02 -05:00
Galantsev, Dmitrii 8b317a6490 Add rocprofiler plugin
Rename ROCR -> Runtime and ROCP -> Profiler

Change-Id: If90953da8fa5d695b681813dad4a3e7ec26a9c7e
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 234b2d835b]
2024-05-07 04:39:39 -05:00
Galantsev, Dmitrii 24f30a6ee3 Error if power metric inaccessible
Change-Id: I359c24f24d0200181646d5a7c13a6e0e4d4958b6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 1f5fa94132]
2024-05-07 04:39:39 -05:00
Galantsev, Dmitrii 93b990ffa0 AMDSMI - Add ring hang event
Change-Id: I84696e3cc1a4eba8de48e464f1a208ed9c6e489d
Depends-On: I2e73ba08ee0004f6f30660b2fa425ea94bafceca
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 5525bf8c86]
2024-05-03 16:45:42 -05:00
Bill(Shuzhou) Liu 79897be094 Add new XGMI and PCIE bandwidth fields from gpu_metrics
For new ASIC, the RDC_EVNT_XGMI, RDC_FI_PCIE_RX and RDC_FI_PCIE_TX
are not supported. New fileds RDC_FI_XGMI and RDC_FI_PCIE_BANDWIDTH
should be used.

Change-Id: Iff5bbef4c07994090fa7c4e9b319966215525283


[ROCm/rdc commit: 61a75d346b]
2024-05-03 16:18:17 -04:00
Galantsev, Dmitrii 028355dff0 SWDEV-439576 - rocmsmi -> amdsmi
- Migrate to amdsmi library
- NOTE: raslib still uses rocmsmi
- Remove unused rocmsmi service
- Remove unused RDC client code
- Remove RSMI calls from protos/rdc.proto

Change-Id: Ifc34a264c506b0ec5792307ee56b34526268762d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 9702d0f2d7]
2024-04-09 20:19:28 -05:00
Galantsev, Dmitrii c314326da0 Revert "Sort the ROCr gpu index based on BDF"
Fix 'rdcd diag' compute and system tests.
This reverts commit 4acaddc32d.

Change-Id: Ia092c46649c1d6338fb96ffe7e6feba4b045f027


[ROCm/rdc commit: 662cc0f8b2]
2024-04-09 10:27:19 -05:00
Galantsev, Dmitrii 53ecc0fc81 Remove -X from .hsaco files
Change-Id: I1f1b4f07eb854ce2e254564b83719be52b553b02
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 9d55c26247]
2024-03-27 20:35:08 -05:00
Galantsev, Dmitrii dd257bfcac CMAKE - Find hsa-runtime64
Change-Id: Id877eb9cfcc61d81993a6a43703ef2e5f72e1e8f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 6d5d9971c2]
2024-02-19 23:49:38 -05:00
Galantsev, Dmitrii 3c18db8861 SWDEV-444700 - CMAKE - Fix RUNPATH
These RUNPATH changes make it so libraries can be found without setting
LD_LIBRARY_PATH.

Mostly tested on installed RDC binaries and libraries. The
build binaries should also work.

Change-Id: Ifd908a5b61d24dfcbb1d08d21b4ee830156d8643
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 32806681ca]
2024-02-13 16:56:28 -06:00
Galantsev, Dmitrii 185245cafa CMAKE: Reduce install messages size
Change-Id: I6fa7cfe986b1de702492a96bddbfd406501bba50
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: aa5448fc16]
2024-02-06 00:31:32 -06:00
Bill(Shuzhou) Liu d1efa59fe8 Fallback to junction temperature and socket power
If the card does not have edge temperature, fallback to junction
temperature. If the card only have socket power, then use socket
power instead.

Change-Id: I053a67a89cf3b29a34e82123f522c08d7dd68916


[ROCm/rdc commit: 5cfe2b4169]
2024-02-05 10:10:26 -06:00
Galantsev, Dmitrii 703d6c0d44 Use templates for module population
Also add stddef.h workaround for old GCC.
RHEL-8 still uses GCC 8.5 and templates are not well supported.

Change-Id: Ia4dae23892ec63682ea848c46ba81de85cf6d209
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: f9e80cc37a]
2024-01-10 00:27:09 -06:00
Galantsev, Dmitrii 38c60ff90b RVS: Finish initial RVS integration
NOTE: RVS Build is disabled by default due to CI build issues.

Change-Id: I1593f0fe22075a9f86f54afa3ac151e109f1f7bd
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: eaa1862a80]
2024-01-10 00:27:04 -06:00
Galantsev, Dmitrii ea624cbb7c LINT: Add cpplint, clang-format and pre-commit support
Change-Id: I3cbb787ef27d90486b212dfb1a8c77c460acc2ac
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 434e40305d]
2024-01-09 11:37:11 -06:00
Galantsev, Dmitrii 61cf14d7cc Simplify ModuleMgr
Change-Id: I3a57876c73e50771fcedb7ca4c67d55ac406b34d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 95e057c88d]
2024-01-09 11:37:11 -06:00
Bill(Shuzhou) Liu 4acaddc32d Sort the ROCr gpu index based on BDF
The rocm-smi index is changed to sort based on BDF. The rocr plugin
is also changed based on that.

Change-Id: I5851431db336d50266b253dec1894a7bd9f3554b


[ROCm/rdc commit: 61a2773875]
2023-11-16 09:07:22 -05:00
Galantsev, Dmitrii d4440d392e Upgrade to CXX-17 gtest-1.14
Change-Id: I1c7316f151128cbc9318b226dac14950e399d2c7
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 8f9a6796f1]
2023-09-28 12:54:49 -05:00
Galantsev, Dmitrii a337dc062b SWDEV-392942 - Disable rocmtools
Temporarily disable rocmtools because of hsa_shut_down issues

Change-Id: I5e8b6729b8200ccdd5c399862bfc632ba69f884c
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 90e824c63b]
2023-04-05 13:20:19 -05:00
Bill(Shuzhou) Liu df95a71a09 Rebuild rdc_ras library on Ubuntu 20.04
Rebuild rdc_ras library on Ubuntu 20.04 for backward compatibilities.
Fallback to rocm_smi for ECC errors if rdc_ras library not available.

Change-Id: I8db9687e3eb54a6f62fce2c8d57a796c6da6b5c4


[ROCm/rdc commit: 29551b1fd0]
2023-03-16 10:02:15 -04:00
Galantsev, Dmitrii c1a76d532a SWDEV-380364 - Resolve dmon + rocmtools halt
* Move hsa_init out of rocmtools and into RDC
* Remove secondary hsa_shut_down from ROCR module

Change-Id: I57d84d41ddc51595b98e734265f10bc5129a7352
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Depends-On: I2b389ee1a9ba3507b2df1fc2fe83598f67731aac


[ROCm/rdc commit: 24b3f138e9]
2023-02-02 18:33:14 -06:00