Also force unset HSA_TOOLS_LIB so it doesn't break rocprofiler-sdk
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: e73eaf8115]
NOTE: GPU ordering used is not the same as in HSA/HIP.
GPUs are ordered via amdsmi and then GPU_ID fields are compared to map
GPU partitions to each other.
Change-Id: If379214f5281d7d5ee98515b3e5ba7affc2e2197
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: 85b619b2f0]
AMDSMI needs to merge first and bump the version to at least 24.4.2
Change-Id: I30149bb78c79ebc3de0dabdc8e63fcef12b2f406
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: a5cb334f8b]
The 'value' pointer was being written to a lot and then used for reading
within the same function. This likely caused issues all over RDC when
reading the metrics.
This commit changes it so *value is written to only once.
Change-Id: I83c158c1e46c6ce46ff87d8a2e769f26ffa8c0da
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: 91be467cad]
* Implement CPU discovery support
SWDEV-482949:
enable the CPU model name info support to the RDC, rdci command
can detect GPU and CPU modules at the same time.
It will query the CPU info through the amdsmi interface like below:
1 GPUs found.
-----------------------------------------------------------------
GPU Index Device Information
0 AMD Radeon PRO W7800
=================================================================
1 CPUs found.
-----------------------------------------------------------------
CPU Index Device Information
0 AMD Ryzen Threadripper PRO 7995WX 96-Cores
-----------------------------------------------------------------
Change-Id: Ibc6533c9a61000cd86c45b1bae14c3eb6788c119
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
* CMAKE - Add required version for amdsmi
Change-Id: I341a89351d196ec66cce215a5d1d3953302fcc66
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
---------
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Co-authored-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: 3bdca8b8b6]
Memory check:get the threshold of retired page number
EEPROM check:read and verify the checksum
Power/Thermal check: power/thermal throttle status counter
Signed-off-by: Meng Li <li.meng@amd.com>
Change-Id: Id2c751416eb5bf007e6e1da8dc05966a6ba1324e
[ROCm/rdc commit: 016a1d9d39]
Fix for https://github.com/ROCm/ROCm/issues/3997. When compiling a C program that includes rdc/rdc.h, multiple assertion errors are thrown without this header included.
Change-Id: Ie5b5c1a1a17c8207cf9b1be23b31193e260d5c1a
Co-authored-by: harkgill-amd <harkgill@amd.com>
[ROCm/rdc commit: 83f36f1673]
1. For temperature the unit in milli Celsius
2. For power the unit in microwatts.
3. Fix second register call to rdcd doesn't functional because start flag
Co-authored-by: Chao Fei <chao.fei@amd.com>
[ROCm/rdc commit: bd7d7c99c1]
Implemented max memory bandwith and current memory bandwidth. Added two
new field ids: RDC_FI_GPU_MEMORY_MAX_BANDWIDTH, RDC_FI_GPU_MEMORY_CUR_BANDWIDTH
Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I453e49937a84777146575f4f5bdd69fd4fe53bfc
[ROCm/rdc commit: 30f9b2ac2f]
SWDEV-475242
For the description of "FP32 Engine Activity" and "FP64 Engine Activity" in dcgm,
It seems that we do not have an equivalent to these pipe-utilizations on our hardware.
In rocprofiler, I think VALU Utilization is the closest to what we want.
Change-Id: Ibce8835ef4757084cdfd73258de6fc1606ca0158
Signed-off-by: Chen Gong <curry.gong@amd.com>
[ROCm/rdc commit: 251fcbe49d]
Add the RdcSmiHealth module, which will call rocm_smi_lib.
It will support following health:
- XGMI error detected
- PCIE replay count detected
- Memory check
- InfoROM check
- Power/Thermal check
The grpc client and server side health function is added.
The health module is added to the rdci.
At present, XGMI/PCIE and a part of Memory have been implemented.
Others will be added as soon as possible.
Change-Id: I1bd99290bdc7dea733f21a41a8c4bcefb2138112
[ROCm/rdc commit: 853d3b0cc5]
- Enable set and get for policy settings
- Enable register and clear policy events
Change-Id: If4eaaf9b80e668fb21691757210e0aa1532cecae
Signed-off-by: stali <Star.Li@amd.com>
[ROCm/rdc commit: d8fec06bab]