390 コミット

作成者 SHA1 メッセージ 日付
limeng12 71e2727a8f Backgroud health check
Add the RdcSmiHealth module, which will call rocm_smi_lib.
It will support following health:
 - XGMI error detected
 - PCIE replay count detected
 - Memory check
 - InfoROM check
 - Power/Thermal check
The grpc client and server side health function is added.
The health module is added to the rdci.

At present, XGMI/PCIE and a part of Memory have been implemented.
Others will be added as soon as possible.

Change-Id: I1bd99290bdc7dea733f21a41a8c4bcefb2138112


[ROCm/rdc commit: 853d3b0cc5]
2024-11-19 14:00:49 +08:00
Galantsev, Dmitrii 39758d913c Update changelog for 6.3
Change-Id: I1b2d26f1e6c7963052fb36fd6c40e3d10c22082d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Co-authored-by: Rawat, Swati <Swati.Rawat@amd.com>


[ROCm/rdc commit: f1428a8226]
2024-11-15 14:10:11 -06:00
Bill(Shuzhou) Liu b813ae3426 Correct RDC_FI_PCIE_BANDWIDTH unit
The unit should be mbps instead of GB/second


[ROCm/rdc commit: 5e3ebecf80]
2024-11-13 09:45:46 -05:00
stali f34e245ba1 Enable RDCI policy subsystem
- Enable set and get for policy settings
- Enable register and clear policy events

Change-Id: If4eaaf9b80e668fb21691757210e0aa1532cecae
Signed-off-by: stali <Star.Li@amd.com>


[ROCm/rdc commit: d8fec06bab]
2024-11-12 20:40:08 -06:00
Galantsev, Dmitrii 8e657c165c RVS - Fix cookie_t -> rdc_diag_callback_t types issue
Issue introduced in ae9030ab1a

Change-Id: I2b6a8024d45fc44d92cf2770be9887dfc0fb3ede
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: e1b57c43f3]
2024-11-12 10:36:52 -06:00
Galantsev, Dmitrii efd58742db AMDSMI - Fix kRasErrStateStrings in tests
Change-Id: Ia9498fae215397baf7201715574954313c17da93
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 4f7e441566]
2024-11-07 11:21:22 -06:00
Galantsev, Dmitrii ae9030ab1a RVS - Report test progress in realtime
Change-Id: Id9fea71f242f372f408ecd777c030465b7ef9989
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 37ddd5bf50]
2024-11-07 11:21:22 -06:00
Galantsev, Dmitrii 73c79fcd83 Finish basic logging impl
Change-Id: Ia3d6ac80f4832f1bfb63573c543659abd5f84341
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 9c77312c51]
2024-11-07 11:21:22 -06:00
Galantsev, Dmitrii b0035605ee CMAKE - Find modules at build time
Change-Id: I9370ef1433579aff1a37f3636050f525638d8658
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: cdf1588974]
2024-11-07 11:21:22 -06:00
Galantsev, Dmitrii 39687e8d96 CMAKE - Fix RVS include
Change-Id: I65095cc3d04fc2a5daeee5c809f635cb1662822f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

Revert "Disable RVS as the error scares people"

This reverts commit f3450f61bf.

Change-Id: I5086c25772444aa3bfc4c10abc1ea58d3f3f1f27
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: dd50027748]
2024-11-07 11:18:41 -06:00
Chao Fei d489245fbe Enable RDC policy feature
1. Add policy APIs
2. Add policy example for policy API usage

Change-Id: I14deb7c809d0b865b7bb083842092fc37868025e
Signed-off-by: Chao Fei <Chao.Fei@amd.com>


[ROCm/rdc commit: 345ac64a43]
2024-10-23 20:37:27 -04:00
Li Ma 7e3c4b9a21 SWDEV-475244 - Memory Usage and Bandwidth: memory activity
Implemented memory activity and added a new fied id
RDC_FI_GPU_MEMORY_ACTIVITY.

Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I11abe356ef6b01ce4917fd19dcc128efbc535f39


[ROCm/rdc commit: 4bd31b605a]
2024-10-22 11:11:31 +08:00
Li Ma 09c718954c SWDEV-475255 - MM Engine Decoding Throughput
Implemented DEC activity for now due to ENC activity is unavailable in
amdsmi.

Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I34bb56e6e0d8d2ab91243f8932f0ac10cb2d1e9f


[ROCm/rdc commit: b17abf93fa]
2024-10-18 10:01:41 +08:00
Galantsev, Dmitrii 793b2de0cb Profiler - Modify metrics
Remove occupancy metrics and replace with OccupancyPercent

Add OCCUPANCY_PERCENT which uses OccupancyPercent
Add GR_ENGINE_ACTIVE which uses GPU_UTIL/100
Add TENSOR_ACTIVE_PERCENT which uses MfmaUtil
Modify FLOPS_64 to use FP64_ACTIVE

Change-Id: I5f30d77a0c80f5ac78abd1a9e57f8a0a3c6cc00b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 28acbf0436]
2024-10-15 19:00:30 -05:00
adapryor c283aebd1c Add XGMI read/write sum metrics
Change-Id: I898b779ea7f5336edf0d047fb1e5d3ec40085baa


[ROCm/rdc commit: e20bc58b1c]
2024-10-09 17:02:55 -05:00
Galantsev, Dmitrii 999cae5e2c SWDEV-466829 - Disable ROCP when in GTest
Change-Id: I3b218fe256717c1dc9187d5f17476dfc990656c2
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: c40a6308c5]
2024-09-26 17:00:05 -05:00
Galantsev, Dmitrii 59ea16496e Increase MAX_NUM_DEVICES limit
Change-Id: I0cf21be156649818fd05a66928054710322b23ac
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: d4a868cb69]
2024-09-25 20:58:19 -05:00
Galantsev, Dmitrii a4e55e52ec README: Add known issues section
Change-Id: I298750fdafed556480271cfce31c3fc88984cf0b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 04be1211c1]
2024-09-25 15:10:41 -05:00
Bill(Shuzhou) Liu 6372df9447 Update the hsaco for diagonstic on MI300X
Add hsaco for gfx940, gfx941 and gfx942

Change-Id: Ibd55fcc2d036d1190357e1e86d4e170568426d94


[ROCm/rdc commit: 9800528c19]
2024-09-17 14:15:35 -05:00
Li Ma 5ad5406de3 SWDEV-445415 - Pthread detach instead of pthread join
Detcah the thread which handle shutdown signals instead of joining
thread can avoid the segfault issue on specific ASIC.

Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I74ac53c027ac370605caaa87115c83fd8027526a


[ROCm/rdc commit: ca569346a3]
2024-09-13 18:32:37 -04:00
Sam Wu b60da58d74 Bump rocm-docs-core to 1.7.2
Update documentation requirements

Change-Id: I19cd1a96309844898e112777412e9c006a8874a0


[ROCm/rdc commit: b5df3a2135]
2024-09-13 14:01:46 -06:00
Galantsev, Dmitrii f3450f61bf Disable RVS as the error scares people
Change-Id: I572fdb65dd8882ab4fdc1474cb39fc0e493b1eab
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 660c5afaf4]
2024-09-13 10:42:59 -05:00
Li Ma 9b705c04eb SWDEV-483668 - Drop -shared-libasan flag for GCC compiler
Libasan is in gcc by default, thus building RDC with ASAN
enabled by GCC doesn't need -shared-libasan.

Change-Id: I8078f7ea5d46c6beea29c2823db3357a67f00b60
Signed-off-by: Li Ma <li.ma@amd.com>


[ROCm/rdc commit: 183c65c8b2]
2024-09-10 23:13:15 -04:00
Chen Gong dc905e20ff Implement the discovery -v command line interface
Call the previously implemented get_rdcd_version and rdc_get_smiversion

Change-Id: If76037d462fa9328c3af8c85423ee4547882e36e
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 0cfca6d93d]
2024-09-10 10:06:44 -05:00
Chen Gong cd98bb7f90 Implement the code related to the GetMixedComponentVersion()
Change-Id: I98aad97b4cb6498b7f2fc03a2d5ee7c9e949d5f1
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 1edd04d84e]
2024-09-10 10:06:44 -05:00
Chen Gong dc85a9e385 Provide a way for rdci to get component version
For rdci, the version information of some components(such as RDCD),
cannot be obtained through the rdc_device_get_component_version API.

Here create a fake rdc API to get them.

Change-Id: I75d8bcd1993873cff209995b58362f75787a4598
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 8db404f84f]
2024-09-10 10:06:44 -05:00
Chen Gong 891039280f Implement rdc_device_get_component_version API related code
Implement an API to obtain the version information of the rdc calling component.
See rdc_component_t for details on available components.
It can be expanded later if necessary.

Change-Id: I03b48f774179c52c57b606704283add74ca39a02
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 5a3fd9fbc1]
2024-09-10 10:06:44 -05:00
Chen Gong 2ae8557614 Add an rdc API to get component version
Change-Id: I56250a6101debeb78628f1fd1dfff7f21c52cdc0
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 69a9f24b6e]
2024-09-10 10:06:44 -05:00
Chen Gong d19c6dfa36 Reorganize the code path of the rdci Discovery Subsystem
Prepare for adding 'detection version information' later

Change-Id: Ib2b5e70b2360b1c5ff87a537f41f34f23c7ed61f
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 45c6d0b03b]
2024-09-10 10:06:44 -05:00
Chen Gong 2aba92bdce Add the function of outputting rdci version information
Change-Id: Iabeec48ba2e109ead7fb6fb07454ebcdc74a11e6
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 6591563d53]
2024-09-10 10:06:44 -05:00
Chen Gong c495b7d086 Add the function of outputting rdcd version information
Change-Id: I0572fd4b98f697660ab9099deabfd4f0fce802f3
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 9f8d447e75]
2024-09-10 10:06:44 -05:00
Chen Gong 5db56b48eb Get the hash value and pass it to rdcd and rdci
Want to display version information along with the hash value.

Change-Id: I0f9ad576f8f66747ce2e84d4f524ccd16d399927
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: ac874d3921]
2024-09-10 10:06:44 -05:00
Galantsev, Dmitrii cc3c3ce9b6 Add OAM_ID
Change-Id: I771b2f7f088940838c09ba3521a7955faa64e7ec
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: bffe4e22fa]
2024-09-09 21:19:33 -05:00
Galantsev, Dmitrii 3dd90a6ff2 Update python_interface and remove --enable_pci_id
Change-Id: Ie5d511f3da25221bf60bc669ab172323703a1c45
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: bbe0b3573c]
2024-08-26 19:55:53 -04:00
Bill(Shuzhou) Liu 98ba530267 Update the document to install the rdc service
Correct the file path of the rdc.service in the document

Change-Id: Ib161e97abdd5e2a117b2758ff5407b55337ab25b


[ROCm/rdc commit: 56b08ea7c3]
2024-08-21 12:21:57 -05:00
Galantsev, Dmitrii 06a1bba81a INSTALL - Fix rdc groups and lock file check
This fixes issues like:

1.

  /run/lock/rdcd.lock: Bad file descriptor
  Failed to determine owner of lock file.: Numerical result out of range

2.

  rdc.service: Failed to determine group credentials: No such process
  rdc.service: Failed at step GROUP spawning rdcd: No such process

Change-Id: I0ef5eb6ab72d036a3ea8dcb81f7a9108d279f7d6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: c015f0fcaa]
2024-08-12 18:54:43 -05:00
Galantsev, Dmitrii ef73a46c6c INSTALL - Add rdci binary
Change-Id: I2b7047989b650d6a3998d7a5b37fad7ade876b17
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: c22a737ce7]
2024-08-12 16:58:35 -05:00
Galantsev, Dmitrii d541941b4c INSTALL - Always add video group to rdc user
Previously we relied on "render" to be enough and only used "video" as a
fallback. On some systems like SLES this might not be sufficient.

One issue happened when starting rocprofiler as part of RDC
initialization:
  what():  hsa error code: 4104 HSA_STATUS_ERROR_OUT_OF_RESOURCES
The issue only happened when RDC was started with systemd.
Turns out "rdc" user (under which systemctl starts RDC) only had render
but not video group. Adding video group solved the issue.

Change-Id: Idf6a9521ae72a0b28a428869aa7ab1edde3ae259
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 4ebc34095c]
2024-08-12 16:58:35 -05:00
AravindanC efc8298f66 SWDEV-396819 Update File Permissions for other binary files
Change-Id: I085b482e87a016c82b339e2efe67e3d1b5a7af21


[ROCm/rdc commit: 9155768fe7]
2024-07-25 18:02:29 -07:00
Galantsev, Dmitrii 8234acd12b Azure - Switch to amd-staging branch
Change-Id: If37b4cd804e0ea50ea4031118b83090263fd39f6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 35357d85d0]
2024-07-23 17:08:32 -05:00
Tom Rix 27f35431ea Fix build with BUILD_STANDALONE=OFF
When the rdc is built with this configure option
-DBUILD_STANDALONE=OFF

This error is caused
CMake Error at rdc_libs/CMakeLists.txt:106 (export):
  export given target "rdc_client" which is not built by this project.

Resolve this by using conditional

Change-Id: I3f6bb2946c609c7db9fc38015b7d9c8ae766f3a0
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 6762a6dd8b]
2024-07-08 12:49:09 -05:00
Galantsev, Dmitrii 970cc3e72a Update CHANGELOG.md and README.md for ROCm 6.2
Change-Id: If062cb23290469beef0b04a146c485602377be5d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: bd9901324c]
2024-06-26 17:40:59 -05:00
Galantsev, Dmitrii 9a2806ac95 SWDEV-452795 - Disable RAS plugin, fix XGMI
RAS plugin loaded rocm-smi which is in conflict with amd-smi library

Main source of grief was the map 'devInfoTypesStrings' that is defined
in both rocm-smi and amd-smi

We assume that rocm-smi would get lazy-loaded by RAS library and
overwrite symbols defined in amd-smi. devInfoTypesStrings in rocm-smi
contains different number of elements, the enums are also different.
RDC relies on amd-smi's enums.

One such enum is kDevGpuMetrics:
  rocm-smi: kDevGpuMetrics = 68
  amd-smi:  kDevGpuMetrics = 75

Example of overlapping map definitions:

  $ objdump --dynamic-syms /opt/rocm/lib/libamd_smi.so | grep devInfoTypesStrings
  00000000003c4980 g    DO .data.rel.ro0000000000000008  Base        devInfoTypesStrings
  00000000003db830 g    DO .bss0000000000000030  Base        _ZN3amd3smi6Device19devInfoTypesStringsE
  $ objdump --dynamic-syms /opt/rocm/lib/librocm_smi64.so  | grep devInfoTypesStrings
  00000000003dc590 g    DO .bss0000000000000030  Base        _ZN3amd3smi6Device19devInfoTypesStringsE
  00000000003c9c68 g    DO .data.rel.ro0000000000000008  Base        devInfoTypesStrings

Change-Id: Ib2f2db32b6abd7ebe84e7807c25581461eb86bae
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: d85657e5f2]
2024-06-26 03:42:07 -05:00
Galantsev, Dmitrii 3132f91d38 SWDEV-468423 - Install authentication scripts
Change-Id: I4289fa546bf44861c18f71e156c84a4f7dd4a2ed
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: a885944d97]
2024-06-18 17:20:12 -05:00
Galantsev, Dmitrii b50c64b868 Use correct rocprofiler metrics
Change-Id: I26603de7425abb6588f770ed68c22e14d6d20d56
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: d4bb33d100]
2024-06-11 11:15:18 -05:00
Galantsev, Dmitrii 73948f95e2 Rewrite rocprofiler plugin
Change-Id: Ic7dd967cc60cacd2b16a465180505ea2a342fccf
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 3514225b83]
2024-06-11 03:11:15 -05:00
Galantsev, Dmitrii 29b86095ed Fix rocprofiler plugin
- Replace non-working fields with working ones
    - remove CU_OCCUPANCY completely as it isn't well supported
- Fix rocprofiler initialization with shared_ptr and rdc_module_init
- Replace env var ROCPROFILER_METRICS_PATH with ROCP_METRICS
    - ROCPROFILER_METRICS_PATH is only relevant for rocprofv2
    - ROCP_METRICS is only relevant for rocprofv1 (which we are using)

Change-Id: I21e6fa3f0e1694c38f44ca0e5659d672559f7380
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 20ca2ce574]
2024-06-06 01:51:39 -05:00
Galantsev, Dmitrii c2a75bbe4c Finalize the rocprofiler fields
Change-Id: I4ed1c4309f21bdcc7281d911663036caf5947182
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 07c414af5e]
2024-06-04 19:49:06 -05:00
Galantsev, Dmitrii f73e123900 Add GPU indexing and fix check for fields in rocprof
- Fix RUNPATH for tests

Change-Id: I79517592b49d27080a010a2e41e5878adf24a157
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: e11afbf60f]
2024-06-04 12:56:22 -05:00
Maisam Arif d9adf280cd Updated RDC to use AMD-SMI 24.6.0 structs
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I9ef0f3cb786c1238e53cf21df5c6afafac829175


[ROCm/rdc commit: 7c6bd4dc1c]
2024-05-31 10:37:39 -05:00