From 682998015213d2ebbbed58f82e22e7174b604649 Mon Sep 17 00:00:00 2001 From: Charis Poag Date: Thu, 5 Dec 2024 19:23:16 -0600 Subject: [PATCH] [SWDEV-495824] AMD SMI reporting CPX partitions incorrectly Updated changelog to provide options to users on how to fix. Change-Id: I4fd04b1e65ff9d678b2d13109599f57a03c84d41 Signed-off-by: Charis Poag [ROCm/amdsmi commit: b911a0606a1a1dedeb080e0773eb05020b762f2d] --- projects/amdsmi/CHANGELOG.md | 66 ++++++++++++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) diff --git a/projects/amdsmi/CHANGELOG.md b/projects/amdsmi/CHANGELOG.md index 0ad25a11ea..9198857f1d 100644 --- a/projects/amdsmi/CHANGELOG.md +++ b/projects/amdsmi/CHANGELOG.md @@ -45,6 +45,72 @@ GPU2 0000:46:00.0 32 Gb/s 512 Gb/s XGMI ### Known issues +- **AMD SMI only reports 63 GPU devices when setting CPX on all 8 GPUs** + When setting CPX as a partition mode, there is a DRM node limitation of 64. + + This is a known limitation of the Linux kernel, not the driver. Other drivers, such as those using PCIe space (e.g., ast), may be occupying the necessary DRM nodes. + + The number of DRM nodes used can be checked via `ls /sys/class/drm` + + Options are as follows: + 1) ***Workaround - removing other devices using DRM nodes*** + + Recommended steps for removing unnecessary drivers: + a. Unload amdgpu - `sudo rmmod amdgpu` + b. Remove unnecessary driver(s) - ex. `sudo rmmod ast` + c. Reload amgpu - `sudo modprobe amdgpu` + d. Confirm `amd-smi list` reports all nodes (this can vary per MI ASIC) + + 2) ***Update your OS' kernel*** + Typically you can find examples online by searching "`Update kernel command line`" + + Ex. "Update kernel Ubuntu 22.04 command line" should provide some good examples. + https://phoenixnap.com/kb/how-to-update-kernel-ubuntu + + 3) ***Building and installing your own kernel*** + *This option is helpful for users on OS distributions that have not yet merged the necessary changes.* + https://phoenixnap.com/kb/build-linux-kernel + + All changes are in the mainline kernel if users need to build their own. + + References to kernel changes: + ```text + for libdrm : + Author: James Zhu + + Date: Mon Aug 7 10:14:18 2023 -0400 + + xf86drm: use drm device name to identify drm node type + + Currently drm node's minor range is used to identify node's type. + + Since kernel drm uses node type name and minor to generate drm + + device name, It will be more general to use drm device name to + + identify drm node type. + + Signed-off-by: James Zhu + + Reviewed-by: Simon Ser + + commit 1080273c2b31db6f031a7f889f3104f53ab4502c + + Author: James Zhu + + Date: Mon Aug 7 10:06:32 2023 -0400 + + xf86drm: update DRM_NODE_NAME_MAX supporting more nodes + + Current DRM_NODE_NAME_MAX only can support up to 999 nodes, + + Update to support up to 2^MINORBITS nodes. + + Signed-off-by: James Zhu + + Reviewed-by: Simon Ser + ``` + ## amd_smi_lib for ROCm 6.3.1 ### Added