提交图

707 次代码提交

作者 SHA1 备注 提交日期
Charis Poag 6fada8c4a6 Merge amd-staging into amd-master 20240401
Change-Id: I52c8665735e86deed53645197c11889fc7ece8c5
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-04-01 17:48:06 -05:00
Charis Poag f5c32b5415 Add ROCm 6.1.1 changelog, ROCm SMI deprication, vbios fix
* Updates:
    - Add ROCm 6.1.1 Changelog updates
    - Add planned ROCm SMI deprication notice
    - Fix rocm-smi --showvbios showing extra errors
      for GPUs which do not have a VBIOS (MI300a ASICs)

Change-Id: I0e5ccfe2677f9c7909ca13863a920e323e82b439
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-03-30 00:11:09 -05:00
guanyu12 fe5648805f Merge amd-staging into amd-master 20240329
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: Iea46d075f0ee45bb68469e87b377ce3519b39e2b
2024-03-29 10:26:04 +08:00
Bill(Shuzhou) Liu 750704720b Unlock the mutex when process is dead
After the dead process is detected, pthread_mutex_consistent() will
be called. After that, the pthread_mutex_unlock() should also be
called to unlock it: "It is the responsibility of the application to
recover the state so it can be reused."

Change-Id: I45d3e2e68c3b06779f3acb1e908dbec0c6a39297
2024-03-21 15:31:10 -05:00
guanyu12 8d4261c5c5 Merge amd-staging into amd-master 20240321
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: I006fc6c187f134a4851e262fa53ab6bf8d58759d
2024-03-21 14:03:51 +08:00
Charis Poag c5acd4ee88 Update ROCm 6.0/6.1 CHANGELOG.md & README.md
* Updates:
    - [CHANGELOG.md] Provide 6.1 and 6.0 changes
    - [README.md] Update readme with relavant changes
    - [CLI] Updated --showpower to expand on types of power provided to users

Change-Id: Ic653cc81f80b7973654e2c23e1ab70567b930aa7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-03-20 00:17:33 -05:00
guanyu12 ab8ebd4dea Merge amd-staging into amd-master 20240314
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: I1d79ce09196cf101c2a885fd6be8f1094e8d5f9f
2024-03-14 11:15:44 +08:00
Galantsev, Dmitrii 9a3a50f929 Fix misc memory leaks
Change-Id: I3dbf56e98d8c1312f9081956ed590962b2bdace3
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-03-08 16:26:47 -06:00
Galantsev, Dmitrii b60541ef42 Fix memory leak created by hanging opendir
Change-Id: I01e372c6a6b427f21e89cb5e4217f876346a35be
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-03-08 16:26:47 -06:00
Galantsev, Dmitrii 46ea462189 Add .github/CONTRIBUTING.md
Change-Id: Ie20c720514666dec307a92ec05fe9c3b56ba9cc5
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-03-08 16:25:35 -06:00
Charis Poag c2035fa1b9 [SWDEV-436308] Add Partition_ID from KFD
* Updates:
    - [CLI] rocm-smi (no arg) and --showhw:
      Now displays 'ID'/'PARTITION ID' from the pcie_id identifier
      Helps users identify which partition # the device is
      Information provided by KFD
      Note: partition_id of 0, means a primary node (AKA root node),
      ex. ASICs which do not have partitioning support will show 0
    - [API] Fix partitions nodes which do not enumerate with domain:
            Adding kfd's domain, allows ASICs which have domains
            to enumerate in proper order.
            Full pcie_id / bdf propagates to all partition nodes.
    - [API] Update rsmi_dev_pci_id_get() to allow users to extract
      partition_id from device
    - [CLI] Added fix for devices which have modprobe failure,
      but DRM does not come up properly. Even though driver shows
      initialization was successful.
    - [API/Utils] Overloaded print_int_as_hex() template:
      Now accepts bitsize, and prints in smallest byte size
      possible. Note: bitsize of < 8, please just print as decimial.

Change-Id: Ib0c6f73b2b9c9fea29442a39a669c432874382d8
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-03-08 10:51:15 -05:00
guanyu12 4d1ea826e1 Merge amd-staging into amd-master 20240308
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: I2edf51a9b8f93589bf6eadee7b2691629c433977
2024-03-08 16:22:17 +08:00
David Galiffi 020c7c3e3f Add Doc team to CODEOWNERS file
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
Change-Id: Iad8eea0645b63bddb835ed22080facc7d25c1bc0
2024-03-07 11:45:12 -05:00
Istvan Kiss 50a079af0f Update documentation and add python API documentation
Change-Id: Ibccf5b6a5fba81cea42e04a022deac8a3207b9b8
2024-03-06 22:01:30 -05:00
Charis Poag 90160a7c9c Fix rocm_smi library calls
- [CLI] Rounded VRAM output on CLI, no diffrence in output
    - [python API] Fixed initializing calls which reuse initializeRsmi()
      calls - now we set a global reference to rocmsmi to use
      throughout API calls (see error below)

Traceback (most recent call last):
  File "/home/charpoag/rocmsmi_pythonapi.py", line 9, in <module>
    rocm_smi.initializeRsmi()
  File "/opt/rocm/libexec/rocm_smi/rocm_smi.py", line 3531, in initializeRsmi
    ret_init = rocmsmi.rsmi_init(0)
NameError: name 'rocmsmi' is not defined

Change-Id: I0eff3b8a432abf6d4344a02b9f638e1191c51a19
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-03-04 21:08:08 -06:00
Maisam Arif 3c64c32d99 Merge amd-staging into amd-master 20240304
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I99e9de40a57539407ce06b1e7385830c662134e1
2024-03-04 15:28:58 -06:00
Oliveira, Daniel 35b561fd69 fix: [SWDEV-432974] [rocm/rocm_smi_lib]
Checks returned error by get_gpu_pci_bandwith() before assert

Code changes related to the following:
  * Unit tests

Change-Id: Ia0fe64f168711147c5e66c7917cf633be40dee9f
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-03-01 17:30:07 -06:00
Charis Poag a74062cc30 Merge amd-staging into amd-master 20240226
Change-Id: I1d6db79aa35dabbfb4b837ffdb5dd63ff099cbd9
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-02-26 13:50:16 -06:00
Charis Poag 93ed5205f9 Merge amd-staging into amd-master 20240216
Change-Id: Id3e41507ab6143d08cb052710aa19c6f2e402fed
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-02-16 20:03:19 -06:00
Oliveira, Daniel b4d37caa70 fix: [rocm/rocm_smi_lib] rsmi_dev_activity_metric_get gfx/memory activity does not update with GPU activity
Checks and forces rereading gpu metrics unconditionally

Code changes related to the following:
  * Device::dev_log_gpu_metrics()
  * Examples
  * Unit tests

Change-Id: Ic1c4f34a39f2bf197263f80ddbb84da26345807d
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-02-16 09:47:45 -06:00
Oliveira, Daniel ce36198cb1 fix: [rocm/rocm_smi_lib] header cleanup Remove non-unified headers
Cleans up individual gpu metric APIs which will be implemented according to 'unified-headers' standards

Code changes related to the following:
  * 'rsmi_dev_metrics_' APIs
  * Functional tests
  * Examples

Change-Id: I7d562a95889361ee6f8f7588f8a790f42c8eb262
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-02-14 17:50:26 -06:00
Charis Poag 4b5ccb57f0 [SWDEV-423481/SWDEV-423393] Align all device identifier details
Updated:
 * [CLI] Fixed vram % - printf style formatting causes many data errors
   This fix updates to the recommended way of outputting formatted data.
   https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting
 * [API/CLI] Added gpu_id / GUID from kfd (rsmi_dev_guid_get)
       -> CLI name: "GUID"
       -> ROCm SMI calls: no arg, -i, --showhw, --showproduct
 * [API/CLI] Added node_id from kfd (rsmi_dev_node_get)
       -> CLI name: "Node"
       -> ROCm SMI calls: no arg, --showhw, --showproduct
 * [CLI] Added target gfx version from kfd
       -> CLI name: "GFX Version" or "GFX VER"
       -> ROCm SMI calls: --showhw, --showproduct
 * [CLI] Base ROCm CLI
       -> Removed - stacked id formatting:
	   This is to simplify identifiers helpful to users.
	   More identifiers can be found on -i --showhw, --showproduct
 * [CLI] Update -i, --showhw, --showproduct, w/out arg
      -> Card ID/DID/Model/SKU/VBIOS:
            All unsupported values now display "N/A" instead
            of "unknown" or "unsupported"
 * [CLI] Showhw now expands data based on content

Change-Id: Ifb8586f9f545892b8a5aa7903608273cdd77e075
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-02-13 19:52:29 -05:00
Vladimir Stempen 677433b367 Fix [Not supported] status for get_compute_process_info_by_pid
On some systems [rocm-smi --showpids] reports
get_compute_process_info_by_pid, Not supported on the given system
[PID] [PROCESS NAME] 1 UNKNOWN UNKNOWN UNKNOWN

get_compute_process_info_by_pid fails because cu_occupancy debugfs method
is not provided on some graphics cards and GFX revisions by design

Proposing a change to return success status when only cu_occupancy debugfs method
is not found and provide cu_occupancy invalidation value to mark only
this parameter as UNKNOWN

Change-Id: Iae37070d9bd19483b4e6c8ee24c7d9a4c92f00d7
Signed-off-by: Vladimir Stempen <Vladimir.Stempen@amd.com>
Reviewed-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-02-13 18:17:47 -05:00
Galantsev, Dmitrii de9eaaac8c CMAKE - Default to lib instead of lib64
Change-Id: Ib21d41018b091d92c2ed408ff0c4d28e6a74c903
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-02-12 20:16:28 -06:00
Galantsev, Dmitrii d03061823a Merge amd-staging into amd-master 20240212
Change-Id: I662f2a470446550ba8c612aa1e5be911d7f7489f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-02-12 11:30:04 -06:00
Bill(Shuzhou) Liu 4e0a7f2f67 Support set min or max clock
In addition to be able to set clock range, new setextremum option
is added to set only min/max clock as sometimes one of them may
not be supported.

Change-Id: I7c91ba308f3fc6c78efc88117509c515d403a6cb
2024-02-09 09:24:26 -06:00
Galantsev, Dmitrii 1015cba489 Add lychee.toml for dead link checks
Use Lychee[1] to check dead links

[1] - https://github.com/lycheeverse/lychee

Change-Id: I741a2760283da8c21b95e5b516f78e39a9d9a0a1
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-02-08 18:18:39 -05:00
Charis Poag c18ec624af [SWDEV-437365] Fix --showpower
Updates:
  - [CLI] Switching to use generic rsmi_dev_power_get()
  this is a backwards compatible function to
  retrieve power values. More consistent than
  previous fixes.
  - [API] Update API for rsmi_dev_power_get()
  Now provides @depricated for this function.
  Providing notes on newer ASICS only support
  current socket power, where as previous
  ASICS only provided average power.

Change-Id: I34da0e925cf0b6c669bdd801b017f33f3b3ee86a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
(cherry picked from commit 51aec98edd)
2024-02-02 19:30:46 -05:00
Charis Poag 51aec98edd [SWDEV-437365] Fix --showpower
Updates:
  - [CLI] Switching to use generic rsmi_dev_power_get()
  this is a backwards compatible function to
  retrieve power values. More consistent than
  previous fixes.
  - [API] Update API for rsmi_dev_power_get()
  Now provides @depricated for this function.
  Providing notes on newer ASICS only support
  current socket power, where as previous
  ASICS only provided average power.

Change-Id: I34da0e925cf0b6c669bdd801b017f33f3b3ee86a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-02-02 00:00:38 -06:00
guanyu12 23b3376398 Merge amd-staging into amd-master 20240201
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: I285bcd292990730ccad6b663ba6943211e6a5bba
2024-02-01 14:45:10 +08:00
Charis Poag 5d2cd0c271 Add rsmi_dev_target_graphics_version_get
Updates:
   - [API] rsmi_dev_target_graphics_version_get, takes
     reported value from KFD -> parses into human-readable
     values. If device does not support, returns MAX UINT64
     value and RSMI_STATUS_NOT_SUPPORTED.
     Otherwise, puts into base10 format removing
     extra 0's + putting in correct format. If user
     provides nullptr, returning RSMI_STATUS_INVALID_ARGS.
    - [Test/Example] sys_info_read updated to include
     new rsmi_dev_target_graphics_version_get tests

Change-Id: I50f94e06b8733a5dec2eb08f284b44927f36abcd
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-01-29 14:25:24 -06:00
Galantsev, Dmitrii 9386d60522 Merge amd-staging into amd-master 20240124
Change-Id: I358fde8bed15c8b2a240a0be8cf5411e21238b08
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-01-24 16:26:34 -06:00
Bill(Shuzhou) Liu 73c65b6bfe UMC ECC count return not supported
The current code assume err_count sysfs only have 2 lines, which is
changed for umc_err_count by adding extra line for defer errors.
The code is changed to relax such check.

Change-Id: I1c469555a5d460d7bc4f4926245646c09c6a2056
2024-01-24 08:31:24 -06:00
Bill(Shuzhou) Liu 905c25e59b Voltage clock display as 0 when overdrive and voltage not supported
Change the python tool not to display above information if it is
not supported.

Change-Id: I48ffd95f07168219a629dfb391c1b4587308286d
2024-01-19 17:11:08 -05:00
guanyu12 68ba8fd4ff Merge amd-staging into amd-master 20240118
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: I22971ade4774319930cb0a9bced2e3c3d7e91265
2024-01-18 10:29:57 +08:00
Bill(Shuzhou) Liu a0ec98c30d Return NOT_SUPPORT for set function in VM guest
Fix the unit tests which are fail in VM guest environment.

Change-Id: Id7c58887692bbdecba54f5d2d8463b292e19b4ad
2024-01-17 11:18:25 -06:00
Galantsev, Dmitrii 147af192b5 Remove word 'error' from non-error message
This simplifies grep lookup

Change-Id: I46cd13e0ab414791655fd93e8dcf270a946a6687
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-01-12 15:18:55 -06:00
Sam Wu 67dc4b0f2a [ROCDOC-95] Standardize documentation for ReadtheDocs
Apply the following changes to project documentation for ReadtheDocs:

add version number to documentation left navigation bar and page title
add an "About" section with a license page
enable htmlzip, pdf, epub formats when publishing on Read the Docs
set pdf title, author, copyright, and version
rename .sphinx/.doxygen to sphinx/doxygen
remove docBin from URL
update rocm-docs-core dependency
update dependabot config

Change-Id: Ife8c89a2e9323f436b3e54ef2a9e013c19b3b228
2024-01-11 17:47:58 -05:00
guanyu12 770a177077 Merge amd-staging into amd-master 20240111
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: Ia13a1669d77e91446362e2e0c19e84496046c488
2024-01-11 11:30:59 +08:00
Oliveira, Daniel 373621aed3 rocm_smi_lib: Fix gpu_metrics_v1_5 support
Adds support and implement APIs for 'gpu_metrics_v1_5'

Code changes related to the following:
  * gpu metrics 1.5 support
  * Unit tests
  * Examples

Build changes related to the following: None

Change-Id: Ie8917dd63c1dd1a94467b100fa44b634cebe62b6
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-01-05 14:24:34 -06:00
guanyu12 4b17a34716 Merge amd-staging into amd-master 20231214
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: Iebd82680b7ed56abf84ad71a92a267a90a488aa6
2023-12-14 19:31:42 +08:00
Galantsev, Dmitrii 8615d096c3 SWDEV-436561 - Add CODEOWNERS
Change-Id: I4201a0fa76f61dd56c84d644bca049f9846b27fe
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-12-12 11:18:23 -06:00
guanyu12 6793fda4ef Merge amd-staging into amd-master 20231207
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: Ic67feea6e7d21338cc3bbd76220f03effec59cbf
2023-12-07 13:21:57 +08:00
Charis Poag c6b0c93e6f Memory partition permission denied fix
Received EACCES return for file that does not have
write access (read only). Permissions would be an
issue, but we check for sudo/root permissions early on.

Change-Id: I98615b02e4acccc59facb42225887a6b7273716b
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-12-06 21:51:30 -05:00
Galantsev, Dmitrii 48163b8d4f TESTS - Temporarily disable overdrive tests
Change-Id: Ice06d31e874621abf3135548eedfe2158281891d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-12-06 19:33:17 -06:00
Galantsev, Dmitrii 102c2c692a TESTS - Fix overdrive error on not-supported
Change-Id: I47e7f499229b47b151f4ba4d5fa9c59ac04d6816
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-12-06 02:43:04 -06:00
Galantsev, Dmitrii 1ae7164f20 Merge amd-staging into amd-master 20231205
Change-Id: Ib8b0672f8993cfd995d567f582dd9b33d03ddac4
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-12-05 10:28:15 -06:00
Oliveira, Daniel 8e0d3d5a39 rocm_smi_lib: Fix GPU Metrics Max Elements Read Exceeded
Code changes related to the following:
  * Check smallest copy size for multi-valued metrics
  * Unit tests: gpu_metric_read
  * ROCMSMI examples

Build changes related to the following:
  * CMakeLists.txt

Change-Id: Ieb2363020fa21c93fbacd0edcc1d394eed183051
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2023-12-04 17:01:08 -06:00
Galantsev, Dmitrii a128867497 Fix ASAN for tests and log metrics better
Change-Id: Ib495cfc28c48a4d291a89673a3b6fc13313845c7
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-11-30 15:39:05 -05:00
Galantsev, Dmitrii 142fbac7ac Add linting via pre-commit and docker
Please see .pre-commit-config.yaml for details

- Add clang-format
- Add cpplint
- Add config for clang-tidy but don't enforce with pre-commit

Change-Id: Ica447c78e6fde94b43bfdc00f5b4efc338363e24
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-11-28 23:21:36 -05:00