Grafico dei commit

88 Commit

Autore SHA1 Messaggio Data
Mukul Joshi 406859ca8a Update KFD SMI event notification handling
Event bitmask in KFD SMI event is now replaced with event index in
the SMI event message. Sending a event bitmask, which was a 64-bit
field with only 1 bit set, was quite wasteful of memory and also
potentially limiting to 64 events. Instead the kernel would send
event index in the SMI event message. As a result, update the
KFD SMI event handling to expect the event index in the message.

Change-Id: I3e74620788d3c1f7c0bdaa69e9d9ab3d1aba2c92
2020-09-16 13:24:50 -04:00
Chris Freehill cafd678d5d Add missing docs section for EvntNotif
Change-Id: I69187c734d2618ddb4272c58bb76d04646908793
2020-09-11 15:48:56 -05:00
Divya Shikre 54d4b9d500 Adding setsrange, setmrange, setvc, setslevel and setmlevel functionality to rocm lib and cli
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I5fd65ea7bcd5403aaf2e42d2aa28d837929da253
2020-09-08 18:42:39 -04:00
Chris Freehill 0468aa4971 Correct event counter documentation example
Change-Id: I74c41de8e4aacbd42d9e156983369eb76bec3367
2020-08-06 08:49:21 -05:00
Chris Freehill c2439d28e8 Correct usage of bitwise &
Also, fix warning related to catch() and cpplint error.

Change-Id: I4292170538d0f700fccb605814c5058543abe74a
2020-07-26 20:08:24 -05:00
Chris Freehill 52514835f0 Update xgmi event counter documentation
Also:
* fix doxygen manual generation that was altered during
  OAM refactor
* quiet some compile warnings.

Change-Id: I548a3cf00eb887bea3dbf58e362ca6dfe90bde28
2020-07-16 17:42:56 -05:00
Mukul Joshi eea1ed8c3d Add support to retrieve process SDMA usage information.
Also, print SDMA usage information in TestProcInfoRead.

Change-Id: I8d19be3b8653e298c81237e5067eca75a1743e70
2020-07-13 17:32:08 -04:00
Chris Freehill 68155baed5 Handle un-readable kfd properties files
Some systems have kfd sysfs properties entries that
are unreadable--for example, when a multi-gpu system is
dividing the gpus among containers, each container may
only be able to access certain gpus.

Previously, all kfd topology node properties entries were
assumed to be valid. Now, we check for readability before
declaring them "valid".

Fixes SWDEV-240169

Also:
* remove an assertion that would happen when no pcie
device identifier files are found on the system.
* fix cpplint issues

Change-Id: I74321b685159dd2628c890b33c39ad82988cb9dd
2020-07-10 12:35:31 -04:00
Chris Freehill c2ef9a6879 Fix docs + cmake_utils path issues
This corrects issues that arose after OAM reorganization.
It should address SWDEV-243294.

Also, fix some compile warnings that show up on RHEL.

Change-Id: Id14d444905da35cd7346bcfbcd82b6d0572708c4
2020-07-08 09:47:25 -05:00
Chris Freehill 6594f8f58b Refactor rsmi to support oam
Change-Id: Idc524e01ba06eb5c8d1682becaf5bf8ced5bffcf
2020-06-22 18:51:46 -05:00
Mike Li 488bbb668a Add support to retrieve XGMI hive id
Change-Id: I1eee05dd85ecb856889d1cfe0565454d2f538856
Signed-off-by: Mike Li <Tianxinmike.Li@amd.com>
2020-06-19 07:35:23 -07:00
Divya Shikre 2805ed16a4 Adding current voltage feature & gtest.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ic555a3af265e603419e2875d1989a366abc82596
2020-06-16 11:48:56 -04:00
Chris Freehill f946ea37ef Update XGMI perf counter test to show utilization
Also:
* When destroying a counter, make sure to stop the counter first
* In the test, do not stop (disable) the counter before
  reading it.
* Clean up some whitespace in other tests
* Re-add manual pdf file

Change-Id: I0786ef3a994ca568299c77e44f092af8943ac33d
2020-06-10 12:49:49 -04:00
Mukul Joshi e30ebbc787 Add support to retrieve process VRAM usage information.
Change-Id: I60843a99207a658022a26aa346b79f91863833cf
2020-05-26 15:19:24 -04:00
Chris Freehill 8ced9c986a Add RSMI ref manual to packages
Also,
* remove extraneous test files
* fix Doxygen docs. issues
* fix whitespace issues

Change-Id: I9b58b0d68bd125a34f4fe0dc84d609c7b0b6e30e
2020-05-18 23:40:38 -04:00
Mike Li c7d349183a Add functions that are used to query Hardware topology.
Change-Id: I0f4cd02b237bde4d6dccfb0e83e65376ecb1cfaa
Signed-off-by: Mike Li <Tianxinmike.Li@amd.com>
2020-05-18 12:37:27 -04:00
Chris Freehill 02e4a9c14f Don't use static variable for monitors
Change-Id: I24b5ccfa94b2d722b070a6c6385af9201d21d9c5
2020-05-15 08:05:06 -04:00
Chris Freehill 8e03d10035 Add ref counting for rsmi init and shutdown
Also, clean lint from kfd_ioctl.h file.

Change-Id: I5a2ae127ab6ab6676a1b075ed10858d0ebfe13c1
2020-05-11 15:57:42 -05:00
Chris Freehill e1f0d7e85a Use user-mode version of kfd_ioctl.h file
Previously using kernel mode version.

Change-Id: I82bfff9c019a9059b4d0d198c6cf06dc515cc528
2020-05-07 17:13:59 -05:00
Chris Freehill 2235ede34c Add event notification API
Change-Id: Ib6e8efbe6cdefaa7de1f74bd26993e9b4b011649
2020-05-06 14:07:25 -05:00
Chris Freehill f8b57c3b16 Add device mutual exclusion tests and related fixes
* Added a new test to verify mutual exclusion of access to device
  resources
* Added some missing acquiring of mutexes to some RSMI calls, as
  well as try-catch blocks.

Change-Id: I87aac009878a0b2d1f975e1d5b794d887bb23ff9
2020-04-08 15:05:11 -05:00
Chris Freehill 52196caaee Shared mutex fixes and improvements
* Don't make different shared memory mutexes for different users
* Don't delete (unlink) the shared mutex file if the mutex
  initialization fails. This may mess up other processes that
  are using it. Instead, print a message on how to resolve the
  situation, and then throw an error.

  Note, this situation comes up when debug builds (usually)
  either assert() or otherwise end execution without a proper
  clean up.
* Remove cpplint from shared_mutex code

Change-Id: I5f8ca6150cac5c2405fb97007516da345093f966
2020-04-06 17:08:33 -05:00
Mukul Joshi fd79e5c161 Add rsmi_topo_get_numa_affinity()
Given a device index, return the corresponding NUMA node for the
device.
Also, add NUMA node tests to Sys Info Read test.

Change-Id: I0df4937470e6362e6737ccea568d4b3e5890c91a
2020-04-01 11:38:08 -04:00
Chris Freehill 324c0ca0e5 More general solution to api support hwmon mapping
This solution takes into account that some hwmons use
label files to map sensor types. The previous solution
did not take this into account.

Change-Id: I1d6204573cefa8197b2cfe0ffb412b545df3d80a
2020-03-16 11:37:47 -05:00
Chris Freehill 0bf81ed2f9 Fix segmentation fault that sometimes occurs on release builds
Fixes SWDEV-216441

Change-Id: I3ea01a4edd14000a103de751757dfaadc7d358bb
2020-02-24 17:17:26 -06:00
Chris Freehill 2d6e15190c Add rsmi_compute_process_gpus_get()
Given a process ID, give the device indices that process is
currently using.

Also:
* made corrections to how RSMI, amdgpu (ie, "card#") and
  KFD indicies translate from one another
* add a few missing error codes to rsmi_status_string()
* fix some formatting

Change-Id: Icd2cae66bb4fec768da96af7cf9cf8b8b66ec7f9
2020-02-22 10:47:58 -06:00
Chris Freehill d00b9ac07d Security improvements
Improvements include
* adding additional build flags that warn about stack-smashing
and type conversion errors
* run-time checks for valid function input values and adquate
space for the result of arithmetic operations.
* make sure default case for switch statements do something
besides just assert
* disable using env. var. debugging in release mode

Change-Id: I5f048310c5c56e05d9ec31bcc273404d6a0dd646
2020-01-16 14:56:27 -06:00
Chris Freehill 52dfa4bcca Docs., error checking and test improvements
* Update doc. on api-support function
* Check for valid integer value when reading a monitor int. val.
* If fan-write test attempts to set speed higher than max.
   possible, then skip the test

Change-Id: I01ad0ab1f4caffdb0d2c26e9575f278c35a6b017
2019-11-06 11:19:47 -05:00
Chris Freehill 68d25e82fd Support checking for specific device-getter api support
For device-getter functions, allow users to specify a nullptr
for the provided buffer. In those cases, the function will return
RSMI_STATUS_NOT_SUPPORTED if the hardware or system software does
not support the function. If the function is supported, then
RSMI_STATUS_INVALID_ARGS will be returned, unless a different
error is encountered.

Additionally, tests and documentation were updated to reflect
this change.

Change-Id: Ie7db3a4c8c66af97ebd7ee1e3b95cd331ace9d9c
2019-10-05 15:55:18 -05:00
Ori Messinger 2412dff6a2 Display GPU vram vendor
Add support and testing for reading the vram vendor associated with
the GPU. The vram vendor can be found as a separate sysfs file at:
/sys/class/drm/card[X]/device/mem_info_vram_vendor
The vram vendor is displayed as a string value.

Change-Id: I12c8e56e57f45aa08d7d6c25338c4e468ed1c7fc
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
2019-10-04 11:51:30 -04:00
Chris Freehill 551b15182b Add functions that tell what capabilities are supported
The new functions added in this commit allow a caller to tell up
front what functions, function variants and monitors are
supported.

Also,
* fixed a few documentation/formatting issues
* fixed a process_info test issue

Change-Id: I2184ab1a4a6898f847e791f273e2185d556e78e9
2019-09-23 13:30:47 -05:00
Chris Freehill 469af303d6 Make bdfid use 32 bit domain if possible
If the 32-bit domain is found in the kfd node properties for
a device, then it will be used when constructing the bdfid.
If it's not present, it will continue to use the 16 bit version.

Also, whether or not 32b or 16b are used for the domain, the
domain will now be placed in the upper 32b of the 64b bdfid.

* Fixed some unrelated doxygen issues

Change-Id: Icb5116daa1ab45ee305bdbe6cd5df5736dd3ffa3
2019-08-27 11:05:58 -04:00
Chris Freehill 01e0800741 Fix issues with buffer length when getting brand name
* Specifically, address case when brand name is longer than buffer
provided

* Also, slightly modify prototype to match similar, existing APIs.

* Address some cpplint issues.

Change-Id: Iaf77304e23085123e88f301e4b33bc4e6be2a225
2019-08-26 07:21:02 -04:00
Ori Messinger 7f2d970a80 Display GPU brand name
Add support and testing for reading the brand name associated with
a specific GPU (such as mi25, mi50, mi60, etc). The brand name is
associated with the SKU of the GPU, and some brand names can be
mapped from multiple different SKUs.

Change-Id: I36eb95ca8e72efdd294ccd684841195925dfe820
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
2019-08-22 12:24:29 -04:00
Chris Freehill aaecfd6fff Adjust how we read ECC block counter status
This change corresponds to kernel changes.

Change-Id: Ibd977e8b3338349036cb16e55fb0b2c9c187726d
2019-08-09 16:06:43 -05:00
Kent Russell a34832f11e Fix RAS change
RAS formatting changed, so get it to handle both types of sysfs output
until it's normalized
Change-Id: I56f2a2495af8ff4d01011bc614283376afb9ad0a
2019-08-08 12:09:18 -04:00
Chris Freehill 73c54e1fd0 Add support for rsmi_dev_memory_reserved_pages_get()
Also, don't return an error for empty sysfs files. The reserved memory
page file will often have no lines. We don't want it to appear that
this function is not supported if the file is empty.

Change-Id: I1d28bb184ea587bb578fe71dd75adc2a812d09a8
2019-08-06 11:42:03 -05:00
Chris Freehill cf13d6f4d8 Add rsmi_dev_serial_number_get()
Also correct whitespace issues

Change-Id: I7ffe23672304c31ed08d7148b04a19a7d4c3d7ef
2019-07-22 07:09:53 -05:00
Harish Kasiviswanathan 68cb303a44 Add rsmi_dev_drm_render_minor_get()
Function to get the drm minor number associated with ROCm device

Change-Id: I9356b9ca75151882acbb075076bc072f08b73aae
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
2019-07-11 13:12:34 -04:00
Chris Freehill 31e02fdc61 Add rsmi_dev_firmware_version_get()
Change-Id: Iba3e5f3eaa0eb031fc013fc168bded22bc249b5c
2019-07-09 22:50:44 -05:00
Chris Freehill 557e1f5704 Add xgmi error_status and error_reset functions
Also, comment corrections and added check for invalid arguments

Change-Id: I891cbf9b37bfda629914a008811b840323872c02
2019-07-09 09:55:05 -04:00
Chris Freehill 9b93cbe21d Add initial support for getting process information
Added implementation of and tests for
rsmi_dev_compute_process_info_by_pid_get() and
rsmi_dev_compute_process_info_get()

Change-Id: I4c4f5f39fe6701da37916c9ad41449b5d35ac7af
2019-07-03 20:01:43 -05:00
Chris Freehill 1c5e090507 Add rsmi_dev_memory_busy_percent_get()
Change-Id: Ide683b6c72870af547331f4502c5bb8c445d61b5
2019-06-25 19:09:13 -05:00
Chris Freehill ea26baec20 Event counter support
XGMI related events are supported

Change-Id: If17036fe890c8be45da3654353599821b5828c14
2019-06-24 17:40:01 -05:00
Kent Russell 35d2807196 Add support for reading GPU's unique ID
Add support and testing for reading the Unique ID associated with a
specific GPU. This ID will persist across reboots, even if the GPU is
moved to a different machine. Note that this is per-GPU, not per-card,
as some cards have multiple GPUs, and each GPU will get a unique
identifier

Change-Id: Idce50c6febc2ceb1a4c1200d2489ec8b9d8fe174
2019-06-21 08:39:36 -04:00
Chris Freehill 11f714326b Add support for junction, edge and memory temperature sensors (#42)
* If vendor/device/subsystem name is not found, use device ID string

* Update documentation for get-name functions

* Add support for junction, edge and memory temperature sensors
2019-05-24 15:24:49 -05:00
Chris Freehill 59538cd004 If vendor/device/subsystem name is not found, use device ID string (#41)
* If vendor/device/subsystem name is not found, use device ID string
2019-05-16 16:15:17 -05:00
Chris Freehill 7b9ff01a37 Check for root access early for functions that require it 2019-05-15 16:54:20 -05:00
Chris Freehill 53489c1f3d Correct return code of isAMDGpu()
Also, correct some comments, whitespace.
2019-05-13 18:02:03 -05:00
Chris Freehill ae7ca83920 By default, only consider AMD GPU's in RSMI device list
With newly added initialization parameters that can be
passed to rsmi_init(), you can tell RSMI to consider other
devices.

Also:
-fixed incorrect header file name that would break in C
builds
-modified rsmi_init() and rsmi_shut_down() to reinitialize and
clear static structures
2019-05-13 18:02:03 -05:00