* Updates:
- [API/CLI] rsmi_dev_*_partition_set &
rsmi_dev_*_partition_reset - exposed RSMI_STATUS_BUSY for
EBUSY writes + cleaned up accidental map insertions
(maplookup[] can insert values that are not in the map,
map.at(key) fixes this potential issue)
- [API] rsmi_dev_gpu_metrics_info_get() - returns
RSMI_STATUS_NOT_SUPPORTED for unsupported metric tables
outside of 1v1/1v2/1v3
- [API] writeDevInfoStr() - exposes RSMI_STATUS_BUSY for
EBUSY write errors; kept backward compatibility
for other writes which do not care about these states
- [API] rsmi_dev_od_volt_info_get()
& rsmi_dev_od_volt_curve_regions_get() have better logging
+ Expose more details on why they are erroring
- [Utils/logs/example] Expose AMD GPU gfx target version to aid in
system troubleshooting
- [Utils] Added test methods that look at od volt
freq & regions into here - for easier access across
several tests
- [Utils] Updated getRSMIStatusString(new argument - fullstatus;
default to true for backwards compatibility)
-> true shows shortened RSMI STATUS response
- [Utils] Added splitString to cut out noisy return responses
(used in getRSMIStatusString(), when fullstatus = true)
- [Utils] Added getFileCreationDate() to expose build date
of the library - helpful for local builds or experimental builds
- [Utils] Macro cleanup
- [Example] Added a few gpu_metric checks - helpful for upcoming
updates
- [Device] SYSFS/DebugFS - now have better r/w displayed in logs
- [LOGS] Expose library build date - see above for details
- [Tests] Add more warnings/errors to test builds
- [Tests] Moved up Partition tests for ordered test runs - helped
identify issues with GPU BUSY writes
- [Tests] compute_partition_read_write - handles RSMI_STATUS_BUSY
with waits for busy status found & cleaned up how we checked
for partition changes - with RSMI responses exposed more clearly
- [Tests] perf_determinism - multi gpu now properly runs through
with full resets as needed
- [Tests] volt_freq_curv_read - better error handling with more
verbose output
Change-Id: Ie94c6abb6a9aab95c345996d3ad3843cf6734977
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/amdsmi commit: 57b6135e54]
* Updates:
- [API] After discovering all amd gpus, we now properly
map correct bdf (xgmi nodes). Especially important for
partition changes - aka secondary nodes.
- [API] While adding new secondary nodes we now have
better grouping -> due to resorting based on
kfd properties list & matching to primary uniqueid
- [API] All secondary nodes are now AddToDeviceList
with correct bdf (location id), provided by kfd
- [API] Modified AddToDeviceList(..., uint64_t bdfid):
providing an optional field - bdfid. This allows working
around primary pcie cards with xgmi nodes
- [API] Utils - cpplint minor fixes
- [Example] Removed all endl references w/ newline, fixed
spacing, and some incorrect values displaying as hex
(needed dec representation)
- [API] kfd node functions - now print full path of file
for trace logs
- [Tests] power_read.cc: Added in generic power test to
confirm guaranteeing specific return values
Change-Id: I143474e8d64c4915a966e789be6bcea4fa7f4472
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/amdsmi commit: 6f1afd2678]
* Updates:
- [API] Added rsmi_dev_power_get(uint32_t dv_ind,
uint64_t *power,
RSMI_POWER_TYPE
*type)
provides generic get to average or
current power & provides backwards
compatibility
- Added a utility function to get MonitorTypes
(monitor_type_string(type)) &
RSMI_POWER_TYPE (power_type_string(type))
strings
- [Tests] Added rsmi_dev_power_get tests and
provided better verification of return values for
all power APIs
- [Tests] Updated power outputs to show correct
units
- [example] Now uses avg, current, and generic
power functions with type output response
Change-Id: I5ca06ca37fd5f61e100f2835b664d6cdd1ca42e6
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/amdsmi commit: 31a1fcce7d]
* Updates:
- rocm_smi_lib + CLI:
Rename all "NPS mode" -> "memory partition"
related files/functions/API/CLI to align with correct
technical naming
- rocm_smi_main: fixed identifying primary card's unique id
utilize rsmi_dev_unique_id_get to map which
KFD nodes belong to it
- rsmi_dev_*_partition*: now have better logging output
- compute partition tests:
Added 20 sec delay for workaround until GPU
busy is confirmed as the issue
- CPPLint fixes/formatting
- [Example] Moved all endl to "\n" for efficiency
- [Example] Added Edge & Junction temperature examples
- [Example] Added rsmi_minmax_bandwidth_get() example - WIP
Change-Id: Ida6db6fda7e0ac9d696a34cb15b4746e69d58d51
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/amdsmi commit: b251bb0c9f]
Sending RSMI_STATUS_UNEXPECTED_DATA for drivers
which do not set some clock freqs
Change-Id: I43a9515c2757dddd412bb25cfd54095e63367030
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/amdsmi commit: f191c2753c]
Code changes related to the following:
* Added 'rsmi_dev_revision_get()' related code
* Test code
* Functional tests
Change-Id: I8c2097c65384a028c8c8437b717d05d52fe45250
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/amdsmi commit: 573620f586]
Updates:
* VoltRead - needed to properly send out RSMI_STATUS_NOT_SUPPORTED
when device does not have voltage hwmon files
* ComputePart. - test failure was likely caused due to EvtNotif
causing conflicts (unknown exactly why). Test passes when
moving it ahead of the event notifier. Both API calls may have
a system resource issue, TBD.
* rocm_smi_example - now indicates when an API call
returns RSMI_STATUS_NOT_SUPPORTED or
RSMI_STATUS_NOT_YET_IMPLEMENTED. Allows example to fully complete
on systems which may not provide support for all API calls.
Change-Id: I520b8584e078d412414e8e5797c664220a7e823a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/amdsmi commit: 78a0812f7f]
Updates:
* Added RSMI_STATUS_SETTING_UNAVAILABLE for
rsmi_dev_compute_partition_set - gives users
better error output when attempting to set
compute partition to values not listed in
available_compute_partition SYSFS
* Updated python --setcomputepartition to
provide better output when receiving
RSMI_STATUS_SETTING_UNAVAILABLE
* Updated all test & example files to check for
RSMI_STATUS_SETTING_UNAVAILABLE when doing
rsmi_dev_compute_partition_set
Change-Id: Ida5d54880d9b9b6e4a0468cdb962fdc0c18d6257
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/amdsmi commit: 0d3558945b]
Updates:
* Added rsmi_dev_compute_partition_reset & rsmi_dev_nps_mode_reset
* Added --resetcomputepartition and --resetnpsmode python smi calls
* Added temp data files rocmsmi_boot_compute_partition_<device num>
& rocmsmi_boot_nps_mode_partition_<device num>, writes UNKNOWN
if data cannot be read or device does not support
* Cleaned up NPS & compute API documentation
* Added creation and reading of API temp files (used in reset
functionality)
* Cleaned up output of rocm_smi_example
* Updated rocm_smi_example to check if running with sudo permission
before executing write API calls (cleans up erroneous output)
* Added template specialization for storing temp data, requires
specific rsmi_type_t enums (restrics what data can be stored)
* Added storage of temp data, if temp files do not exist
* Updated google tests for NPS & compute to include reset API calls
Change-Id: I69895a466b97107617e6dbb355737b84499a76c9
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/amdsmi commit: 77c950a4bf]
Updates:
* Added rsmi_dev_nps_mode_set and rsmi_dev_nps_mode_get
* Added ability to set multiple SYSFS files in debug build
* Added ability to see user's env variables set for debug build
* Added tests for rsmi_dev_nps_mode_set and rsmi_dev_nps_mode_get
* Added ability to restart AMD GPU driver, used in nps_mode_set
* Updated ROCm_SMI_Manual.pdf to include new APIs
* Added progress bar for long running python_smi_tools, used
in setting nps_mode if runs longer than .1 seconds
Change-Id: I6d61bedd28d7cba6aff432ad2d127ba741b7d15a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/amdsmi commit: 9ef376cd61]
Original updates:
* Added .gitignore to help with future commits
* Updated/added copyrights on modified or added files
* Updated rocm_smi.h/.cc
- Added 3 new SMI API functions:
rsmi_dev_compute_partition_set &
rsmi_dev_compute_partition_get
- Added helpful maps/enums used in
new get/set compute_partition API calls
* Updated rocm_smi.py
- Added --showcomputepartition
- Added --setcomputepartition
- Fixed a few mistypes
* Updated rsmiBindings.py - added helpful class/dict/list
* Updated rocm_smi_example.cc
- Added helpful MACRO to detect if api is not supported.
- Added current_compute_partition set/get rocm lib calls
- Added helpful macro to discover future RSMI errors
- Commented out test_set_freq, was having permission issues
on a Navi21
* Updated rocm_smi_main.cc
- Added helpful map to debug API calls, left in for future use
- Added comment to better understand a non-class function returns
* Added computepartition_read_write.cc/.h
- Added get/set compute partition API test calls
- Confirmed on devices that do not support the API calls, tests pass
* Updated rocm_smi_test/main.cc
- Calls new compute partition gtests
Added following updates from review feedback:
* Updated rocm_smi.h/cc
- Removed C++ API calls, adding support for both C/C++
API calls could cause confusion and adds extra work for us
- rsmi_dev_compute_partition_get -> Fixed an edge case where
user gives a small buffer length size (smaller than data
received), but does not receive the partial buffer back.
google Tests are updated to reflect this find.
* Updated rocm_smi_example.cc
- Fixed test_set_freq, issue was that file was not writable.
We now indicate this warning, so prior errors make sense.
- General test code cleanup. Removed extra code,
by creating loops for tests.
* Updated rocm_smi_main.cc
- Moved and got rid of an external reference to a map used
for debugging RSMI enums, now is a const public reference.
* Updated rocm_smi.py
- Updated python code to identify NOT_SUPPORTED due to
(currently) only a few GPU support the feature
Change-Id: I4a567acbb59d6771fb64df08d19175fe3604fd1b
[ROCm/amdsmi commit: 4d7f3f2bc7]