İşleme Grafiği

128 İşleme

Yazar SHA1 Mesaj Tarih
David Yat Sin f58aff630c rocr: Fix sem_post overflow errors
WaitSemaphore and PostSemaphore are used in the HybridMutex
implementation. If HybridMutex did not have to call WaitSemaphore when
acquired, then calling PostSemaphore would cause the internal count
inside sem_t to slowly grow to large values and eventually cause
overflow.

Change-Id: I173fc17c874b49926e56991405e9086ea8c138fc
2024-11-13 21:57:26 -05:00
David Yat Sin 4ec730f1dc rocr: Add HSA_SIGNAL_WAIT_ABORT_TIMEOUT
Add support for abort timeout when hsa_signal_wait_relaxed is called and
signal does not clear within timeout.
timeout is in seconds

Change-Id: If1db5a8af33c82ddc4b48968c3d8eceb97d0ea6d
2024-11-13 21:57:02 -05:00
German Andryeyev 0fc7369ba5 rocr: Disable WaitAny() in AsyncEventsLoop()
- Add the new path to avoid WaitAny() calls  in AsyncEventsLoopp() with
HSA_WAIT_ANY_DEBUG key. The new path is selected by default.
The optimizaiton combines all logic of WaitAny() in a single processing loop
and avoids extra memory allocations or ref counting.  Also it won't spin
on the CPU if all events are busy.

Change-Id: I197ce60d0d023fbb672f700d6e87702686f1f55a
2024-10-25 14:37:02 -04:00
Jonathan Kim 32bb0764b7 rocr: Fix IPC DMA Buf fragment handling and enable for development
Discarding blocks for reallocation on IPC export for better memory
performance trigger memory violations with DMA BUF exports so bypass
this for now as application performance drops haven't been observed
with the bypass.

The raw fragment should be passed to the DMA Buf export call as well
since offsets will be implicitly applied in the Thunk/KFD for
export/import calls.

Also, use the agent information directly from the pointer
information so that the export call doesn't have to scan memory to find
this.  Pass the node ID in the handle so that the import call doesn't
have to make two thunk imports to fetch the node ID for GPU memory
imports.

Finally, allow the user to use DMA Buf IPC via
HSA_ENABLE_IPC_MODE_LEGACY=0 for developer testing as legacy mode will
be applied by default.

Change-Id: Ie8fe267f8768fa5df37126078406f7065f69ff4e
2024-09-27 14:40:42 -04:00
Saleel Kudchadker 3baaa6e9c0 rocr: Allocate AQL queue on device memory
- Use HSA_ALLOCATE_QUEUE_DEV_MEM=1 to create AQL queue in device
memory.
- Before writing AQL packet header to the queue use an SFENCE to ensure
that there is no reodering of the writes over PCIE

Change-Id: I5eacdc35108c4a1e245c75ae349b7495451aa60d
2024-09-05 17:48:02 -04:00
David Yat Sin c8dd4d2b3b rocr: Handle pthread_create returning errors
Rewriting logic to fix issue where pthread_create would return errors
other than EINVAL, and these errors would be ignored.

Change-Id: I573958724dcf886c20e8c14e6a9182303b3ffa06
2024-08-22 12:15:10 -04:00
Jonathan Kim eb30a5bbc7 rocr: Memory copy based on recommended SDMA engines
Recommended SDMA engines for DMA copies are now exposed for better
GPU-GPU performance. ROCr can now select those DMA engines.

Also lock-in host-device copies to SDMA0 and device-host copies to
SDMA1 for better stability and performance.

Change-Id: Ideff2e13daf537104efecb8b837bd49ee5096cb5
2024-08-20 16:22:32 -04:00
James Xu a621bca303 Fix compile errors with musl>=1.2.3
Patch submitted on behalf of user AngryLoki:

The fix repeats common pattern, used for musl, 
e.g: https://github.com/void-linux/void-packages/blob/5ccf1c66a1df2d644e1a0db0a68fca321469c57e/srcpkgs/MangoHud/patches/0001-elfhacks-d_un.d_ptr-is-relative-on-non-glibc-systems.patch#L90.

Quoting:
d_un.d_ptr is relative on non glibc systems

elf(5) documents it this way, glibc diverts from this documentation

Change-Id: I815f88f127ef00c88ae827a8ad48df0d33c92467
2024-08-19 11:02:29 -04:00
Jonathan Kim ea646cf958 Disable DMABUF IPC iplementation
Current DMABUF implemenation is unstable.  Switch back to legacy
support for now.

Change-Id: I3be871f38c6524b0bcc9225bab61de4e57771efb
2024-08-12 13:14:14 -04:00
Saleel Kudchadker 26e105d9ab Initial external logging API
New API to accept a file stream for logging

Co-authored-by: David Yat Sin <David.YatSin@amd.com>

Change-Id: Ie09c35ae14ca86a97eb25f61251be287c55d7169
Signed-off-by: Chris Freehill <cfreehil@amd.com>
2024-08-07 02:59:00 +00:00
David Yat Sin 2f05c2a273 Revert "Use pthread_setaffinity_np"
This reverts commit 1df7a44112e45b7fb447926778490f741601219a.

Change-Id: Ib386c8f944b6da0ef68ddd2be3f26013cd36ef5b
Signed-off-by: Chris Freehill <cfreehil@amd.com>
2024-06-25 12:27:09 -05:00
David Yat Sin 1cee8656df Revert "Use pthread_attr_setaffinity_np when available"
This reverts commit ef95ccf81e59b8608861e8f2f256d981eee19df7.

Reason for revert: Causing performance regressions on some systems

Change-Id: I82951350cafbd57c495852d6f90023a3373f04f6
Signed-off-by: Chris Freehill <cfreehil@amd.com>
2024-06-25 12:27:09 -05:00
David Yat Sin 57b93e02a4 Use pthread_attr_setaffinity_np when available
If pthread_attr_setaffinity_np function exists use it instead of
pthread_setaffinity_np as pthread_setaffinity_np seems to fail to set
the affinity settings on some systems.

Change-Id: Icd8b17039699ac10d9cd5c4dbb6ac44630673949
2024-04-29 15:02:54 +00:00
Shweta.Khatri bc9cac97fe Fixing compilation errors related to MUSL libc
Fix Musl libc NULL errors and unsupported pthread funcs for compatibility.
Also ensures cleanup and error handling irrespective of CPU affinity override.

Fix submitted by github dev - AngryLoki
https://github.com/ROCm/ROCR-Runtime/issues/181

Change-Id: Ia487315e504112be5d3370756f23f6e23b9ae4be
2024-04-17 07:14:15 -04:00
David Yat Sin 8d666dea01 PC Sampling: Allocate resources to retrieve data from trap handler
Allocate required device and host buffers to be able to interact with
the 2nd level trap handler.

Change-Id: If99de5aacf956ca57ecafc7b04b797be9c9decaa
2024-04-11 12:53:00 -04:00
David Yat Sin 0bc244e10a PC Sampling: Create PC Sampling interfaces
Create new interface group for PC Sampling

Change-Id: I59b4cfe9f8d1ae313dc28be1d2ed49f750d8212b
2024-04-11 12:52:23 -04:00
Shweta.Khatri 00b63f7452 Replace lazy_ptr's Init() with reset() method
The function Init() called by one of the constructors of lazy_ptr is undefined.
Replacing with reset method sets the object to an uninitialized state and assigns a new constructor function

Fix submitted on github by zhoumin2 - https://github.com/ROCm/ROCR-Runtime/pull/184

Change-Id: I7d906d526ce7fe7e2548b01810e6395b13497bf3
2024-03-26 15:07:34 -04:00
Jonathan R. Madsen 7ce263b0e4 Update rocprofiler-register support
- add rocprofiler-register to CPACK_DEBIAN_BINARY_PACKAGE_DEPENDS when found
- add rocprofiler-register to CPACK_RPM_BINARY_PACKAGE_REQUIRES when found
- remove report_tool_load_failures_explicit_
- add HSA_TOOLS_DISABLE_REGISTER flag
- add HSA_TOOLS_REPORT_REGISTER_FAILURE
- use HSA_TOOLS_REPORT_REGISTER_FAILURE instead of HSA_TOOLS_REPORT_LOAD_FAILURE
- changed rocprofiler-register message to not include the word "error"

Change-Id: Ib7fd7f14c42758a54c347874018281bb1b5477a6
2024-02-22 11:55:25 -05:00
Jonathan Kim 62f3f250ce Optimize and fix SDMA gang copies
Optimizations include:
- Greedy gang by placing gang leaders on first D2D sdma blit context
to avoid dead locking with other gang leaders and items.  Note that
this is fine since we can't avoid an oversubscription problem when
there is only 1 xGMI link anyways, so treat all xGMI links as a single
pipe for ganging.
- Non-leader gang items don't have to poll on dependency signals so this
opens up more non-blocking SDMA channels.
- unlock gang lock when gangs are not needed.
- Change gang factor lookup from vector pair to map and register all
gpus in gang factor lookup regardless of link type so that we can take
advantage of the O(logN) direct key/value lookup time.

Fixes include:
- HSA_PAGE_SIZE_4KB was an incorrect macro to use for gang size limit.
As a result, small copies ended up ganging and hitting latency limit.
Use hardcoded 4096 bytes instead.
- Cap auxillary gang factor to the number of non-XGMI SDMA engines.

Change-Id: Ic23fde131502906a807134a04599aa6d012e8cbb
2024-01-25 10:42:27 -05:00
Jonathan R. Madsen 8f0ea44c09 Suppress reporting no tools were found with rocprofiler-register
Change-Id: If853517d40e073202d12e2a6b16fb54be5529650
2024-01-17 01:01:19 -05:00
Jonathan Kim e20f41df62 Enable IPC DMA buf
Set HSA_ENABLE_IPC_MODE_LEGACY off (i.e. use DMA bufs implementation
by default).

Change-Id: I7b1c6cb7d19310adf6f0bfe060736f4adbf7adc2
2024-01-16 22:43:27 -05:00
Jonathan Kim 5dfebdbca9 Change IPC implementation to use DMA Bufs
As the KFD IPC IOCTLs will not be upstreamed, change runtime
implementation to use DMA bufs.

DMA buf fds will be passed over abstract unix domain sockets.
The exporter spins a thread that creates a socket server.
The importer connects to the server to fetch the fd.

libDRM will be required to do a manual import and GPU map for
memory that is not already imported and mapped.

For now, use the legacy IPC implementation by default as a
follow on patch will disable the HSA_ENABLE_IPC_MODE_LEGACY
environment variable.

Change-Id: Ifd8469e9adfc81f8a1ea78d6010fb10b515ba1b4
2024-01-16 22:43:00 -05:00
David Yat Sin 8d3fee5095 Use HybridMutex for signal mutexes
Implement HybridMutex to improve latencies compared to KernelMutex when
there is contention between several threads calling hsa_signal_create
and hsa_amd_signal_async_handler.

Change-Id: If53377033e749b0050727964c9303f09b02527cc
2024-01-16 21:29:39 +00:00
David Yat Sin 6333fdecf3 Use pthread_setaffinity_np
On some systems, pthread_addr_setaffinity_np does not exist, so we need
to use pthread_setaffinity_np on thread after pthread_create

Provided by Julian Samaroo on github

https: //github.com/RadeonOpenCompute/ROCR-Runtime/pull/143
Change-Id: I4649f94333f2d7b0a5993b370a4bfc48d92acecb
2023-12-18 17:41:49 -05:00
David Yat Sin f07b8f2250 Use CPU_SET_S instead of CPU_SET
Fix incorrect use of CPU_SET on variable size cpu_set_t

Suggested by Christopher E. Moore on github
https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/130

Change-Id: I710b56683ba07c08dcd83c851bf72e4f127a0ad4
2023-12-04 15:05:22 +00:00
David Yat Sin a7a3358067 Implement alternate scratch
The alternate scratch memory is used for dispatches that have a low
number of waves but relatively large wave size.
This allows us to keep the tmpring_size.bits.WAVES field of the main
scratch to full occupancy.

Change-Id: I32d240fac4b7d38200d1eebc1b0fdc8a823920d3
2023-12-04 15:05:22 +00:00
David Yat Sin dca8f3a21d Implement async scratch reclaim
For devices where the CP FW supports asynchronous scratch reclaim, ROCr
is able to claw-back scratch memory that was assigned to an AQL queue.
With that ability, ROCr does not have to rely on using USO
(use-scratch-once) when assigning large amounts of memory to a queue.
If we reach a situation where we are running low on device memory, ROCr
will attempt to claw-back the scratch memory.

Change-Id: Iddf8ec84e37ab8b9fdc58bafbe2b61fe2acb6eb7
2023-12-04 15:05:22 +00:00
Jonathan Kim 81c64228e0 Increase SDMA copy size
SDMA4.4 and SDMA5.2+ has increased it's available copy size to 2^30 bytes
represented by exponent as bits set in the COUNT field of the
linear copy.

Also note that the full 2^22 byte limit is available from SDMA4 onwards
as it has corrected the 0x3fffe0 HW limitation from SDMA3.

As copy limit has increase, this can change system performance
so provide env var HSA_ENABLE_SDMA_COPY_SIZE_OVERRIDE=0 to fall
back to the original 0x3fffe0 limit for debugging purposes.

Change-Id: I0fb6e5378f68e5b8a00ff559271691a943ee06ee
2023-12-04 15:03:31 +00:00
Jonathan Kim 7df0167821 Enable D2D SDMA Ganging over xGMI
Use all available SDMA engines capped by xGMI bandwith for
all D2D copies within a hive.

By default, set the latency boundary copy size as 4KB and below.
Any copy size in within this boundary will not gang.

Avoid oversubscribing engines by not ganging on engines with
pending non-ganged work.

An enviroment variable HSA_ENABLE_SDMA_GANG has been provided
to override default ganging behaviour.

Change-Id: Iccde76aa1af1d47ea2a151789432c9db4f0ffa8d
2023-07-27 08:58:26 -04:00
Jeremy Newton 132a19e9c3 Fix non-x86 builds
I've just reverted some code what it was in 5.5 by wrapping new x86
specific bits with #if's, e.g.:
- CPUID is x86 specific
- mwait is x86 specific

Change-Id: I6cefae34282c777c7340daf3f934d2a11742502e
Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com>
2023-06-30 01:04:04 -04:00
David Yat Sin a397373cea Add HSA_ENABLE_PEER_SDMA env variable
Add support for HSA_ENABLE_PEER_SDMA env variable that can be used to
disable use of SDMA engines for device-to-device transfers. Note that
setting HSA_ENABLE_SDMA=0 will disable all SDMA transfers and override
HSA_ENABLE_PEER_SDMA values.

Change-Id: I737b3c2b2efcf3ff237f98bc748f49b8252ed24a
2023-05-18 00:10:20 +00:00
David Yat Sin a180c9ee78 Add env var to override SRAM ECC
Add HSA_ENABLE_SRAMECC environment variable that can be used to
override SRAM ECC mode reported by KFD

Change-Id: I2b95511820a2d3d146a76b03070659c0695b61fd
2023-04-27 16:16:05 -04:00
Lancelot SIX 183f5d90aa linux os_thread: improve error handling
On Linux, the os_thread abstraction is built on top of pthread.  Many of
the pthread calls might fail and return error codes.  The error
conditions are only checked via assertions (if ever checked) which means
that when doing a release build, no error condition is checked.  The
same goes for dlsym/dlinfo and clock_gettime.

This commit improves the situation this by checking the error conditions
and acting accordingly.  When the error condition is detected in a
function with a mean to indicate some error to its caller, then this
patch prints some error message and returns.  If there is no way to
propagate the error up the call stack, print some error message and
abort the process.

For the os_info::os_info ctor, the only user is CreateThread, which
checks that the built thread is Valid().  If not, nullptr is returned to
the caller.

It could be possible to use exceptions when functions cannot pass
errors, but for now I only use abort as it is what abort would do with
debug build.

Change-Id: I815703c3b95777cc29bb89a7d654ac879c14a759
2023-04-17 09:48:11 -04:00
David Yat Sin 8ebf5f9c48 Adding scratch memory reservation
Some applications will keep trying to allocate device memory until the
allocation fails. This causes all device memory to be used up and we are
then unable to allocate scratch memory for dispatches. Reserve enough
memory for 1 small scratch allocation.

Change-Id: I968400d41540ba1aca8f28581f229693eec02225
2023-04-06 15:13:36 +00:00
Shweta Khatri 83a307c449 By default, disable mwaitx feature.
This can be enabled by setting HSA_ENABLE_MWAITX=1

Change-Id: I4be00892780beeb8b14c3c5f34aa10b158921bff
2023-03-15 19:57:25 -04:00
David Yat Sin cc48dfdbff Use mwaitx when busy-waiting signals
Use mwaitx instructions when busy waiting for signals to reduce CPU
energy usage.
This can be disabled by setting HSA_ENABLE_MWAITX=0

Change-Id: Ic207895a491b2bf6dacba47ef0921df3faad5b5a
2023-02-22 16:55:43 +00:00
David Yat Sin 0ed1568afc Add function for parse CPUID information
Used to detect whether mwaitx instruction is supported

Change-Id: I66fe906325aa523c8815133cf782df3a17a7edab
2023-02-22 16:55:42 +00:00
Shweta Khatri 8aac885318 Fixes hang due to change in order of initialization of libraries
Fixes hang due to change in order of initialization of libraries
that have cyclical dependencies and they call hsa_init() during their
initialization phase.
This implementation looks for a symbol called "HSA_AMD_TOOL_PRIORITY"
across all loaded shared libraries using dynamic section entries of the
loaded lib instead of using dlopen and dlsym for the same purpose.

Change-Id: I4865f2fd18dd186ec311a432ec38fbb5583805d2
2023-01-26 01:17:22 -05:00
David Yat Sin a4f898ad15 Add env variable to print image SRD contents
Add environment variable HSA_IMAGE_PRINT_SRD to print contents of SRD
registers for image functions

Change-Id: Ifb47a73dcfad8745ee7445e20de96e1021b80bd6
2023-01-13 11:01:04 -05:00
Shweta Khatri 8751e65b79 Fixed callback method for dl_iterate_phdr api which is called for each loaded shared object
Simplified the callback method. Also fixed the way, loaded shared object were getting appended into a string vector,
which was not being passed to this callback method.

Change-Id: I68661dd73f61a11c42fa92f670e8e7b6ffcb5711
2022-11-21 19:00:34 -05:00
David Yat Sin dd255d31b8 Fix uninitialized variable warning
Fix warning when using valgrind

Change-Id: Ie59eaa990b9b5d339a178a2c6f9f4fac0e34e925
2022-09-08 09:10:00 -04:00
David Yat Sin df3fe8c2fb Add env variable to disable CPU affinity override
New environment variable HSA_OVERRIDE_CPU_AFFINITY_DEBUG to
enable/disable overriding CPU affinity.

Default value is enabled(1).

This is a temporary variable and may be removed in the future.

Change-Id: Id6a7c611730471ddc276ca333fde1e57046bf32a
2022-08-19 11:07:49 -04:00
Sean Keely 965df6eef7 Basic SVM profiler.
Mostly a demo at this point.  Logs SVM (aka HMM) info to
HSA_SVM_PROFILE if set.

Example: HSA_SVM_PROFILE=log.txt SomeApp

Change-Id: Ib6fd688f661a21b2c695f586b833be93662a15f4
2022-06-23 19:30:06 -05:00
skhatri e7fc301aa7 Adding support for rocrtracer tools loading without environment variable
During hsa initializing stage, ROCr now searches all the loaded libraries
for a  symbol "HSA_AMD_TOOL_PRIORITY" and adds all those libraries to
the tools library init list.  Tools libraries listed in HSA_TOOLS_LIB
env variable are also loaded in the given order and take priority
over HSA_AMD_TOOL_PRIORITY.

Change-Id: I739af42bbd777c44a9152c11e17dd69979b65e82
2022-06-23 20:08:30 -04:00
David Yat Sin 4ac840269c Add API for available GPU memory
Add support for AMD Agent to return amount of memory available

Change-Id: I5c32e2cebbaa2993b044250aefe434e4cc02d8c2
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-06-07 10:33:18 -04:00
Sean Keely 3ebe99f96d Add experimental option to force discovery of all copy agents.
Discards all user provided async copy agent info and relies on
pointer info discovery.

Change-Id: Ife3e708a49ffccbede4983ab47d5ed0032970857
2022-05-14 18:08:57 -05:00
Sean Keely 0ee82742a7 Switch to CLOCK_BOOTTIME for HSA system clock.
This is consistent with KFD and has significantly better latency.
KFD is taking this as the definition of the SystemClockCounter.

Change-Id: I4c1b3bc58c738206265c55ebefd41356c013bfe5
2022-05-05 15:27:29 -04:00
Jeremy Newton 178a7a5cfa Drop some unnecessary definitions
__x86_64__ and __AMD64__ should be already defined by the compiler to
specify the compilation target and shouldn't be defined manually.

I fixed two x86_64 checks to include VS variables, as removing this
might cause it to fail to compile on that compiler.

Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com>
Change-Id: I600ff449af85bf7d83ecab167d97933922e2d917
2022-04-19 12:22:42 -04:00
Sean Keely 4e9849034d Correct inf loop defect in fast clock init.
Each time delay is grown we need to reset elapsed.  We want to take
the most accurate sample from the set at fixed delay.

Without this we will hang if there is ever an insufficiently accurate,
high unit clock read.

Change-Id: Ic65f364067789ac85a6572d67af2d77528e265bb
2022-04-01 16:15:37 -04:00
Sean Keely 552dcead93 Correct scratch allocation logic to account for asymmetric harvest.
With asym. harvest hw does not issue groups equally to each SE,
occasionally hw will skip an SE so that the distribution reflects
each SE's CU count.  Scratch resources must be allocated to reflect
this asymmetric distribution of groups.

Change-Id: I65e26206500483ea18e6e8796e65ecba5354b029
2022-03-02 19:59:30 -06:00