rocm-systems

Author	SHA1	Message	Date
zichguan-amd	7946ddb647	rocr: check _SC_LEVEL1_DCACHE_LINESIZE before use Support musl Fixes ROCm/ROCR-Runtime#318 Signed-off-by: zichguan-amd <zichuan.guan@amd.com>	2025-07-14 14:44:31 -04:00
Sunday Clement	31b6474801	rocr: Remove Recursive Include Removed unnecessary header inlude in file to prevent circular include. Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>	2025-06-13 12:29:52 -04:00
Sunday Clement	1635746a9c	rocr: Fix Potential Deadlock Moved the Call to pthread_mutex_lock to an else statement for better code readibility. Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>	2025-06-04 10:18:09 -04:00
Sunday Clement	a97b7df4b9	rocr: Fix Potential Deadlock Because eventDescrp->mutex is a non-recursive lock attempting to acquire the lock with pthread_mutex_lock can cause the system to hang indefinitely if the lock was already previously aquired with the preceeding call to pthread_mutex_trylock. Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>	2025-06-04 10:18:09 -04:00
Alysa Liu	369d89ade3	rocr: Fixed inefficient copy operations Changed variable assignments to use std::move() where appropriate Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>	2025-06-02 11:18:36 -04:00
Sunday Clement	293092f32f	rocr: Fix Resource Leak allocated memory was previously not freed in the event of an error with rwlock initialization. Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>	2025-05-30 09:16:26 -04:00
David Yat Sin	11da1293de	rocr:Fix compile warnings	2025-05-28 16:12:02 -04:00
David Yat Sin	da2607024b	rocr: Perform memcpy for small code-object loads On large BAR systems, for small-sized code-objects, we get performance using direct memcpy due to latencies when doing the blit-copy.	2025-05-22 18:39:19 -04:00
Aaron Liu	166b0fa45a	rocr/dtif: add dtif environment variable Using HSA_ENABLE_DTIF to control dtif/native thunk code path Signed-off-by: Aaron Liu <aaron.liu@amd.com> Reviewed-by: David Yat Sin <David.YatSin@amd.com>	2025-05-13 16:44:31 -04:00
Tony Gutierrez	6e3c375bf1	rocr: Flags to alloc queue buf/struct in dev mem This builds on a prior change that allowed for allocating a user-mode queue's packet buffer in device memory to also allocate the queue struct in device memory. This provides additional latency benefits particularly for cases where dispatches are performed from the GPU itself. Flags are added to support the various use cases.	2025-04-23 15:53:29 -04:00
lyndonli	c34a2798ce	rocr: Remove redundant Refresh() call The initial call to Refresh() in the constructor is unnecessary as it's handled in Runtime::Load(). Signed-off-by: lyndonli <Lyndon.Li@amd.com>	2025-03-25 09:13:59 -04:00
David Yat Sin	02b38d0614	rocr: Put back scratch_backing_memory_byte_size The scratch_backing_memory_byte_size is not used by CP, but it is currently used by rocgdb. Putting the field back, but we need to find a solution for alt_scratch_backing_memory_byte_size. Also, completely disabling alternate scratch as we need some changes to support debugger.	2025-03-06 16:23:38 -05:00
David Yat Sin	9a950ab788	rocr: Temporarily disable alternate scratch memory Temporarily disable alternate scratch memory usage by default due to some stability issues.	2025-03-03 09:27:29 -05:00
David Yat Sin	aa2f98e6f9	rocr: Update for new async scratch reclaim Updating ROCr code to match new handshake protocol with CP FW for asynchronous scratch reclaim. Increase previous limits when scratch reclaim feature is available.	2025-02-19 21:02:00 -05:00
Sv. Lockal	5d04bd42f3	Fix build issues for musl libc (#267 ) Change-Id: Ia31330b0f96669966712b58986abeca754c2cbb9	2025-01-29 14:31:05 +00:00
Yiannis Papadopoulos	26bfa0b8f6	rocr/aie: Add dma-buf import support for AIEAgents via the Driver interface Change-Id: I70f8d8772dda7c06944d75042cb3034ddd89aff4	2025-01-27 15:22:46 -05:00
Shweta Khatri	6361466baa	rocr: Use view3dAs2dArray flag, for thick/3D swizzle modes. Added HSA_IMAGE_ENABLE_3D_SWIZZLE_DEBUG environment flag to enable/disable this. Default value is false (view3dAs2dArray = 1) Enabling this flag will enable support for swizzles that do 3D interleaving. Note that all features of 3D images are supported with 2D swizzles,it's just that the access patterns are different and therefore cache hit-rates may be better or worse, depending on how it's used. Volumetric algorithms do better with 3D and apps that tend to access a single slice at a time do better with 2D. Change-Id: Id8574a6710fe4333a1ee331e5ce9195a81434198	2025-01-27 09:28:33 -05:00
David Yat Sin	7ea25ebb85	rocr: Add thread priority for AsyncEventHandler Set priority to maximum for signal event handler and minimum for exceptions event handler. Change-Id: I1b982d3c2e4c880fafc073fe1a542d01692a6fdc	2025-01-24 10:08:12 -05:00
Eddie Richter	e9cc839b2b	rocr/aie: AIE Queue Processing Change-Id: I681c971ba7229037ca85d5529838aa7bbe5820e2	2024-12-10 10:50:02 -05:00
Apurv Mishra	89115369cc	rocr: declare 'args' as class member in 'os_thread' Removed 'args' as a unique pointer and deletion in 'ThreadTrampoline', then declared as a class member. Change-Id: Ia52058392d0170e8b5e57cfdd2c587f47a6f93f0 Signed-off-by: Apurv Mishra <apurv.mishra@amd.com>	2024-11-27 10:27:40 -05:00
David Yat Sin	f58aff630c	rocr: Fix sem_post overflow errors WaitSemaphore and PostSemaphore are used in the HybridMutex implementation. If HybridMutex did not have to call WaitSemaphore when acquired, then calling PostSemaphore would cause the internal count inside sem_t to slowly grow to large values and eventually cause overflow. Change-Id: I173fc17c874b49926e56991405e9086ea8c138fc	2024-11-13 21:57:26 -05:00
David Yat Sin	4ec730f1dc	rocr: Add HSA_SIGNAL_WAIT_ABORT_TIMEOUT Add support for abort timeout when hsa_signal_wait_relaxed is called and signal does not clear within timeout. timeout is in seconds Change-Id: If1db5a8af33c82ddc4b48968c3d8eceb97d0ea6d	2024-11-13 21:57:02 -05:00
German Andryeyev	0fc7369ba5	rocr: Disable WaitAny() in AsyncEventsLoop() - Add the new path to avoid WaitAny() calls in AsyncEventsLoopp() with HSA_WAIT_ANY_DEBUG key. The new path is selected by default. The optimizaiton combines all logic of WaitAny() in a single processing loop and avoids extra memory allocations or ref counting. Also it won't spin on the CPU if all events are busy. Change-Id: I197ce60d0d023fbb672f700d6e87702686f1f55a	2024-10-25 14:37:02 -04:00
Jonathan Kim	32bb0764b7	rocr: Fix IPC DMA Buf fragment handling and enable for development Discarding blocks for reallocation on IPC export for better memory performance trigger memory violations with DMA BUF exports so bypass this for now as application performance drops haven't been observed with the bypass. The raw fragment should be passed to the DMA Buf export call as well since offsets will be implicitly applied in the Thunk/KFD for export/import calls. Also, use the agent information directly from the pointer information so that the export call doesn't have to scan memory to find this. Pass the node ID in the handle so that the import call doesn't have to make two thunk imports to fetch the node ID for GPU memory imports. Finally, allow the user to use DMA Buf IPC via HSA_ENABLE_IPC_MODE_LEGACY=0 for developer testing as legacy mode will be applied by default. Change-Id: Ie8fe267f8768fa5df37126078406f7065f69ff4e	2024-09-27 14:40:42 -04:00
Saleel Kudchadker	3baaa6e9c0	rocr: Allocate AQL queue on device memory - Use HSA_ALLOCATE_QUEUE_DEV_MEM=1 to create AQL queue in device memory. - Before writing AQL packet header to the queue use an SFENCE to ensure that there is no reodering of the writes over PCIE Change-Id: I5eacdc35108c4a1e245c75ae349b7495451aa60d	2024-09-05 17:48:02 -04:00
David Yat Sin	c8dd4d2b3b	rocr: Handle pthread_create returning errors Rewriting logic to fix issue where pthread_create would return errors other than EINVAL, and these errors would be ignored. Change-Id: I573958724dcf886c20e8c14e6a9182303b3ffa06	2024-08-22 12:15:10 -04:00
Jonathan Kim	eb30a5bbc7	rocr: Memory copy based on recommended SDMA engines Recommended SDMA engines for DMA copies are now exposed for better GPU-GPU performance. ROCr can now select those DMA engines. Also lock-in host-device copies to SDMA0 and device-host copies to SDMA1 for better stability and performance. Change-Id: Ideff2e13daf537104efecb8b837bd49ee5096cb5	2024-08-20 16:22:32 -04:00
James Xu	a621bca303	Fix compile errors with musl>=1.2.3 Patch submitted on behalf of user AngryLoki: The fix repeats common pattern, used for musl, e.g: https://github.com/void-linux/void-packages/blob/5ccf1c66a1df2d644e1a0db0a68fca321469c57e/srcpkgs/MangoHud/patches/0001-elfhacks-d_un.d_ptr-is-relative-on-non-glibc-systems.patch#L90. Quoting: d_un.d_ptr is relative on non glibc systems elf(5) documents it this way, glibc diverts from this documentation Change-Id: I815f88f127ef00c88ae827a8ad48df0d33c92467	2024-08-19 11:02:29 -04:00
Jonathan Kim	ea646cf958	Disable DMABUF IPC iplementation Current DMABUF implemenation is unstable. Switch back to legacy support for now. Change-Id: I3be871f38c6524b0bcc9225bab61de4e57771efb	2024-08-12 13:14:14 -04:00
Saleel Kudchadker	26e105d9ab	Initial external logging API New API to accept a file stream for logging Co-authored-by: David Yat Sin <David.YatSin@amd.com> Change-Id: Ie09c35ae14ca86a97eb25f61251be287c55d7169 Signed-off-by: Chris Freehill <cfreehil@amd.com>	2024-08-07 02:59:00 +00:00
David Yat Sin	2f05c2a273	Revert "Use pthread_setaffinity_np" This reverts commit 1df7a44112e45b7fb447926778490f741601219a. Change-Id: Ib386c8f944b6da0ef68ddd2be3f26013cd36ef5b Signed-off-by: Chris Freehill <cfreehil@amd.com>	2024-06-25 12:27:09 -05:00
David Yat Sin	1cee8656df	Revert "Use pthread_attr_setaffinity_np when available" This reverts commit ef95ccf81e59b8608861e8f2f256d981eee19df7. Reason for revert: Causing performance regressions on some systems Change-Id: I82951350cafbd57c495852d6f90023a3373f04f6 Signed-off-by: Chris Freehill <cfreehil@amd.com>	2024-06-25 12:27:09 -05:00
David Yat Sin	57b93e02a4	Use pthread_attr_setaffinity_np when available If pthread_attr_setaffinity_np function exists use it instead of pthread_setaffinity_np as pthread_setaffinity_np seems to fail to set the affinity settings on some systems. Change-Id: Icd8b17039699ac10d9cd5c4dbb6ac44630673949	2024-04-29 15:02:54 +00:00
Shweta.Khatri	bc9cac97fe	Fixing compilation errors related to MUSL libc Fix Musl libc NULL errors and unsupported pthread funcs for compatibility. Also ensures cleanup and error handling irrespective of CPU affinity override. Fix submitted by github dev - AngryLoki https://github.com/ROCm/ROCR-Runtime/issues/181 Change-Id: Ia487315e504112be5d3370756f23f6e23b9ae4be	2024-04-17 07:14:15 -04:00
David Yat Sin	8d666dea01	PC Sampling: Allocate resources to retrieve data from trap handler Allocate required device and host buffers to be able to interact with the 2nd level trap handler. Change-Id: If99de5aacf956ca57ecafc7b04b797be9c9decaa	2024-04-11 12:53:00 -04:00
David Yat Sin	0bc244e10a	PC Sampling: Create PC Sampling interfaces Create new interface group for PC Sampling Change-Id: I59b4cfe9f8d1ae313dc28be1d2ed49f750d8212b	2024-04-11 12:52:23 -04:00
Shweta.Khatri	00b63f7452	Replace lazy_ptr's Init() with reset() method The function Init() called by one of the constructors of lazy_ptr is undefined. Replacing with reset method sets the object to an uninitialized state and assigns a new constructor function Fix submitted on github by zhoumin2 - https://github.com/ROCm/ROCR-Runtime/pull/184 Change-Id: I7d906d526ce7fe7e2548b01810e6395b13497bf3	2024-03-26 15:07:34 -04:00
Jonathan R. Madsen	7ce263b0e4	Update rocprofiler-register support - add rocprofiler-register to CPACK_DEBIAN_BINARY_PACKAGE_DEPENDS when found - add rocprofiler-register to CPACK_RPM_BINARY_PACKAGE_REQUIRES when found - remove report_tool_load_failures_explicit_ - add HSA_TOOLS_DISABLE_REGISTER flag - add HSA_TOOLS_REPORT_REGISTER_FAILURE - use HSA_TOOLS_REPORT_REGISTER_FAILURE instead of HSA_TOOLS_REPORT_LOAD_FAILURE - changed rocprofiler-register message to not include the word "error" Change-Id: Ib7fd7f14c42758a54c347874018281bb1b5477a6	2024-02-22 11:55:25 -05:00
Jonathan Kim	62f3f250ce	Optimize and fix SDMA gang copies Optimizations include: - Greedy gang by placing gang leaders on first D2D sdma blit context to avoid dead locking with other gang leaders and items. Note that this is fine since we can't avoid an oversubscription problem when there is only 1 xGMI link anyways, so treat all xGMI links as a single pipe for ganging. - Non-leader gang items don't have to poll on dependency signals so this opens up more non-blocking SDMA channels. - unlock gang lock when gangs are not needed. - Change gang factor lookup from vector pair to map and register all gpus in gang factor lookup regardless of link type so that we can take advantage of the O(logN) direct key/value lookup time. Fixes include: - HSA_PAGE_SIZE_4KB was an incorrect macro to use for gang size limit. As a result, small copies ended up ganging and hitting latency limit. Use hardcoded 4096 bytes instead. - Cap auxillary gang factor to the number of non-XGMI SDMA engines. Change-Id: Ic23fde131502906a807134a04599aa6d012e8cbb	2024-01-25 10:42:27 -05:00
Jonathan R. Madsen	8f0ea44c09	Suppress reporting no tools were found with rocprofiler-register Change-Id: If853517d40e073202d12e2a6b16fb54be5529650	2024-01-17 01:01:19 -05:00
Jonathan Kim	e20f41df62	Enable IPC DMA buf Set HSA_ENABLE_IPC_MODE_LEGACY off (i.e. use DMA bufs implementation by default). Change-Id: I7b1c6cb7d19310adf6f0bfe060736f4adbf7adc2	2024-01-16 22:43:27 -05:00
Jonathan Kim	5dfebdbca9	Change IPC implementation to use DMA Bufs As the KFD IPC IOCTLs will not be upstreamed, change runtime implementation to use DMA bufs. DMA buf fds will be passed over abstract unix domain sockets. The exporter spins a thread that creates a socket server. The importer connects to the server to fetch the fd. libDRM will be required to do a manual import and GPU map for memory that is not already imported and mapped. For now, use the legacy IPC implementation by default as a follow on patch will disable the HSA_ENABLE_IPC_MODE_LEGACY environment variable. Change-Id: Ifd8469e9adfc81f8a1ea78d6010fb10b515ba1b4	2024-01-16 22:43:00 -05:00
David Yat Sin	8d3fee5095	Use HybridMutex for signal mutexes Implement HybridMutex to improve latencies compared to KernelMutex when there is contention between several threads calling hsa_signal_create and hsa_amd_signal_async_handler. Change-Id: If53377033e749b0050727964c9303f09b02527cc	2024-01-16 21:29:39 +00:00
David Yat Sin	6333fdecf3	Use pthread_setaffinity_np On some systems, pthread_addr_setaffinity_np does not exist, so we need to use pthread_setaffinity_np on thread after pthread_create Provided by Julian Samaroo on github https: //github.com/RadeonOpenCompute/ROCR-Runtime/pull/143 Change-Id: I4649f94333f2d7b0a5993b370a4bfc48d92acecb	2023-12-18 17:41:49 -05:00
David Yat Sin	f07b8f2250	Use CPU_SET_S instead of CPU_SET Fix incorrect use of CPU_SET on variable size cpu_set_t Suggested by Christopher E. Moore on github https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/130 Change-Id: I710b56683ba07c08dcd83c851bf72e4f127a0ad4	2023-12-04 15:05:22 +00:00
David Yat Sin	a7a3358067	Implement alternate scratch The alternate scratch memory is used for dispatches that have a low number of waves but relatively large wave size. This allows us to keep the tmpring_size.bits.WAVES field of the main scratch to full occupancy. Change-Id: I32d240fac4b7d38200d1eebc1b0fdc8a823920d3	2023-12-04 15:05:22 +00:00
David Yat Sin	dca8f3a21d	Implement async scratch reclaim For devices where the CP FW supports asynchronous scratch reclaim, ROCr is able to claw-back scratch memory that was assigned to an AQL queue. With that ability, ROCr does not have to rely on using USO (use-scratch-once) when assigning large amounts of memory to a queue. If we reach a situation where we are running low on device memory, ROCr will attempt to claw-back the scratch memory. Change-Id: Iddf8ec84e37ab8b9fdc58bafbe2b61fe2acb6eb7	2023-12-04 15:05:22 +00:00
Jonathan Kim	81c64228e0	Increase SDMA copy size SDMA4.4 and SDMA5.2+ has increased it's available copy size to 2^30 bytes represented by exponent as bits set in the COUNT field of the linear copy. Also note that the full 2^22 byte limit is available from SDMA4 onwards as it has corrected the 0x3fffe0 HW limitation from SDMA3. As copy limit has increase, this can change system performance so provide env var HSA_ENABLE_SDMA_COPY_SIZE_OVERRIDE=0 to fall back to the original 0x3fffe0 limit for debugging purposes. Change-Id: I0fb6e5378f68e5b8a00ff559271691a943ee06ee	2023-12-04 15:03:31 +00:00
Jonathan Kim	7df0167821	Enable D2D SDMA Ganging over xGMI Use all available SDMA engines capped by xGMI bandwith for all D2D copies within a hive. By default, set the latency boundary copy size as 4KB and below. Any copy size in within this boundary will not gang. Avoid oversubscribing engines by not ganging on engines with pending non-ganged work. An enviroment variable HSA_ENABLE_SDMA_GANG has been provided to override default ganging behaviour. Change-Id: Iccde76aa1af1d47ea2a151789432c9db4f0ffa8d	2023-07-27 08:58:26 -04:00
Jeremy Newton	132a19e9c3	Fix non-x86 builds I've just reverted some code what it was in 5.5 by wrapping new x86 specific bits with #if's, e.g.: - CPUID is x86 specific - mwait is x86 specific Change-Id: I6cefae34282c777c7340daf3f934d2a11742502e Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com>	2023-06-30 01:04:04 -04:00

1 2 3

148 Commits