rocm-systems

Автор	SHA1	Сообщение	Дата
Sunday Clement	90e35e8486	rocr: Remove Recursive Include Removed unnecessary header inlude in file to prevent circular include. Signed-off-by: Sunday Clement <Sunday.Clement@amd.com> [ROCm/ROCR-Runtime commit: `31b6474801`]	2025-06-13 12:29:52 -04:00
Sunday Clement	1da312af87	rocr: Fix Potential Deadlock Moved the Call to pthread_mutex_lock to an else statement for better code readibility. Signed-off-by: Sunday Clement <Sunday.Clement@amd.com> [ROCm/ROCR-Runtime commit: `1635746a9c`]	2025-06-04 10:18:09 -04:00
Sunday Clement	25886ecda8	rocr: Fix Potential Deadlock Because eventDescrp->mutex is a non-recursive lock attempting to acquire the lock with pthread_mutex_lock can cause the system to hang indefinitely if the lock was already previously aquired with the preceeding call to pthread_mutex_trylock. Signed-off-by: Sunday Clement <Sunday.Clement@amd.com> [ROCm/ROCR-Runtime commit: `a97b7df4b9`]	2025-06-04 10:18:09 -04:00
Alysa Liu	88dd451c64	rocr: Fixed inefficient copy operations Changed variable assignments to use std::move() where appropriate Signed-off-by: Alysa Liu <Alysa.Liu@amd.com> [ROCm/ROCR-Runtime commit: `369d89ade3`]	2025-06-02 11:18:36 -04:00
Sunday Clement	3d3cca8083	rocr: Fix Resource Leak allocated memory was previously not freed in the event of an error with rwlock initialization. Signed-off-by: Sunday Clement <Sunday.Clement@amd.com> [ROCm/ROCR-Runtime commit: `293092f32f`]	2025-05-30 09:16:26 -04:00
David Yat Sin	1b1d4e017a	rocr:Fix compile warnings [ROCm/ROCR-Runtime commit: `11da1293de`]	2025-05-28 16:12:02 -04:00
David Yat Sin	342e478e7d	rocr: Perform memcpy for small code-object loads On large BAR systems, for small-sized code-objects, we get performance using direct memcpy due to latencies when doing the blit-copy. [ROCm/ROCR-Runtime commit: `da2607024b`]	2025-05-22 18:39:19 -04:00
Aaron Liu	137b168b46	rocr/dtif: add dtif environment variable Using HSA_ENABLE_DTIF to control dtif/native thunk code path Signed-off-by: Aaron Liu <aaron.liu@amd.com> Reviewed-by: David Yat Sin <David.YatSin@amd.com> [ROCm/ROCR-Runtime commit: `166b0fa45a`]	2025-05-13 16:44:31 -04:00
Tony Gutierrez	6f37386eb2	rocr: Flags to alloc queue buf/struct in dev mem This builds on a prior change that allowed for allocating a user-mode queue's packet buffer in device memory to also allocate the queue struct in device memory. This provides additional latency benefits particularly for cases where dispatches are performed from the GPU itself. Flags are added to support the various use cases. [ROCm/ROCR-Runtime commit: `6e3c375bf1`]	2025-04-23 15:53:29 -04:00
lyndonli	e9c934c116	rocr: Remove redundant Refresh() call The initial call to Refresh() in the constructor is unnecessary as it's handled in Runtime::Load(). Signed-off-by: lyndonli <Lyndon.Li@amd.com> [ROCm/ROCR-Runtime commit: `c34a2798ce`]	2025-03-25 09:13:59 -04:00
David Yat Sin	e130172218	rocr: Put back scratch_backing_memory_byte_size The scratch_backing_memory_byte_size is not used by CP, but it is currently used by rocgdb. Putting the field back, but we need to find a solution for alt_scratch_backing_memory_byte_size. Also, completely disabling alternate scratch as we need some changes to support debugger. [ROCm/ROCR-Runtime commit: `02b38d0614`]	2025-03-06 16:23:38 -05:00
David Yat Sin	d93d05bcf1	rocr: Temporarily disable alternate scratch memory Temporarily disable alternate scratch memory usage by default due to some stability issues. [ROCm/ROCR-Runtime commit: `9a950ab788`]	2025-03-03 09:27:29 -05:00
David Yat Sin	5905b82579	rocr: Update for new async scratch reclaim Updating ROCr code to match new handshake protocol with CP FW for asynchronous scratch reclaim. Increase previous limits when scratch reclaim feature is available. [ROCm/ROCR-Runtime commit: `aa2f98e6f9`]	2025-02-19 21:02:00 -05:00
Sv. Lockal	d1507361ec	Fix build issues for musl libc (#267 ) Change-Id: Ia31330b0f96669966712b58986abeca754c2cbb9 [ROCm/ROCR-Runtime commit: `5d04bd42f3`]	2025-01-29 14:31:05 +00:00
Yiannis Papadopoulos	428cc5b47c	rocr/aie: Add dma-buf import support for AIEAgents via the Driver interface Change-Id: I70f8d8772dda7c06944d75042cb3034ddd89aff4 [ROCm/ROCR-Runtime commit: `26bfa0b8f6`]	2025-01-27 15:22:46 -05:00
Shweta Khatri	4325142db1	rocr: Use view3dAs2dArray flag, for thick/3D swizzle modes. Added HSA_IMAGE_ENABLE_3D_SWIZZLE_DEBUG environment flag to enable/disable this. Default value is false (view3dAs2dArray = 1) Enabling this flag will enable support for swizzles that do 3D interleaving. Note that all features of 3D images are supported with 2D swizzles,it's just that the access patterns are different and therefore cache hit-rates may be better or worse, depending on how it's used. Volumetric algorithms do better with 3D and apps that tend to access a single slice at a time do better with 2D. Change-Id: Id8574a6710fe4333a1ee331e5ce9195a81434198 [ROCm/ROCR-Runtime commit: `6361466baa`]	2025-01-27 09:28:33 -05:00
David Yat Sin	922b61ddee	rocr: Add thread priority for AsyncEventHandler Set priority to maximum for signal event handler and minimum for exceptions event handler. Change-Id: I1b982d3c2e4c880fafc073fe1a542d01692a6fdc [ROCm/ROCR-Runtime commit: `7ea25ebb85`]	2025-01-24 10:08:12 -05:00
Eddie Richter	8ea388af92	rocr/aie: AIE Queue Processing Change-Id: I681c971ba7229037ca85d5529838aa7bbe5820e2 [ROCm/ROCR-Runtime commit: `e9cc839b2b`]	2024-12-10 10:50:02 -05:00
Apurv Mishra	baf737a3cb	rocr: declare 'args' as class member in 'os_thread' Removed 'args' as a unique pointer and deletion in 'ThreadTrampoline', then declared as a class member. Change-Id: Ia52058392d0170e8b5e57cfdd2c587f47a6f93f0 Signed-off-by: Apurv Mishra <apurv.mishra@amd.com> [ROCm/ROCR-Runtime commit: `89115369cc`]	2024-11-27 10:27:40 -05:00
David Yat Sin	ed5bbc1eeb	rocr: Fix sem_post overflow errors WaitSemaphore and PostSemaphore are used in the HybridMutex implementation. If HybridMutex did not have to call WaitSemaphore when acquired, then calling PostSemaphore would cause the internal count inside sem_t to slowly grow to large values and eventually cause overflow. Change-Id: I173fc17c874b49926e56991405e9086ea8c138fc [ROCm/ROCR-Runtime commit: `f58aff630c`]	2024-11-13 21:57:26 -05:00
David Yat Sin	3e694d739a	rocr: Add HSA_SIGNAL_WAIT_ABORT_TIMEOUT Add support for abort timeout when hsa_signal_wait_relaxed is called and signal does not clear within timeout. timeout is in seconds Change-Id: If1db5a8af33c82ddc4b48968c3d8eceb97d0ea6d [ROCm/ROCR-Runtime commit: `4ec730f1dc`]	2024-11-13 21:57:02 -05:00
German Andryeyev	6617af10e6	rocr: Disable WaitAny() in AsyncEventsLoop() - Add the new path to avoid WaitAny() calls in AsyncEventsLoopp() with HSA_WAIT_ANY_DEBUG key. The new path is selected by default. The optimizaiton combines all logic of WaitAny() in a single processing loop and avoids extra memory allocations or ref counting. Also it won't spin on the CPU if all events are busy. Change-Id: I197ce60d0d023fbb672f700d6e87702686f1f55a [ROCm/ROCR-Runtime commit: `0fc7369ba5`]	2024-10-25 14:37:02 -04:00
Jonathan Kim	ff4690de61	rocr: Fix IPC DMA Buf fragment handling and enable for development Discarding blocks for reallocation on IPC export for better memory performance trigger memory violations with DMA BUF exports so bypass this for now as application performance drops haven't been observed with the bypass. The raw fragment should be passed to the DMA Buf export call as well since offsets will be implicitly applied in the Thunk/KFD for export/import calls. Also, use the agent information directly from the pointer information so that the export call doesn't have to scan memory to find this. Pass the node ID in the handle so that the import call doesn't have to make two thunk imports to fetch the node ID for GPU memory imports. Finally, allow the user to use DMA Buf IPC via HSA_ENABLE_IPC_MODE_LEGACY=0 for developer testing as legacy mode will be applied by default. Change-Id: Ie8fe267f8768fa5df37126078406f7065f69ff4e [ROCm/ROCR-Runtime commit: `32bb0764b7`]	2024-09-27 14:40:42 -04:00
Saleel Kudchadker	8d1fe1f7ea	rocr: Allocate AQL queue on device memory - Use HSA_ALLOCATE_QUEUE_DEV_MEM=1 to create AQL queue in device memory. - Before writing AQL packet header to the queue use an SFENCE to ensure that there is no reodering of the writes over PCIE Change-Id: I5eacdc35108c4a1e245c75ae349b7495451aa60d [ROCm/ROCR-Runtime commit: `3baaa6e9c0`]	2024-09-05 17:48:02 -04:00
David Yat Sin	de85c5738e	rocr: Handle pthread_create returning errors Rewriting logic to fix issue where pthread_create would return errors other than EINVAL, and these errors would be ignored. Change-Id: I573958724dcf886c20e8c14e6a9182303b3ffa06 [ROCm/ROCR-Runtime commit: `c8dd4d2b3b`]	2024-08-22 12:15:10 -04:00
Jonathan Kim	b6aa5a4c09	rocr: Memory copy based on recommended SDMA engines Recommended SDMA engines for DMA copies are now exposed for better GPU-GPU performance. ROCr can now select those DMA engines. Also lock-in host-device copies to SDMA0 and device-host copies to SDMA1 for better stability and performance. Change-Id: Ideff2e13daf537104efecb8b837bd49ee5096cb5 [ROCm/ROCR-Runtime commit: `eb30a5bbc7`]	2024-08-20 16:22:32 -04:00
James Xu	e5d7121245	Fix compile errors with musl>=1.2.3 Patch submitted on behalf of user AngryLoki: The fix repeats common pattern, used for musl, e.g: https://github.com/void-linux/void-packages/blob/5ccf1c66a1df2d644e1a0db0a68fca321469c57e/srcpkgs/MangoHud/patches/0001-elfhacks-d_un.d_ptr-is-relative-on-non-glibc-systems.patch#L90. Quoting: d_un.d_ptr is relative on non glibc systems elf(5) documents it this way, glibc diverts from this documentation Change-Id: I815f88f127ef00c88ae827a8ad48df0d33c92467 [ROCm/ROCR-Runtime commit: `a621bca303`]	2024-08-19 11:02:29 -04:00
Jonathan Kim	db44209c11	Disable DMABUF IPC iplementation Current DMABUF implemenation is unstable. Switch back to legacy support for now. Change-Id: I3be871f38c6524b0bcc9225bab61de4e57771efb [ROCm/ROCR-Runtime commit: `ea646cf958`]	2024-08-12 13:14:14 -04:00
Saleel Kudchadker	bdc02d3054	Initial external logging API New API to accept a file stream for logging Co-authored-by: David Yat Sin <David.YatSin@amd.com> Change-Id: Ie09c35ae14ca86a97eb25f61251be287c55d7169 Signed-off-by: Chris Freehill <cfreehil@amd.com> [ROCm/ROCR-Runtime commit: `26e105d9ab`]	2024-08-07 02:59:00 +00:00
David Yat Sin	14f6875df2	Revert "Use pthread_setaffinity_np" This reverts commit 1df7a44112e45b7fb447926778490f741601219a. Change-Id: Ib386c8f944b6da0ef68ddd2be3f26013cd36ef5b Signed-off-by: Chris Freehill <cfreehil@amd.com> [ROCm/ROCR-Runtime commit: `2f05c2a273`]	2024-06-25 12:27:09 -05:00
David Yat Sin	b4be8a2bfc	Revert "Use pthread_attr_setaffinity_np when available" This reverts commit ef95ccf81e59b8608861e8f2f256d981eee19df7. Reason for revert: Causing performance regressions on some systems Change-Id: I82951350cafbd57c495852d6f90023a3373f04f6 Signed-off-by: Chris Freehill <cfreehil@amd.com> [ROCm/ROCR-Runtime commit: `1cee8656df`]	2024-06-25 12:27:09 -05:00
David Yat Sin	860be91593	Use pthread_attr_setaffinity_np when available If pthread_attr_setaffinity_np function exists use it instead of pthread_setaffinity_np as pthread_setaffinity_np seems to fail to set the affinity settings on some systems. Change-Id: Icd8b17039699ac10d9cd5c4dbb6ac44630673949 [ROCm/ROCR-Runtime commit: `57b93e02a4`]	2024-04-29 15:02:54 +00:00
Shweta.Khatri	4f4d215196	Fixing compilation errors related to MUSL libc Fix Musl libc NULL errors and unsupported pthread funcs for compatibility. Also ensures cleanup and error handling irrespective of CPU affinity override. Fix submitted by github dev - AngryLoki https://github.com/ROCm/ROCR-Runtime/issues/181 Change-Id: Ia487315e504112be5d3370756f23f6e23b9ae4be [ROCm/ROCR-Runtime commit: `bc9cac97fe`]	2024-04-17 07:14:15 -04:00
David Yat Sin	bb10ff65c2	PC Sampling: Allocate resources to retrieve data from trap handler Allocate required device and host buffers to be able to interact with the 2nd level trap handler. Change-Id: If99de5aacf956ca57ecafc7b04b797be9c9decaa [ROCm/ROCR-Runtime commit: `8d666dea01`]	2024-04-11 12:53:00 -04:00
David Yat Sin	8165c03e7b	PC Sampling: Create PC Sampling interfaces Create new interface group for PC Sampling Change-Id: I59b4cfe9f8d1ae313dc28be1d2ed49f750d8212b [ROCm/ROCR-Runtime commit: `0bc244e10a`]	2024-04-11 12:52:23 -04:00
Shweta.Khatri	565dbac2d4	Replace lazy_ptr's Init() with reset() method The function Init() called by one of the constructors of lazy_ptr is undefined. Replacing with reset method sets the object to an uninitialized state and assigns a new constructor function Fix submitted on github by zhoumin2 - https://github.com/ROCm/ROCR-Runtime/pull/184 Change-Id: I7d906d526ce7fe7e2548b01810e6395b13497bf3 [ROCm/ROCR-Runtime commit: `00b63f7452`]	2024-03-26 15:07:34 -04:00
Jonathan R. Madsen	c85e1dc4cd	Update rocprofiler-register support - add rocprofiler-register to CPACK_DEBIAN_BINARY_PACKAGE_DEPENDS when found - add rocprofiler-register to CPACK_RPM_BINARY_PACKAGE_REQUIRES when found - remove report_tool_load_failures_explicit_ - add HSA_TOOLS_DISABLE_REGISTER flag - add HSA_TOOLS_REPORT_REGISTER_FAILURE - use HSA_TOOLS_REPORT_REGISTER_FAILURE instead of HSA_TOOLS_REPORT_LOAD_FAILURE - changed rocprofiler-register message to not include the word "error" Change-Id: Ib7fd7f14c42758a54c347874018281bb1b5477a6 [ROCm/ROCR-Runtime commit: `7ce263b0e4`]	2024-02-22 11:55:25 -05:00
Jonathan Kim	15127c6f85	Optimize and fix SDMA gang copies Optimizations include: - Greedy gang by placing gang leaders on first D2D sdma blit context to avoid dead locking with other gang leaders and items. Note that this is fine since we can't avoid an oversubscription problem when there is only 1 xGMI link anyways, so treat all xGMI links as a single pipe for ganging. - Non-leader gang items don't have to poll on dependency signals so this opens up more non-blocking SDMA channels. - unlock gang lock when gangs are not needed. - Change gang factor lookup from vector pair to map and register all gpus in gang factor lookup regardless of link type so that we can take advantage of the O(logN) direct key/value lookup time. Fixes include: - HSA_PAGE_SIZE_4KB was an incorrect macro to use for gang size limit. As a result, small copies ended up ganging and hitting latency limit. Use hardcoded 4096 bytes instead. - Cap auxillary gang factor to the number of non-XGMI SDMA engines. Change-Id: Ic23fde131502906a807134a04599aa6d012e8cbb [ROCm/ROCR-Runtime commit: `62f3f250ce`]	2024-01-25 10:42:27 -05:00
Jonathan R. Madsen	4f7dfe87d2	Suppress reporting no tools were found with rocprofiler-register Change-Id: If853517d40e073202d12e2a6b16fb54be5529650 [ROCm/ROCR-Runtime commit: `8f0ea44c09`]	2024-01-17 01:01:19 -05:00
Jonathan Kim	7ac11c41af	Enable IPC DMA buf Set HSA_ENABLE_IPC_MODE_LEGACY off (i.e. use DMA bufs implementation by default). Change-Id: I7b1c6cb7d19310adf6f0bfe060736f4adbf7adc2 [ROCm/ROCR-Runtime commit: `e20f41df62`]	2024-01-16 22:43:27 -05:00
Jonathan Kim	3c63cf521b	Change IPC implementation to use DMA Bufs As the KFD IPC IOCTLs will not be upstreamed, change runtime implementation to use DMA bufs. DMA buf fds will be passed over abstract unix domain sockets. The exporter spins a thread that creates a socket server. The importer connects to the server to fetch the fd. libDRM will be required to do a manual import and GPU map for memory that is not already imported and mapped. For now, use the legacy IPC implementation by default as a follow on patch will disable the HSA_ENABLE_IPC_MODE_LEGACY environment variable. Change-Id: Ifd8469e9adfc81f8a1ea78d6010fb10b515ba1b4 [ROCm/ROCR-Runtime commit: `5dfebdbca9`]	2024-01-16 22:43:00 -05:00
David Yat Sin	cbe9337918	Use HybridMutex for signal mutexes Implement HybridMutex to improve latencies compared to KernelMutex when there is contention between several threads calling hsa_signal_create and hsa_amd_signal_async_handler. Change-Id: If53377033e749b0050727964c9303f09b02527cc [ROCm/ROCR-Runtime commit: `8d3fee5095`]	2024-01-16 21:29:39 +00:00
David Yat Sin	dcd5f16de0	Use pthread_setaffinity_np On some systems, pthread_addr_setaffinity_np does not exist, so we need to use pthread_setaffinity_np on thread after pthread_create Provided by Julian Samaroo on github https: //github.com/RadeonOpenCompute/ROCR-Runtime/pull/143 Change-Id: I4649f94333f2d7b0a5993b370a4bfc48d92acecb [ROCm/ROCR-Runtime commit: `6333fdecf3`]	2023-12-18 17:41:49 -05:00
David Yat Sin	d81bf9cd57	Use CPU_SET_S instead of CPU_SET Fix incorrect use of CPU_SET on variable size cpu_set_t Suggested by Christopher E. Moore on github https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/130 Change-Id: I710b56683ba07c08dcd83c851bf72e4f127a0ad4 [ROCm/ROCR-Runtime commit: `f07b8f2250`]	2023-12-04 15:05:22 +00:00
David Yat Sin	6140d8a66d	Implement alternate scratch The alternate scratch memory is used for dispatches that have a low number of waves but relatively large wave size. This allows us to keep the tmpring_size.bits.WAVES field of the main scratch to full occupancy. Change-Id: I32d240fac4b7d38200d1eebc1b0fdc8a823920d3 [ROCm/ROCR-Runtime commit: `a7a3358067`]	2023-12-04 15:05:22 +00:00
David Yat Sin	66b9fdc2d6	Implement async scratch reclaim For devices where the CP FW supports asynchronous scratch reclaim, ROCr is able to claw-back scratch memory that was assigned to an AQL queue. With that ability, ROCr does not have to rely on using USO (use-scratch-once) when assigning large amounts of memory to a queue. If we reach a situation where we are running low on device memory, ROCr will attempt to claw-back the scratch memory. Change-Id: Iddf8ec84e37ab8b9fdc58bafbe2b61fe2acb6eb7 [ROCm/ROCR-Runtime commit: `dca8f3a21d`]	2023-12-04 15:05:22 +00:00
Jonathan Kim	1931c4f8a4	Increase SDMA copy size SDMA4.4 and SDMA5.2+ has increased it's available copy size to 2^30 bytes represented by exponent as bits set in the COUNT field of the linear copy. Also note that the full 2^22 byte limit is available from SDMA4 onwards as it has corrected the 0x3fffe0 HW limitation from SDMA3. As copy limit has increase, this can change system performance so provide env var HSA_ENABLE_SDMA_COPY_SIZE_OVERRIDE=0 to fall back to the original 0x3fffe0 limit for debugging purposes. Change-Id: I0fb6e5378f68e5b8a00ff559271691a943ee06ee [ROCm/ROCR-Runtime commit: `81c64228e0`]	2023-12-04 15:03:31 +00:00
Jonathan Kim	ae3b48d227	Enable D2D SDMA Ganging over xGMI Use all available SDMA engines capped by xGMI bandwith for all D2D copies within a hive. By default, set the latency boundary copy size as 4KB and below. Any copy size in within this boundary will not gang. Avoid oversubscribing engines by not ganging on engines with pending non-ganged work. An enviroment variable HSA_ENABLE_SDMA_GANG has been provided to override default ganging behaviour. Change-Id: Iccde76aa1af1d47ea2a151789432c9db4f0ffa8d [ROCm/ROCR-Runtime commit: `7df0167821`]	2023-07-27 08:58:26 -04:00
Jeremy Newton	b3f22fef0a	Fix non-x86 builds I've just reverted some code what it was in 5.5 by wrapping new x86 specific bits with #if's, e.g.: - CPUID is x86 specific - mwait is x86 specific Change-Id: I6cefae34282c777c7340daf3f934d2a11742502e Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com> [ROCm/ROCR-Runtime commit: `132a19e9c3`]	2023-06-30 01:04:04 -04:00
David Yat Sin	14052ab9d0	Add HSA_ENABLE_PEER_SDMA env variable Add support for HSA_ENABLE_PEER_SDMA env variable that can be used to disable use of SDMA engines for device-to-device transfers. Note that setting HSA_ENABLE_SDMA=0 will disable all SDMA transfers and override HSA_ENABLE_PEER_SDMA values. Change-Id: I737b3c2b2efcf3ff237f98bc748f49b8252ed24a [ROCm/ROCR-Runtime commit: `a397373cea`]	2023-05-18 00:10:20 +00:00

1 2 3

147 Коммитов