rocm-systems

Автор	SHA1	Сообщение	Дата
David Yat Sin	57b93e02a4	Use pthread_attr_setaffinity_np when available If pthread_attr_setaffinity_np function exists use it instead of pthread_setaffinity_np as pthread_setaffinity_np seems to fail to set the affinity settings on some systems. Change-Id: Icd8b17039699ac10d9cd5c4dbb6ac44630673949	2024-04-29 15:02:54 +00:00
Shweta.Khatri	bc9cac97fe	Fixing compilation errors related to MUSL libc Fix Musl libc NULL errors and unsupported pthread funcs for compatibility. Also ensures cleanup and error handling irrespective of CPU affinity override. Fix submitted by github dev - AngryLoki https://github.com/ROCm/ROCR-Runtime/issues/181 Change-Id: Ia487315e504112be5d3370756f23f6e23b9ae4be	2024-04-17 07:14:15 -04:00
David Yat Sin	8d666dea01	PC Sampling: Allocate resources to retrieve data from trap handler Allocate required device and host buffers to be able to interact with the 2nd level trap handler. Change-Id: If99de5aacf956ca57ecafc7b04b797be9c9decaa	2024-04-11 12:53:00 -04:00
David Yat Sin	0bc244e10a	PC Sampling: Create PC Sampling interfaces Create new interface group for PC Sampling Change-Id: I59b4cfe9f8d1ae313dc28be1d2ed49f750d8212b	2024-04-11 12:52:23 -04:00
Shweta.Khatri	00b63f7452	Replace lazy_ptr's Init() with reset() method The function Init() called by one of the constructors of lazy_ptr is undefined. Replacing with reset method sets the object to an uninitialized state and assigns a new constructor function Fix submitted on github by zhoumin2 - https://github.com/ROCm/ROCR-Runtime/pull/184 Change-Id: I7d906d526ce7fe7e2548b01810e6395b13497bf3	2024-03-26 15:07:34 -04:00
Jonathan R. Madsen	7ce263b0e4	Update rocprofiler-register support - add rocprofiler-register to CPACK_DEBIAN_BINARY_PACKAGE_DEPENDS when found - add rocprofiler-register to CPACK_RPM_BINARY_PACKAGE_REQUIRES when found - remove report_tool_load_failures_explicit_ - add HSA_TOOLS_DISABLE_REGISTER flag - add HSA_TOOLS_REPORT_REGISTER_FAILURE - use HSA_TOOLS_REPORT_REGISTER_FAILURE instead of HSA_TOOLS_REPORT_LOAD_FAILURE - changed rocprofiler-register message to not include the word "error" Change-Id: Ib7fd7f14c42758a54c347874018281bb1b5477a6	2024-02-22 11:55:25 -05:00
Jonathan Kim	62f3f250ce	Optimize and fix SDMA gang copies Optimizations include: - Greedy gang by placing gang leaders on first D2D sdma blit context to avoid dead locking with other gang leaders and items. Note that this is fine since we can't avoid an oversubscription problem when there is only 1 xGMI link anyways, so treat all xGMI links as a single pipe for ganging. - Non-leader gang items don't have to poll on dependency signals so this opens up more non-blocking SDMA channels. - unlock gang lock when gangs are not needed. - Change gang factor lookup from vector pair to map and register all gpus in gang factor lookup regardless of link type so that we can take advantage of the O(logN) direct key/value lookup time. Fixes include: - HSA_PAGE_SIZE_4KB was an incorrect macro to use for gang size limit. As a result, small copies ended up ganging and hitting latency limit. Use hardcoded 4096 bytes instead. - Cap auxillary gang factor to the number of non-XGMI SDMA engines. Change-Id: Ic23fde131502906a807134a04599aa6d012e8cbb	2024-01-25 10:42:27 -05:00
Jonathan R. Madsen	8f0ea44c09	Suppress reporting no tools were found with rocprofiler-register Change-Id: If853517d40e073202d12e2a6b16fb54be5529650	2024-01-17 01:01:19 -05:00
Jonathan Kim	e20f41df62	Enable IPC DMA buf Set HSA_ENABLE_IPC_MODE_LEGACY off (i.e. use DMA bufs implementation by default). Change-Id: I7b1c6cb7d19310adf6f0bfe060736f4adbf7adc2	2024-01-16 22:43:27 -05:00
Jonathan Kim	5dfebdbca9	Change IPC implementation to use DMA Bufs As the KFD IPC IOCTLs will not be upstreamed, change runtime implementation to use DMA bufs. DMA buf fds will be passed over abstract unix domain sockets. The exporter spins a thread that creates a socket server. The importer connects to the server to fetch the fd. libDRM will be required to do a manual import and GPU map for memory that is not already imported and mapped. For now, use the legacy IPC implementation by default as a follow on patch will disable the HSA_ENABLE_IPC_MODE_LEGACY environment variable. Change-Id: Ifd8469e9adfc81f8a1ea78d6010fb10b515ba1b4	2024-01-16 22:43:00 -05:00
David Yat Sin	8d3fee5095	Use HybridMutex for signal mutexes Implement HybridMutex to improve latencies compared to KernelMutex when there is contention between several threads calling hsa_signal_create and hsa_amd_signal_async_handler. Change-Id: If53377033e749b0050727964c9303f09b02527cc	2024-01-16 21:29:39 +00:00
David Yat Sin	6333fdecf3	Use pthread_setaffinity_np On some systems, pthread_addr_setaffinity_np does not exist, so we need to use pthread_setaffinity_np on thread after pthread_create Provided by Julian Samaroo on github https: //github.com/RadeonOpenCompute/ROCR-Runtime/pull/143 Change-Id: I4649f94333f2d7b0a5993b370a4bfc48d92acecb	2023-12-18 17:41:49 -05:00
David Yat Sin	f07b8f2250	Use CPU_SET_S instead of CPU_SET Fix incorrect use of CPU_SET on variable size cpu_set_t Suggested by Christopher E. Moore on github https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/130 Change-Id: I710b56683ba07c08dcd83c851bf72e4f127a0ad4	2023-12-04 15:05:22 +00:00
David Yat Sin	a7a3358067	Implement alternate scratch The alternate scratch memory is used for dispatches that have a low number of waves but relatively large wave size. This allows us to keep the tmpring_size.bits.WAVES field of the main scratch to full occupancy. Change-Id: I32d240fac4b7d38200d1eebc1b0fdc8a823920d3	2023-12-04 15:05:22 +00:00
David Yat Sin	dca8f3a21d	Implement async scratch reclaim For devices where the CP FW supports asynchronous scratch reclaim, ROCr is able to claw-back scratch memory that was assigned to an AQL queue. With that ability, ROCr does not have to rely on using USO (use-scratch-once) when assigning large amounts of memory to a queue. If we reach a situation where we are running low on device memory, ROCr will attempt to claw-back the scratch memory. Change-Id: Iddf8ec84e37ab8b9fdc58bafbe2b61fe2acb6eb7	2023-12-04 15:05:22 +00:00
Jonathan Kim	81c64228e0	Increase SDMA copy size SDMA4.4 and SDMA5.2+ has increased it's available copy size to 2^30 bytes represented by exponent as bits set in the COUNT field of the linear copy. Also note that the full 2^22 byte limit is available from SDMA4 onwards as it has corrected the 0x3fffe0 HW limitation from SDMA3. As copy limit has increase, this can change system performance so provide env var HSA_ENABLE_SDMA_COPY_SIZE_OVERRIDE=0 to fall back to the original 0x3fffe0 limit for debugging purposes. Change-Id: I0fb6e5378f68e5b8a00ff559271691a943ee06ee	2023-12-04 15:03:31 +00:00
Jonathan Kim	7df0167821	Enable D2D SDMA Ganging over xGMI Use all available SDMA engines capped by xGMI bandwith for all D2D copies within a hive. By default, set the latency boundary copy size as 4KB and below. Any copy size in within this boundary will not gang. Avoid oversubscribing engines by not ganging on engines with pending non-ganged work. An enviroment variable HSA_ENABLE_SDMA_GANG has been provided to override default ganging behaviour. Change-Id: Iccde76aa1af1d47ea2a151789432c9db4f0ffa8d	2023-07-27 08:58:26 -04:00
Jeremy Newton	132a19e9c3	Fix non-x86 builds I've just reverted some code what it was in 5.5 by wrapping new x86 specific bits with #if's, e.g.: - CPUID is x86 specific - mwait is x86 specific Change-Id: I6cefae34282c777c7340daf3f934d2a11742502e Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com>	2023-06-30 01:04:04 -04:00
David Yat Sin	a397373cea	Add HSA_ENABLE_PEER_SDMA env variable Add support for HSA_ENABLE_PEER_SDMA env variable that can be used to disable use of SDMA engines for device-to-device transfers. Note that setting HSA_ENABLE_SDMA=0 will disable all SDMA transfers and override HSA_ENABLE_PEER_SDMA values. Change-Id: I737b3c2b2efcf3ff237f98bc748f49b8252ed24a	2023-05-18 00:10:20 +00:00
David Yat Sin	a180c9ee78	Add env var to override SRAM ECC Add HSA_ENABLE_SRAMECC environment variable that can be used to override SRAM ECC mode reported by KFD Change-Id: I2b95511820a2d3d146a76b03070659c0695b61fd	2023-04-27 16:16:05 -04:00
Lancelot SIX	183f5d90aa	linux os_thread: improve error handling On Linux, the os_thread abstraction is built on top of pthread. Many of the pthread calls might fail and return error codes. The error conditions are only checked via assertions (if ever checked) which means that when doing a release build, no error condition is checked. The same goes for dlsym/dlinfo and clock_gettime. This commit improves the situation this by checking the error conditions and acting accordingly. When the error condition is detected in a function with a mean to indicate some error to its caller, then this patch prints some error message and returns. If there is no way to propagate the error up the call stack, print some error message and abort the process. For the os_info::os_info ctor, the only user is CreateThread, which checks that the built thread is Valid(). If not, nullptr is returned to the caller. It could be possible to use exceptions when functions cannot pass errors, but for now I only use abort as it is what abort would do with debug build. Change-Id: I815703c3b95777cc29bb89a7d654ac879c14a759	2023-04-17 09:48:11 -04:00
David Yat Sin	8ebf5f9c48	Adding scratch memory reservation Some applications will keep trying to allocate device memory until the allocation fails. This causes all device memory to be used up and we are then unable to allocate scratch memory for dispatches. Reserve enough memory for 1 small scratch allocation. Change-Id: I968400d41540ba1aca8f28581f229693eec02225	2023-04-06 15:13:36 +00:00
Shweta Khatri	83a307c449	By default, disable mwaitx feature. This can be enabled by setting HSA_ENABLE_MWAITX=1 Change-Id: I4be00892780beeb8b14c3c5f34aa10b158921bff	2023-03-15 19:57:25 -04:00
David Yat Sin	cc48dfdbff	Use mwaitx when busy-waiting signals Use mwaitx instructions when busy waiting for signals to reduce CPU energy usage. This can be disabled by setting HSA_ENABLE_MWAITX=0 Change-Id: Ic207895a491b2bf6dacba47ef0921df3faad5b5a	2023-02-22 16:55:43 +00:00
David Yat Sin	0ed1568afc	Add function for parse CPUID information Used to detect whether mwaitx instruction is supported Change-Id: I66fe906325aa523c8815133cf782df3a17a7edab	2023-02-22 16:55:42 +00:00
Shweta Khatri	8aac885318	Fixes hang due to change in order of initialization of libraries Fixes hang due to change in order of initialization of libraries that have cyclical dependencies and they call hsa_init() during their initialization phase. This implementation looks for a symbol called "HSA_AMD_TOOL_PRIORITY" across all loaded shared libraries using dynamic section entries of the loaded lib instead of using dlopen and dlsym for the same purpose. Change-Id: I4865f2fd18dd186ec311a432ec38fbb5583805d2	2023-01-26 01:17:22 -05:00
David Yat Sin	a4f898ad15	Add env variable to print image SRD contents Add environment variable HSA_IMAGE_PRINT_SRD to print contents of SRD registers for image functions Change-Id: Ifb47a73dcfad8745ee7445e20de96e1021b80bd6	2023-01-13 11:01:04 -05:00
Shweta Khatri	8751e65b79	Fixed callback method for dl_iterate_phdr api which is called for each loaded shared object Simplified the callback method. Also fixed the way, loaded shared object were getting appended into a string vector, which was not being passed to this callback method. Change-Id: I68661dd73f61a11c42fa92f670e8e7b6ffcb5711	2022-11-21 19:00:34 -05:00
David Yat Sin	dd255d31b8	Fix uninitialized variable warning Fix warning when using valgrind Change-Id: Ie59eaa990b9b5d339a178a2c6f9f4fac0e34e925	2022-09-08 09:10:00 -04:00
David Yat Sin	df3fe8c2fb	Add env variable to disable CPU affinity override New environment variable HSA_OVERRIDE_CPU_AFFINITY_DEBUG to enable/disable overriding CPU affinity. Default value is enabled(1). This is a temporary variable and may be removed in the future. Change-Id: Id6a7c611730471ddc276ca333fde1e57046bf32a	2022-08-19 11:07:49 -04:00
Sean Keely	965df6eef7	Basic SVM profiler. Mostly a demo at this point. Logs SVM (aka HMM) info to HSA_SVM_PROFILE if set. Example: HSA_SVM_PROFILE=log.txt SomeApp Change-Id: Ib6fd688f661a21b2c695f586b833be93662a15f4	2022-06-23 19:30:06 -05:00
skhatri	e7fc301aa7	Adding support for rocrtracer tools loading without environment variable During hsa initializing stage, ROCr now searches all the loaded libraries for a symbol "HSA_AMD_TOOL_PRIORITY" and adds all those libraries to the tools library init list. Tools libraries listed in HSA_TOOLS_LIB env variable are also loaded in the given order and take priority over HSA_AMD_TOOL_PRIORITY. Change-Id: I739af42bbd777c44a9152c11e17dd69979b65e82	2022-06-23 20:08:30 -04:00
David Yat Sin	4ac840269c	Add API for available GPU memory Add support for AMD Agent to return amount of memory available Change-Id: I5c32e2cebbaa2993b044250aefe434e4cc02d8c2 Signed-off-by: David Yat Sin <david.yatsin@amd.com>	2022-06-07 10:33:18 -04:00
Sean Keely	3ebe99f96d	Add experimental option to force discovery of all copy agents. Discards all user provided async copy agent info and relies on pointer info discovery. Change-Id: Ife3e708a49ffccbede4983ab47d5ed0032970857	2022-05-14 18:08:57 -05:00
Sean Keely	0ee82742a7	Switch to CLOCK_BOOTTIME for HSA system clock. This is consistent with KFD and has significantly better latency. KFD is taking this as the definition of the SystemClockCounter. Change-Id: I4c1b3bc58c738206265c55ebefd41356c013bfe5	2022-05-05 15:27:29 -04:00
Jeremy Newton	178a7a5cfa	Drop some unnecessary definitions __x86_64__ and __AMD64__ should be already defined by the compiler to specify the compilation target and shouldn't be defined manually. I fixed two x86_64 checks to include VS variables, as removing this might cause it to fail to compile on that compiler. Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com> Change-Id: I600ff449af85bf7d83ecab167d97933922e2d917	2022-04-19 12:22:42 -04:00
Sean Keely	4e9849034d	Correct inf loop defect in fast clock init. Each time delay is grown we need to reset elapsed. We want to take the most accurate sample from the set at fixed delay. Without this we will hang if there is ever an insufficiently accurate, high unit clock read. Change-Id: Ic65f364067789ac85a6572d67af2d77528e265bb	2022-04-01 16:15:37 -04:00
Sean Keely	552dcead93	Correct scratch allocation logic to account for asymmetric harvest. With asym. harvest hw does not issue groups equally to each SE, occasionally hw will skip an SE so that the distribution reflects each SE's CU count. Scratch resources must be allocated to reflect this asymmetric distribution of groups. Change-Id: I65e26206500483ea18e6e8796e65ecba5354b029	2022-03-02 19:59:30 -06:00
Sean Keely	b9a0c1d313	Do not discard fragment allocator blocks multiple times. discardBlock may be called multiple times on the same block. We must not discard the block multiple times or we will corrupt in-use memory accounting. Change-Id: Ife9f3162785965a795dcf81887d4d447cc096e62	2022-02-10 18:39:46 -06:00
Sean Keely	37942c982a	Add HSA_AMD_AGENT_INFO_COOPERATIVE_COMPUTE_UNIT_COUNT. On gfx90a only a reduced number of CUs must be used for cooperative dispatches due to CWSR and launcher interactions with asymetric harvest. We must use one fewer CUs per SE than the lowest count of CUs on any SE. Also adds env var HSA_COOP_CU_COUNT which enables the cooperative CU count computation. Set to 1 to enable the new computation. This is an opt-in feature that will become enabled by default (opt-out) in a future release. Change-Id: Ifbb75ced3bbc15876eef44922c6a4f6fde8c4c28	2022-01-31 15:22:07 -05:00
Sean Keely	fce6ba052e	Correct documentation typo. ROCM_VISIBLE_DEVICES was used where ROCR_VISIBLE_DEVICES was intended. Change-Id: I644a546f3c9dd0b50898ef8a21dbb8f5c3a36926	2021-12-10 16:19:30 -06:00
Sean Keely	df55cb0450	Rework memory locks to allow device parallelism in alloc/free. Prior solution used a single global lock to protect the memory tracking structures. This change protects the memory tracking structure with a shared mutex (rw lock) in shared (r) mode for memory allocations and frees so that long duration processes, calling to kfd, can be done in parallel. Operations which must modify the memory map take the mutex in exclusive mode (w) and must not call to the thunk while holding the mutex. The fragment allocator now requires separate protection and is protected with a mutex at the device level. Protecting at the device level, rather than pool, allows retention of the current recursive design and allows calling Trim from withing Allocate. This could be made finer (pool level locks) but would require backing out of Allocate entirely to call Trim. Trim and any retried Allocation must be done in isolation (per device) or we may report OOM when memory is actually available in some pool's fragment cache. So some device level serialization is required in at least some paths. Change-Id: I7c1e94d6965ffcc602b12fefdd3a6e97b84b5e00	2021-11-24 19:22:05 -06:00
Sean Keely	322588a60e	Add missing return in ScopeGuard::operator=. This omission did not cause problems earlier due to having not been instanced. Change-Id: I7a54f82e06c299902f3bf6b4d3737cc5e30961ad	2021-11-15 18:50:46 -06:00
Sean Keely	19c1e92b4c	Remove io_link workarounds. KFD topology has been corrected and the defaults used by this workaround are no longer true for all chips. Change-Id: I0242d8077e9666ed1cf0dc3985244258ae5c0924	2021-10-11 19:15:07 -05:00
Sean Keely	a8c3ea82a4	Add debug option to skip setting the initial cu mask. Adds debug variable HSA_CU_MASK_SKIP_INIT. Change-Id: I5c742d1184a36fdef818bc50c3b780b859b68560	2021-09-16 23:43:49 -05:00
Sean Keely	2aa0795b33	Improve HSA_CU_MASK parsing efficiency. Delay parsing until after GPU discovery. Use the surfaced GPU count and maximum phyiscal CU count to limit parsed bit masks. This prevents pathological input such as HSA_CU_MASK=0-8000000:0-8000000 from attempting to consume 7TiB. Change-Id: I3773d2db3740c2023b0f6275d1818b69119b0495	2021-08-27 20:05:18 -04:00
Sean Keely	4455250be1	Add HSA_CU_MASK New environment variable HSA_CU_MASK allows users to specify a cu mask to every queue allocated from any GPU. hsa_amd_queue_cu_set_mask is restricted from escaping this mask. A new API hsa_amd_queue_cu_get_mask is added to query the current cu mask. Change-Id: I846c03a5faaca9b95067c31db84b59cc9fce2f03	2021-07-29 02:23:34 -05:00
Sean Keely	206e87d28b	Support debugging hw exceptions. Change-Id: I9780147294af2e9457fa54693580735452ee2ae6	2021-07-16 18:03:26 -05:00
Sean Keely	8adbda1c18	Allocate any size vram request through the fragment allocator. Enables the fragment allocator to handle >2MB allocations, maintaining good TLB alignment. Prior code contained a bug that caused the effective API granule for vram allocations >2MB to be bumped to 2MB. Also adjusts the block cache's block retention heuristic to not count discarded blocks as in use. This will reduce block retention when a significant amount of large blocks or IPC is in use. Change-Id: I30bd85eb87951df822211f799d9cfe579ab109c6	2021-06-10 19:30:54 -05:00
Sean Keely	ca8387768e	Allow limiting debug warning messages. Add macro debug_warning_n to stop printing a message after N instances. Change-Id: Id5f84b11eb63b3a20bd2bcb2ea8f10a066b457ef	2021-06-03 15:25:55 -05:00

1 2 3

116 Коммитов