* SWDEV-561708 Counted queue size from env var
* use counted_queue_size for test
* remove rocrtst changes; add a const for default queue size
* Remove env var from test; use queue->size
* Improve env var documentation
* Correct type
* SWDEV-561708 Initial shared queue pool apis
* Validate params; some fixes in callback function (but still needs to be checked)
* Dtor cleanup
* minor
* Enable profiling; remove callback since aql_queue takes care of it
* setPriority and setCuMask APIs updated for counted queues
* Increasing step and minor version for rocprofiler
* Tests for CountedQueueManager
* tests
* Code refactored to make pool manager part of GpuAgent only (incomplete); unique handles issue pending
* Refactored code to support CQM inside GpuAgent and unique handles; multithreaded test added
* Changed to ASSERT_SUCCESS macros for all tests
* RIng buffer overflow test added
* tests fixed; cleanup added at hsa_shutdown
* priority conversion table changes
* Compiler warnings fixed
* Rewrite 1 test; add desc and improve SetUp() code
* Improvement
* Unififed getinfo for both counted and non-counted queues
* Address PR feedback
* Addressing feedback: memleak, data type mismatch, documentation
* improve comment
* format
* Missing HSA_API macros for roctracer
* Revert "Addressing feedback: memleak, data type mismatch, documentation"
This reverts commit 5e498a55fb3640e00d06cec63dcec79293fb23de.
* Improving acquire api doc
* release api doc improved
* error codes for release api doc
* rocr: Add ProtectMemory API and use it in RemoveAccess
Replace munmap + mmap with mprotect when removing memory access.
This improves performance by 5-10x, ensures atomicity (no race
condition window), and prepares for WSL/DXG compatibility fixes.
Suggested-by: David Yat Sin <David.YatSin@amd.com>
Signed-off-by: Flora Cui <flora.cui@amd.com>
Signed-off-by: Horatio Zhang <Hongkun.Zhang@amd.com>
* rocr: Skip CPU mapping operations on WSL
On WSL, CPU cannot access GPU VRAM due to platform restrictions.
CPU access would fault-in system RAM instead, causing data corruption
and memory leaks. Return HSA_STATUS_ERROR to fail fast rather than
silently creating broken mappings. GPU-to-GPU mappings remain functional.
Signed-off-by: Flora Cui <flora.cui@amd.com>
Signed-off-by: Horatio Zhang <Hongkun.Zhang@amd.com>
* rocr: reduce ifdef linux
v2: Fix IsDXG check logic
Signed-off-by: David Yat Sin <David.YatSin@amd.com>
Signed-off-by: Horatio Zhang <Hongkun.Zhang@amd.com>
---------
Signed-off-by: Horatio Zhang <Hongkun.Zhang@amd.com>
Signed-off-by: David Yat Sin <David.YatSin@amd.com>
Signed-off-by: Flora Cui <flora.cui@amd.com>
* SWDEV-569319 Replace ScopedAcquire with stdcpp wrappers
* Remove KernelMutex and KernelSharedMutex abstractions with std::mutex and std::shared_mutex
* Replaced unique_locks with lock_guards
* More changes
* Replace new and deletes with smart pointers
* Replaced some more with shared ptrs
* Replacements with smart pointers - pt 2
* missed change
* SWDEV-558848 - vmm api support for rocr on windows
* Fixes to VMM handle Map/Unmap Set/Get Access
* Fix GetShareableHandle to use pointer for shareable handle
* Update os specific map/unmap memory calls
* clang format update
* Minor syntax fixes from code review
Co-authored-by: Yiannis Papadopoulos <102817138+ypapadop-amd@users.noreply.github.com>
---------
Co-authored-by: Rahul Manocha <rmanocha@amd.com>
Co-authored-by: Yiannis Papadopoulos <102817138+ypapadop-amd@users.noreply.github.com>
* Run pre-commit's whitespace related hooks on projects/rocr-runtime
In order for pre-commit to be useful, everything needs to meet a common
baseline.
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Add missing semicolon which would block compilation on big endian CPUs
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
---------
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* SWDEV-555347 - Remove lock contention in async events loop
* SWDEV-555347 - Introduce Pool of AsyncEventItems
* create generic mempool for AsyncEventItem
* Use BaseShared allocate and free for async event pool
---------
Co-authored-by: Rahul Manocha <rmanocha@amd.com>
* rocr: fix nullptr dereference
Return early in the case that malloc fails to avoid dereferencing of a
null pointer on eventDescrp.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
* rocr: Fix potential nullptr dereference
returns early if sym->section() fails to properly acquire the object.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
---------
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
Co-authored-by: Sunday Clement <Sunday.Clement@amd.com>
Make sure ROCR can be compiled under windows. Extra setup for the windows build environment is required. The change should not have any functional changes under Linux.
* rocr: Fix Incorrect Assertion Check
The wrong variable is used in the assertion statement, should be error
checking for the value of paramEndLoc after it is modified by the call
to find().
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
* rocr: Fix Potential Undefined Behaviour
In the event that the SvmProfileControl destructor is called and
event == -1 is true then the call to close(event) is effectively
close(-1) which is undefined behaviour. This has been changed to only
call close() on valid file descriptors.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
* rocr: Add Error Check on Bytes Read
In the case that there is an incomplete read the call to copyTo() will
now return an error.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
* rocr: Fix Exception Error
Destructors are implicitly marked with noexcept being true by default
so if its not explicitly marked false in the destructor or the
functions it calls, any thrown exceptions will cause the program to
crash.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
---------
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
Co-authored-by: Sunday Clement <Sunday.Clement@amd.com>
Moved the Call to pthread_mutex_lock to an else statement for better
code readibility.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
[ROCm/ROCR-Runtime commit: 1635746a9c]
Because eventDescrp->mutex is a non-recursive lock attempting to
acquire the lock with pthread_mutex_lock can cause the system to hang
indefinitely if the lock was already previously aquired with the
preceeding call to pthread_mutex_trylock.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
[ROCm/ROCR-Runtime commit: a97b7df4b9]
allocated memory was previously not freed in the event of an error
with rwlock initialization.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
[ROCm/ROCR-Runtime commit: 293092f32f]
On large BAR systems, for small-sized code-objects, we get performance
using direct memcpy due to latencies when doing the blit-copy.
[ROCm/ROCR-Runtime commit: da2607024b]
Using HSA_ENABLE_DTIF to control dtif/native thunk code path
Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Reviewed-by: David Yat Sin <David.YatSin@amd.com>
[ROCm/ROCR-Runtime commit: 166b0fa45a]
This builds on a prior change that allowed for allocating
a user-mode queue's packet buffer in device memory to also
allocate the queue struct in device memory. This provides
additional latency benefits particularly for cases where
dispatches are performed from the GPU itself. Flags are
added to support the various use cases.
[ROCm/ROCR-Runtime commit: 6e3c375bf1]
The initial call to Refresh() in the constructor is
unnecessary as it's handled in Runtime::Load().
Signed-off-by: lyndonli <Lyndon.Li@amd.com>
[ROCm/ROCR-Runtime commit: c34a2798ce]
The scratch_backing_memory_byte_size is not used by CP, but it is
currently used by rocgdb. Putting the field back, but we need to find a
solution for alt_scratch_backing_memory_byte_size.
Also, completely disabling alternate scratch as we need some changes to
support debugger.
[ROCm/ROCR-Runtime commit: 02b38d0614]
Updating ROCr code to match new handshake protocol with CP FW for
asynchronous scratch reclaim.
Increase previous limits when scratch reclaim feature is available.
[ROCm/ROCR-Runtime commit: aa2f98e6f9]
Added HSA_IMAGE_ENABLE_3D_SWIZZLE_DEBUG environment flag to
enable/disable this. Default value is false (view3dAs2dArray = 1)
Enabling this flag will enable support for swizzles that do 3D
interleaving. Note that all features of 3D images are supported
with 2D swizzles,it's just that the access patterns are different
and therefore cache hit-rates may be better or worse, depending
on how it's used. Volumetric algorithms do better with 3D and apps
that tend to access a single slice at a time do better with 2D.
Change-Id: Id8574a6710fe4333a1ee331e5ce9195a81434198
[ROCm/ROCR-Runtime commit: 6361466baa]
Set priority to maximum for signal event handler and minimum for
exceptions event handler.
Change-Id: I1b982d3c2e4c880fafc073fe1a542d01692a6fdc
[ROCm/ROCR-Runtime commit: 7ea25ebb85]
Removed 'args' as a unique pointer and deletion in
'ThreadTrampoline', then declared as a class member.
Change-Id: Ia52058392d0170e8b5e57cfdd2c587f47a6f93f0
Signed-off-by: Apurv Mishra <apurv.mishra@amd.com>
[ROCm/ROCR-Runtime commit: 89115369cc]
WaitSemaphore and PostSemaphore are used in the HybridMutex
implementation. If HybridMutex did not have to call WaitSemaphore when
acquired, then calling PostSemaphore would cause the internal count
inside sem_t to slowly grow to large values and eventually cause
overflow.
Change-Id: I173fc17c874b49926e56991405e9086ea8c138fc
[ROCm/ROCR-Runtime commit: f58aff630c]
Add support for abort timeout when hsa_signal_wait_relaxed is called and
signal does not clear within timeout.
timeout is in seconds
Change-Id: If1db5a8af33c82ddc4b48968c3d8eceb97d0ea6d
[ROCm/ROCR-Runtime commit: 4ec730f1dc]
- Add the new path to avoid WaitAny() calls in AsyncEventsLoopp() with
HSA_WAIT_ANY_DEBUG key. The new path is selected by default.
The optimizaiton combines all logic of WaitAny() in a single processing loop
and avoids extra memory allocations or ref counting. Also it won't spin
on the CPU if all events are busy.
Change-Id: I197ce60d0d023fbb672f700d6e87702686f1f55a
[ROCm/ROCR-Runtime commit: 0fc7369ba5]
Discarding blocks for reallocation on IPC export for better memory
performance trigger memory violations with DMA BUF exports so bypass
this for now as application performance drops haven't been observed
with the bypass.
The raw fragment should be passed to the DMA Buf export call as well
since offsets will be implicitly applied in the Thunk/KFD for
export/import calls.
Also, use the agent information directly from the pointer
information so that the export call doesn't have to scan memory to find
this. Pass the node ID in the handle so that the import call doesn't
have to make two thunk imports to fetch the node ID for GPU memory
imports.
Finally, allow the user to use DMA Buf IPC via
HSA_ENABLE_IPC_MODE_LEGACY=0 for developer testing as legacy mode will
be applied by default.
Change-Id: Ie8fe267f8768fa5df37126078406f7065f69ff4e
[ROCm/ROCR-Runtime commit: 32bb0764b7]
- Use HSA_ALLOCATE_QUEUE_DEV_MEM=1 to create AQL queue in device
memory.
- Before writing AQL packet header to the queue use an SFENCE to ensure
that there is no reodering of the writes over PCIE
Change-Id: I5eacdc35108c4a1e245c75ae349b7495451aa60d
[ROCm/ROCR-Runtime commit: 3baaa6e9c0]
Rewriting logic to fix issue where pthread_create would return errors
other than EINVAL, and these errors would be ignored.
Change-Id: I573958724dcf886c20e8c14e6a9182303b3ffa06
[ROCm/ROCR-Runtime commit: c8dd4d2b3b]
Recommended SDMA engines for DMA copies are now exposed for better
GPU-GPU performance. ROCr can now select those DMA engines.
Also lock-in host-device copies to SDMA0 and device-host copies to
SDMA1 for better stability and performance.
Change-Id: Ideff2e13daf537104efecb8b837bd49ee5096cb5
[ROCm/ROCR-Runtime commit: eb30a5bbc7]
Current DMABUF implemenation is unstable. Switch back to legacy
support for now.
Change-Id: I3be871f38c6524b0bcc9225bab61de4e57771efb
[ROCm/ROCR-Runtime commit: ea646cf958]