The "optimized" version of memcpy is outdated and
was used in win32 only.
Change-Id: I7f2e0e9051e37cec95438266824b5b0025c324c6
[ROCm/clr commit: 7448113cfc]
- UUID needs to be specified in the format GPU-<body>, <body> encodes UUID as a 16 chars
- Convert set UUID in HIP_VISIBLE_DEVICES to device index internally
- Then use existing device index logic for HIP_VISIBLE_DEVICES
Change-Id: I654f492a49cd4d7a9b7339360ab558165240caa5
[ROCm/clr commit: fb217fa9e0]
Mempool may reuse memory without a wait. Hence, the timestamp has
to be preserved and can't be destroyed.
Change-Id: I6f095f44afa69887a4b7aeb3b329804aedd96f3e
[ROCm/clr commit: ffb516db3e]
For all windows allocation on SVM memory tagged with flag ROCCLR_MEM_INTERPROCESS.
hipHostMalloc validation is based on flag. So remove ROCCLR_MEM_INTERPROCESS before check.
Change-Id: I823bbf228d9a4a9acb4abffc01ac6b3f544c6e12
[ROCm/clr commit: 8179fa98a2]
Avoid a deadlock on the host call buffer creation. Since the buffer will be
allocated in the queue thread, then use direct device memory allocation
skipping the global context lock.
Change-Id: I09b55ee03bb42ab5d320c152b52a8c842c5fdcc1
[ROCm/clr commit: 62559a6e5a]
AMD CPUs have had avx512_bf16 support for quite some time now (from
consumer Ryzen 7000 series to enterprise grade CPUs). This
patch should allow users to use the hardware bf16 unit when running the
__host__ variants of the function. This can be enabled via `hipcc ...
-mavx512vl -mavx512bf16`.
Change-Id: I67c377afc95ddfe8d45a048dce078a247d4a1878
[ROCm/clr commit: 49349f168c]
- UUID is Ascii string with a maximum of 21 chars which uniquely identifies a GPU
- Convert set UUID in HIP_VISIBLE_DEVICES to device index internally
- Then use existing device index logic for HIP_VISIBLE_DEVICES
Change-Id: I8cab4fe42459f8209b97f909300789e6e687b9ac
[ROCm/clr commit: 52db98edd9]
If we are using the mask returned by getLastUsedSdmaEngine() then we
need to apply the SDMA Read/Write mask to it before using with HSA
copy_on_engine API.
Change-Id: I6e5dc6c187eeb3c61ee159e9d2a0fa7b4737c06e
[ROCm/clr commit: 3f0bcf7834]
Dispatching multiple packets with ring the doorbell once is not supported by the lower layers
Change-Id: I7665a2dcdd4ef9e47dadfe410180fed64c5a4ee0
[ROCm/clr commit: d7f352dbed]
- Application is passing null for parameter stream in API hipStreamWaitEvent
- When event stream isn't capturing and event is not recorded, causes segfault because we are accessing deviceId() from waitStream
Change-Id: I8b87ffd6f234677f68b66dcb7ef44b2ff04a7c91
[ROCm/clr commit: 880f1f0049]
When large bar is enabled, persistent memory leads to overallocation for 32 bit architecture.
Change-Id: Iae39359d8128588de02e42d77fe58e868b8e71fd
[ROCm/clr commit: bc80802c1a]
=> hipDeviceSynchronize is not allowed during capture.
=> hipEventSynchronize during capture should return hipErrorCapturedEvent error
=> hipEventQuery during capture should return hipErrorCapturedEvent error
hipStreamSynchronize, hipEventSynchronize, hipStreamWaitEvent, hipStreamQuery
For Side Stream(Stream that is not currently under capture):
=> If current thread is capturing in relaxed mode, calls are allowed
=> If any stream in current/concurrent thread is capturing in global mode, calls are not allowed
=> If any stream in current thread is capturing in ThreadLocal mode, calls are not allowed
For Stream that is currently under capture
=> calls are not allowed
=> Any call that is not allowed during capture invalidates the capture sequence
=> It is invalid to call synchronous APIs during capture. Synchronous APIs,
such as hipMemcpy(), enqueue work to the legacy stream and synchronize it before returning.
Change-Id: I201c6e63e1a5d93fd416a3b520264c0fdbe31237
[ROCm/clr commit: 19b4660cbb]
The new logic has a lock for PAL call and doesn't require the lock for queues.
Change-Id: I61b67c3c4abd2ede44809de1d6beed756766032e
[ROCm/clr commit: 2f3ad43c4a]
alwaysResident setting doesn't require per queue residency tracking.
Thus, the logic can be skipped to avoid the lock of queues.
Change-Id: Ib5cff5b79d3ecb8c2f2eb2565cf069f9a69438b0
[ROCm/clr commit: 95e3958748]
- Create a vector to allow multiple TS to be stored in Command.
- This would mean we dont wait for entire batch in Accumulate command
to finish when we exhaust signals.
- Reduce the number of signals created at init to 64. This min value
may still need to be tuned but the KFD allows max of 4094 interrupt
signals per device.
- Store kernel names whenever they are available and not just when
profiling. If we dynamically enable profiling like for Torch, a crash
can happen if hipGraphInstantiate wasnt included in Torch profile scope
beacuse we previously entered kernel names only when profiler is
attached.
Change-Id: I34e7881a25bbc763f82fdeb3408a8ea58e1ec006
[ROCm/clr commit: c157bfb202]
- With https://gerrit-git.amd.com/c/lightning/ec/llvm-project/+/1002628 applied, at -O0 Kernel::dynamicParallelism() returns true but virtual queue isn't created
- This causes segfault inside VirtualGPU::submitKernelInternal() when getVQVirtualAddress() is called
Change-Id: Ia7af042adad2329e870c142caaac3e8fa886f8b8
[ROCm/clr commit: d1fff7cea2]