Windows kills threads on exit without any notification. However,
runtime can still destroy VirtualGPU object from the host thread with
HostQueue destruction.
This change also forces RGP trace transfer on the last capture without
any delays.
Change-Id: I768e87e99e1d23a021e63c12f36e450817743759
Remove global program lock in order to fix too
long kernel launch overhead with multi-threads
on MGPUs.
This patch depends on a compiler patch that makes
LC thread safe.
Change-Id: Ic8a7374d19112764d6de5d483ec5d07a56661d1b
The ROCclr assigns zero-based IDs to GPUs in the order they are
discovered. That zero-based ID is what is used to identify the GPU
on which the HIP_OPS activity took place.
When multiple ranks are used, each rank's first logical device always
has GPU ID 0, regardless of which physical device is selected with
CUDA_VISIBLE_DEVICES. Because of this, when merging trace files from
multiple ranks, GPU IDs from different processes may overlap.
The long term solution is to use the KFD's gpu_id which is stable
across APIs and processes. Unfortunately the gpu_id is not yet exposed
by the ROCr, so for now use the driver's node id.
Change-Id: Ib78854527d600d175bb76e2df0747c33f898c615
- Store last fence scopes and use the last value to determine if we need a cache flush again. This helps cases where hipExtLaunchKernel API is
used.
- Purge code for ROC_EVENT_NO_FLUSH
Change-Id: I531cf9c9c60d5e2b3a9e265d0f52f79ed2fa8a8c
Remove the activity_prof::CallbacksTable. The table was redundant with
the information already stored in the roctracer library. Instead use a
single callback into the roctracer library to query whether the activity
is enabled, and to report it.
Change-Id: I2e05b0881bb4a1953c14361d00ea310d02eb6e0c
Profiling should be enabled for any command reporting activities as the
activity record captures the profilingInfo's start and end timestamps.
Since IS_PROFILER_ON is only used to determine whether API tracing is
enabled, there is no need to expose it globally, it should be a property
of the activity_prof::CallbacksTable.
Change-Id: I44a0d19ed2862606cfbc9a98c1a07a336ab7e26c
The activity_ is only instantiated if profiling is enabled.
Remove the HIP private global record ID. Instead, use the correlation ID
stored in the hip_api_data_t by the profiler while the last HIP function
is in scope.
For NDRange and Copy commands, store the kernel name and byte size
(respectively) in the record.
General cleanups to improve the code's readability.
Change-Id: I01907484b0d9611eb9440c3a7c4865479dc42289
The fix for SWDEV-329789 moved down the last use of the a
command object pointer in order to prevent a race condition.
However, the previous patch did not move down the release of
that command. By releasing the command early, another thread
could get a command with the same pointer. That second thread
could later submit work to the queue using that new command.
The first thread could then perform a comparison against the
queue's last command using its own now-stale pointer. This
could eventually allow the second thread to skip synchornizing
on the queue. This would result in host synchronizations
completing before their device work was actually complete.
Change-Id: I292b7b369743251ceafe453a4c5cae14a6d01046
If for every eventRecord handler is not submitted,
memory is not getting released during hipFree and leads to OOM.
Change-Id: I19b61a0c523502e9e1a3564ce8b791f3e2cea02c
Runtime can reset the last command only if it didn't change
since the query at the beginning of finish()
Change-Id: I629f2d788e9bbaa17ca4e96b1a753f8131e32463
Maintain status of handler callback. For event records we no longer
submit callbacks to reduce the load on the async handler thread. However
without a callback we leak command memory/decrement refcounts. Indicate
status of the handler which we can use to queue a callback when
finish is called.
Change-Id: I89fd02f3d047a0e8162664ee17581a14795f1928
The heap must be cleared once per device, but ROCclr doesn't
create a queue per device in HIP. Hence, the clear operation will
be performed during the first queue creation.
Change-Id: I52ceb06d67d11cde6d019c5ab510059f426a9bfb
Enqueue a handler callback for hipEventRecords(aka marker_ts_) for every
64 submits, This recycles the memory if we dont end up calling
synchronize for the longest time.
Change-Id: I3d39fe76d52a5d81387927edd85b5663b563682c
Adding virtual memory management APIs to rocclr.
The HIP layer will handle virtual allocs on devices.
Change-Id: Ia978f105c2c3fed3959c77580ba228e845105754
- Add a global cache state for a device to indicate scopes of submitted
AQL packets
- Remove scopes for TS marker if hipEventReleaseToDevice is passed. Set
env ROC_EVENT_NO_FLUSH=1 to use NOP AQL for event records.
It would flush caches by default with system scope release.
- Calling finish() should ensure if caches are flushed, if not queue a
marker
Change-Id: Ibbbdbb1cd7ac61cb35649169212142545be159e0