Windows alings fields to 8 bytes even with 32bit builds.
Add BUG_CLR_SYSMEM_POOL to cotnrol sysmempool.
Change-Id: I8622aabc9f7391ed7dd8583b252ce9eb41d62293
[ROCm/clr commit: 6bb7d1afdc]
Applications may submit commands withoout waits
for GPU. That causes a growth of SW unreleased commands.
Make sure runtime flushes SW queue, if it grows over some
threshold, controlled by DEBUG_CLR_MAX_BATCH_SIZE.
Change-Id: Ia4d85c24210ef91c394f638ab6b53b14323a0396
[ROCm/clr commit: 8657a77029]
- Don't generate callbacks for HIP events
- Don't process profiling info in the callback for HIP events
- Wait for CPU status update of the submitted commands
every 50 calls. That will allow to drain the commands and
destroy HSA signals.
Change-Id: Ib601a350e7e7c2b6c6209a172385389baccf73a9
[ROCm/clr commit: 364dfb0ed1]
Ensure the member function Alloc() and Free() of command_pool_ will not be
accessed after command_pool_ be destructed.
Signed-off-by: Chong Li <chongli2@amd.com>
Change-Id: Ic2d36423302518a030bd61fa399290ebe2ed8194
[ROCm/clr commit: e6a5c81221]
hipDeviceSynchronize called from __hipUnregisterFatBinary
accesses static maps and monitors. This change ensures these ojects
are not destroyed before __hipUnregisterFatBinary is called.
Additionally it disables the teardown process for static build.
Change-Id: I46b58641d60efcf6637a8e99cdd786ffe9e2c77d
[ROCm/clr commit: 9b33db9b24]
- awaitCompletion code may do a endless spin wait for cases where we
dont submit a handler. One such case can be the hipExt*Launch API which
takes a stop event. In that case we optimize the stop event by attaching
a signal to the dispatch packet but dont submit a handler when we attach
the signal. That means if awaitCompletion() is called after that, we
would keep on waiting on command status on the host rather than simply
checking signal value.
Change-Id: Ie8bf175aeefa3f9e4299b1ae7ae9108dad67e283
[ROCm/clr commit: 561fb8a459]
Switch commands creation to the new suballocator to avoid
frequent expensive OS calls
Change-Id: I3597c811820e577c15708bad8b8a41aa53acc400
[ROCm/clr commit: 5b0bfdcbad]
- Create a vector to allow multiple TS to be stored in Command.
- This would mean we dont wait for entire batch in Accumulate command
to finish when we exhaust signals.
- Reduce the number of signals created at init to 64. This min value
may still need to be tuned but the KFD allows max of 4094 interrupt
signals per device.
- Store kernel names whenever they are available and not just when
profiling. If we dynamically enable profiling like for Torch, a crash
can happen if hipGraphInstantiate wasnt included in Torch profile scope
beacuse we previously entered kernel names only when profiler is
attached.
Change-Id: I34e7881a25bbc763f82fdeb3408a8ea58e1ec006
[ROCm/clr commit: c157bfb202]
- The correlation_id had random junk values which we were inserting in
the dispatch AQL packet even when no profiler was attached but if we had
a valid timestamp.
- Also make sure we dont even write the reserved2 field in the AQL
packet if no profiler attached.
Change-Id: Icdb7493198c1bb5e2d786a97e027288660854cd7
[ROCm/clr commit: 9a6ddae7b2]
- Track all captured commands under a new AccumulateCommand
- Add begin() and end() methods to capture commands
- Explicit TS object now passed to certain methods because
profilingBegin() and profilingEnd() now happen separately and thus can
run into threading issues
Change-Id: I171106bdcad72b057836cb2f3fc398db3533119f
[ROCm/clr commit: 40f41f4d0b]
- Refactor code and cleanup logic for callback saving for event records
Change-Id: I5c56aa8e9c968a5bca70fb07ad1796da318e9e89
[ROCm/clr commit: 1338ff37e8]
Remove the activity_prof::CallbacksTable. The table was redundant with
the information already stored in the roctracer library. Instead use a
single callback into the roctracer library to query whether the activity
is enabled, and to report it.
Change-Id: I2e05b0881bb4a1953c14361d00ea310d02eb6e0c
[ROCm/clr commit: 52eb28930a]
Profiling should be enabled for any command reporting activities as the
activity record captures the profilingInfo's start and end timestamps.
Since IS_PROFILER_ON is only used to determine whether API tracing is
enabled, there is no need to expose it globally, it should be a property
of the activity_prof::CallbacksTable.
Change-Id: I44a0d19ed2862606cfbc9a98c1a07a336ab7e26c
[ROCm/clr commit: e713b5c7d0]
The activity_ is only instantiated if profiling is enabled.
Remove the HIP private global record ID. Instead, use the correlation ID
stored in the hip_api_data_t by the profiler while the last HIP function
is in scope.
For NDRange and Copy commands, store the kernel name and byte size
(respectively) in the record.
General cleanups to improve the code's readability.
Change-Id: I01907484b0d9611eb9440c3a7c4865479dc42289
[ROCm/clr commit: 4fbae91468]
If for every eventRecord handler is not submitted,
memory is not getting released during hipFree and leads to OOM.
Change-Id: I19b61a0c523502e9e1a3564ce8b791f3e2cea02c
[ROCm/clr commit: 7b1c6d06d5]
Set release scope as system for dispatch AQL when events are passed to
hip*LaunchKernelGGL*
Change-Id: I93b91591e0ab023f1ecc5247f7905eca26147358
[ROCm/clr commit: 02566677cf]
Enqueue a handler callback for hipEventRecords(aka marker_ts_) for every
64 submits, This recycles the memory if we dont end up calling
synchronize for the longest time.
Change-Id: I3d39fe76d52a5d81387927edd85b5663b563682c
[ROCm/clr commit: fa76f03654]
- Add a global cache state for a device to indicate scopes of submitted
AQL packets
- Remove scopes for TS marker if hipEventReleaseToDevice is passed. Set
env ROC_EVENT_NO_FLUSH=1 to use NOP AQL for event records.
It would flush caches by default with system scope release.
- Calling finish() should ensure if caches are flushed, if not queue a
marker
Change-Id: Ibbbdbb1cd7ac61cb35649169212142545be159e0
[ROCm/clr commit: 8eeaa998c0]
- Queue handler for hipEventRecord(aka marker_ts_) only if there is a
callback associated with it.
Change-Id: I8a9877ae0e342556053abbaacc9510744a8e772a
[ROCm/clr commit: 3c3c0ca4c5]
The optimization is controlled with ROCR_SKIP_KERNEL_ARG_COPY.
This is initial check-in for experiments. Extra changes are
necessary for full support:
- handle graph capture with the original sysmem alloc
- avoid memobject references, otherwise there is a race condition with
reusage of the arg buffer
- Remove arg setup from hip
Change-Id: Ib0af710f93e79834711fa4049a7c66093711e68b
[ROCm/clr commit: 7e12cf6318]
std::mem_fun() and std::bind2nd() are removed in c++17. Switch to
simpler logic that does not require those functions.
Change-Id: I19a31f076e1813e367615bd377b424046ce144c7
[ROCm/clr commit: d934612948]
Add ref counting to ProfilingSignal class to track the last release.
If a signal was used in the marker, then don't reuse it,
but create a new one for internal usage.
Don't rely on HSA callback for the command status update if there
are no pending dispatches.
Change-Id: I19f14ed9d80acfe79993b343b2187635f8428a20
[ROCm/clr commit: ff15c0893e]
HSA signal calback may occur during the actual marker submit. That
may cause a deadlock, because shared lock_ object. Create the new
notify_lock_ field to protect the notification.
Change-Id: I9752af84e59895530620fac3932c6fc276de8658
[ROCm/clr commit: f34c1b9ff8]
Runtime can't assign internal HSA signals for HIP events, because
HIP application can destroy the HIP stream or signal reuse may
occur internally. Switch to global HSA signals for HIP events.
Change-Id: Ieaea2d6b039e492b2e7c5112782a8f4e601e50a1
[ROCm/clr commit: ce8dad2ecc]
We do not want to release resources during setStatus in HIP because of Graphs
Change-Id: Idc7b188ab5f8be6975ea91005dd2bbf177401f8c
[ROCm/clr commit: 133287f31f]
If AMD event contains a reference to a HW event, then runtime
could check/wait for HW event. CPU status update will occur later
after HSA signal callback, but it's not important for the result.
Change-Id: I591391a953bbdba6a25ac07e2cd98aeb17cd4596
[ROCm/clr commit: 85c70a7495]