- NULL stream was waiting for itself to be empty before each command.
- Force "blocking" streams to wait for NULL to empty. This was missing
before.
- async copy was disabling itself via trueAsync=false for common cases.
Refactor:
- rename _null_stream to _default_stream.
- move some null sync function to defaultSync, move to dev member func.
-Move staging buffer locks inside the staging buffer code.
-Remove dedicated per-device completion_signal + per-device lock -
instead allocated signal from the per-stream pool. This elimintes
the lock and allows more concurrency.
-remove switch HIP_DISABLE_BIDIR_MEMCPY
- refactor staging buffer to operate on hsa* data structures not
hc::accelerator.
- use hsa_memory_allocate to allocate staging buffers rather than
am_alloc.
- Refactor device reset with single member function. Don't reallocate
staging buffers on reset.
- Properly track dependencies based on command type. Add new deps for
H2D and D2D rather than overloading H2D.
- Control with HIP_DB=mask (env var). See src/hip_hcc.cpp for mask
values:
#define DB_API 0 /* 0x01 - shortcut to enable HIP_TRACE_API on single switch */
#define DB_SYNC 1 /* 0x02 - trace synchronization pieces */
#define DB_MEM 2 /* 0x04 - trace memory allocation / deallocation */
#define DB_COPY1 3 /* 0x08 - trace memory copy commands. . */
#define DB_SIGNAL 4 /* 0x10 - trace signal pool commands */
- Combine with HIP_TRACE to see debug with API trace.
- Use colors to distinguish different flows of debug.
- Add define COMPILE_DB_TRACE to allow removing all debug at compile-time
In this change, the cpu memcpy will wait until all the commands in the current stream are done.
Note that, it only waits on current stream. But not on other streams.
Add dependency for host-to-host copy.
Add debug mode for HIP_DISABLE_HW_COPY_DEP and
HIP_DISABLE_HW_KERNEL_DEP - setting these to -1 now ignores
all dependencies.
hipEnvVar is the base test case, to be called by hipEnvVarDriver
at the run time.
Test case includes tests for normal use case of the environment
variable, invalid value/sequence and use CUDA_VISIBLE_DEVICES as a
alternative.
- Add chicken bits to use host-side dependency management.
- Add optional PinInPlace path for unpinned copies
- Synchronize before pinned memcpy path.
- Add mutex to protect two threads launching to same stream.