- Updated kernel arg manager to support allocating kernel args on multiple devices for single graph.
- Updated AQL path to capture on the device where graph node is added.
Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>
* SWDEV-550626 - Refactor atomics header and tests
1. Introduce __HIP_ATOMIC_BACKWARD_COMPAT.
By default we define __HIP_ATOMIC_BACKWARD_COMPAT=1 to
let hip atomic functions maintain old assumptions. if
users want to adopt the new behavior, that is , by default
assume no-fine-grained no-remote-memory, then they can
define __HIP_ATOMIC_BACKWARD_COMPAT=0 and get the new
behaviour.
2. Use __HIP_ATOMIC_BACKWARD_COMPAT_MEMORY to replace
original __HIP_FINE_GRAINED_MEMORY in atomic header.
And apply __HIP_FINE_GRAINED_MEMORY onto all
atomicXXX_system() functions to prevent failure on memory
allocated by hipHostMalloc().
3. Replace HIP_TEST_FINE_GRAINED_MEMORY with
HIP_TEST_ATOMIC_BACKWARD_COMPAT_MEMORY in hip-tests.
4. Fix negative test errors.
Fix managed memory test error on memory order.
some other minor changes.
As a result all originally disabled tests are enabled.
5. Add more atomics tests in some cases.
6. Reduce test time in each case.
Reduce iteration number to 1 for tests that cost too much time.
8. Put common codes into hip_test_common.hh
Return error when ext_fine_grain_pool is unavailable for
hipHostMallocUncached, hipHostAllocUncached and
hipExtHostRegisterUncached.
Disable related tests on Navi4x where
ext_fine_grain_pool is unavailable
Prior we were enabling dynamic loading mode if BUILD_SHARED_LIBS, but this is not correct. We should only be loading dynamically if the amd_comgr library itself is shared.
Background: we have a configuration where we use a static linked comgr stub in order to achieve LLVM isolation (it dynamically loads the comgr and compiler into a dedicated link namespace) in an otherwise dynamic linked clr.
hipMemPtrGetInfo was returning the error hipErrorInvalidValue if it
was called on a nullptr. However, this does not match the malloc
convention where a nullptr has size zero; for example,
malloc_usable_size() returns zero if called on a nullptr.
This commit changes hipMemPtrGetInfo to set the size to zero and
return hipSuccess when called with a nullptr. (This also fits with
hipMalloc and hipFree usage, since hipMalloc of size zero results in a
nullptr, and hipFree of a nullptr is successful.)
The build of ROCR backend will be enabled by default in Windows.
It requires the dll loader until ROCR dll will be always available in Windows for any configuration.
*Lay foundation to batch packets efficiently for graphs
*Dynamically copy packets with max threshold set with
DEBUG_HIP_GRAPH_BATCH_SIZE, if not stagger packet copy with pow2
*Default threshold for DEBUG_HIP_GRAPH_BATCH_SIZE is 256
*If TS are not collected for a signal for reuse, create a new signal.
This can potentially increase signal footprint if the handler doesn't run
fast enough.
During a process tear-down we wait on all signals before releasing them:
VirtualGPU::HwQueueTracker::~HwQueueTracker() {
for (auto& signal : signal_list_) {
CpuWaitForSignal(signal);
signal->release();
}
[...]
}
In the case where we exit the process after a GPU error that did not
cause an abort (ulimit -c == 0), waiting for the signal can be skipped.
With the device on the error state, no progress is made, and the signal
is probably never going to be modified again:
inline bool WaitForSignal(hsa_signal_t signal, bool active_wait = false, bool yield = false) {
[...]
if (HIP_SKIP_ABORT_ON_GPU_ERROR && amd::Device::IsGPUInError()) {
ClPrint(amd::LOG_ERROR, amd::LOG_SIG,
"Device not Stable, while waiting for Signal ="
"(0x%lx) for %d ns",
signal.handle, kTimeout4Secs);
return true;
}
[...]
}
However, after calling CpuWaitForSignal, when calling "release", we can
end-up on a signal dtor which also tries to wait on the signal. Because
the GPU is the error state, we never receive the signal, and hang the
process during tear down. This happens with the ProfilingSignal dtor:
ProfilingSignal::~ProfilingSignal() {
if (signal_.handle != 0) {
if (hsa_signal_load_relaxed(signal_) > 0) {
LogError("Runtime shouldn't destroy a signal that is still busy!");
if (hsa_signal_wait_scacquire(signal_, HSA_SIGNAL_CONDITION_LT, kInitSignalValueOne,
kUnlimitedWait, HSA_WAIT_STATE_BLOCKED) != 0) {
}
}
hsa_signal_destroy(signal_);
}
}
This dtor should check that the GPU is not in the error state before
trying to wait, which is what this patch implements.
Bug: SWDEV-555043
Bug: SWDEV-553435
Bug: SWDEV-553679
Bug: SWDEV-555119
* SWDEV-541623 - cuda parity hipLaunchCooperativeKernelMultiDevice and hipExtLaunchMultiKernelMultiDevice
numDevices does not match the system devices
* SWDEV-541623 - enable Unit_hipExtLaunchMultiKernelMultiDevice_Negative_MultiKernelSameDevice
---------
Co-authored-by: agunashe <ajay.gunashekar@amd.com>
* SWDEV-548838 Add local and global fence support for barrier function
The original barrier function didn't distinct between local and global scope. There was only __CLK_LOCAL_MEM_FENCE which triggers both local and global fence. This commit introduces __CLK_LOCAL_MEM_FENCE and __CLK_GLOBAL_MEM_FENCE that properly distinguish the scopes.
---------
Co-authored-by: Tim <Tim.Gu@Amd.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Tim Gu <timgu102@amd.com>
* Fix grid_group::group_dim to return grid_dim and not block_dim
* Add unit test for grid_group.group_dim()
* Fix unit test errors
* Skip group_dim() assertions for base_type test
- Clean up and standardization of MIT licenses after discussion with legal team.
- Update README.md with blurb for top-level files.
- MIT License explicitly mentioned for relevant projects.
- Removal of years.
- Copyright attribution should be to `Advanced Micro Devices, Inc.` and not `AMD ROCm(TM) Software`
- Removal of `All rights reserved.`
- Reduce line width of the text for readability.
- Add clear visual separators for additional licenses.
- Convert text files to markdown format for aforementioned separators.
- Update build scripts to point to renamed files.
- Fixed SMI doc references
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
- The graph nodes have been updated to capture the device ID from the capture stream or the current device when explicitly added.
- Update the device ID for the memcpy node, ensuring that the device where the memory is allocated is taken into account for H2D and D2H pinned operations.
Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>
* SWDEV-551080
* Fix condition for taking shader path, the size check was moved
incorrectly
* Also account for a bitmask returned for preferred engines
- Refactor deviceLocalAlloc arguments
- Refactor hostAlloc code, have cleaner interface
- Kern args buffer need to have execute flag set as CP enforces this on
certain newer HW.