* add additional runtime checks and gfx1201 fix
This commit contains three fixes:
- increase the max. number of files at the beginning of the run to the
max. allowed by the system
- check for large BAR support. WE don not abort if its not available,
but print a warning.
- for gfx1201, do not use uncached memory at the moment.
* Change get_arch_name to return const char*
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Fix C++ new syntax
not sure how it compiled before
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* use snprintf instead of strncpy
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* destructor cleanip
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* add const keyword
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Add put to all pes from all lanes concurrently
* Remove wg_init, use size_t for size params, 64bit data exchange (more
bits for verification masking)
* Rename to flood-test, add put,putnbi,p,get,getnbi,g variants, count time
correctly
* Add flood tester to the testing script
* add to gda test case w/o the _g variant that is not implemented.
[ROCm/rocshmem commit: cca7872bcf]
- use the reduce_psync buffers for synchronization in allreduce, not the
barrier_psync.
- execute a wwg barrier after the allreduce operation. After internal
discussion it was determined that it is required for correctness.
[ROCm/rocshmem commit: 6f512e92a5]
* byteswap<T> returns by value
* replace hand-rolled implementations with Clang __builtin_bswap<N> intrinsics
* new high-level interface endian::to_be, endian::from_be, etc. to indicate conversion direction
[ROCm/rocshmem commit: cf8b72a047]
* Error out when IPC gets selected when it is impossible to run it.
* Use RTLD_LAZY when dlopening
* Do not dlclose libbnxt/ionic/mlx5.so as that breaks libibverbs
[ROCm/rocshmem commit: 47f6fa6267]
* util: dlsym optional helper
Like DLSYM_HELPER, but does not return if the symbol is not found.
Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>
* gda ionic: sync dv and fw headers
Sync dv and fw headers to match out-of-tree libionic and firmware.
Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>
* gda ionic: collapsed cqe
Detect and enable collapsed cqe if supported by drivers and firmware.
Fall back to regular completion queue.
Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>
---------
Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>
[ROCm/rocshmem commit: 1494c24f9a]
* fix reduction for gfx942 and 1201
match the synchronizaation of internal_putmem_wg and internal_getmem_wg
to their non-internal counterparts. the internal_putmem_wg is used in
the ipc reduction
* move specialization to internal_putmem
[ROCm/rocshmem commit: 8d2504d6c1]
it fails in about 50% of the cases. Will revisit later why it fails,
but RO is at the moment lower priority, so disabling the test for now.
[ROCm/rocshmem commit: ed2f75f1de]
* bnxt_post_wqe_amo_single with fetching = true would return
before releasing the send queue lock, resulting in a deadlock.
* Release the send queue lock before returning from the function.
[ROCm/rocshmem commit: 016e08120a]
* gda: validate and exit cleanly when forced GDA config is invalid
* ro: validate and exit cleanly when forced RO config is invalid
[ROCm/rocshmem commit: 80a710ac0a]
Changed rocshmem_get_device_ctx() to properly copy the full rocshmem_ctx_t
structure and return only the ctx_opaque pointer instead of trying to
copy directly to a void pointer.
Prior implementation would cause undefined behavior or memory corruption
as it was copying 16 bytes of data to 8 bytes. It worked so far beucase ctx_opaque
field is at proper offsest, but incorrectly memcpy would overwrite some other allocations
and cause issues.
This fixes the context memory handling when passing device context from
host to device kernels.
[ROCm/rocshmem commit: cf6a53e81c]
* Add functional test for barrier_all_on_stream
* Add rocshmem_barrier_all_on_stream support for GDA and RO backends
Implements rocshmem_barrier_all_on_stream operation for
GPU Direct Access and Reverse Offload backends.
Previously, rocshmem_barrier_all_on_stream was only supported for IPC backend.
* Add functional test for rocshmem_broadcastmem_on_stream
* Add host-side rocshmem_broadcastmem_on_stream API
Implement stream-based broadcast collective operation
- Add rocshmem_broadcastmem_on_stream host API and kernel implementation
- Add functional test TeamBroadcastmemOnStreamTester with multi-stream
support and correctness verification
- Use per-workgroup contexts to avoid contention across parallel streams
API:
rocshmem_broadcastmem_on_stream(team, dest, source, nelems, pe_root, stream)
* Add functional test for rocshmem_getmem_on_stream
* Add host-side rocshmem_getmem_on_stream API
Implement stream-based point-to-point RMA get operation
- Add rocshmem_getmem_on_stream host API and kernel implementation
- Support for asynchronous getmem operations on HIP streams
- Add backend support for GDA, RO, and IPC contexts
- Use work-group collective getmem for efficient memory transfer
API:
rocshmem_getmem_on_stream(dest, source, nelems, pe, stream)
(AI Assist)
* Add host-side rocshmem_putmem_on_stream API
- Add rocshmem_putmem_on_stream for asynchronous remote writes
- Support for concurrent RMA operations on HIP streams
- Add backend support for GDA, RO, and IPC contexts
- Use work-group device collective operation
API:
rocshmem_putmem_on_stream(dest, source, bytes, pe, stream)
(AI Assist)
* Add functional test for rocshmem_putmem_on_stream
* Add host-side rocshmem_putmem_signal_on_stream API
Enables asynchronous putmem operations with signaling on HIP streams.
The implementation includes:
- Kernel wrapper rocshmem_putmem_signal_kernel
- Host interface putmem_signal_on_stream method
- Context layer support across all backends (IPC, GDA, RO)
- Public API
Function signature:
void rocshmem_putmem_signal_on_stream(void *dest, const void *source,
size_t bytes, uint64_t *sig_addr,
uint64_t signal, int sig_op,
int pe, hipStream_t stream);
* Add functional test for rocshmem_putmem_signal_on_stream
* Add host-side rocshmem_signal_wait_until_on_stream API
Enables asynchronous signal wait operations on HIP streams.
The implementation includes:
- Kernel wrapper rocshmem_signal_wait_until_kernel
- Host interface signal_wait_until_on_stream method
- Context layer support across all backends (IPC, GDA, RO)
- Native uint64_t support in wait_until API (generated from P2P_SYNC.py)
Function signature:
void rocshmem_signal_wait_until_on_stream(uint64_t *sig_addr, int cmp,
uint64_t cmp_value,
hipStream_t stream);
(AI Assist)
* Add functional test for rocshmem_signal_wait_until_on_stream
* Add documentation for stream API functions
This commit adds API documentation for the following host-side
stream functions:
- rocshmem_barrier_all_on_stream (collective routines)
- rocshmem_broadcastmem_on_stream (collective routines)
- rocshmem_getmem_on_stream (RMA operations)
- rocshmem_putmem_on_stream (RMA operations)
- rocshmem_putmem_signal_on_stream (signaling operations)
- rocshmem_signal_wait_until_on_stream (point-to-point sync)
The documentation includes function signatures, parameter descriptions,
and detailed explanations of asynchronous behavior and stream handling.
(AI Assist)
* Rename "bytes" -> "nelems"
* Add "_TEST_" to the variables used in tests
* Remove incorrect hipStreamDefault usage
hipStreamDefault is not a default stream. This is a flag.
If stream == nullptr, then just pass it to kernel. It will launch the kernel on the default stream
[ROCm/rocshmem commit: d0c8380650]
* Let functional tests build without external MPI
* Fix error conditions when using uuid startup with internal MPI
* Do not abort if libibverbs is not found but not using GDA
* Enabled RO functional test initialized with TEST_UUID
* Reduce load time for ro backend_can_run and prevent mpilib_dlclose
crashing
* Fix case TEST_UUID=1, ROCSHMEM_BACKEND='' (autoloading gda)
[ROCm/rocshmem commit: c99bc21e10]
* reenable gfx1100
use the modified version of the flat_store_short assembly instruction as suggested by the compiler team (32bit input value instead of 16bit)
* add fix for gfx1201
add the same fix for gfx1201 that was introduced for gfx1100
[ROCm/rocshmem commit: 224c969bef]
* Add host-side rocshmem_alltoallmem_on_stream function
Function signature:
rocshmem_alltoallmem_on_stream(rocshmem_team_t team, void *dest,
const void *source, size_t size,
hipStream_t stream)
- The function launches rocshmem_alltoallmem_kernel which calls
device-side alltoall<char> workgroup collective through default context.
- Uses dynamic block size determination via occupancy API.
- Implemented for all backends.
* Fix incorrect sync buffer size allocation for alltoall in GDA and IPC backends
When allocating memory for alltoall_pSync_pool in setup_teams() and
teams_init() functions, the code incorrectly used ROCSHMEM_BCAST_SYNC_SIZE
instead of ROCSHMEM_ALLTOALL_SYNC_SIZE.
* Add functional test for team_alltoallmem_on_stream
This commit adds a new functional test to verify the correctness of
the host-side rocshmem_team_alltoallmem_on_stream API.
* Add documentation for rocshmem_alltoallmem_on_stream
This commit adds API documentation for the host-side
rocshmem_alltoallmem_on_stream function in the collective routines
section. The documentation includes:
[ROCm/rocshmem commit: 5577feb70d]
* gda: add check for active interfaces when selecting the GDA backend
* fix __func__ maco in rocshmem_ctx_pe_quiet
* gda: switch to more generic RDMA NIC term in has_active_ib_interface
* gda: add active MLX5 and Pensando vendor ID checks for backend selection
[ROCm/rocshmem commit: 29000a5644]