rocm-systems

Údar	SHA1	Teachtaireacht	Dáta
Yiltan	258d264ecc	Add default context alltoall API (#350 ) [ROCm/rocshmem commit: `fddbe7b15d`]	2025-12-10 11:43:15 -05:00
Aurelien Bouteiller	972893bab2	Reenable building test-only with external MPI (#352 ) [ROCm/rocshmem commit: `1a16b3bedc`]	2025-12-10 11:40:29 -05:00
Aurelien Bouteiller	92459fa840	Update version to 3.2.0 for 7.2.0 rocm release (#351 ) [ROCm/rocshmem commit: `ef5f2be215`]	2025-12-09 10:26:55 -05:00
Anatolii Rozanov	f98c72d627	Add host API for _on_stream operations (#340 ) Add functional test for barrier_all_on_stream * Add rocshmem_barrier_all_on_stream support for GDA and RO backends Implements rocshmem_barrier_all_on_stream operation for GPU Direct Access and Reverse Offload backends. Previously, rocshmem_barrier_all_on_stream was only supported for IPC backend. * Add functional test for rocshmem_broadcastmem_on_stream * Add host-side rocshmem_broadcastmem_on_stream API Implement stream-based broadcast collective operation - Add rocshmem_broadcastmem_on_stream host API and kernel implementation - Add functional test TeamBroadcastmemOnStreamTester with multi-stream support and correctness verification - Use per-workgroup contexts to avoid contention across parallel streams API: rocshmem_broadcastmem_on_stream(team, dest, source, nelems, pe_root, stream) * Add functional test for rocshmem_getmem_on_stream * Add host-side rocshmem_getmem_on_stream API Implement stream-based point-to-point RMA get operation - Add rocshmem_getmem_on_stream host API and kernel implementation - Support for asynchronous getmem operations on HIP streams - Add backend support for GDA, RO, and IPC contexts - Use work-group collective getmem for efficient memory transfer API: rocshmem_getmem_on_stream(dest, source, nelems, pe, stream) (AI Assist) * Add host-side rocshmem_putmem_on_stream API - Add rocshmem_putmem_on_stream for asynchronous remote writes - Support for concurrent RMA operations on HIP streams - Add backend support for GDA, RO, and IPC contexts - Use work-group device collective operation API: rocshmem_putmem_on_stream(dest, source, bytes, pe, stream) (AI Assist) * Add functional test for rocshmem_putmem_on_stream * Add host-side rocshmem_putmem_signal_on_stream API Enables asynchronous putmem operations with signaling on HIP streams. The implementation includes: - Kernel wrapper rocshmem_putmem_signal_kernel - Host interface putmem_signal_on_stream method - Context layer support across all backends (IPC, GDA, RO) - Public API Function signature: void rocshmem_putmem_signal_on_stream(void dest, const void source, size_t bytes, uint64_t sig_addr, uint64_t signal, int sig_op, int pe, hipStream_t stream); Add functional test for rocshmem_putmem_signal_on_stream * Add host-side rocshmem_signal_wait_until_on_stream API Enables asynchronous signal wait operations on HIP streams. The implementation includes: - Kernel wrapper rocshmem_signal_wait_until_kernel - Host interface signal_wait_until_on_stream method - Context layer support across all backends (IPC, GDA, RO) - Native uint64_t support in wait_until API (generated from P2P_SYNC.py) Function signature: void rocshmem_signal_wait_until_on_stream(uint64_t sig_addr, int cmp, uint64_t cmp_value, hipStream_t stream); (AI Assist) Add functional test for rocshmem_signal_wait_until_on_stream * Add documentation for stream API functions This commit adds API documentation for the following host-side stream functions: - rocshmem_barrier_all_on_stream (collective routines) - rocshmem_broadcastmem_on_stream (collective routines) - rocshmem_getmem_on_stream (RMA operations) - rocshmem_putmem_on_stream (RMA operations) - rocshmem_putmem_signal_on_stream (signaling operations) - rocshmem_signal_wait_until_on_stream (point-to-point sync) The documentation includes function signatures, parameter descriptions, and detailed explanations of asynchronous behavior and stream handling. (AI Assist) * Rename "bytes" -> "nelems" * Add "_TEST_" to the variables used in tests * Remove incorrect hipStreamDefault usage hipStreamDefault is not a default stream. This is a flag. If stream == nullptr, then just pass it to kernel. It will launch the kernel on the default stream [ROCm/rocshmem commit: `d0c8380650`]	2025-12-09 08:55:46 -06:00
Dimple Prajapati	b9c172de16	Add IBGDA backend flag to enable bitcode generation (#347 ) * Change to enable ibgda bitcode compilation * Apply suggestion from @abouteiller --------- Co-authored-by: Aurelien Bouteiller <aurelien.bouteiller@amd.com> [ROCm/rocshmem commit: `fbe57306b9`]	2025-12-08 16:19:48 -08:00
Avinash Kethineedi	4a0a3cc6e3	Refactor: modularize RMA and AMO WQE posting functions (#331 ) * Refactor: modularize RMA and AMO WQE posting functions - Extract shared logic for SQ/CQ waiting, doorbell ringing, and WQE building * Remove unused variables * Update return buffer address calculation for atomics [ROCm/rocshmem commit: `1acf454048`]	2025-12-08 14:54:41 -06:00
Yiltan	9b77387067	Fix docs rendering issue (#349 ) [ROCm/rocshmem commit: `d5bcb3a201`]	2025-12-08 15:54:06 -05:00
Yiltan	cf1db0529a	Remove unused fence policy (#348 ) [ROCm/rocshmem commit: `ecd4c9f561`]	2025-12-08 14:06:53 -05:00
Aurelien Bouteiller	92c56e7fbd	Functional tests without MPI support (#343 ) * Let functional tests build without external MPI * Fix error conditions when using uuid startup with internal MPI * Do not abort if libibverbs is not found but not using GDA * Enabled RO functional test initialized with TEST_UUID * Reduce load time for ro backend_can_run and prevent mpilib_dlclose crashing * Fix case TEST_UUID=1, ROCSHMEM_BACKEND='' (autoloading gda) [ROCm/rocshmem commit: `c99bc21e10`]	2025-12-08 11:46:16 -05:00
Yiltan	1c3ce17f13	[GDA/BNXT] Optimize Alltoall using put signal (#334 ) * Modularize bnxt * add post_wqe_amo_single * add alltoall with putsignal impl * make ringing the doorbell optional [ROCm/rocshmem commit: `baaf8091b5`]	2025-12-05 12:41:22 -05:00
Avinash Kethineedi	1ecc355062	IPC: insert `__threadfence_system()` after *wg RMA APIs to guarantee global memory visibility (#346 ) [ROCm/rocshmem commit: `f907ef91e4`]	2025-12-04 10:21:25 -06:00
Edgar Gabriel	3d658b558b	reenable gfx1100 (#328 ) * reenable gfx1100 use the modified version of the flat_store_short assembly instruction as suggested by the compiler team (32bit input value instead of 16bit) * add fix for gfx1201 add the same fix for gfx1201 that was introduced for gfx1100 [ROCm/rocshmem commit: `224c969bef`]	2025-12-03 13:49:38 -06:00
Anatolii Rozanov	4b04b540bf	Add host API for alltoallmem_on_stream collective operation (#333 ) * Add host-side rocshmem_alltoallmem_on_stream function Function signature: rocshmem_alltoallmem_on_stream(rocshmem_team_t team, void dest, const void source, size_t size, hipStream_t stream) - The function launches rocshmem_alltoallmem_kernel which calls device-side alltoall<char> workgroup collective through default context. - Uses dynamic block size determination via occupancy API. - Implemented for all backends. * Fix incorrect sync buffer size allocation for alltoall in GDA and IPC backends When allocating memory for alltoall_pSync_pool in setup_teams() and teams_init() functions, the code incorrectly used ROCSHMEM_BCAST_SYNC_SIZE instead of ROCSHMEM_ALLTOALL_SYNC_SIZE. * Add functional test for team_alltoallmem_on_stream This commit adds a new functional test to verify the correctness of the host-side rocshmem_team_alltoallmem_on_stream API. * Add documentation for rocshmem_alltoallmem_on_stream This commit adds API documentation for the host-side rocshmem_alltoallmem_on_stream function in the collective routines section. The documentation includes: [ROCm/rocshmem commit: `5577feb70d`]	2025-12-03 08:40:24 -05:00
Yiltan	0f32739b52	Updated important missing enviroment variables (#344 ) [ROCm/rocshmem commit: `8b350a51fe`]	2025-12-02 11:40:30 -05:00
Aurelien Bouteiller	1e3a161c74	MLX5 cards have a vendor-id that does not match the pci-vendor-id for (#342 ) some reason. Signed-off-by: Aurelien Bouteiller <abouteil@amd.com> [ROCm/rocshmem commit: `0f7da76018`]	2025-12-02 11:32:37 -05:00
Kutovoi, Vadim	dde9e2464e	gda: add check for active interfaces when selecting the GDA backend (#327 ) * gda: add check for active interfaces when selecting the GDA backend * fix __func__ maco in rocshmem_ctx_pe_quiet * gda: switch to more generic RDMA NIC term in has_active_ib_interface * gda: add active MLX5 and Pensando vendor ID checks for backend selection [ROCm/rocshmem commit: `29000a5644`]	2025-12-01 15:49:25 -05:00
Adel Johar	2c243feb1b	[Docs] Move environment variables to separate page (#341 ) [ROCm/rocshmem commit: `ba77bdd9a6`]	2025-12-01 14:25:27 -05:00
Yiltan	77ba8cc76c	Cleanup readme.md [ROCm/rocshmem commit: `774159a08f`]	2025-12-01 10:43:18 -05:00
Yiltan	2079193495	Update docs for GDA (#337 ) [ROCm/rocshmem commit: `5606fdafd6`]	2025-12-01 09:38:11 -05:00
Yiltan	f9caef6908	Add rocshmem_int64_p (#335 ) [ROCm/rocshmem commit: `d9e2890222`]	2025-11-26 10:31:23 -05:00
Yiltan	a40485d292	Alltoall bug fix for RCCL (#329 ) [ROCm/rocshmem commit: `7ab9823169`]	2025-11-20 11:16:00 -05:00
Edgar Gabriel	db4c6293cc	add relaxed_ordering option (#324 ) * add relaxed_ordering option add an environment variable that allows to control setting the IBV_ACCESS_RELAXED_ORDERING flag when registering memory with the ibv_reg_mr* functions. * missed a spot [ROCm/rocshmem commit: `2ae2033648`]	2025-11-20 08:20:25 -06:00
Yiltan	b126537b55	[GDA] Alltoall optimization - single warp (#319 ) * Remove testing of data types As the collective is templated, we are just testing if sizeof(T) works * Added single threaded varients * Applied thread puts optimization to barrier * Apply single threaded optimization to alltoall * This optimization only works on bnxt, so place a switch to protect it * Handle the edge case where the thread count is smaller than the number of PEs [ROCm/rocshmem commit: `1347d5d628`]	2025-11-19 14:25:29 -05:00
Anatolii Rozanov	8d8dffbd43	Fix typo in README.md (#326 ) [ROCm/rocshmem commit: `618bdef082`]	2025-11-19 08:36:01 -06:00
Yiltan	cfb3d42524	Update change log for rocm 7.1.1 (#320 ) (#321 ) (cherry picked from commit 783ab37f68ffb72b4baffda516fcc19e2f28804e) [ROCm/rocshmem commit: `1fb0fc4bd0`]	2025-11-14 13:51:03 -05:00
Edgar Gabriel	2b0bca5c87	disable gfx1100 temporarily (#322 ) [ROCm/rocshmem commit: `ef3ba6cd45`]	2025-11-14 10:46:19 -07:00
Yiltan	a500bc8029	only use rocm_install if we build the tools (#316 ) [ROCm/rocshmem commit: `73786e203e`]	2025-11-12 10:58:49 -05:00
Edgar Gabriel	722b54fddb	replace MPI function call. (#317 ) * replace MPI function call. * add two missing defs for RO [ROCm/rocshmem commit: `e1a7e20b1b`]	2025-11-12 07:38:47 -06:00
Dana Robinson	237f64065f	Fix typo in CONTRIBUTING.md (#315 ) [ROCm/rocshmem commit: `65790c1b4f`]	2025-11-09 12:58:19 -06:00
Yiltan	740cbe6098	Use dlopen for libnuma (#312 ) [ROCm/rocshmem commit: `80f0a39866`]	2025-11-07 10:12:11 -05:00
Edgar Gabriel	3c25349ec1	initial commit for gfx12 support (#305 ) [ROCm/rocshmem commit: `d185fe3555`]	2025-11-07 08:54:03 -06:00
Edgar Gabriel	5e6a4e15f6	disable memory tests (#310 ) disable fine-grain and coarse-grain memory testst until a fix is available in ROCm 7.1 and/or our CI image. Otherwise we might miss other errors due to constant CI failures. [ROCm/rocshmem commit: `4fc5541d78`]	2025-11-07 08:04:31 -06:00
Allen Hubbe	5e82060ba0	gda: fix getmem_nbi_wg source and dest (#311 ) A copy paste mistake in a previous commit caused source and dest to be reversed. Correct the source and dest params. Fixes: `e8a7371007` Signed-off-by: Allen Hubbe <allen.hubbe@amd.com> [ROCm/rocshmem commit: `e2dcf99456`]	2025-11-06 16:21:20 -06:00
Yiltan	7348bac9bf	Memset queues (#313 ) [ROCm/rocshmem commit: `cd9b5ee806`]	2025-11-06 14:16:53 -05:00
Yiltan	bf19d70a29	Added ibv_wrapper which opens library using dlopen (#309 ) [ROCm/rocshmem commit: `110f9c8793`]	2025-11-05 16:12:44 -05:00
Allen Hubbe	e8a7371007	gda ionic: use all threads in wave operations (#295 ) Use all available threads for polling the cq to increase the maximum message rate. Even when posting a single wqe in the wave, use all available theads for polling the cq to reserve space in the sq. Changes were needed in the rocshmem abstraction to avoid disabling gpu threads, like taking turns or using only the first thread in a wave or wavefront. To avoid breaking other gda implementations, reimplement turn-based or single thread strategy in post_wqe_rma_turn and post_wqe_rma_single. Signed-off-by: Allen Hubbe <allen.hubbe@amd.com> [ROCm/rocshmem commit: `6de67d5d7c`]	2025-11-05 11:01:14 -06:00
Aurelien Bouteiller	51cf7c6c05	python venv madness round 2: use ensurepip if installed (#308 ) When creating a python venv during the install_dependencies script, we try to use ensurepip if it is installed, as it deals better with cases where multiple venvs are active simultaneously. (as seen in CI buildbot) [ROCm/rocshmem commit: `b7a6d86c6b`]	2025-11-05 10:52:22 -05:00
Aurelien Bouteiller	76e8750d88	Add backend type query method, use it to disable 32bit amo testers on gda (#307 ) * Add backend type query method, use it to disable 32bit amo testers on gda * The infrateam testers work [ROCm/rocshmem commit: `8c175315f2`]	2025-11-05 10:24:07 -05:00
erieaton-amd	4aaa1a27f5	Fix rocshmem_ptr definition signature (#306 ) Makes the signature of the definition match the declaration in rocshmem.hpp. Signed-off-by: Eric Eaton <erieaton@amd.com> [ROCm/rocshmem commit: `7b5765ec0e`]	2025-11-04 12:42:47 -05:00
Aurelien Bouteiller	e622398337	install_dependencies pip issues with ubuntu 24 (#302 ) * The install_dependencies script would fail on ubuntu 24.04 they changed how pip works so we need to create a venv first now * Fix install_dependencies for ubuntu 22 * Make sure we build in the builddir and install in the installdir combine installdir for ucx and ompi when user-provided by INSTALL_DIR retain prior behavior if not overridden to avoid breaking CI scripts [ROCm/rocshmem commit: `e155af8704`]	2025-10-31 16:34:36 -04:00
Yiltan	3535ce8c0a	Alltoall linear parallel Optimization (#303 ) [ROCm/rocshmem commit: `8dd2112ec8`]	2025-10-31 10:26:44 -04:00
Yiltan	2f8a1c02a4	[GDA] Implement internal_direct_barrier_wg (#299 ) [ROCm/rocshmem commit: `5f87bb061b`]	2025-10-31 10:26:24 -04:00
Allen Hubbe	fa7841f0d4	functional_tests: n, nskip, nloop, nlarge options (#297 ) To make the functional tests more useful for benchmarking, allow user to specify the number of loops and related parameters via command options. Signed-off-by: Allen Hubbe <allen.hubbe@amd.com> [ROCm/rocshmem commit: `ed91c8cce2`]	2025-10-30 11:54:49 -04:00
Edgar Gabriel	0ad710e537	minor change to MPI detection logic (#294 ) somehow the test whether we requested MPI support or not stopped working, although no obvious code change can be located. Make the if-statement more stringent by explicitely testing whether USE_MPI_SUPPORT is "ON". [ROCm/rocshmem commit: `c0285ac0ce`]	2025-10-28 12:54:26 -05:00
akolliasAMD	6f6719dbab	renamed memcpy to memcpy_lane (#296 ) [ROCm/rocshmem commit: `87d87cc881`]	2025-10-28 09:33:13 -06:00
Yiltan	163148ce7e	[GDA] Improve error messages (#292 ) [ROCm/rocshmem commit: `ff29d139cb`]	2025-10-27 16:51:15 -04:00
Yiltan	0bd07a26be	[GDA/BNXT] Implemented CQE Collapsing (#279 ) [ROCm/rocshmem commit: `6290db319c`]	2025-10-23 14:53:44 -04:00
Aurelien Bouteiller	bdb30e2984	Tests/syncall (#291 ) * SyncAll test case would run Sync * Despecialized name for argument reader * Rename sync-test to team-sync-test as it uses teams * Another stab at probing NUM_GPUS [ROCm/rocshmem commit: `054bc33dc4`]	2025-10-23 13:40:41 -04:00
Edgar Gabriel	3eadf8cc62	fix Win_flush prototype in function table (#289 ) the bug was exposed when trying to compile a backend with HDP flush support. [ROCm/rocshmem commit: `e2c6bb8bd4`]	2025-10-23 08:43:41 -05:00
Edgar Gabriel	d37af80d7e	add support for GPUs using wavefront size of 32 (#285 ) * add gfx1100 support Add support for Radeon 7900 GPUs (RX and PRO), and 7800 PRO. I was contemplating to add gfx1101 and gfx1102 GPUs as well, but those are the lower end models that are more unlikely to be used for compute intensive jobs. In addition, I do not have access to them to test the support. * update WF_SIZe for different options Radeon systems use a WarpSize of 32, unlike current Instinct systems, which use a warp size of 64. For the device side, a gfx specific ifdef is sufficient. For the host side, we need to query the device properties. * adjust functional tests to wf_size of 32 * update unit tests to handle wf_size of 32 * address reviewer comments [ROCm/rocshmem commit: `d0c2845031`]	2025-10-22 16:04:58 -05:00

1 2 3 4 5 ...

429 Tiomáintí