Граф коммитов

445 Коммитов

Автор SHA1 Сообщение Дата
Aurelien Bouteiller bcdf60def6 Enable new a2a (pr 334) on ionic as well (#366)
* Enable new a2a (pr 334) on ionic as well

* Apply suggestions from AI code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/rocshmem commit: 82d91433c9]
2026-01-06 20:41:51 -05:00
Edgar Gabriel e38f98fad5 fix reduction test for gfx1201 (#374)
* fix reduction for gfx942 and 1201

match the synchronizaation of internal_putmem_wg and internal_getmem_wg
to their non-internal counterparts. the internal_putmem_wg is used in
the ipc reduction

* move specialization to internal_putmem

[ROCm/rocshmem commit: 8d2504d6c1]
2026-01-06 10:15:38 -06:00
Edgar Gabriel cc727261de disable the putmem_signal_on_stream on RO (#376)
it fails in about 50% of the cases. Will revisit later why it fails,
but RO is at the moment lower priority, so disabling the test for now.

[ROCm/rocshmem commit: ed2f75f1de]
2026-01-06 08:10:46 -06:00
Aurelien Bouteiller abb1e0684a Do not hardcode wf_size==64 in ionic provider (#367)
* Do not hardcode wf_size==64 in ionic provider

* Simpler same_qp_mask in ionic


[ROCm/rocshmem commit: 0c496d83d6]
2026-01-05 18:36:58 -05:00
Omri Mor 56bfb13644 QueuePair: prefix bnxt functions and variables (#373)
[ROCm/rocshmem commit: f5940f6b9a]
2025-12-22 14:46:17 -08:00
Omri Mor c43dc136f3 [Bugfix] GDA/bnxt: release SQ lock before return (#372)
* bnxt_post_wqe_amo_single with fetching = true would return
  before releasing the send queue lock, resulting in a deadlock.
* Release the send queue lock before returning from the function.

[ROCm/rocshmem commit: 016e08120a]
2025-12-22 12:05:00 -08:00
Kutovoi, Vadim a4b99485a9 gda/ro: validate and exit cleanly when forced GDA config is invalid (#354)
* gda: validate and exit cleanly when forced GDA config is invalid

* ro: validate and exit cleanly when forced RO config is invalid

[ROCm/rocshmem commit: 80a710ac0a]
2025-12-22 10:54:33 +00:00
Omri Mor ed38201b90 gda: fix incorrect casts from void* to uintptr_t (#369)
[ROCm/rocshmem commit: e8fc5e67c4]
2025-12-19 16:18:49 -08:00
Dimple Prajapati e21c087f2a [BugFix] Fix rocshmem_get_device_ctx to return ctx_opaque pointer (#359)
Changed rocshmem_get_device_ctx() to properly copy the full rocshmem_ctx_t
structure and return only the ctx_opaque pointer instead of trying to
copy directly to a void pointer.

Prior implementation would cause undefined behavior or memory corruption
as it was copying 16 bytes of data to 8 bytes. It worked so far beucase ctx_opaque
field is at proper offsest, but incorrectly memcpy would overwrite some other allocations
and cause issues.
This fixes the context memory handling when passing device context from
host to device kernels.

[ROCm/rocshmem commit: cf6a53e81c]
2025-12-19 10:01:02 -05:00
Aurelien Bouteiller dde4902844 Fix driver.sh script for system where neither amd-smi or rocm-smi are (#370)
found

[ROCm/rocshmem commit: 5eaa152010]
2025-12-19 10:00:11 -05:00
dependabot[bot] 750d3f8b2e Bump urllib3 from 2.5.0 to 2.6.0 in /docs/sphinx (#365)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.5.0 to 2.6.0.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/2.5.0...2.6.0)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-version: 2.6.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rocshmem commit: 166a591216]
2025-12-19 09:55:42 -05:00
Yiltan 43363894a8 remove cq lock (#357)
[ROCm/rocshmem commit: b28a56bd54]
2025-12-15 09:23:24 -05:00
Yiltan 5e2ba952f3 [GDA/BNXT] Remove doorbell arbitration (#363)
[ROCm/rocshmem commit: fe1a28e409]
2025-12-15 09:23:01 -05:00
Edgar Gabriel f9fd5d3cdd use 64 threads for reduction test (#360)
* use 64 threads for reduction test

much faster with IPC backend.

* change all relevant collective tests.

[ROCm/rocshmem commit: c35210f174]
2025-12-15 08:14:18 -06:00
yugang-amd 195fe4e5ee GDA docs style edits (#362)
* Update docs/install.rst

Co-authored-by: yugang-amd <yugang.wang@amd.com>

* Update docs/install.rst

Co-authored-by: yugang-amd <yugang.wang@amd.com>

* Update docs/install.rst

Co-authored-by: yugang-amd <yugang.wang@amd.com>

* Update docs/install.rst

Co-authored-by: yugang-amd <yugang.wang@amd.com>

* Update docs/install.rst

Co-authored-by: yugang-amd <yugang.wang@amd.com>

* Update docs/install.rst

Co-authored-by: yugang-amd <yugang.wang@amd.com>

* Update docs/install.rst

Co-authored-by: yugang-amd <yugang.wang@amd.com>

* Update docs/install.rst

Co-authored-by: yugang-amd <yugang.wang@amd.com>

* Update docs/install.rst

Co-authored-by: yugang-amd <yugang.wang@amd.com>

* Update docs/sphinx/_toc.yml.in

Co-authored-by: yugang-amd <yugang.wang@amd.com>

* Apply suggestions from code review

Co-authored-by: yugang-amd <yugang.wang@amd.com>

---------

Co-authored-by: Yiltan <ytemucin@amd.com>

[ROCm/rocshmem commit: bbad1d8539]
2025-12-10 17:03:58 -05:00
Aurelien Bouteiller e783b47388 Bump version number for post-7.2 devel (#356)
[ROCm/rocshmem commit: 64460f0ec9]
2025-12-10 13:03:20 -05:00
Yiltan 258d264ecc Add default context alltoall API (#350)
[ROCm/rocshmem commit: fddbe7b15d]
2025-12-10 11:43:15 -05:00
Aurelien Bouteiller 972893bab2 Reenable building test-only with external MPI (#352)
[ROCm/rocshmem commit: 1a16b3bedc]
2025-12-10 11:40:29 -05:00
Aurelien Bouteiller 92459fa840 Update version to 3.2.0 for 7.2.0 rocm release (#351)
[ROCm/rocshmem commit: ef5f2be215]
2025-12-09 10:26:55 -05:00
Anatolii Rozanov f98c72d627 Add host API for *_on_stream operations (#340)
* Add functional test for barrier_all_on_stream

* Add rocshmem_barrier_all_on_stream support for GDA and RO backends

Implements rocshmem_barrier_all_on_stream operation for
GPU Direct Access and Reverse Offload backends.

Previously, rocshmem_barrier_all_on_stream was only supported for IPC backend.

* Add functional test for rocshmem_broadcastmem_on_stream

* Add host-side rocshmem_broadcastmem_on_stream API

Implement stream-based broadcast collective operation

- Add rocshmem_broadcastmem_on_stream host API and kernel implementation
- Add functional test TeamBroadcastmemOnStreamTester with multi-stream
  support and correctness verification
- Use per-workgroup contexts to avoid contention across parallel streams

API:
rocshmem_broadcastmem_on_stream(team, dest, source, nelems, pe_root, stream)

* Add functional test for rocshmem_getmem_on_stream

* Add host-side rocshmem_getmem_on_stream API

Implement stream-based point-to-point RMA get operation

- Add rocshmem_getmem_on_stream host API and kernel implementation
- Support for asynchronous getmem operations on HIP streams
- Add backend support for GDA, RO, and IPC contexts
- Use work-group collective getmem for efficient memory transfer

API:
rocshmem_getmem_on_stream(dest, source, nelems, pe, stream)

(AI Assist)

* Add host-side rocshmem_putmem_on_stream API

- Add rocshmem_putmem_on_stream for asynchronous remote writes
- Support for concurrent RMA operations on HIP streams
- Add backend support for GDA, RO, and IPC contexts
- Use work-group device collective operation

API:
rocshmem_putmem_on_stream(dest, source, bytes, pe, stream)

(AI Assist)

* Add functional test for rocshmem_putmem_on_stream

* Add host-side rocshmem_putmem_signal_on_stream API

Enables asynchronous putmem operations with signaling on HIP streams.

The implementation includes:
- Kernel wrapper rocshmem_putmem_signal_kernel
- Host interface putmem_signal_on_stream method
- Context layer support across all backends (IPC, GDA, RO)
- Public API

Function signature:
void rocshmem_putmem_signal_on_stream(void *dest, const void *source,
                                      size_t bytes, uint64_t *sig_addr,
                                      uint64_t signal, int sig_op,
                                      int pe, hipStream_t stream);

* Add functional test for rocshmem_putmem_signal_on_stream

* Add host-side rocshmem_signal_wait_until_on_stream API

Enables asynchronous signal wait operations on HIP streams.

The implementation includes:
- Kernel wrapper rocshmem_signal_wait_until_kernel
- Host interface signal_wait_until_on_stream method
- Context layer support across all backends (IPC, GDA, RO)
- Native uint64_t support in wait_until API (generated from P2P_SYNC.py)

Function signature:
void rocshmem_signal_wait_until_on_stream(uint64_t *sig_addr, int cmp,
                                          uint64_t cmp_value,
                                          hipStream_t stream);

(AI Assist)

* Add functional test for rocshmem_signal_wait_until_on_stream

* Add documentation for stream API functions

This commit adds API documentation for the following host-side
stream functions:

- rocshmem_barrier_all_on_stream (collective routines)
- rocshmem_broadcastmem_on_stream (collective routines)
- rocshmem_getmem_on_stream (RMA operations)
- rocshmem_putmem_on_stream (RMA operations)
- rocshmem_putmem_signal_on_stream (signaling operations)
- rocshmem_signal_wait_until_on_stream (point-to-point sync)

The documentation includes function signatures, parameter descriptions,
and detailed explanations of asynchronous behavior and stream handling.

(AI Assist)

* Rename "bytes" -> "nelems"

* Add "_TEST_" to the variables used in tests

* Remove incorrect hipStreamDefault usage

hipStreamDefault is not a default stream. This is a flag.

If stream == nullptr, then just pass it to kernel. It will launch the kernel on the default stream

[ROCm/rocshmem commit: d0c8380650]
2025-12-09 08:55:46 -06:00
Dimple Prajapati b9c172de16 Add IBGDA backend flag to enable bitcode generation (#347)
* Change to enable ibgda bitcode compilation

* Apply suggestion from @abouteiller

---------

Co-authored-by: Aurelien Bouteiller <aurelien.bouteiller@amd.com>

[ROCm/rocshmem commit: fbe57306b9]
2025-12-08 16:19:48 -08:00
Avinash Kethineedi 4a0a3cc6e3 Refactor: modularize RMA and AMO WQE posting functions (#331)
* Refactor: modularize RMA and AMO WQE posting functions
  - Extract shared logic for SQ/CQ waiting, doorbell ringing, and WQE building
* Remove unused variables
* Update return buffer address calculation for atomics

[ROCm/rocshmem commit: 1acf454048]
2025-12-08 14:54:41 -06:00
Yiltan 9b77387067 Fix docs rendering issue (#349)
[ROCm/rocshmem commit: d5bcb3a201]
2025-12-08 15:54:06 -05:00
Yiltan cf1db0529a Remove unused fence policy (#348)
[ROCm/rocshmem commit: ecd4c9f561]
2025-12-08 14:06:53 -05:00
Aurelien Bouteiller 92c56e7fbd Functional tests without MPI support (#343)
* Let functional tests build without external MPI

* Fix error conditions when using uuid startup with internal MPI

* Do not abort if libibverbs is not found but not using GDA

* Enabled RO functional test initialized with TEST_UUID

* Reduce load time for ro backend_can_run and prevent mpilib_dlclose
crashing

* Fix case TEST_UUID=1, ROCSHMEM_BACKEND='' (autoloading gda)

[ROCm/rocshmem commit: c99bc21e10]
2025-12-08 11:46:16 -05:00
Yiltan 1c3ce17f13 [GDA/BNXT] Optimize Alltoall using put signal (#334)
* Modularize bnxt

* add post_wqe_amo_single

* add alltoall with putsignal impl

* make ringing the doorbell optional

[ROCm/rocshmem commit: baaf8091b5]
2025-12-05 12:41:22 -05:00
Avinash Kethineedi 1ecc355062 IPC: insert __threadfence_system() after *wg RMA APIs to guarantee global memory visibility (#346)
[ROCm/rocshmem commit: f907ef91e4]
2025-12-04 10:21:25 -06:00
Edgar Gabriel 3d658b558b reenable gfx1100 (#328)
* reenable gfx1100

use the modified version of the flat_store_short assembly instruction as suggested by the compiler team (32bit input value instead of 16bit)

* add fix for gfx1201

add the same fix for gfx1201 that was introduced for gfx1100

[ROCm/rocshmem commit: 224c969bef]
2025-12-03 13:49:38 -06:00
Anatolii Rozanov 4b04b540bf Add host API for alltoallmem_on_stream collective operation (#333)
* Add host-side rocshmem_alltoallmem_on_stream function

Function signature:
  rocshmem_alltoallmem_on_stream(rocshmem_team_t team, void *dest,
                                 const void *source, size_t size,
                                 hipStream_t stream)

- The function launches rocshmem_alltoallmem_kernel which calls
device-side alltoall<char> workgroup collective through default context.
- Uses dynamic block size determination via occupancy API.
- Implemented for all backends.

* Fix incorrect sync buffer size allocation for alltoall in GDA and IPC backends

When allocating memory for alltoall_pSync_pool in setup_teams() and
teams_init() functions, the code incorrectly used ROCSHMEM_BCAST_SYNC_SIZE
instead of ROCSHMEM_ALLTOALL_SYNC_SIZE.

* Add functional test for team_alltoallmem_on_stream

This commit adds a new functional test to verify the correctness of
the host-side rocshmem_team_alltoallmem_on_stream API.

* Add documentation for rocshmem_alltoallmem_on_stream

This commit adds API documentation for the host-side
rocshmem_alltoallmem_on_stream function in the collective routines
section. The documentation includes:

[ROCm/rocshmem commit: 5577feb70d]
2025-12-03 08:40:24 -05:00
Yiltan 0f32739b52 Updated important missing enviroment variables (#344)
[ROCm/rocshmem commit: 8b350a51fe]
2025-12-02 11:40:30 -05:00
Aurelien Bouteiller 1e3a161c74 MLX5 cards have a vendor-id that does not match the pci-vendor-id for (#342)
some reason.

Signed-off-by: Aurelien Bouteiller <abouteil@amd.com>

[ROCm/rocshmem commit: 0f7da76018]
2025-12-02 11:32:37 -05:00
Kutovoi, Vadim dde9e2464e gda: add check for active interfaces when selecting the GDA backend (#327)
* gda: add check for active interfaces when selecting the GDA backend

* fix __func__ maco in rocshmem_ctx_pe_quiet

* gda: switch to more generic RDMA NIC term in has_active_ib_interface

* gda: add active MLX5 and Pensando vendor ID checks for backend selection

[ROCm/rocshmem commit: 29000a5644]
2025-12-01 15:49:25 -05:00
Adel Johar 2c243feb1b [Docs] Move environment variables to separate page (#341)
[ROCm/rocshmem commit: ba77bdd9a6]
2025-12-01 14:25:27 -05:00
Yiltan 77ba8cc76c Cleanup readme.md
[ROCm/rocshmem commit: 774159a08f]
2025-12-01 10:43:18 -05:00
Yiltan 2079193495 Update docs for GDA (#337)
[ROCm/rocshmem commit: 5606fdafd6]
2025-12-01 09:38:11 -05:00
Yiltan f9caef6908 Add rocshmem_int64_p (#335)
[ROCm/rocshmem commit: d9e2890222]
2025-11-26 10:31:23 -05:00
Yiltan a40485d292 Alltoall bug fix for RCCL (#329)
[ROCm/rocshmem commit: 7ab9823169]
2025-11-20 11:16:00 -05:00
Edgar Gabriel db4c6293cc add relaxed_ordering option (#324)
* add relaxed_ordering option

add an environment variable that allows to control setting the
IBV_ACCESS_RELAXED_ORDERING flag when registering memory with the
ibv_reg_mr* functions.

* missed a spot

[ROCm/rocshmem commit: 2ae2033648]
2025-11-20 08:20:25 -06:00
Yiltan b126537b55 [GDA] Alltoall optimization - single warp (#319)
* Remove testing of data types
As the collective is templated, we are just testing if sizeof(T) works

* Added single threaded varients

* Applied thread puts optimization to barrier

* Apply single threaded optimization to alltoall

* This optimization only works on bnxt, so place a switch to protect it

* Handle the edge case where the thread count is smaller than the number of PEs

[ROCm/rocshmem commit: 1347d5d628]
2025-11-19 14:25:29 -05:00
Anatolii Rozanov 8d8dffbd43 Fix typo in README.md (#326)
[ROCm/rocshmem commit: 618bdef082]
2025-11-19 08:36:01 -06:00
Yiltan cfb3d42524 Update change log for rocm 7.1.1 (#320) (#321)
(cherry picked from commit 783ab37f68ffb72b4baffda516fcc19e2f28804e)

[ROCm/rocshmem commit: 1fb0fc4bd0]
2025-11-14 13:51:03 -05:00
Edgar Gabriel 2b0bca5c87 disable gfx1100 temporarily (#322)
[ROCm/rocshmem commit: ef3ba6cd45]
2025-11-14 10:46:19 -07:00
Yiltan a500bc8029 only use rocm_install if we build the tools (#316)
[ROCm/rocshmem commit: 73786e203e]
2025-11-12 10:58:49 -05:00
Edgar Gabriel 722b54fddb replace MPI function call. (#317)
* replace MPI function call.

* add two missing defs for RO

[ROCm/rocshmem commit: e1a7e20b1b]
2025-11-12 07:38:47 -06:00
Dana Robinson 237f64065f Fix typo in CONTRIBUTING.md (#315)
[ROCm/rocshmem commit: 65790c1b4f]
2025-11-09 12:58:19 -06:00
Yiltan 740cbe6098 Use dlopen for libnuma (#312)
[ROCm/rocshmem commit: 80f0a39866]
2025-11-07 10:12:11 -05:00
Edgar Gabriel 3c25349ec1 initial commit for gfx12 support (#305)
[ROCm/rocshmem commit: d185fe3555]
2025-11-07 08:54:03 -06:00
Edgar Gabriel 5e6a4e15f6 disable memory tests (#310)
disable fine-grain and coarse-grain memory testst until a fix is
available in ROCm 7.1 and/or our CI image. Otherwise we might miss other
errors due to constant CI failures.

[ROCm/rocshmem commit: 4fc5541d78]
2025-11-07 08:04:31 -06:00
Allen Hubbe 5e82060ba0 gda: fix getmem_nbi_wg source and dest (#311)
A copy paste mistake in a previous commit caused source and dest to
be reversed.  Correct the source and dest params.

Fixes: e8a7371007

Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>

[ROCm/rocshmem commit: e2dcf99456]
2025-11-06 16:21:20 -06:00
Yiltan 7348bac9bf Memset queues (#313)
[ROCm/rocshmem commit: cd9b5ee806]
2025-11-06 14:16:53 -05:00