Graf Tiomantas

421 Tiomáintí

Údar SHA1 Teachtaireacht Dáta
Aurelien Bouteiller c99bc21e10 Functional tests without MPI support (#343)
* Let functional tests build without external MPI

* Fix error conditions when using uuid startup with internal MPI

* Do not abort if libibverbs is not found but not using GDA

* Enabled RO functional test initialized with TEST_UUID

* Reduce load time for ro backend_can_run and prevent mpilib_dlclose
crashing

* Fix case TEST_UUID=1, ROCSHMEM_BACKEND='' (autoloading gda)
2025-12-08 11:46:16 -05:00
Yiltan baaf8091b5 [GDA/BNXT] Optimize Alltoall using put signal (#334)
* Modularize bnxt

* add post_wqe_amo_single

* add alltoall with putsignal impl

* make ringing the doorbell optional
2025-12-05 12:41:22 -05:00
Avinash Kethineedi f907ef91e4 IPC: insert __threadfence_system() after *wg RMA APIs to guarantee global memory visibility (#346) 2025-12-04 10:21:25 -06:00
Edgar Gabriel 224c969bef reenable gfx1100 (#328)
* reenable gfx1100

use the modified version of the flat_store_short assembly instruction as suggested by the compiler team (32bit input value instead of 16bit)

* add fix for gfx1201

add the same fix for gfx1201 that was introduced for gfx1100
2025-12-03 13:49:38 -06:00
Anatolii Rozanov 5577feb70d Add host API for alltoallmem_on_stream collective operation (#333)
* Add host-side rocshmem_alltoallmem_on_stream function

Function signature:
  rocshmem_alltoallmem_on_stream(rocshmem_team_t team, void *dest,
                                 const void *source, size_t size,
                                 hipStream_t stream)

- The function launches rocshmem_alltoallmem_kernel which calls
device-side alltoall<char> workgroup collective through default context.
- Uses dynamic block size determination via occupancy API.
- Implemented for all backends.

* Fix incorrect sync buffer size allocation for alltoall in GDA and IPC backends

When allocating memory for alltoall_pSync_pool in setup_teams() and
teams_init() functions, the code incorrectly used ROCSHMEM_BCAST_SYNC_SIZE
instead of ROCSHMEM_ALLTOALL_SYNC_SIZE.

* Add functional test for team_alltoallmem_on_stream

This commit adds a new functional test to verify the correctness of
the host-side rocshmem_team_alltoallmem_on_stream API.

* Add documentation for rocshmem_alltoallmem_on_stream

This commit adds API documentation for the host-side
rocshmem_alltoallmem_on_stream function in the collective routines
section. The documentation includes:
2025-12-03 08:40:24 -05:00
Yiltan 8b350a51fe Updated important missing enviroment variables (#344) 2025-12-02 11:40:30 -05:00
Aurelien Bouteiller 0f7da76018 MLX5 cards have a vendor-id that does not match the pci-vendor-id for (#342)
some reason.

Signed-off-by: Aurelien Bouteiller <abouteil@amd.com>
2025-12-02 11:32:37 -05:00
Kutovoi, Vadim 29000a5644 gda: add check for active interfaces when selecting the GDA backend (#327)
* gda: add check for active interfaces when selecting the GDA backend

* fix __func__ maco in rocshmem_ctx_pe_quiet

* gda: switch to more generic RDMA NIC term in has_active_ib_interface

* gda: add active MLX5 and Pensando vendor ID checks for backend selection
2025-12-01 15:49:25 -05:00
Adel Johar ba77bdd9a6 [Docs] Move environment variables to separate page (#341) 2025-12-01 14:25:27 -05:00
Yiltan 774159a08f Cleanup readme.md 2025-12-01 10:43:18 -05:00
Yiltan 5606fdafd6 Update docs for GDA (#337) 2025-12-01 09:38:11 -05:00
Yiltan d9e2890222 Add rocshmem_int64_p (#335) 2025-11-26 10:31:23 -05:00
Yiltan 7ab9823169 Alltoall bug fix for RCCL (#329) 2025-11-20 11:16:00 -05:00
Edgar Gabriel 2ae2033648 add relaxed_ordering option (#324)
* add relaxed_ordering option

add an environment variable that allows to control setting the
IBV_ACCESS_RELAXED_ORDERING flag when registering memory with the
ibv_reg_mr* functions.

* missed a spot
2025-11-20 08:20:25 -06:00
Yiltan 1347d5d628 [GDA] Alltoall optimization - single warp (#319)
* Remove testing of data types
As the collective is templated, we are just testing if sizeof(T) works

* Added single threaded varients

* Applied thread puts optimization to barrier

* Apply single threaded optimization to alltoall

* This optimization only works on bnxt, so place a switch to protect it

* Handle the edge case where the thread count is smaller than the number of PEs
2025-11-19 14:25:29 -05:00
Anatolii Rozanov 618bdef082 Fix typo in README.md (#326) 2025-11-19 08:36:01 -06:00
Yiltan 1fb0fc4bd0 Update change log for rocm 7.1.1 (#320) (#321)
(cherry picked from commit cb239930bc4167806269cbb94e26cf02837daa7a)
2025-11-14 13:51:03 -05:00
Edgar Gabriel ef3ba6cd45 disable gfx1100 temporarily (#322) 2025-11-14 10:46:19 -07:00
Yiltan 73786e203e only use rocm_install if we build the tools (#316) 2025-11-12 10:58:49 -05:00
Edgar Gabriel e1a7e20b1b replace MPI function call. (#317)
* replace MPI function call.

* add two missing defs for RO
2025-11-12 07:38:47 -06:00
Dana Robinson 65790c1b4f Fix typo in CONTRIBUTING.md (#315) 2025-11-09 12:58:19 -06:00
Yiltan 80f0a39866 Use dlopen for libnuma (#312) 2025-11-07 10:12:11 -05:00
Edgar Gabriel d185fe3555 initial commit for gfx12 support (#305) 2025-11-07 08:54:03 -06:00
Edgar Gabriel 4fc5541d78 disable memory tests (#310)
disable fine-grain and coarse-grain memory testst until a fix is
available in ROCm 7.1 and/or our CI image. Otherwise we might miss other
errors due to constant CI failures.
2025-11-07 08:04:31 -06:00
Allen Hubbe e2dcf99456 gda: fix getmem_nbi_wg source and dest (#311)
A copy paste mistake in a previous commit caused source and dest to
be reversed.  Correct the source and dest params.

Fixes: 6de67d5d7c

Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>
2025-11-06 16:21:20 -06:00
Yiltan cd9b5ee806 Memset queues (#313) 2025-11-06 14:16:53 -05:00
Yiltan 110f9c8793 Added ibv_wrapper which opens library using dlopen (#309) 2025-11-05 16:12:44 -05:00
Allen Hubbe 6de67d5d7c gda ionic: use all threads in wave operations (#295)
Use all available threads for polling the cq to increase the maximum
message rate.  Even when posting a single wqe in the wave, use all
available theads for polling the cq to reserve space in the sq.

Changes were needed in the rocshmem abstraction to avoid disabling gpu
threads, like taking turns or using only the first thread in a wave or
wavefront.  To avoid breaking other gda implementations, reimplement
turn-based or single thread strategy in post_wqe_rma_turn and
post_wqe_rma_single.

Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>
2025-11-05 11:01:14 -06:00
Aurelien Bouteiller b7a6d86c6b python venv madness round 2: use ensurepip if installed (#308)
When creating a python venv during the install_dependencies script, we try to use ensurepip if it is installed, as it deals better with cases where multiple venvs are active simultaneously. (as seen in CI buildbot)
2025-11-05 10:52:22 -05:00
Aurelien Bouteiller 8c175315f2 Add backend type query method, use it to disable 32bit amo testers on gda (#307)
* Add backend type query method, use it to disable 32bit amo testers on
gda

* The infrateam testers work
2025-11-05 10:24:07 -05:00
erieaton-amd 7b5765ec0e Fix rocshmem_ptr definition signature (#306)
Makes the signature of the definition match the declaration in rocshmem.hpp.

Signed-off-by: Eric Eaton <erieaton@amd.com>
2025-11-04 12:42:47 -05:00
Aurelien Bouteiller e155af8704 install_dependencies pip issues with ubuntu 24 (#302)
* The install_dependencies script would fail on ubuntu 24.04
they changed how pip works so we need to create a venv first now

* Fix install_dependencies for ubuntu 22

* Make sure we build in the builddir and install in the installdir
combine installdir for ucx and ompi when user-provided by INSTALL_DIR
retain prior behavior if not overridden to avoid breaking CI scripts
2025-10-31 16:34:36 -04:00
Yiltan 8dd2112ec8 Alltoall linear parallel Optimization (#303) 2025-10-31 10:26:44 -04:00
Yiltan 5f87bb061b [GDA] Implement internal_direct_barrier_wg (#299) 2025-10-31 10:26:24 -04:00
Allen Hubbe ed91c8cce2 functional_tests: n, nskip, nloop, nlarge options (#297)
To make the functional tests more useful for benchmarking, allow user to
specify the number of loops and related parameters via command options.

Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>
2025-10-30 11:54:49 -04:00
Edgar Gabriel c0285ac0ce minor change to MPI detection logic (#294)
somehow the test whether we requested MPI support or not stopped
working, although no obvious code change can be located.

Make the if-statement more stringent by explicitely testing whether
USE_MPI_SUPPORT is "ON".
2025-10-28 12:54:26 -05:00
akolliasAMD 87d87cc881 renamed memcpy to memcpy_lane (#296) 2025-10-28 09:33:13 -06:00
Yiltan ff29d139cb [GDA] Improve error messages (#292) 2025-10-27 16:51:15 -04:00
Yiltan 6290db319c [GDA/BNXT] Implemented CQE Collapsing (#279) 2025-10-23 14:53:44 -04:00
Aurelien Bouteiller 054bc33dc4 Tests/syncall (#291)
* SyncAll test case would run Sync

* Despecialized name for argument reader

* Rename sync-test to team-sync-test as it uses teams

* Another stab at probing NUM_GPUS
2025-10-23 13:40:41 -04:00
Edgar Gabriel e2c6bb8bd4 fix Win_flush prototype in function table (#289)
the bug was exposed when trying to compile a backend with HDP flush
support.
2025-10-23 08:43:41 -05:00
Edgar Gabriel d0c2845031 add support for GPUs using wavefront size of 32 (#285)
* add gfx1100 support

Add support for Radeon 7900 GPUs (RX and PRO), and 7800 PRO.

I was contemplating to add gfx1101 and gfx1102 GPUs as well, but those are the lower end models that are more unlikely to be used for compute intensive jobs. In addition, I do not have access to them to test the support.

* update WF_SIZe for different options

Radeon systems use a WarpSize of 32, unlike current Instinct systems,
which use a warp size of 64. For the device side, a gfx specific ifdef
is sufficient. For the host side, we need to query the device
properties.

* adjust functional tests to wf_size of 32

* update unit tests to handle wf_size of 32

* address reviewer comments
2025-10-22 16:04:58 -05:00
Avinash Kethineedi 955c22aeed Add ROCSHMEM_CTX_INVALID for invalid context handling (#287)
* Add `ROCSHMEM_CTX_INVALID` for invalid context handling
  - Define `ROCSHMEM_CTX_INVALID` as {nullptr, nullptr}
  - Add == and != operators to rocshmem_ctx_t
  - Use `ROCSHMEM_CTX_INVALID` on failed context creation
  - Skip ctx destroy if context is invalid

* Update docs for context create and destroy APIs usage and behavior
2025-10-22 12:00:56 -05:00
Yiltan b534423de7 Provide an error when there are no NICs on a system (#286) 2025-10-20 13:07:56 -04:00
Aurelien Bouteiller c44f4ece1f Print an error and quit cleanly if GDA required but could not init (#284) 2025-10-20 13:04:13 -04:00
Yiltan c3eeae473b Implement rocshmem_pe_quiet() (#282)
Co-authored-by: Aurelien Bouteiller <aurelien.bouteiller@amd.com>
2025-10-20 11:42:39 -04:00
Edgar Gabriel 6f74cdfd75 update tester for RO (#281)
update the tester script to only tests the amo functions on RO that are
expected to pass. We can revisit the non-passing tests later, but this
prevents us from having passing CIs at the moment, while RO is simply
lower priority than other asks.
2025-10-20 09:03:17 -05:00
Yiltan 9338c84480 Updated docs for ROCm 7.x.x (#239)
Co-authored-by: Aurelien Bouteiller <aurelien.bouteiller@amd.com>
Co-authored-by: yugang-amd <yugang.wang@amd.com>
2025-10-17 12:10:37 -04:00
Aurelien Bouteiller aef74812ae Add ROCSHMEM_GDA_PROVIDER envvar (#280)
* Add ROCHSMEM_GDA_PROVIDER env control

* Single name for a single concept: vendor->provider
2025-10-17 10:46:05 -04:00
Edgar Gabriel ed957302d4 update to build system: (#277)
- make adding PMIx library to compile time based on the result of
  finding PMIx support. This is required eg if compiling rocSHMEM with ompi
  4.0/4.1, which do not have a built-in PMIx version.
- when setting USE_EXTERNAL_MPI=OFF  which ensures that we do not
  check for external MPI libraries (even if one would be available).
2025-10-17 07:42:11 -05:00