* Support pipelining codegen and template specialization
* Support ReduceCopy pipelining for AllReduce, ReduceScatter, and Reduce (currently enabled for bfloat16)
* Remove need for FUNC_INDEX_TOTAL
* Add pipeline field to device function key construction logic
* Avoid unneeded codegen for LL/LL64 kernels
* Modify conditions and add pipeline dtypes env
* Optimize selection for both gfx942 and gfx950
* Increase pipeline bitfield width
* Use __forceinline__ for all device functions
* Realign reduceCopy with original form
* Add opt-out option to enable perf debugs
* Remove force-reduce-pipelining option from README
* Update CHANGELOG.md
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
* Support gfx950 in topo_expl
* Fix dependencies and fetch fmt from sources
* Remove third_party folder in make clean
* Add empty target when fmt is found
* Add MI350 example
* Update README.md
---------
Co-authored-by: isaki001 <ioannissakiotis@gmail.com>
* add direct allgather algorithm
* minor fix
* add debug print for memory allocation tracker
* add message size threshold for direct allgather
* scatter transfers across ranks
* update changelog
* minor fix
* Update CHANGELOG.md
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
* enable direct AG when pxn is ON on MI300X or MI350
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Changed `TestBedChild` protocol to send the result code before the return value to avoid hanging if the call fails. Switched `TestBedChild::GetUniqueId` to use this.
* First commit for rccl multi node test workflow
* Adding workflow dispatch
* Added branch based pull trigger
* Changed typo in branch name
* Add input variables to push
* Removed input variables to push
* Added self hosted runner for Vultr cloud
* Skipping build and only running test
* Changed test runner label name
* Made changes to executable paths in test script
* Made changes to run
* Made changes to cd into cvs dir
* This is a dummy commit
* Added cmake options
* Modified build options
* Commiting build changes
* Adding rccl and rccl-tests
* Re-ordering rccl and rccl-tests
* adding --global command
* modified cmake command
* modified script paths
* Testing OIDC for rccl repo
* Testing OIDC for rccl repo
* Testing build and upload workflow
* use default env variable for AMDGPU families on push workflow trigger
* Adding cleanup and correct role
* Adding additional yml files
* Fixing typo';
* Adding new sha
* Adding correct gpu target
* Adding back venv bin activate
* Adding workflow dispatch for tests
* Testing
* Adding cat
* Adding cat
* Adding rocm dir change
* Adding checkout
* cat with sudo
* rccl checkout
* correcting branch
* removing sudo
* trying to adjust correct path'
* Adding output dir path
* Use docker container with pre-installed MPI
* Adding back build steps
* Fixing SHA
* Adding exclusion logic:
* Adding test
* Adding CI check
* Removing testing
* Limit to build only rccl, rccl-tests and required dependencies
* Adding test
* Removing test
* Removing quote
* Reverting test
* PR comments
---------
Co-authored-by: arravikum <arravikum@amd.com>
Co-authored-by: Marius Brehler <marius.brehler@amd.com>
- Updated ncclDevFuncId to use a hash-based lookup with std::unordered_map.
- Keys are now 64-bit integers, which pack coll, algo, proto, devRedOp, and type fields.
- Improved flexibility and maintainability by moving away from row-based indexing.
- Added error handling for missing keys in the hash map.
- Aligned key generation logic with generate.py and updated generate.py.
* File containing test for comm.h
* Update CommTest.cpp
Added gtest API for assert
* Update CommTest.cpp
Adding copyright
* Update CommTest.cpp
Removing info and tested as not required.
* Update and rename CommTest.cpp to CommTests.cpp
* Update CMakeLists.txt
* change gfx950 algo/proto selection for multinode allreduce, allgather, reduceScatter
* gfx950 tuning: enable tuning for broadcast, allreduce starts LL128 earlier and switches to ring earlier, change LL128 start for allgather and reduceScatter
* lower LL128 threshold
* update reduceScatter LL128 min to match LL max for consistency
* enable multinode PXN and increase chunksize for gfx950
* change LL128 start to 128KB, adjust ring-start according to node-count
* disable code-path for fused-AR on LL128 for gfx950
* use LL128 starting from 1KB for multinode allgather on gfx950
* start LL128 earlier for multinode reduceScatter on gfx950
* start LL128 earlier for multinode broadcast on gfx950
* set multinode allreduce to start simple on 64MB for gfx950
* start LL128 from 1KB for multinode broadcast on gfx950
* setting multinode AR to use tree instead of ring at 16MB, 64MB, 128MB
* set multinode broadcast to use LL for up to 256KB depending on node-count for gfx950
* adjust algo for 32MB multinode allreduce on gfx950
* make 32MB tree LL128 for multinode AR on gfx950
* make sure ring is not picked on 2N allreduce on small sizes
Leverages the traits of extended-scope fine-grain memory to get rid of a device-scope acquire-release fence. This improves throughput for single node workloads on gfx942 and gfx950 for some input sizes (e.g., ~32 MiB to about 256 MiB) when using the simple protocol. Multinode workloads on MI300X see a smaller but statistically significant uplift for some message sizes. Runtime disablement is supported via setting the environment variable RCCL_GFX942_CHEAP_FENCE_ON to 0.
* Added useAcc as a template parameter to address the 2% performance regression in allreduceWithBias
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
This PR is intended to move RCCL to gating changes on CI failures. Right now, only build/unittests run per PR consistently. We should eventually add all single and multi-node test status badges once those tests are running in presubmit and continuously on develop
* Added new binary for executing unit tests
Added new unit tests for argcheck.cc and alt_rsmi.cc files
Modified the method to execute unit tests to cover static methods
by using a bash script to convert static to non-static functions
and variables on the fly restricted to debug build type.
* Added new unit tests for src/transport/shm.cc
* Added new unit tests for graph/xml.cc