* Add new implementation of direct send/recv reduce scatter
* Resolved conflicts
* Add multiple channels support to the reduction kernel of direct reduce scatter and adjust offset into buffer to utilize multiple channels.
* Resolve validation issue when number of elements is not divisible by number of channels leaving elements unaccount for in reduction.
* fix proxy hang
* set maxSrcs to 64 in reduceCopy
* optimize multi-channel code
* fix validation issue in single node MI300
* Tune the message size range for 2,4, and 8 Nodes
* Move Direct RS into separate kernel
* Add Copyright
* resolve review comments
* resolve review comments
* fix merge build issue
* revert move Direct RS into separate kernel
* address review comments
* address review comments
---------
Co-authored-by: KawtharShafie <kawtharshafie@gmail.com>
Co-authored-by: Ghadeer Alabandi <abandiga@gmail.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
* Fix channel overuse for 1 rank comms
* limit channels when warpSpeed is enabled but not used
* enable std::min check against # of CUs for maxChannels computation when warpSpeed is enabled
---------
Co-authored-by: Mustafa Abduljabbar <muabdulj@amd.com>
Co-authored-by: isaki001 <ioannissakiotis@gmail.com>
Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>
* ROCSHMEM linking/building to match MSCCL++ style
* add rocSHMEM as a submodule
* Move rocSHMEM submodule to ext-src/rocSHMEM
* Adding submodule support proper, as well as a patch for rocshmem
* Cleaning up INCLUDE_DIR vs INCLUDE_DIRS mixup
* updating patch file
* Pointing rocshmem submodule to edgars fixup patch
* Adding IBVERBS link to the submodule build
* More IBVERBS patching
* pin rocshmem submodule to b534423
* Adding IPC support in rocSHMEM build
* updating rocshmem submodule to resolve CQ errors
* Updating submodule to include recent a2a optimizations
* invoke rocshmem alltoall from rccl
* Updating submodule to CQ error number hang
* Updating submodule to include a2a improvements and bug fixes
* Updating submodule to point to Yiltan's fork and doorbell ring removal commit
* Updating hash to correspond with submodule change
* Updating to no-ctx wg call and updating submodule
* copy-in/copy-out using multiples CUs
* Updating rocSHMEM submodule to include doorbell improvs
* updating gitmodule to point to upstream
* code cleanup and adjust threashold
* guard rocshmem a2a invocation
* Only build with rocshmem when specified
* code cleanup
* address review comments
* Removing debugging failure case
Signed-off-by: Thomas Huber <thomas.huber@amd.com>
* whitespace fix
* Adding rocshmem compile guard
* Removing unneccesary comment
Signed-off-by: Thomas Huber <thomas.huber@amd.com>
* remove commented lines
* address review comments
* cleanup
---------
Signed-off-by: Thomas Huber <thomas.huber@amd.com>
Co-authored-by: Thomas Huber <thomas.huber@amd.com>
Co-authored-by: Nusrat Islam <nusislam@dell300x-ccs-aus-k12-27.cs-aus.dcgpu>
Co-authored-by: Nusrat Islam <nusislam@dell300x-ccs-aus-k13-09.cs-aus.dcgpu>
Co-authored-by: Islam <nusislam@amd.com>
Co-authored-by: Nusrat Islam <nusislam@dell300x-ccs-aus-k13-03.cs-aus.dcgpu>
[ROCm/rccl commit: 27648b0900]
* Force ring in WarpSpeed manual mode and log event
* Skip usage for non-ring in WarpSpeed auto mode
* Enable WarpSpeed when its CU count is set
[ROCm/rccl commit: 93fdcb160c]
* Add ncclCommDump API
* remove trailing whitespace changes
* Add more proxy trace timestamps
* Add facebook_rccl namespace before proxyTrace timestamp call
* Clean up ProxyTrae construction
* Move updateProxyOpCounter to member function
* Move setProxyOpTimestamp to member function
* Move addNewProxyOp to member function
* Make internal methods private
* Make ProxyTrace thread safe
* Fix unit tests
* Fix overwritten ProxyTrace DONE setting in net.cc
[ROCm/rccl commit: 08dd75712f]
* Use one side stream per process
* Handle multiple GPUs per process
* Reset stream when not found
* Address review comments
* Fix missing mutex initializer
[ROCm/rccl commit: 185e78a8f0]
* Fail the job if compiler flag HIP_HOST_UNCACHED_MEMORY is not turned on on mi350x
Place the check after initTransportsRank as the GPU arch info in comm->topo->nodes info is populated after that.
* Update src/init.cc to use ERROR instead of WARN
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
[ROCm/rccl commit: 05f914c997]
* Enhance logging in NCCL initialization
It's convenient to log comms obj and default channels together for debugging
* Add opCount to collDevWork and update increment logic
Added opCount to collDevWork and incremented it when proxyOpQueue is empty (e.g., for intra-node comms)
* Clarify opCount increment logic in enqueue.cc
Updated comment to clarify incrementing opCount for intranode communications.
* Refactor NCCL_INIT logging format
Updated logging format for NCCL_INIT to improve clarity.
* Remove duplicate INFO logging in init.cc
[ROCm/rccl commit: b00ee4c83c]
* Added ERROR message class to handle fatal error messages.
New ERROR message class will print the message in all debug level,
including none.
Change some of the fatal error message to be in ERROR instead of WARN.
Added new error handler function to print out more meaningful error
message in the future.
* Added CHANGELOG entry.
* Update CHANGELOG.md
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
* Change to no longer reuse NONE as ERROR. ERROR is now a separated class.
* Update CHANGELOG.md
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
[ROCm/rccl commit: 1ce83d5cc0]
* Add initial commit to increase tb size to 512
* Fix LL perf issue when subset of NCCL_MAX_NTHREADS is used
Adding a constant to barrier_generic logic from using fallback logic when nthreads < NCCL_MAX_NTHREADS and nthreads == blockDim.X
* Adjust nthreads for LL
* Opt threads for reduce_scatter upper small range
* Add macro for single node
* Restrict MSCCL to 256 threads to prevent mem access fault
* Support pre-MI350 compatibility
* Partially refactor threadblock size override
* Use const macros instead of numerals
* opt out of unused function
[ROCm/rccl commit: 12f51ba8bf]
* Gate based on ROCM version, safe for ROCm 7.0.2 and beyond.
* Updates naming to gfx9CheapFenceOff since we use this for gfx942 and gfx950. Thanks Nilesh.
* Add info logging statement to NCCL_INIT to print whether enabled when INFO logging is enabled.
[ROCm/rccl commit: c70f5b4621]
In this commit it disabled by default and can be enabled via
`RCCL_ENABLE_CONTEXT_TRACKING=1` for both (CDNA, RDNA)
Original PR https://github.com/ROCm/rccl/pull/1927
[ROCm/rccl commit: 00a42c80f3]
* Revert disabling of context tracking for Radeon
Original commit df3b7e47
`Disable context tracking for the current version. (#1839)`
* Add env variable for disabling of context tracking for Radeon
`export NCCL_DISABLE_CONTEXT_TRACKING=1` to force disable of context tracking
* Update docs/how-to/rccl-usage-tips.rst
Fix grammar, thanks @amd-jnovotny
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
* Rename NCCL_DISABLE_CONTEXT_TRACKING -> RCCL_DISABLE_CONTEXT_TRACKING
* Revert changes in includes and rename util function
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
[ROCm/rccl commit: 07925ec027]
Device API (Experimental)
* Introduces device-side APIs to integrate NCCL communication directly into application kernels.
* Supports LSA (Load/Store Access) for CUDA P2P communication over NVLink and some PCIe platforms.
* Supports Multimem for hardware multicast using NVLink SHARP.
* Adds initial framework for GIN (GPU-Initiated Networking), currently under development.
* Introduces device communicators created using ncclDevCommCreate.
* Enables device-side communication operations with synchronization (ncclLsaBarrierSession) and memory accessors (ncclGetLsaPointer, ncclGetLsaMultimemPointer).
* Experimental APIs - signatures and functionality may evolve in future releases.
* No ABI compatibility is guaranteed — applications must be recompiled with each new NCCL release.
Symmetric memory improvements
* Support for aggregating symmetric operations using ncclGroupStart/End APIs.
* Reimplement symmetric kernels using device API.
New Host APIs
* Introduce new host collective APIs: ncclAlltoAll, ncclScatter, ncclGather.
CE (Copy Engine) Collectives
* Reduce SM utilization for alltoall, scatter, gather, and allgather within a single (MN)NVL domain.
* Free up SM capacity for the application to do computation at the same time.
* To enable the feature for ncclAllGather, ncclAlltoAll, ncclGather, ncclScatter, register buffers into symmetric windows and use the NCCL_CTA_POLICY_ZERO flag in the communicator config_t.
NCCL Inspector Plugin
* Introduces an Inspector plugin for always-on performance monitoring.
* Produces structured JSON output with metadata, execution time, bandwidth, and optional event traces for each NCCL operation.
* Enables integration with analysis tools such as Performance Exporter to visualize NCCL performance bottlenecks.
* Lightweight to enable via environment variables NCCL_PROFILER_PLUGIN and NCCL_INSPECTOR_ENABLE.
CMake support (Experiemental)
* Adds a CMake build system as an alternative to existing Makefiles.
* Known issues: pkg.build and Device API currently do not work with CMake.
* The known issues will be addressed in a future release.
Decreased max CTA count from 32 to 16 on Blackwell
* SM overhead is decreased by 50% with this improvement.
* This may cause some perf drop on Blackwell because of the reduced SM usage.
* If the extra SM capacity is not desired, two options are available to restore to previous behavior: 1) Setting NCCL_MIN_CTAS=32 NCCL_MAX_CTAS=32 environment variables; 2) setting communicator config to over-write max CTA count to 32.
* Based on community feedback, future versions may consider different trade-offs between performance and SM overhead.
Plugins
* Network
* App-aware Network plugin. NCCL passes information about communication operations to be executed on the network end point. This allows for better tuning of network end points and their use in the plugins.
* Improve handling of physical and virtual network devices and load/unload.
* Network plugin version 11 - add explicit context and communication ID support for per communicator init/finalize.
* Add Multi-Request Net API. Using this will help NCCL to anticipate multiple send/recv requests and optimize for it. See maxMultiRequestSize field in ncclNetProperties_v11_t.
* Profiler
* Add support for API events (group, collective, and p2p) and for tracking kernel launches in the profiler plugin.
* Add Inspector Profiler Plugin (see section above).
* Add a hook to Google’s CoMMA profiler on github.
* Tuner
* Expose NCCL tuning constants at tuner initialization via ncclTunerConstants_v5_t.
* Add NVL Domain Information API.
* Support multiple plugin types from a single shared object.
New Parameterization and ncclConfig changes:
* Add new option NCCL_MNNVL_CLIQUE_ID=-2 which will use rack serial number to partition the MNNVL clique. This will limit NVLink domains to GPUs within a single rack.
* Add NCCL_NETDEVS_POLICY to control how NET devices are assigned to GPUs. The default (AUTO) is the policy used in previous versions.
* Add NCCL_SINGLE_PROC_MEM_REG_ENABLE control variable to enable NVLS UB registration in the “one process, multiple ranks” case as opt in.
* Move nChannelsPerNetPeer into ncclConfig. NCCL_NCHANNELS_PER_NET_PEER can override the value in ncclConfig.
* Enable PxN over C2C by default
* PxN over C2C will improve performance for Grace-Blackwell platforms by allowing NCCL to leverage the NIC attached to a peer GPU over NVLINK, C2C, and PCIe.
* This behavior can be overridden by setting NCCL_PXN_C2C=0.
Other Improvements:
* Allow FP8 support for non-reductive operations on pre sm90 devices. (See https://github.com/pytorch/pytorch/pull/151594#discussion_r2135777776)
* Fix NVLS+CollNet and temporarily disables COLLNET_CHAIN for >8 GPUs.
* Only consider running interfaces for socket traffic. NCCL will not attempt to use interfaces that do not have the IFF_RUNNING bit. (https://github.com/NVIDIA/nccl/issues/1798)
* Modernize mutex management. Convert to std::mutex and std::lock_guard.
* Remove sm35 and sm50 GENCODE targets which have long been deprecated and were causing issues with the latest NCCL release builds.
* Improved NVLS/NVLSTree tuning prediction to improve algorithm and protocol selection.
* NVLSTree Tuning Fixes. Update tuning data for H100, GB200-NV72.
* Respond better to RoCE link flaps. Instead of reporting an “unknown event” it will now report “GID table changed”.
* Move libvirt bridge interface to the end of possible interfaces so that they are considered last. These interfaces are usually virtual bridges to relay traffic to containers running on the host and cannot be used for traffic to a remote node and are therefore unsuitable.
[ROCm/rccl commit: f1308997d0]
* [rocm_regression] Return errors when HSA_NO_SCRATCH_RECLAIM=1 even for rocm >= 6.4.0
* [rocm_regression] Check firmware version
* [rocm_regression] Resolve review comments
* [rocm_regression] Move hsa env checking into init once func
* [rocm_regression] Prevent hot fix version in firmware
* [rocm_regression] Improve unit tests
[ROCm/rccl commit: 361d596229]
* add direct allgather algorithm
* minor fix
* add debug print for memory allocation tracker
* add message size threshold for direct allgather
* scatter transfers across ranks
* update changelog
* minor fix
* Update CHANGELOG.md
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
* enable direct AG when pxn is ON on MI300X or MI350
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
[ROCm/rccl commit: 5e7937effb]
Leverages the traits of extended-scope fine-grain memory to get rid of a device-scope acquire-release fence. This improves throughput for single node workloads on gfx942 and gfx950 for some input sizes (e.g., ~32 MiB to about 256 MiB) when using the simple protocol. Multinode workloads on MI300X see a smaller but statistically significant uplift for some message sizes. Runtime disablement is supported via setting the environment variable RCCL_GFX942_CHEAP_FENCE_ON to 0.
[ROCm/rccl commit: 1aa2570b48]
* Support fused all reduce and elementwise operations
Add additional "acc" parameter to RCCL Replayer logs
Add flag which indicates availability of new API
* Fix Recorder json parsing
* Remove unreachable code
* Remove extra acc pointer check
* .
* Revert "[DEVICE] Adding ability to choose unroll factor at runtime (#1734)"
This reverts commit 4cadf3597c.
* Use noinline to reduce kernels linking time
* Don't use noinline for gfx942 and gfx950 to avoid perf regression
---------
Co-authored-by: AtlantaPepsi <timhu102@amd.com>
Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>
[ROCm/rccl commit: 9a4213356d]
Boosts single node bfloat16 allreduce performance by up to 20% for some data sizes and provides gating with the RCCL_GFX942_CHEAP_FENCE_OFF environment variable
[ROCm/rccl commit: 11fabf1de1]
Improve support for DirectNIC (CX8)
* Add support for XDR speed detection.
* When DirectNIC is enabled, report only the RDMA interfaces.
Extend the P2C (PXN over C2C) support to send/receive operations.
Support compilation with GCC 14 (Issues #1743, #1751).
Fix the unloading of network plugins that also provide tuner capability.
Fix the change of the current device across the calls to ncclCommDestroy()
and ncclCommAbort().
A note for users on MNNVL systems: please ensure an adequate stack size for
NCCL threads. While the default Linux stack size limit of 8192 KB is known
to be sufficient, we've seen crashes if the limit is changed to
"unlimited", as it causes the glibc library to unexpectedly *decrease* the
stack size of NCCL's background threads to just 2048 KB. Use "ulimit -s"
in bash to print the current limit; if needed, reset it to 8192 KB using
"ulimit -s 8192" (one also needs to ensure that the new setting is
propagated to other nodes when launching a multi-node NCCL job).
[ROCm/rccl commit: 7c12c627c6]
This feature tracks the proxy events and status of each send/recv op. ProxyTrace keeps a fixed number of active ops in host mem and dumps the status of each op when the program crashes or hangs.
[ROCm/rccl commit: 020dcf0a7c]