Граф коммитов

361 Коммитов

Автор SHA1 Сообщение Дата
Atul Kulkarni 892d258319 Add missing header in alloc.h (#2086) 2025-12-04 11:26:19 -06:00
Wenkai Du 185e78a8f0 Use one side stream per process (#2063)
* Use one side stream per process

* Handle multiple GPUs per process

* Reset stream when not found

* Address review comments

* Fix missing mutex initializer
2025-12-02 10:03:15 -08:00
Pedram Alizadeh fb67e5b467 Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic (#2037)
* Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic

* Switching to hip_bf16.h from ROCm 6.0.0
2025-11-13 15:56:18 -05:00
Kapil S. Pawar acdafac49f [RcclReplayer] Compile without the need for RCCL to be compiled (#2039) 2025-11-10 15:38:48 -06:00
Bertan Dogancay b1e680adc0 [GEN/BUILD] Refactor generator script and reduce build time for old archs. (#2030) 2025-11-07 15:15:25 -05:00
Bertan Dogancay a9bb7e9807 [Launch] Enable Implicit order launch with serial mode (#2033) 2025-11-07 13:29:53 -05:00
Ghadeer Ahmed H Alabandi 45991fadad [NET] Enable capping the number of QPs created for send/recv colls (#1998) 2025-11-07 00:47:01 +00:00
Arm Patinyasakdikul 1ce83d5cc0 Added ERROR message class to handle fatal error messages. (#2002)
* Added ERROR message class to handle fatal error messages.

New ERROR message class will print the message in all debug level,
including none.

Change some of the fatal error message to be in ERROR instead of WARN.

Added new error handler function to print out more meaningful error
message in the future.

* Added CHANGELOG entry.

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Change to no longer reuse NONE as ERROR. ERROR is now a separated class.

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-10-30 16:14:20 -05:00
Arm Patinyasakdikul 84fdcab68a Added copyrights for Palamida scan 7.2. (#2018) 2025-10-30 13:33:20 -05:00
isaki001 641c0eb51c P2p batching hang-fix (#2011)
* prevent batching when send/recv bytes dont match, restore bit reversal for channel to part mapping, prevent batching beyond 32-nodes

* correct computation for channel to part mapping

* update changelog

* disabling p2p-batching by default
2025-10-30 13:32:01 -05:00
Mustafa Abduljabbar 12f51ba8bf [Device] Adjust threadblock size for gfx950 to increase LL64/Simple performance for AR, RS and AG (#1978)
* Add initial commit to increase tb size to 512
* Fix LL perf issue when subset of NCCL_MAX_NTHREADS is used
Adding a constant to barrier_generic logic from using fallback logic when nthreads < NCCL_MAX_NTHREADS and nthreads == blockDim.X
* Adjust nthreads for LL
* Opt threads for reduce_scatter upper small range
* Add macro for single node
* Restrict MSCCL to 256 threads to prevent mem access fault
* Support pre-MI350 compatibility
* Partially refactor threadblock size override
* Use const macros instead of numerals
* opt out of unused function
2025-10-29 23:24:32 -05:00
Nilesh M Negi c35bc721ad Fix ncclDevFuncId for AllReduceWithBias (#1980) 2025-10-17 09:28:57 -05:00
gilbertlee-amd fedddb452c Enabling gdrcopy option for gfx950 (#1955) 2025-10-15 10:55:25 -06:00
alex-breslow-amd c70f5b4621 [gfx950] Make bypassing __threadfence the default for multinode. (#1947)
* Gate based on ROCM version, safe for ROCm 7.0.2 and beyond.
* Updates naming to gfx9CheapFenceOff since we use this for gfx942 and gfx950.  Thanks Nilesh.
* Add info logging statement to NCCL_INIT to print whether enabled when INFO logging is enabled.
2025-10-15 09:15:36 -07:00
isaki001 0f99fd84a3 gfx950 channel tuning for ReduceScatter and AllGather (#1940)
* add channel thresholds to override channel-count adjustments
2025-10-14 09:50:44 -05:00
Artem Kuzmitckii 00a42c80f3 Reverse logic of context tracking enablement from #1927 (#1971)
In this commit it disabled by default and can be enabled via
`RCCL_ENABLE_CONTEXT_TRACKING=1` for both (CDNA, RDNA)
Original PR https://github.com/ROCm/rccl/pull/1927
2025-10-09 10:24:09 +02:00
BertanDogancay 3f94267f21 Merge remote-tracking branch 'nccl/master' into develop 2025-10-06 18:36:49 -04:00
Nilesh M Negi 342ec086e3 Revert "changes for hugepages backed host buffer for larger allocations (#1841)" (#1951)
This reverts commit 65b69bf318.
2025-10-02 23:43:09 -05:00
Bhuvan Mital 65b69bf318 changes for hugepages backed host buffer for larger allocations (#1841) 2025-09-28 00:40:22 -05:00
Artem Kuzmitckii 07925ec027 Revert disabling of context tracking for Radeon (#1927)
* Revert disabling of context tracking for Radeon

Original commit 6fc228e2
 `Disable context tracking for the current version. (#1839)`

* Add env variable for disabling of context tracking for Radeon

`export NCCL_DISABLE_CONTEXT_TRACKING=1` to force disable of context tracking

* Update docs/how-to/rccl-usage-tips.rst

Fix grammar, thanks @amd-jnovotny

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Rename NCCL_DISABLE_CONTEXT_TRACKING -> RCCL_DISABLE_CONTEXT_TRACKING

* Revert changes in includes and rename util function

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-09-27 15:19:50 -04:00
Mustafa Abduljabbar 0dd2b2f65e Fix extra token typo (#1943) 2025-09-26 11:18:43 -04:00
Mustafa Abduljabbar 7a329bbd94 Expose symbols for RCCL algo/proto/channels selection functions (#1923)
* Unhide symbols for algo/proto functions

* Add all_gather direct usage detection
2025-09-25 18:58:30 -04:00
corey-derochie-amd d86cf78810 Moved new functions to the bottom of the function table to maintain backward compatibility (#1931)
* Moved new functions to the bottom of the function table to maintain backward compatibility

* Added ordering fixes to api_trace.cc
2025-09-23 13:30:27 -06:00
Mustafa Abduljabbar c1e1f2faeb Use batched P2P to enhance alltoall small message performance (#1902)
* Batch P2P operations (2 per CU/channel) and update channel-part mapping

- Revert bitreversal and fix channel mapping to be compatible with P2P batching and avoid hangs

- P2P batching is only used for more than 2 nodes to avoid aggregating intra-node traffic when it is dominant for less than 2 nodes

* Address single node regression and channel per net peer

* Add batching threshold

* Add enable switch for batching

* Update CHANGELOG.md

* Add minor comment change

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-09-22 16:25:10 -04:00
corey-derochie-amd ed095cad35 Moved latency_profiler license into subdirs and updated NOTICES. (#1918) 2025-09-18 12:54:39 -06:00
Venkateshwar Reddy Kandula 0cc896910e due nccl api sync update RCCL_API_TRACE_VERSION_PATCH to 2 (#1916) 2025-09-18 07:36:50 -06:00
Nilesh M Negi da06c69cb8 [INIT] Use rocm-smi API instead of CLI for querying FW version (#1920) 2025-09-17 19:17:19 -05:00
isaki001 9c36439354 add reduce/broadcast algo/proto selection table for multi-node gfx940 (#1889) 2025-09-10 14:25:23 -05:00
Mustafa Abduljabbar 7ccc6f268f Force enable proto and/or algo after model selection (#1799)
* Force enable proto or algo

* Remove inc nccl_common.h

* Move logic and add error checks

* Fix topo_expl compatibility

* Allow algo/proto overrides

* Remove extra function decl

* Clarify warning message

* Move algo/proto overrides into separate functions

* Update CHANGELOG.md
2025-09-03 08:54:13 -04:00
ycui1984 361d596229 [rocm_regression] Return errors when HSA_NO_SCRATCH_RECLAIM=1 even for rocm>=6.4.0 (#1867)
* [rocm_regression] Return errors when HSA_NO_SCRATCH_RECLAIM=1 even for rocm >= 6.4.0
* [rocm_regression] Check firmware version
* [rocm_regression] Resolve review comments
* [rocm_regression] Move hsa env checking into init once func
* [rocm_regression] Prevent hot fix version in firmware
* [rocm_regression] Improve unit tests
2025-08-29 11:18:23 -05:00
BertanDogancay 08a7be231b Merge remote-tracking branch 'nccl/master' into develop 2025-08-28 15:46:28 -05:00
Nusrat Islam df448862c3 Device allocation tracker (#1878)
* alloc: add memory allocation tracker

* alloc: add tracker for ncclCuMemAlloc() APIs

* alloc: add null pointer check during free
2025-08-27 09:30:51 -05:00
Mustafa Abduljabbar 277747c199 [Device] Add dynamic fetch/reduce pipelining for reduction collectives - Simple protocol (#1861)
* Support pipelining codegen and template specialization

* Support ReduceCopy pipelining for AllReduce, ReduceScatter, and Reduce (currently enabled for bfloat16)

* Remove need for FUNC_INDEX_TOTAL

* Add pipeline field to device function key construction logic

* Avoid unneeded codegen for LL/LL64 kernels

* Modify conditions and add pipeline dtypes env

* Optimize selection for both gfx942 and gfx950

* Increase pipeline bitfield width

* Use __forceinline__ for all device functions

* Realign reduceCopy with original form

* Add opt-out option to enable perf debugs

* Remove force-reduce-pipelining option from README

* Update CHANGELOG.md

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-08-26 15:03:54 -04:00
Nusrat Islam 5e7937effb Add direct allgather algorithm (#1868)
* add direct allgather algorithm

* minor fix

* add debug print for memory allocation tracker

* add message size threshold for direct allgather

* scatter transfers across ranks

* update changelog

* minor fix

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* enable direct AG when pxn is ON on MI300X or MI350

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-08-25 07:55:10 -05:00
Mustafa Abduljabbar c1b3cd8911 Have ncclDevFuncId use 64-Bit keyed map with field packing (#1857)
- Updated ncclDevFuncId to use a hash-based lookup with std::unordered_map.
- Keys are now 64-bit integers, which pack coll, algo, proto, devRedOp, and type fields.
- Improved flexibility and maintainability by moving away from row-based indexing.
- Added error handling for missing keys in the hash map.
- Aligned key generation logic with generate.py and updated generate.py.
2025-08-19 16:41:19 -04:00
isaki001 44121db890 [TUNING] gfx950 16N tuning (#1835)
* change gfx950 algo/proto selection for multinode allreduce, allgather, reduceScatter
* gfx950 tuning: enable tuning for broadcast, allreduce starts LL128 earlier and switches to ring earlier, change LL128 start for allgather and reduceScatter
* lower LL128 threshold
* update reduceScatter LL128 min to match LL max for consistency
* enable multinode PXN and increase chunksize for gfx950
* change LL128 start to 128KB, adjust ring-start according to node-count
* disable code-path for fused-AR on LL128 for gfx950
* use LL128 starting from 1KB for multinode allgather on gfx950
* start LL128 earlier for multinode reduceScatter on gfx950
* start LL128 earlier for multinode broadcast on gfx950
* set multinode allreduce to start simple on 64MB for gfx950
* start LL128 from 1KB for multinode broadcast on gfx950
* setting multinode AR to use tree instead of ring at 16MB, 64MB, 128MB
* set multinode broadcast to use LL for up to 256KB depending on node-count for gfx950
* adjust algo for 32MB  multinode allreduce on gfx950
* make 32MB tree LL128 for multinode AR on gfx950
* make sure ring is not picked on 2N allreduce on small sizes
2025-08-15 15:12:45 -05:00
mberenjk c61152baa4 Added useAcc as a template parameter to address the performance regression (#1856)
* Added useAcc as a template parameter to address the 2% performance regression in allreduceWithBias
---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-08-14 15:58:54 -05:00
Karthikeyan Arumugam 6d41e5ba99 Add cstring header explictly as it is removed from HIP (#1859) 2025-08-13 15:14:22 -07:00
Avinash 3f8cac388e Compiler warnings fix 2 (#1801)
* Changes to device code

* Changes to src/misc

* Changes to graph

* src/include changes

* src/transport changes

* changes in init, enqueue, proxy

* Changes to CMakeLists.txt

* Additional changes to device code

* Additional changes to net.cc

* adding 'compiler warning' tag to ease upstream merge'

* typo correction

* Addessing comments

* Additional changes for new commits
2025-08-05 17:36:23 -05:00
ycui1984 874cd657ef Add collective latency profiler (#1785)
* [LatencyProfiler] Initial commit

* [LatencyProfiler] Add unit tests

* [LatencyProfiler] add more

* [LatencyProfiler] Pass unit tests

* [LatencyProfiler] Add hooks to integrate with meta internal tools

* [LatencyProfiler] Restore install.sh

* [LatencyProfiler] Resolved comments 1. add proper license 2. use proper namespace

* [LatencyProfiler] Add header
2025-07-30 14:59:28 -07:00
Mustafa Abduljabbar 4ce3df8d3a Optimize alltoall for 64 GPUs and above for gfx942 (#1828)
Add pxn and p2p net chunksize mi300x tuning
2025-07-30 15:14:43 -04:00
mberenjk c84ee3d298 Upcast FP8 to Half (FP16) for Sum Operation (#1775)
* adding hadd and hadd2 support using builtin functions.

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-07-29 11:33:06 -05:00
Atul Kulkarni 1c3d1b3842 Added new unit tests for src/transport/shm.cc (#1689) 2025-07-25 05:54:42 -05:00
Wenkai Du 9a4213356d Support fused all reduce and elementwise operations (#1729)
* Support fused all reduce and elementwise operations

Add additional "acc" parameter to RCCL Replayer logs

Add flag which indicates availability of new API

* Fix Recorder json parsing

* Remove unreachable code

* Remove extra acc pointer check

* .

* Revert "[DEVICE] Adding ability to choose unroll factor at runtime (#1734)"

This reverts commit 9d72be7b2f.

* Use noinline to reduce kernels linking time

* Don't use noinline for gfx942 and gfx950 to avoid perf regression

---------

Co-authored-by: AtlantaPepsi <timhu102@amd.com>
Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>
2025-07-23 09:04:17 -07:00
alex-breslow-amd 11fabf1de1 Cheaper threadfence for gfx942 in postPeer [1/N]: enable for single node allreduce (#1766)
Boosts single node bfloat16 allreduce performance by up to 20% for some data sizes and provides gating with the RCCL_GFX942_CHEAP_FENCE_OFF environment variable
2025-07-22 07:15:15 -07:00
Kamil Iskra 7c12c627c6 NCCL 2.27.6-1
Improve support for DirectNIC (CX8)
* Add support for XDR speed detection.
* When DirectNIC is enabled, report only the RDMA interfaces.

Extend the P2C (PXN over C2C) support to send/receive operations.

Support compilation with GCC 14 (Issues #1743, #1751).

Fix the unloading of network plugins that also provide tuner capability.

Fix the change of the current device across the calls to ncclCommDestroy()
and ncclCommAbort().

A note for users on MNNVL systems: please ensure an adequate stack size for
NCCL threads.  While the default Linux stack size limit of 8192 KB is known
to be sufficient, we've seen crashes if the limit is changed to
"unlimited", as it causes the glibc library to unexpectedly *decrease* the
stack size of NCCL's background threads to just 2048 KB.  Use "ulimit -s"
in bash to print the current limit; if needed, reset it to 8192 KB using
"ulimit -s 8192" (one also needs to ensure that the new setting is
propagated to other nodes when launching a multi-node NCCL job).
2025-07-11 07:32:13 -07:00
Nilesh M Negi 6b4ad0fd74 [BUILD] Use fmt-header instead of libfmt (#1791) 2025-07-10 17:19:53 -05:00
Nilesh M Negi 2c099fe29a [INIT] Fix fallback for unsupported user-specified runtime unroll factor (#1780)
* [INIT] Fix fallback for unsupported user-specified runtime unroll factor
* Add CollTrace guard
* Move `commSetUnrollFactor()` to rccl_wrap.cc
* Modify comments in the device-code generator script
2025-07-10 10:56:18 -05:00
mberenjk 697bee4ee8 Improving build time by removing the gfx11xx and host code from rccl_float8.h (#1789)
* removing extra build time by removing the gfx11xx arch from using hip_fp8

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-07-09 14:03:47 -05:00
Bertan Dogancay e96c8473a1 [DEVICE] Enable PAT algo for RCCL 1ppn (#1756)
* Enable PAT algo for RCCL 1ppn
2025-07-04 13:45:18 -04:00