Wykres commitów

993 Commity

Autor SHA1 Wiadomość Data
Arm Patinyasakdikul ff75860d73 Fix unroll factor display bug. (#1969) 2025-10-10 15:35:06 -05:00
Surya Periaswamy 5bd5079de1 MSCCL++ fix split path null deref (#1959)
* Add speriaswamy-amd to CODEOWNERS
* MSCCL++: fix split path null deref; key maps by parent ncclUniqueId
* removed no-op
2025-10-09 14:08:38 -05:00
Rahul Vaidya 6b200ee6c5 Fix LL128 proto selection to respect user setting (#1822) 2025-10-09 14:08:03 -05:00
Nusrat Islam d22a39e954 Update direct AG and single node LL threshold (#1944)
* update AG direct and single node LL threshold

* update thresholds based on MI350 expeirmental results

* disable using LL for direct AG

* enable direct AG for lower GPU counts

* direct AG single node tuning

* fix in-place buffer allocation for AG unit test

* whitespace fix

* gate direct AG for gfx950 and gfx942

---------

Co-authored-by: Nusrat Islam <nusislam@nova-login-gtu2.prov.gtu.zts.cpe.ice.amd.com>
2025-10-09 10:48:50 -05:00
Artem Kuzmitckii 00a42c80f3 Reverse logic of context tracking enablement from #1927 (#1971)
In this commit it disabled by default and can be enabled via
`RCCL_ENABLE_CONTEXT_TRACKING=1` for both (CDNA, RDNA)
Original PR https://github.com/ROCm/rccl/pull/1927
2025-10-09 10:24:09 +02:00
BertanDogancay 3f94267f21 Merge remote-tracking branch 'nccl/master' into develop 2025-10-06 18:36:49 -04:00
Nilesh M Negi 342ec086e3 Revert "changes for hugepages backed host buffer for larger allocations (#1841)" (#1951)
This reverts commit 65b69bf318.
2025-10-02 23:43:09 -05:00
amd-jiali 5978d2f9ab Print out the hipRuntimeVersion message from WARN to always show up (#1911)
Authored-by: Jiali Li <jialili@amd.com>
2025-10-02 11:32:32 -05:00
Bhuvan Mital 65b69bf318 changes for hugepages backed host buffer for larger allocations (#1841) 2025-09-28 00:40:22 -05:00
Artem Kuzmitckii 07925ec027 Revert disabling of context tracking for Radeon (#1927)
* Revert disabling of context tracking for Radeon

Original commit 6fc228e2
 `Disable context tracking for the current version. (#1839)`

* Add env variable for disabling of context tracking for Radeon

`export NCCL_DISABLE_CONTEXT_TRACKING=1` to force disable of context tracking

* Update docs/how-to/rccl-usage-tips.rst

Fix grammar, thanks @amd-jnovotny

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Rename NCCL_DISABLE_CONTEXT_TRACKING -> RCCL_DISABLE_CONTEXT_TRACKING

* Revert changes in includes and rename util function

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-09-27 15:19:50 -04:00
alex-breslow-amd 45166f6586 Gate code by rocm_version (#1945) 2025-09-26 13:28:41 -07:00
Mustafa Abduljabbar 0dd2b2f65e Fix extra token typo (#1943) 2025-09-26 11:18:43 -04:00
Mustafa Abduljabbar 7a329bbd94 Expose symbols for RCCL algo/proto/channels selection functions (#1923)
* Unhide symbols for algo/proto functions

* Add all_gather direct usage detection
2025-09-25 18:58:30 -04:00
Larry Meadows cb14fccdcc - LL Protocol: Add missing fences for gfx950, this fixes the hang issue (#1932)
- Remove asm flat_store_dwordx4, not needed
2025-09-25 14:07:07 -07:00
corey-derochie-amd d86cf78810 Moved new functions to the bottom of the function table to maintain backward compatibility (#1931)
* Moved new functions to the bottom of the function table to maintain backward compatibility

* Added ordering fixes to api_trace.cc
2025-09-23 13:30:27 -06:00
Mustafa Abduljabbar c1e1f2faeb Use batched P2P to enhance alltoall small message performance (#1902)
* Batch P2P operations (2 per CU/channel) and update channel-part mapping

- Revert bitreversal and fix channel mapping to be compatible with P2P batching and avoid hangs

- P2P batching is only used for more than 2 nodes to avoid aggregating intra-node traffic when it is dominant for less than 2 nodes

* Address single node regression and channel per net peer

* Add batching threshold

* Add enable switch for batching

* Update CHANGELOG.md

* Add minor comment change

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-09-22 16:25:10 -04:00
corey-derochie-amd 9b04b2a42f Added an implementation of ncclSymGetKernelPtr for when GENERATE_SYM_KERNELS is not defined, as it is normally generated code. (#1925) 2025-09-19 07:52:33 -06:00
corey-derochie-amd ed095cad35 Moved latency_profiler license into subdirs and updated NOTICES. (#1918) 2025-09-18 12:54:39 -06:00
Venkateshwar Reddy Kandula 0cc896910e due nccl api sync update RCCL_API_TRACE_VERSION_PATCH to 2 (#1916) 2025-09-18 07:36:50 -06:00
Nilesh M Negi da06c69cb8 [INIT] Use rocm-smi API instead of CLI for querying FW version (#1920) 2025-09-17 19:17:19 -05:00
Karthik Ganesan 740dfd1efd Update prims_simple.h to keep header file access to rccl_metadata.h uniform (#1906)
Header files in device/ folder are directly referenced in the code base except here.
2025-09-16 08:58:50 -05:00
Bertan Dogancay 93d86dd8e3 [BUILD] Stop generating sym kernels by default (#1907)
* Stop generating sym kernels by default
2025-09-15 12:19:35 -04:00
mberenjk ada4e12360 disabling msccl for fp8 datatype (#1888)
* disabling msccl for fp8 datatype

---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-09-11 13:09:34 -05:00
Wenkai Du de9ebd8a8b Treat PIX and PXB as same GDR distance (#1894) 2025-09-11 10:44:10 -05:00
isaki001 9c36439354 add reduce/broadcast algo/proto selection table for multi-node gfx940 (#1889) 2025-09-10 14:25:23 -05:00
Wenkai Du c2bccf9156 Enable LL128 and use same tuning table for gfx942 4 NICs (#1898) 2025-09-10 11:11:15 -04:00
Mustafa Abduljabbar 7ccc6f268f Force enable proto and/or algo after model selection (#1799)
* Force enable proto or algo

* Remove inc nccl_common.h

* Move logic and add error checks

* Fix topo_expl compatibility

* Allow algo/proto overrides

* Remove extra function decl

* Clarify warning message

* Move algo/proto overrides into separate functions

* Update CHANGELOG.md
2025-09-03 08:54:13 -04:00
ycui1984 361d596229 [rocm_regression] Return errors when HSA_NO_SCRATCH_RECLAIM=1 even for rocm>=6.4.0 (#1867)
* [rocm_regression] Return errors when HSA_NO_SCRATCH_RECLAIM=1 even for rocm >= 6.4.0
* [rocm_regression] Check firmware version
* [rocm_regression] Resolve review comments
* [rocm_regression] Move hsa env checking into init once func
* [rocm_regression] Prevent hot fix version in firmware
* [rocm_regression] Improve unit tests
2025-08-29 11:18:23 -05:00
BertanDogancay 08a7be231b Merge remote-tracking branch 'nccl/master' into develop 2025-08-28 15:46:28 -05:00
Nusrat Islam df448862c3 Device allocation tracker (#1878)
* alloc: add memory allocation tracker

* alloc: add tracker for ncclCuMemAlloc() APIs

* alloc: add null pointer check during free
2025-08-27 09:30:51 -05:00
Mustafa Abduljabbar 277747c199 [Device] Add dynamic fetch/reduce pipelining for reduction collectives - Simple protocol (#1861)
* Support pipelining codegen and template specialization

* Support ReduceCopy pipelining for AllReduce, ReduceScatter, and Reduce (currently enabled for bfloat16)

* Remove need for FUNC_INDEX_TOTAL

* Add pipeline field to device function key construction logic

* Avoid unneeded codegen for LL/LL64 kernels

* Modify conditions and add pipeline dtypes env

* Optimize selection for both gfx942 and gfx950

* Increase pipeline bitfield width

* Use __forceinline__ for all device functions

* Realign reduceCopy with original form

* Add opt-out option to enable perf debugs

* Remove force-reduce-pipelining option from README

* Update CHANGELOG.md

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-08-26 15:03:54 -04:00
Nusrat Islam b882af9ffd fixup: remove extra semicolon (#1881) 2025-08-26 10:57:25 -05:00
Nusrat Islam 5e7937effb Add direct allgather algorithm (#1868)
* add direct allgather algorithm

* minor fix

* add debug print for memory allocation tracker

* add message size threshold for direct allgather

* scatter transfers across ranks

* update changelog

* minor fix

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* enable direct AG when pxn is ON on MI300X or MI350

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-08-25 07:55:10 -05:00
Nilesh M Negi bf6660ee4e [BUILD] Populate host_table entries only for 1 unroll (#1871) 2025-08-23 00:15:38 -05:00
Arm Patinyasakdikul 28a83c3ea6 Removing "Could not find any local path from gpu X to net." warning (#1866)
* Removing "Could not find any local path from gpu X to net." warning to avoid confusion.
2025-08-20 16:52:35 -05:00
Arm Patinyasakdikul fb882e80f6 Remove noinline attribute from reduceCopyPacks and (#1864)
reduceCopyPacksWithBias.
2025-08-19 20:24:31 -05:00
Mustafa Abduljabbar c1b3cd8911 Have ncclDevFuncId use 64-Bit keyed map with field packing (#1857)
- Updated ncclDevFuncId to use a hash-based lookup with std::unordered_map.
- Keys are now 64-bit integers, which pack coll, algo, proto, devRedOp, and type fields.
- Improved flexibility and maintainability by moving away from row-based indexing.
- Added error handling for missing keys in the hash map.
- Aligned key generation logic with generate.py and updated generate.py.
2025-08-19 16:41:19 -04:00
Nusrat Islam 6ade5065b4 device: optimize threadfence for ll64 protocol (#1858)
* device: optimize threadfence for ll64 protocol

* device: use __atomic_signal_fence()

---------

Co-authored-by: Nusrat Islam <nusislam@useocpslog-003.amd.com>
2025-08-18 09:16:41 -05:00
Nilesh M Negi c3b8de4ec8 [DEVICE] Use noinline for LLGenericOp only on gfx950 (#1849) 2025-08-15 15:15:02 -05:00
isaki001 44121db890 [TUNING] gfx950 16N tuning (#1835)
* change gfx950 algo/proto selection for multinode allreduce, allgather, reduceScatter
* gfx950 tuning: enable tuning for broadcast, allreduce starts LL128 earlier and switches to ring earlier, change LL128 start for allgather and reduceScatter
* lower LL128 threshold
* update reduceScatter LL128 min to match LL max for consistency
* enable multinode PXN and increase chunksize for gfx950
* change LL128 start to 128KB, adjust ring-start according to node-count
* disable code-path for fused-AR on LL128 for gfx950
* use LL128 starting from 1KB for multinode allgather on gfx950
* start LL128 earlier for multinode reduceScatter on gfx950
* start LL128 earlier for multinode broadcast on gfx950
* set multinode allreduce to start simple on 64MB for gfx950
* start LL128 from 1KB for multinode broadcast on gfx950
* setting multinode AR to use tree instead of ring at 16MB, 64MB, 128MB
* set multinode broadcast to use LL for up to 256KB depending on node-count for gfx950
* adjust algo for 32MB  multinode allreduce on gfx950
* make 32MB tree LL128 for multinode AR on gfx950
* make sure ring is not picked on 2N allreduce on small sizes
2025-08-15 15:12:45 -05:00
alex-breslow-amd 1aa2570b48 Disable the __threadfence on the sender side of the simple protocol when possible. (#1830)
Leverages the traits of extended-scope fine-grain memory to get rid of a device-scope acquire-release fence.  This improves throughput for single node workloads on gfx942 and gfx950 for some input sizes (e.g., ~32 MiB to about 256 MiB) when using the simple protocol.  Multinode workloads on MI300X see a smaller but statistically significant uplift for some message sizes.  Runtime disablement is supported via setting the environment variable RCCL_GFX942_CHEAP_FENCE_ON to 0.
2025-08-15 07:54:54 -07:00
mberenjk c61152baa4 Added useAcc as a template parameter to address the performance regression (#1856)
* Added useAcc as a template parameter to address the 2% performance regression in allreduceWithBias
---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-08-14 15:58:54 -05:00
Karthikeyan Arumugam 6d41e5ba99 Add cstring header explictly as it is removed from HIP (#1859) 2025-08-13 15:14:22 -07:00
isaki001 74d82a8145 enable more events for LL128 NPKIT trace collection (#1827) 2025-08-07 11:19:36 -05:00
Avinash 3f8cac388e Compiler warnings fix 2 (#1801)
* Changes to device code

* Changes to src/misc

* Changes to graph

* src/include changes

* src/transport changes

* changes in init, enqueue, proxy

* Changes to CMakeLists.txt

* Additional changes to device code

* Additional changes to net.cc

* adding 'compiler warning' tag to ease upstream merge'

* typo correction

* Addessing comments

* Additional changes for new commits
2025-08-05 17:36:23 -05:00
Arm Patinyasakdikul 6fc228e247 Disable context tracking for the current version. (#1839) 2025-08-04 10:48:00 -05:00
Nilesh M Negi bd55f876e9 [DEVICE] Add unroll=2 for gfx950 multi-node (#1824) 2025-07-31 02:35:26 -05:00
ycui1984 874cd657ef Add collective latency profiler (#1785)
* [LatencyProfiler] Initial commit

* [LatencyProfiler] Add unit tests

* [LatencyProfiler] add more

* [LatencyProfiler] Pass unit tests

* [LatencyProfiler] Add hooks to integrate with meta internal tools

* [LatencyProfiler] Restore install.sh

* [LatencyProfiler] Resolved comments 1. add proper license 2. use proper namespace

* [LatencyProfiler] Add header
2025-07-30 14:59:28 -07:00
Mustafa Abduljabbar 4ce3df8d3a Optimize alltoall for 64 GPUs and above for gfx942 (#1828)
Add pxn and p2p net chunksize mi300x tuning
2025-07-30 15:14:43 -04:00
mberenjk c84ee3d298 Upcast FP8 to Half (FP16) for Sum Operation (#1775)
* adding hadd and hadd2 support using builtin functions.

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-07-29 11:33:06 -05:00