Improvements for GB200 systems
* Optimize the network performance by alternating the direction of the
  rings and the NIC to GPU assignment across communicators to limit
  unnecessary sharing.
* Fix the detection of C2C links in case GPU Direct RDMA is disabled
  between a GPU and a NIC.
* Fix PXN support on MNNVL systems, where NCCL would try (and fail) to
  share regular host memory across multiple nodes.
* Fix P2C (PXN over C2C), which is now preferred over regular PXN.  This
  support is currently preliminary and is disabled by default; use
  NCCL_PXN_C2C=1 to enable.

Further reduce the overheads of CUDA graph capturing, which increased in
NCCL 2.26.2 for large graphs.

Optimize the network performance on DGX B200 systems by adjusting the
bandwidths provided to the graph search algorithm.

Enable fp8 reductions in symmetric kernels on Blackwell with CUDA 12.8.

Restore the plugin name handling logic to make it possible to specify a
path to the plugin (Issue #1732).

Restore the ability to change NCCL_COLLNET_ENABLE during execution
(Issue #1741).

Add an example tuner plugin with CSV-based overrides.

Remove an x86 dependency from the example profiler.
Этот коммит содержится в:
Kamil Iskra
2025-06-18 10:34:47 -07:00
родитель 72d2432094
Коммит 3ea7eedf3b
33 изменённых файлов: 2740 добавлений и 143 удалений
+24
Просмотреть файл
@@ -0,0 +1,24 @@
collective,size_bytes,algorithm,protocol,channels,nodes,ranks,pipeOps,regBuff,cost_metric,bandwidth_gbps,latency_us
allreduce,1024,tree,simple,2,1,8,-1,-1,0.15,45.2,12.5
allreduce,1024,ring,simple,4,1,8,-1,-1,0.12,52.1,10.8
allreduce,1024,tree,ll,2,1,8,-1,-1,0.18,41.3,15.2
allreduce,1024,ring,ll,4,1,8,-1,-1,0.14,48.7,12.1
allreduce,32768,tree,simple,2,1,8,-1,-1,0.25,156.8,25.3
allreduce,32768,ring,simple,4,1,8,-1,-1,0.18,189.2,18.4
allreduce,32768,ring,ll128,8,1,8,-1,-1,0.16,201.5,16.2
allreduce,1048576,ring,simple,4,1,8,-1,-1,0.45,425.6,45.1
allreduce,1048576,ring,ll128,8,1,8,-1,-1,0.38,482.3,38.7
allreduce,1048576,nvls,simple,16,1,8,-1,-1,0.32,551.2,32.1
broadcast,1024,tree,simple,2,1,8,-1,-1,0.08,89.4,8.2
broadcast,1024,ring,simple,4,1,8,-1,-1,0.12,71.3,12.1
broadcast,32768,tree,simple,2,1,8,-1,-1,0.18,234.7,18.5
broadcast,32768,ring,ll128,4,1,8,-1,-1,0.15,267.8,15.2
broadcast,1048576,ring,simple,4,1,8,-1,-1,0.35,612.4,35.1
broadcast,1048576,ring,ll128,8,1,8,-1,-1,0.28,702.1,28.3
allreduce,1024,tree,simple,2,2,16,-1,-1,0.22,38.1,22.4
allreduce,1024,ring,simple,4,2,16,-1,-1,0.19,42.7,19.6
allreduce,32768,ring,simple,4,2,16,-1,-1,0.28,145.2,28.1
allreduce,32768,ring,ll128,8,2,16,-1,-1,0.24,167.8,24.3
allreduce,1048576,ring,simple,4,2,16,-1,-1,0.58,387.5,58.2
allreduce,1048576,ring,ll128,8,2,16,-1,-1,0.48,456.9,48.1
allreduce,1048576,nvls,simple,16,2,16,-1,-1,0.42,512.6,42.3
1 collective size_bytes algorithm protocol channels nodes ranks pipeOps regBuff cost_metric bandwidth_gbps latency_us
2 allreduce 1024 tree simple 2 1 8 -1 -1 0.15 45.2 12.5
3 allreduce 1024 ring simple 4 1 8 -1 -1 0.12 52.1 10.8
4 allreduce 1024 tree ll 2 1 8 -1 -1 0.18 41.3 15.2
5 allreduce 1024 ring ll 4 1 8 -1 -1 0.14 48.7 12.1
6 allreduce 32768 tree simple 2 1 8 -1 -1 0.25 156.8 25.3
7 allreduce 32768 ring simple 4 1 8 -1 -1 0.18 189.2 18.4
8 allreduce 32768 ring ll128 8 1 8 -1 -1 0.16 201.5 16.2
9 allreduce 1048576 ring simple 4 1 8 -1 -1 0.45 425.6 45.1
10 allreduce 1048576 ring ll128 8 1 8 -1 -1 0.38 482.3 38.7
11 allreduce 1048576 nvls simple 16 1 8 -1 -1 0.32 551.2 32.1
12 broadcast 1024 tree simple 2 1 8 -1 -1 0.08 89.4 8.2
13 broadcast 1024 ring simple 4 1 8 -1 -1 0.12 71.3 12.1
14 broadcast 32768 tree simple 2 1 8 -1 -1 0.18 234.7 18.5
15 broadcast 32768 ring ll128 4 1 8 -1 -1 0.15 267.8 15.2
16 broadcast 1048576 ring simple 4 1 8 -1 -1 0.35 612.4 35.1
17 broadcast 1048576 ring ll128 8 1 8 -1 -1 0.28 702.1 28.3
18 allreduce 1024 tree simple 2 2 16 -1 -1 0.22 38.1 22.4
19 allreduce 1024 ring simple 4 2 16 -1 -1 0.19 42.7 19.6
20 allreduce 32768 ring simple 4 2 16 -1 -1 0.28 145.2 28.1
21 allreduce 32768 ring ll128 8 2 16 -1 -1 0.24 167.8 24.3
22 allreduce 1048576 ring simple 4 2 16 -1 -1 0.58 387.5 58.2
23 allreduce 1048576 ring ll128 8 2 16 -1 -1 0.48 456.9 48.1
24 allreduce 1048576 nvls simple 16 2 16 -1 -1 0.42 512.6 42.3