4 Commitit

Tekijä SHA1 Viesti Päivämäärä
nawrinsu 6d22ce9b1a Fix protocol and channel override when tuner is used (#1985)
* Fix protocol and channel override when tuner is used

* Added comment

* Fix README for basic tuner implementation

[ROCm/rccl commit: 166268d715]
2025-11-03 13:56:34 -08:00
Arm Patinyasakdikul 54194a17c3 Added ERROR message class to handle fatal error messages. (#2002)
* Added ERROR message class to handle fatal error messages.

New ERROR message class will print the message in all debug level,
including none.

Change some of the fatal error message to be in ERROR instead of WARN.

Added new error handler function to print out more meaningful error
message in the future.

* Added CHANGELOG entry.

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Change to no longer reuse NONE as ERROR. ERROR is now a separated class.

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: 1ce83d5cc0]
2025-10-30 16:14:20 -05:00
Kamil Iskra 44d92cf9df NCCL 2.27.7-1
Prevent initialization failures in certain configurations when attempting
to load fp8-specific symmetric multicast kernels on GPUs older than
Blackwell.


[ROCm/rccl commit: 593de54e52]
2025-07-24 10:39:53 -07:00
Kamil Iskra 5b471d77b2 NCCL 2.27.5-1
Improvements for GB200 systems
* Optimize the network performance by alternating the direction of the
  rings and the NIC to GPU assignment across communicators to limit
  unnecessary sharing.
* Fix the detection of C2C links in case GPU Direct RDMA is disabled
  between a GPU and a NIC.
* Fix PXN support on MNNVL systems, where NCCL would try (and fail) to
  share regular host memory across multiple nodes.
* Fix P2C (PXN over C2C), which is now preferred over regular PXN.  This
  support is currently preliminary and is disabled by default; use
  NCCL_PXN_C2C=1 to enable.

Further reduce the overheads of CUDA graph capturing, which increased in
NCCL 2.26.2 for large graphs.

Optimize the network performance on DGX B200 systems by adjusting the
bandwidths provided to the graph search algorithm.

Enable fp8 reductions in symmetric kernels on Blackwell with CUDA 12.8.

Restore the plugin name handling logic to make it possible to specify a
path to the plugin (Issue #1732).

Restore the ability to change NCCL_COLLNET_ENABLE during execution
(Issue #1741).

Add an example tuner plugin with CSV-based overrides.

Remove an x86 dependency from the example profiler.


[ROCm/rccl commit: 3ea7eedf3b]
2025-06-18 10:34:47 -07:00