* Add another rome model and override
* Fix bug
* Fix typo
* Add ring
* Update ring
* Fix model matching
* Clean up
* Clean up
* Reverse rings for NCCL_RINGS input
* Only reverse NCCL_RINGS for ring graph
* Fix mapping issue when using NCCL_RINGS
* Add NCCL_RINGS_REMAP to handle inconsistant net names
* Template unroll for RCCL kernels
* Adding unroll template arg during CMake hipification
* Reduce linking parallel jobs to avoid OOM in CI
* Workaround issues with UT tests
SWDEV-469533: register spill fix is needed for mainline build
LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs
Use -parallel-jobs=8 for linking
* CI: do not use -j 16 when building
* CI: use -j 8 when building
* Only reduce parallel linking job for CI extended
* Restore original jenkins command. Change parallel linking jobs in cmake
* Disable MSCCLPP
---------
Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>
* Revert "[GRAPH] Use channel shuffling only for IB systems (#1228)"
This reverts commit 5be3b713ef.
* Revert "Revert "Changing channel stride for MI300X multinode (#1196)" (#1224)"
This reverts commit ad31d93f3d.
* [GRAPH] Use channel shuffling only for IB systems
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* [GRAPH] Define channels=48 for gfx94 RoCE systems
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* [GRAPH] Increase channels for RoCE gfx94 systems
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* Add ring simple chunk size tuning
* modifying the tuning table to improve the performance of broadcast for 8MB to 32MB for single-node MI300X after ring simple chunk size tuning
* modifying the tuning table to improve the performance of reduce for 1MB to 4MB for single-node MI300X after ring simple chunk size tuning
---------
Co-authored-by: PedramAlizadeh <pmohamma@amd.com>
Add support for IB SHARP 1PPN operation with user buffers.
Improve support for MNNVL, add NVLS support and multi-clique support.
* Detect the NVLS clique through NVML
* Exchange XML between peers in the same NVLS clique and fuse XMLs
before creating the topology graph.
* Rework bootstrap allgather algorithms to allow for large allgather
operations intra-node (XML exchange).
Net/IB: add support for dynamic GID detection.
* Automatically select RoCEv2/IPv4 interface by default. Allow to
select IPv6 or even the network/mask.
Reduce NVLS memory usage.
* Add stepSize as property of a connection to allow for different
sizes on different peers; set it to 128K for NVLink SHARP.
Improve tuner loading
* Look for more paths, be more consistent with the network device
plugin.
* Also search for tuner support inside the net plugin.
Improve tuner API
* Add context to support multi-device per process.
Add magic number around comm object to detect comm corruption.
* Add some basic check around communicators so that we can report a
problem when a communicator gets corrupted or a wrong comm pointer
is passed to NCCL.
Fix net/IB error path. Github PR #1164
Fix collnet rail mapping with split comm.
Fix packet reordering issue causing bootstrap mismatch
* Use a different tag in ncclTransportP2pSetup for the connectInfo
exchange and the following barrier.
Fix hang when crossNic is inconsistent between ranks.
Fix minCompCap/maxCompCap computation. Github issue #1184
Add support for alternating rings, allow for cross-nic rings without
cross-rail communication.
Add support for user buffer registration for network send/recv.
Optimize aggregated operations to better utilize all channels.
Add flattening for BCM PCI gen5 switches.
Add support for inter-node NVLink communication
Add support for port fusion in NET/IB.
Add support for ReduceScatter and AllGather using Collnet.
Update net API to v8.
Fix hang during A2A connection.