커밋 그래프

272 커밋

작성자 SHA1 메시지 날짜
Wenkai Du c8d3543d3f Add back missing net flush (#1376) 2024-10-15 08:12:26 -07:00
Wenkai Du 5c367a21d0 Improve model matching for GPUs with alltoall XGMI connection (#1372) 2024-10-11 09:53:14 -07:00
Wenkai Du b55b6be0cb Fix crash when PXN is enabled on some platforms (#1369) 2024-10-11 09:02:59 -07:00
corey-derochie-amd c11f6b1531 Only set minNchannels if we are actually using MSCCL, checked using comm->mscclCompatible. (#1337) 2024-10-08 10:20:55 -06:00
BertanDogancay 84081064a0 Merge remote-tracking branch 'nccl/master' into develop 2024-10-02 09:31:25 -05:00
Wenkai Du e453f1ced9 Add another Rome model (#1354) 2024-10-01 17:41:27 -05:00
Nusrat Islam 833435be18 graph: fix for MI300X 64 GPU case (#1308)
PR #1290 introduced a failure for 64 GPU case on MI300X. This PR
fixes the failure.
2024-08-26 18:37:58 -05:00
Wenkai Du 532b70afb6 Add new Rome model (#1304)
* Add another rome model and override

* Fix bug

* Fix typo

* Add ring

* Update ring

* Fix model matching

* Clean up

* Clean up

* Reverse rings for NCCL_RINGS input

* Only reverse NCCL_RINGS for ring graph

* Fix mapping issue when using  NCCL_RINGS

* Add NCCL_RINGS_REMAP to handle inconsistant net names
2024-08-23 08:45:43 +08:00
Wenkai Du d3171b51b7 Fix gfx940 CPX mode (#1290) 2024-08-16 08:46:06 +08:00
Wenkai Du eff56735b0 Fix model matching with PXN enable (#1295) 2024-08-16 06:16:00 +08:00
akolliasAMD d6c317d6ae removed hcc mentions (#1291) 2024-08-14 15:04:13 -06:00
Pedram Alizadeh a25ca9bb90 adding new tunning table for very large number of nodes (#1288) 2024-08-09 10:47:42 -04:00
akolliasAMD c246e25f8e gfx12 Disable ll protocol (#1268) 2024-07-26 08:59:55 -06:00
Nusrat Islam 6f331b0d43 Enable CPX mode for MI300X (#1259)
* graph: enable cpx mode for MI300X

* graph: tune limits for cpx and cleanup
2024-07-19 11:30:37 -05:00
Wenkai Du 89349f2ce4 Template unroll for RCCL kernels (#1250)
* Template unroll for RCCL kernels

* Adding unroll template arg during CMake hipification

* Reduce linking parallel jobs to avoid OOM in CI

* Workaround issues with UT tests

SWDEV-469533: register spill fix is needed for mainline build
LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs
Use -parallel-jobs=8 for linking

* CI: do not use -j 16 when building

* CI: use -j 8 when building

* Only reduce parallel linking job for CI extended

* Restore original jenkins command. Change parallel linking jobs in cmake

* Disable MSCCLPP

---------

Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>
2024-07-19 08:15:59 -07:00
Nilesh M Negi a1ef217b32 Consistent channel shuffling for MI300X multi-node (#1255)
* Revert "[GRAPH] Use channel shuffling only for IB systems (#1228)"

This reverts commit 5be3b713ef.

* Revert "Revert "Changing channel stride for MI300X multinode (#1196)" (#1224)"

This reverts commit ad31d93f3d.
2024-07-18 10:18:09 -05:00
Nilesh M Negi 67e867271f [GRAPH] Disable MSCCL override of no. of channels (#1187)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-07-15 10:45:21 -05:00
Nilesh M Negi 5be3b713ef [GRAPH] Use channel shuffling only for IB systems (#1228)
* [GRAPH] Use channel shuffling only for IB systems

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [GRAPH] Define channels=48 for gfx94 RoCE systems

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [GRAPH] Increase channels for RoCE gfx94 systems

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-07-02 12:20:40 -05:00
Nusrat Islam b09ea29d66 graph: fix minNchannels for multi-node overwrite (#1230) 2024-06-26 16:56:10 -05:00
Wenkai Du ad31d93f3d Revert "Changing channel stride for MI300X multinode (#1196)" (#1224)
This reverts channel stride change in commit
0948eecbba
2024-06-25 14:03:30 -07:00
saurabhAMD e170f41ddd Unit Tests for testing channels (#1222) 2024-06-25 10:10:10 -05:00
Nusrat Islam 05df0f8cea graph: fix minNchannels for multi-node
Multi-node rccl was not correctly setting the minNchannels value. This
PR fixes the bug.
2024-06-24 16:42:44 -05:00
Nusrat Islam 9660e2e2dc Merge pull request #1200 from nusislam/multi-node-256-fix
graph: fix multi-node channel count
2024-06-07 14:34:20 -05:00
gilbertlee-amd 9b94a1052f Disabling NUMA maching for model 79 for some VM configs (#1204) 2024-06-06 17:15:04 -06:00
Nusrat Islam 526cce9bf4 graph: restrict maxChannels to 64 for multi-node and RCCL_ENABLE_INTRANET=1 2024-06-06 10:58:41 -05:00
Nusrat Islam 6ab20a7c6b graph: fix multi-node minChannel count 2024-06-06 10:56:39 -05:00
Nusrat Islam 9746d8ca3f set MIN_NCHANNEL limit to 64 for multi-node 2024-06-03 13:05:05 -05:00
Nusrat Islam ef442f8f92 set MAXCHANNELS to 128 2024-06-03 13:05:05 -05:00
Nusrat Islam 9f654f6cf5 graph: restrict MAXCHANNELS for certain platforms 2024-06-03 13:05:01 -05:00
gilbertlee-amd 0948eecbba Changing channel stride for MI300X multinode (#1196)
* Shuffling MI300X multi-node channels
* Updating tree channel logic
2024-06-03 10:00:55 -06:00
gilbertlee-amd 354e0b29a6 Addressing possible out-of-bounds mem access during channel duplication (#1193) 2024-05-30 14:02:14 -06:00
Wenkai Du 73221b4230 Add ring simple chunk size tuning (#1180)
* Add ring simple chunk size tuning

* modifying the tuning table to improve the performance of broadcast for 8MB to 32MB for single-node MI300X after ring simple chunk size tuning

* modifying the tuning table to improve the performance of reduce for 1MB to 4MB for single-node MI300X after ring simple chunk size tuning

---------

Co-authored-by: PedramAlizadeh <pmohamma@amd.com>
2024-05-29 07:59:47 -07:00
Pedram Alizadeh 73acf3eeec modifying the tuning table to improve the performance of broadcast for 1MB to 64MB for single-node MI300X (#1172) 2024-05-08 15:49:33 -04:00
mberenjk 408278209d Adding ASAN changes to address memory leak issue" (#1170)
Co-authored-by: akolliasAMD <akollias@amd.com>
2024-05-08 09:16:00 -05:00
Wenkai Du b18784d8b8 Add compiler warning for uninitialized variable and fix (#1163)
* Add compiler warning for uninitialized variable and fix

* Add -Wsometimes-uninitialized

* Convert warning to error
2024-05-08 07:00:25 -07:00
Wenkai Du f679db6ff6 Use normal permute path when one NIC per GPU (#1171) 2024-05-08 06:59:57 -07:00
Wenkai Du b513c3970a Bypass NVIDIA Ampere related tuning (#1165) 2024-05-03 17:57:16 -07:00
Wenkai Du bb58b1c258 Fix ignore NUMA not being observed for NICs during model matching (#1164) 2024-05-03 16:42:07 -07:00
Wenkai Du 9e0c9b4ed8 Replace __HIP_PLATFORM_HCC__ with __HIP_PLATFORM_AMD__ (#1154) 2024-04-25 07:19:18 -07:00
BertanDogancay e1a835910e Merge remote-tracking branch 'nccl/master' into develop 2024-04-23 13:34:00 -07:00
Wenkai Du 220066197a Use hipExtMallocWithFlags to allocate host memory on APU (#1149)
Also use SM60 as CUDA compatibility level.
2024-04-17 16:56:38 -07:00
gilbertlee-amd 4cb62f999a Rail optimization for rings (#1140)
- Modifies the ring creation algorithm to be friendlier to rail-optimized topologies (should not affect classic fabric topologies)
2024-04-15 12:03:57 -06:00
Sylvain Jeaugey ab2b89c4c3 2.21.5-1
Add support for IB SHARP 1PPN operation with user buffers.
Improve support for MNNVL, add NVLS support and multi-clique support.
 * Detect the NVLS clique through NVML
 * Exchange XML between peers in the same NVLS clique and fuse XMLs
   before creating the topology graph.
 * Rework bootstrap allgather algorithms to allow for large allgather
   operations intra-node (XML exchange).
Net/IB: add support for dynamic GID detection.
 * Automatically select RoCEv2/IPv4 interface by default. Allow to
   select IPv6 or even the network/mask.
Reduce NVLS memory usage.
 * Add stepSize as property of a connection to allow for different
   sizes on different peers; set it to 128K for NVLink SHARP.
Improve tuner loading
 * Look for more paths, be more consistent with the network device
   plugin.
 * Also search for tuner support inside the net plugin.
Improve tuner API
 * Add context to support multi-device per process.
Add magic number around comm object to detect comm corruption.
 * Add some basic check around communicators so that we can report a
   problem when a communicator gets corrupted or a wrong comm pointer
   is passed to NCCL.
Fix net/IB error path. Github PR #1164
Fix collnet rail mapping with split comm.
Fix packet reordering issue causing bootstrap mismatch
 * Use a different tag in ncclTransportP2pSetup for the connectInfo
   exchange and the following barrier.
Fix hang when crossNic is inconsistent between ranks.
Fix minCompCap/maxCompCap computation. Github issue #1184
2024-04-02 01:53:21 -07:00
Wenkai Du df98a6957d Add another Rome model (#1095) 2024-02-28 10:46:05 -08:00
Sylvain Jeaugey 48bb7fec79 2.20.5-1
Fix UDS connection failure when using ncclCommSplit. Issue #1185
2024-02-26 02:52:39 -08:00
Wenkai Du 74f9e5db64 Add new GPU model (#1080) 2024-02-23 12:19:42 -08:00
Bertan Dogancay 2fb12a9358 Merge pull request #1079 from BertanDogancay/2.19.4-sync
2.19.4 Sync
2024-02-16 09:50:11 -07:00
akolliasAMD bac57421c7 Allow bus id to be null (#1085)
* Allow bus id to be null
2024-02-15 16:36:51 -07:00
Sylvain Jeaugey b6475625fb 2.20.3-1
Add support for alternating rings, allow for cross-nic rings without
cross-rail communication.
Add support for user buffer registration for network send/recv.
Optimize aggregated operations to better utilize all channels.
Add flattening for BCM PCI gen5 switches.
Add support for inter-node NVLink communication
Add support for port fusion in NET/IB.
Add support for ReduceScatter and AllGather using Collnet.
Update net API to v8.
Fix hang during A2A connection.
2024-02-13 04:22:38 -08:00
Wenkai Du d999d9ad21 Merge remote-tracking branch 'rccl/develop' into 2.19.4 2024-02-09 11:31:03 -06:00