Commit Graph

282 Commits

Author SHA1 Message Date
BertanDogancay 36343be84f Merge remote-tracking branch 'nccl/master' into develop 2025-01-23 12:08:46 -06:00
qiwei_ji f2ee8d9132 Check nvlink_node instead of xgmi_node in xml.cc (#1407)
It seems like here wants to check xgmi_node instead. If checks node for "nvlink", it will verify the link_info everytime.
If checks node for "xgmi", when get yes answer, it won't need check vsmi topo interface.
2025-01-06 17:09:27 -08:00
Hujingbo ad4c36dc34 increase p2p channels for Intel platform (#1448)
Co-authored-by: hujingbo <hujingbo@kuaishou.com>
2024-12-10 07:33:37 -08:00
Benjamin Kitor a05329bd0d Add Topologies for 16-GPU gfx942 SuperNode (#1417)
* Add Topologies for 16-GPU gfx942 SuperNode

- Add GigaIO topologies to tools/topo_expl for dev and testing
- Add GigaIO Columba 16 GPU romeModel and adjust topology
  matching algorithm in rome_models for 16 GPU system
- Fix bug which failed to match Rome Model when using subsets
  of system resources (i.e. ROCR_VISIBLE_DEVICES is set)
- Fixes for topo_expl

* Fix bug w/ 1H16P
2024-12-03 13:12:03 -08:00
gilbertlee-amd 000575867c Adding RCCL_MODEL_REVERSAL_DISABLE env var to disable model reversal (#1431)
* Adding RCCL_MODEL_REVERSAL_DISABLE env var to disable model reversal
2024-11-25 11:24:54 -07:00
corey-derochie-amd 1c45962273 Hide or fix all build warnings (#1331)
* Changing C-strings to be const.

* Changed variable-length arrays to std::vector to avoid warnings. VLA is a compiler extension.

* Changed `#define` inside functions into `constexpr int` to preserve scoping and avoid macro redefinition warnings.

* Disabled warnings for modifying `CMAKE_CXX_FLAGS` caused by `check_symbol_exists`, which temporarily modifies the flag to do a compile check.

* Fixed VLA in rccl UT.
2024-11-04 09:46:42 -07:00
Avinash d6006f0425 Memory leak fixes in hostside functions (#1388)
memory leak fixes for parseRome4P2H and ncclTopoAddGPU
2024-10-30 14:25:56 -05:00
gilbertlee-amd 0cbce2a757 Adding support for odd nodes for model_87 (#1309) 2024-10-24 08:38:12 -06:00
Arm Patinyasakdikul 29f87c7191 Increased maximum number of XML nodes to support CPX mode. (#1386) 2024-10-23 11:15:11 -05:00
Wenkai Du c8d3543d3f Add back missing net flush (#1376) 2024-10-15 08:12:26 -07:00
Wenkai Du 5c367a21d0 Improve model matching for GPUs with alltoall XGMI connection (#1372) 2024-10-11 09:53:14 -07:00
Wenkai Du b55b6be0cb Fix crash when PXN is enabled on some platforms (#1369) 2024-10-11 09:02:59 -07:00
corey-derochie-amd c11f6b1531 Only set minNchannels if we are actually using MSCCL, checked using comm->mscclCompatible. (#1337) 2024-10-08 10:20:55 -06:00
BertanDogancay 84081064a0 Merge remote-tracking branch 'nccl/master' into develop 2024-10-02 09:31:25 -05:00
Wenkai Du e453f1ced9 Add another Rome model (#1354) 2024-10-01 17:41:27 -05:00
Nusrat Islam 833435be18 graph: fix for MI300X 64 GPU case (#1308)
PR #1290 introduced a failure for 64 GPU case on MI300X. This PR
fixes the failure.
2024-08-26 18:37:58 -05:00
Wenkai Du 532b70afb6 Add new Rome model (#1304)
* Add another rome model and override

* Fix bug

* Fix typo

* Add ring

* Update ring

* Fix model matching

* Clean up

* Clean up

* Reverse rings for NCCL_RINGS input

* Only reverse NCCL_RINGS for ring graph

* Fix mapping issue when using  NCCL_RINGS

* Add NCCL_RINGS_REMAP to handle inconsistant net names
2024-08-23 08:45:43 +08:00
Wenkai Du d3171b51b7 Fix gfx940 CPX mode (#1290) 2024-08-16 08:46:06 +08:00
Wenkai Du eff56735b0 Fix model matching with PXN enable (#1295) 2024-08-16 06:16:00 +08:00
akolliasAMD d6c317d6ae removed hcc mentions (#1291) 2024-08-14 15:04:13 -06:00
Pedram Alizadeh a25ca9bb90 adding new tunning table for very large number of nodes (#1288) 2024-08-09 10:47:42 -04:00
akolliasAMD c246e25f8e gfx12 Disable ll protocol (#1268) 2024-07-26 08:59:55 -06:00
Nusrat Islam 6f331b0d43 Enable CPX mode for MI300X (#1259)
* graph: enable cpx mode for MI300X

* graph: tune limits for cpx and cleanup
2024-07-19 11:30:37 -05:00
Wenkai Du 89349f2ce4 Template unroll for RCCL kernels (#1250)
* Template unroll for RCCL kernels

* Adding unroll template arg during CMake hipification

* Reduce linking parallel jobs to avoid OOM in CI

* Workaround issues with UT tests

SWDEV-469533: register spill fix is needed for mainline build
LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs
Use -parallel-jobs=8 for linking

* CI: do not use -j 16 when building

* CI: use -j 8 when building

* Only reduce parallel linking job for CI extended

* Restore original jenkins command. Change parallel linking jobs in cmake

* Disable MSCCLPP

---------

Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>
2024-07-19 08:15:59 -07:00
Nilesh M Negi a1ef217b32 Consistent channel shuffling for MI300X multi-node (#1255)
* Revert "[GRAPH] Use channel shuffling only for IB systems (#1228)"

This reverts commit 5be3b713ef.

* Revert "Revert "Changing channel stride for MI300X multinode (#1196)" (#1224)"

This reverts commit ad31d93f3d.
2024-07-18 10:18:09 -05:00
Nilesh M Negi 67e867271f [GRAPH] Disable MSCCL override of no. of channels (#1187)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-07-15 10:45:21 -05:00
Nilesh M Negi 5be3b713ef [GRAPH] Use channel shuffling only for IB systems (#1228)
* [GRAPH] Use channel shuffling only for IB systems

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [GRAPH] Define channels=48 for gfx94 RoCE systems

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [GRAPH] Increase channels for RoCE gfx94 systems

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-07-02 12:20:40 -05:00
Nusrat Islam b09ea29d66 graph: fix minNchannels for multi-node overwrite (#1230) 2024-06-26 16:56:10 -05:00
Wenkai Du ad31d93f3d Revert "Changing channel stride for MI300X multinode (#1196)" (#1224)
This reverts channel stride change in commit
0948eecbba
2024-06-25 14:03:30 -07:00
saurabhAMD e170f41ddd Unit Tests for testing channels (#1222) 2024-06-25 10:10:10 -05:00
Nusrat Islam 05df0f8cea graph: fix minNchannels for multi-node
Multi-node rccl was not correctly setting the minNchannels value. This
PR fixes the bug.
2024-06-24 16:42:44 -05:00
Sylvain Jeaugey 178b6b7590 2.22.3-1
Rework core for NVIDIA Trusted Computing
 * Compress work structs so that they are shared between channels
 * Utilize the full amount of kernel argument space permitted (4k)
   before resorting to work fifo.
 * Rework the task preprocessing phase.
 * Use a separate abortDevFlag which is kept in sync with abortFlag
   using cudaMemcpy operations.
 * Rename src/include/align.h to src/include/bitops.h

Add lazy connection establishment for collective operations
 * Move buffer allocation and connection establishment to the first
   collective operation using that algorithm.
 * Accelerate init time and reduce memory usage.
 * Avoid allocating NVLS buffers if all calls are registered.
 * Compute algo/proto in ncclLaunchCollTasksInfo early on.
 * Connect peers in ncclCollPreconnectFunc if not connected already.
 * Also move shared buffer creation to the first send/recv call.

Accelerate intra-node NVLink detection
 * Make each rank only detect NVLinks attached to its GPU.
 * Fuse XMLs to reconstruct the full NVLink topology

Add init profiling to report time spend in different init phases.
 * Report timings of bootstrap, allgather, search, connect, etc.
 * Add new "PROFILE" category for NCCL_DEBUG_SUBSYS.

Add support for PCI p2p on split PCI switches
 * Detect split PCI switches through a kernel module exposing
   switch information.
 * Update the topology XML and graph to add those inter-switch
   connections.

Add cost estimation API
 * Add a new ncclGroupEndSimulate primitive to return the estimated
   time a group would take.

Net/IB: Add separate traffic class for fifo messages
 * Add NCCL_IB_FIFO_TC to control the traffic class of fifo messages
   independently from NCCL_IB_TC.
   Merges PR #1194

Net/IB: Add support for IB router
 * Use flid instead of lid if subnets do not match
 * Warn if flid is 0

Optimizations and fixes for device network offload (unpack)
 * Double the default number of channels
 * Cache netDeviceType
 * Fix save/increment head logic to enable Tree support.

Support ncclGroupStart/End for ncclCommAbort/Destroy
 * Allow Abort/Destroy to be called within a group when managing
   multiple GPUs with a single process.

Improve Tuner API
 * Provide to the plugin the original cost table so that the plugin
   can leave unknown or disabled algo/proto combinations untouched.
 * Remove nvlsSupport and collnetSupport.

Do not print version to stdout when using a debug file
 * Also print version from all processes with INFO debug level.
   Fixes issue #1271

Fix clang warnings in NVTX headers
 * Update NVTX headers to the latest version
   Fixes issue #1270

Disable port fusion in heterogeneous systems
 * Do not fuse ports if a mix of multi-port and single port are detected.

Fix NVLS graphs search for dual NICs.
 * Fix NVLS graph search when we have more than one NIC per GPU.

Fix crash with collnetDirect
 * Add separate graph search for collnetDirect, testing alltoall paths
   and working similarly to the NVLS search.

Fix hang when nodes have different CPU types
 * Add the CPU type to the rank peer info.
 * Align all ranks on the CPU type after the first allgather.
 * Only use the aligned CPU type for all tuning operations.
   Fixes issue #1136
   Fixes issue #1184

Fix performance of registered send/recv operations
 * Allow for single full size operations
 * Add INFO to confirm the registration of send/recv buffers.

Move all sync ops to finalize stage
 * Ensure ncclCommDestroy is non-blocking if ncclCommFinalize has
   been called.

Improve error reporting during SHM segment creation

Improve support of various compilers
   Merges PR #1177
   Merges PR #1228

Allow net and tuner plugins to be statically linked
 * Search for ncclNet or ncclTuner symbols in the main binary.
   Merges PR #979

Plugin examples includes cleanup
 * Harmonize err.h and common.h usage.
 * Add mixed plugin with both net and tuner.
2024-06-19 01:57:16 -07:00
Nusrat Islam 9660e2e2dc Merge pull request #1200 from nusislam/multi-node-256-fix
graph: fix multi-node channel count
2024-06-07 14:34:20 -05:00
gilbertlee-amd 9b94a1052f Disabling NUMA maching for model 79 for some VM configs (#1204) 2024-06-06 17:15:04 -06:00
Nusrat Islam 526cce9bf4 graph: restrict maxChannels to 64 for multi-node and RCCL_ENABLE_INTRANET=1 2024-06-06 10:58:41 -05:00
Nusrat Islam 6ab20a7c6b graph: fix multi-node minChannel count 2024-06-06 10:56:39 -05:00
Nusrat Islam 9746d8ca3f set MIN_NCHANNEL limit to 64 for multi-node 2024-06-03 13:05:05 -05:00
Nusrat Islam ef442f8f92 set MAXCHANNELS to 128 2024-06-03 13:05:05 -05:00
Nusrat Islam 9f654f6cf5 graph: restrict MAXCHANNELS for certain platforms 2024-06-03 13:05:01 -05:00
gilbertlee-amd 0948eecbba Changing channel stride for MI300X multinode (#1196)
* Shuffling MI300X multi-node channels
* Updating tree channel logic
2024-06-03 10:00:55 -06:00
gilbertlee-amd 354e0b29a6 Addressing possible out-of-bounds mem access during channel duplication (#1193) 2024-05-30 14:02:14 -06:00
Wenkai Du 73221b4230 Add ring simple chunk size tuning (#1180)
* Add ring simple chunk size tuning

* modifying the tuning table to improve the performance of broadcast for 8MB to 32MB for single-node MI300X after ring simple chunk size tuning

* modifying the tuning table to improve the performance of reduce for 1MB to 4MB for single-node MI300X after ring simple chunk size tuning

---------

Co-authored-by: PedramAlizadeh <pmohamma@amd.com>
2024-05-29 07:59:47 -07:00
Pedram Alizadeh 73acf3eeec modifying the tuning table to improve the performance of broadcast for 1MB to 64MB for single-node MI300X (#1172) 2024-05-08 15:49:33 -04:00
mberenjk 408278209d Adding ASAN changes to address memory leak issue" (#1170)
Co-authored-by: akolliasAMD <akollias@amd.com>
2024-05-08 09:16:00 -05:00
Wenkai Du b18784d8b8 Add compiler warning for uninitialized variable and fix (#1163)
* Add compiler warning for uninitialized variable and fix

* Add -Wsometimes-uninitialized

* Convert warning to error
2024-05-08 07:00:25 -07:00
Wenkai Du f679db6ff6 Use normal permute path when one NIC per GPU (#1171) 2024-05-08 06:59:57 -07:00
Wenkai Du b513c3970a Bypass NVIDIA Ampere related tuning (#1165) 2024-05-03 17:57:16 -07:00
Wenkai Du bb58b1c258 Fix ignore NUMA not being observed for NICs during model matching (#1164) 2024-05-03 16:42:07 -07:00
Wenkai Du 9e0c9b4ed8 Replace __HIP_PLATFORM_HCC__ with __HIP_PLATFORM_AMD__ (#1154) 2024-04-25 07:19:18 -07:00
BertanDogancay e1a835910e Merge remote-tracking branch 'nccl/master' into develop 2024-04-23 13:34:00 -07:00