rocm-systems

Author	SHA1	Message	Date
BertanDogancay	36343be84f	Merge remote-tracking branch 'nccl/master' into develop	2025-01-23 12:08:46 -06:00
qiwei_ji	f2ee8d9132	Check nvlink_node instead of xgmi_node in xml.cc (#1407 ) It seems like here wants to check xgmi_node instead. If checks node for "nvlink", it will verify the link_info everytime. If checks node for "xgmi", when get yes answer, it won't need check vsmi topo interface.	2025-01-06 17:09:27 -08:00
Hujingbo	ad4c36dc34	increase p2p channels for Intel platform (#1448 ) Co-authored-by: hujingbo <hujingbo@kuaishou.com>	2024-12-10 07:33:37 -08:00
Benjamin Kitor	a05329bd0d	Add Topologies for 16-GPU gfx942 SuperNode (#1417 ) * Add Topologies for 16-GPU gfx942 SuperNode - Add GigaIO topologies to tools/topo_expl for dev and testing - Add GigaIO Columba 16 GPU romeModel and adjust topology matching algorithm in rome_models for 16 GPU system - Fix bug which failed to match Rome Model when using subsets of system resources (i.e. ROCR_VISIBLE_DEVICES is set) - Fixes for topo_expl * Fix bug w/ 1H16P	2024-12-03 13:12:03 -08:00
gilbertlee-amd	000575867c	Adding RCCL_MODEL_REVERSAL_DISABLE env var to disable model reversal (#1431 ) * Adding RCCL_MODEL_REVERSAL_DISABLE env var to disable model reversal	2024-11-25 11:24:54 -07:00
corey-derochie-amd	1c45962273	Hide or fix all build warnings (#1331 ) * Changing C-strings to be const. * Changed variable-length arrays to std::vector to avoid warnings. VLA is a compiler extension. * Changed `#define` inside functions into `constexpr int` to preserve scoping and avoid macro redefinition warnings. * Disabled warnings for modifying `CMAKE_CXX_FLAGS` caused by `check_symbol_exists`, which temporarily modifies the flag to do a compile check. * Fixed VLA in rccl UT.	2024-11-04 09:46:42 -07:00
Avinash	d6006f0425	Memory leak fixes in hostside functions (#1388 ) memory leak fixes for parseRome4P2H and ncclTopoAddGPU	2024-10-30 14:25:56 -05:00
gilbertlee-amd	0cbce2a757	Adding support for odd nodes for model_87 (#1309 )	2024-10-24 08:38:12 -06:00
Arm Patinyasakdikul	29f87c7191	Increased maximum number of XML nodes to support CPX mode. (#1386 )	2024-10-23 11:15:11 -05:00
Wenkai Du	c8d3543d3f	Add back missing net flush (#1376 )	2024-10-15 08:12:26 -07:00
Wenkai Du	5c367a21d0	Improve model matching for GPUs with alltoall XGMI connection (#1372 )	2024-10-11 09:53:14 -07:00
Wenkai Du	b55b6be0cb	Fix crash when PXN is enabled on some platforms (#1369 )	2024-10-11 09:02:59 -07:00
corey-derochie-amd	c11f6b1531	Only set `minNchannels` if we are actually using MSCCL, checked using `comm->mscclCompatible`. (#1337 )	2024-10-08 10:20:55 -06:00
BertanDogancay	84081064a0	Merge remote-tracking branch 'nccl/master' into develop	2024-10-02 09:31:25 -05:00
Wenkai Du	e453f1ced9	Add another Rome model (#1354 )	2024-10-01 17:41:27 -05:00
Nusrat Islam	833435be18	graph: fix for MI300X 64 GPU case (#1308 ) PR #1290 introduced a failure for 64 GPU case on MI300X. This PR fixes the failure.	2024-08-26 18:37:58 -05:00
Wenkai Du	532b70afb6	Add new Rome model (#1304 ) * Add another rome model and override * Fix bug * Fix typo * Add ring * Update ring * Fix model matching * Clean up * Clean up * Reverse rings for NCCL_RINGS input * Only reverse NCCL_RINGS for ring graph * Fix mapping issue when using NCCL_RINGS * Add NCCL_RINGS_REMAP to handle inconsistant net names	2024-08-23 08:45:43 +08:00
Wenkai Du	d3171b51b7	Fix gfx940 CPX mode (#1290 )	2024-08-16 08:46:06 +08:00
Wenkai Du	eff56735b0	Fix model matching with PXN enable (#1295 )	2024-08-16 06:16:00 +08:00
akolliasAMD	d6c317d6ae	removed hcc mentions (#1291 )	2024-08-14 15:04:13 -06:00
Pedram Alizadeh	a25ca9bb90	adding new tunning table for very large number of nodes (#1288 )	2024-08-09 10:47:42 -04:00
akolliasAMD	c246e25f8e	gfx12 Disable ll protocol (#1268 )	2024-07-26 08:59:55 -06:00
Nusrat Islam	6f331b0d43	Enable CPX mode for MI300X (#1259 ) * graph: enable cpx mode for MI300X * graph: tune limits for cpx and cleanup	2024-07-19 11:30:37 -05:00
Wenkai Du	89349f2ce4	Template unroll for RCCL kernels (#1250 ) * Template unroll for RCCL kernels * Adding unroll template arg during CMake hipification * Reduce linking parallel jobs to avoid OOM in CI * Workaround issues with UT tests SWDEV-469533: register spill fix is needed for mainline build LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs Use -parallel-jobs=8 for linking * CI: do not use -j 16 when building * CI: use -j 8 when building * Only reduce parallel linking job for CI extended * Restore original jenkins command. Change parallel linking jobs in cmake * Disable MSCCLPP --------- Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>	2024-07-19 08:15:59 -07:00
Nilesh M Negi	a1ef217b32	Consistent channel shuffling for MI300X multi-node (#1255 ) * Revert "[GRAPH] Use channel shuffling only for IB systems (#1228)" This reverts commit `5be3b713ef`. * Revert "Revert "Changing channel stride for MI300X multinode (#1196)" (#1224)" This reverts commit `ad31d93f3d`.	2024-07-18 10:18:09 -05:00
Nilesh M Negi	67e867271f	[GRAPH] Disable MSCCL override of no. of channels (#1187 ) Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2024-07-15 10:45:21 -05:00
Nilesh M Negi	5be3b713ef	[GRAPH] Use channel shuffling only for IB systems (#1228 ) * [GRAPH] Use channel shuffling only for IB systems Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> * [GRAPH] Define channels=48 for gfx94 RoCE systems Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> * [GRAPH] Increase channels for RoCE gfx94 systems Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> --------- Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2024-07-02 12:20:40 -05:00
Nusrat Islam	b09ea29d66	graph: fix minNchannels for multi-node overwrite (#1230 )	2024-06-26 16:56:10 -05:00
Wenkai Du	ad31d93f3d	Revert "Changing channel stride for MI300X multinode (#1196 )" (#1224 ) This reverts channel stride change in commit `0948eecbba`	2024-06-25 14:03:30 -07:00
saurabhAMD	e170f41ddd	Unit Tests for testing channels (#1222 )	2024-06-25 10:10:10 -05:00
Nusrat Islam	05df0f8cea	graph: fix minNchannels for multi-node Multi-node rccl was not correctly setting the minNchannels value. This PR fixes the bug.	2024-06-24 16:42:44 -05:00
Sylvain Jeaugey	178b6b7590	2.22.3-1 Rework core for NVIDIA Trusted Computing * Compress work structs so that they are shared between channels * Utilize the full amount of kernel argument space permitted (4k) before resorting to work fifo. * Rework the task preprocessing phase. * Use a separate abortDevFlag which is kept in sync with abortFlag using cudaMemcpy operations. * Rename src/include/align.h to src/include/bitops.h Add lazy connection establishment for collective operations * Move buffer allocation and connection establishment to the first collective operation using that algorithm. * Accelerate init time and reduce memory usage. * Avoid allocating NVLS buffers if all calls are registered. * Compute algo/proto in ncclLaunchCollTasksInfo early on. * Connect peers in ncclCollPreconnectFunc if not connected already. * Also move shared buffer creation to the first send/recv call. Accelerate intra-node NVLink detection * Make each rank only detect NVLinks attached to its GPU. * Fuse XMLs to reconstruct the full NVLink topology Add init profiling to report time spend in different init phases. * Report timings of bootstrap, allgather, search, connect, etc. * Add new "PROFILE" category for NCCL_DEBUG_SUBSYS. Add support for PCI p2p on split PCI switches * Detect split PCI switches through a kernel module exposing switch information. * Update the topology XML and graph to add those inter-switch connections. Add cost estimation API * Add a new ncclGroupEndSimulate primitive to return the estimated time a group would take. Net/IB: Add separate traffic class for fifo messages * Add NCCL_IB_FIFO_TC to control the traffic class of fifo messages independently from NCCL_IB_TC. Merges PR #1194 Net/IB: Add support for IB router * Use flid instead of lid if subnets do not match * Warn if flid is 0 Optimizations and fixes for device network offload (unpack) * Double the default number of channels * Cache netDeviceType * Fix save/increment head logic to enable Tree support. Support ncclGroupStart/End for ncclCommAbort/Destroy * Allow Abort/Destroy to be called within a group when managing multiple GPUs with a single process. Improve Tuner API * Provide to the plugin the original cost table so that the plugin can leave unknown or disabled algo/proto combinations untouched. * Remove nvlsSupport and collnetSupport. Do not print version to stdout when using a debug file * Also print version from all processes with INFO debug level. Fixes issue #1271 Fix clang warnings in NVTX headers * Update NVTX headers to the latest version Fixes issue #1270 Disable port fusion in heterogeneous systems * Do not fuse ports if a mix of multi-port and single port are detected. Fix NVLS graphs search for dual NICs. * Fix NVLS graph search when we have more than one NIC per GPU. Fix crash with collnetDirect * Add separate graph search for collnetDirect, testing alltoall paths and working similarly to the NVLS search. Fix hang when nodes have different CPU types * Add the CPU type to the rank peer info. * Align all ranks on the CPU type after the first allgather. * Only use the aligned CPU type for all tuning operations. Fixes issue #1136 Fixes issue #1184 Fix performance of registered send/recv operations * Allow for single full size operations * Add INFO to confirm the registration of send/recv buffers. Move all sync ops to finalize stage * Ensure ncclCommDestroy is non-blocking if ncclCommFinalize has been called. Improve error reporting during SHM segment creation Improve support of various compilers Merges PR #1177 Merges PR #1228 Allow net and tuner plugins to be statically linked * Search for ncclNet or ncclTuner symbols in the main binary. Merges PR #979 Plugin examples includes cleanup * Harmonize err.h and common.h usage. * Add mixed plugin with both net and tuner.	2024-06-19 01:57:16 -07:00
Nusrat Islam	9660e2e2dc	Merge pull request #1200 from nusislam/multi-node-256-fix graph: fix multi-node channel count	2024-06-07 14:34:20 -05:00
gilbertlee-amd	9b94a1052f	Disabling NUMA maching for model 79 for some VM configs (#1204 )	2024-06-06 17:15:04 -06:00
Nusrat Islam	526cce9bf4	graph: restrict maxChannels to 64 for multi-node and RCCL_ENABLE_INTRANET=1	2024-06-06 10:58:41 -05:00
Nusrat Islam	6ab20a7c6b	graph: fix multi-node minChannel count	2024-06-06 10:56:39 -05:00
Nusrat Islam	9746d8ca3f	set MIN_NCHANNEL limit to 64 for multi-node	2024-06-03 13:05:05 -05:00
Nusrat Islam	ef442f8f92	set MAXCHANNELS to 128	2024-06-03 13:05:05 -05:00
Nusrat Islam	9f654f6cf5	graph: restrict MAXCHANNELS for certain platforms	2024-06-03 13:05:01 -05:00
gilbertlee-amd	0948eecbba	Changing channel stride for MI300X multinode (#1196 ) * Shuffling MI300X multi-node channels * Updating tree channel logic	2024-06-03 10:00:55 -06:00
gilbertlee-amd	354e0b29a6	Addressing possible out-of-bounds mem access during channel duplication (#1193 )	2024-05-30 14:02:14 -06:00
Wenkai Du	73221b4230	Add ring simple chunk size tuning (#1180 ) * Add ring simple chunk size tuning * modifying the tuning table to improve the performance of broadcast for 8MB to 32MB for single-node MI300X after ring simple chunk size tuning * modifying the tuning table to improve the performance of reduce for 1MB to 4MB for single-node MI300X after ring simple chunk size tuning --------- Co-authored-by: PedramAlizadeh <pmohamma@amd.com>	2024-05-29 07:59:47 -07:00
Pedram Alizadeh	73acf3eeec	modifying the tuning table to improve the performance of broadcast for 1MB to 64MB for single-node MI300X (#1172 )	2024-05-08 15:49:33 -04:00
mberenjk	408278209d	Adding ASAN changes to address memory leak issue" (#1170 ) Co-authored-by: akolliasAMD <akollias@amd.com>	2024-05-08 09:16:00 -05:00
Wenkai Du	b18784d8b8	Add compiler warning for uninitialized variable and fix (#1163 ) * Add compiler warning for uninitialized variable and fix * Add -Wsometimes-uninitialized * Convert warning to error	2024-05-08 07:00:25 -07:00
Wenkai Du	f679db6ff6	Use normal permute path when one NIC per GPU (#1171 )	2024-05-08 06:59:57 -07:00
Wenkai Du	b513c3970a	Bypass NVIDIA Ampere related tuning (#1165 )	2024-05-03 17:57:16 -07:00
Wenkai Du	bb58b1c258	Fix ignore NUMA not being observed for NICs during model matching (#1164 )	2024-05-03 16:42:07 -07:00
Wenkai Du	9e0c9b4ed8	Replace __HIP_PLATFORM_HCC__ with __HIP_PLATFORM_AMD__ (#1154 )	2024-04-25 07:19:18 -07:00
BertanDogancay	e1a835910e	Merge remote-tracking branch 'nccl/master' into develop	2024-04-23 13:34:00 -07:00

1 2 3 4 5 ...

282 Commits