rocm-systems

작성자	SHA1	메시지	날짜
Wenkai Du	c8d3543d3f	Add back missing net flush (#1376 )	2024-10-15 08:12:26 -07:00
Wenkai Du	5c367a21d0	Improve model matching for GPUs with alltoall XGMI connection (#1372 )	2024-10-11 09:53:14 -07:00
Wenkai Du	b55b6be0cb	Fix crash when PXN is enabled on some platforms (#1369 )	2024-10-11 09:02:59 -07:00
corey-derochie-amd	c11f6b1531	Only set `minNchannels` if we are actually using MSCCL, checked using `comm->mscclCompatible`. (#1337 )	2024-10-08 10:20:55 -06:00
BertanDogancay	84081064a0	Merge remote-tracking branch 'nccl/master' into develop	2024-10-02 09:31:25 -05:00
Wenkai Du	e453f1ced9	Add another Rome model (#1354 )	2024-10-01 17:41:27 -05:00
Nusrat Islam	833435be18	graph: fix for MI300X 64 GPU case (#1308 ) PR #1290 introduced a failure for 64 GPU case on MI300X. This PR fixes the failure.	2024-08-26 18:37:58 -05:00
Wenkai Du	532b70afb6	Add new Rome model (#1304 ) * Add another rome model and override * Fix bug * Fix typo * Add ring * Update ring * Fix model matching * Clean up * Clean up * Reverse rings for NCCL_RINGS input * Only reverse NCCL_RINGS for ring graph * Fix mapping issue when using NCCL_RINGS * Add NCCL_RINGS_REMAP to handle inconsistant net names	2024-08-23 08:45:43 +08:00
Wenkai Du	d3171b51b7	Fix gfx940 CPX mode (#1290 )	2024-08-16 08:46:06 +08:00
Wenkai Du	eff56735b0	Fix model matching with PXN enable (#1295 )	2024-08-16 06:16:00 +08:00
akolliasAMD	d6c317d6ae	removed hcc mentions (#1291 )	2024-08-14 15:04:13 -06:00
Pedram Alizadeh	a25ca9bb90	adding new tunning table for very large number of nodes (#1288 )	2024-08-09 10:47:42 -04:00
akolliasAMD	c246e25f8e	gfx12 Disable ll protocol (#1268 )	2024-07-26 08:59:55 -06:00
Nusrat Islam	6f331b0d43	Enable CPX mode for MI300X (#1259 ) * graph: enable cpx mode for MI300X * graph: tune limits for cpx and cleanup	2024-07-19 11:30:37 -05:00
Wenkai Du	89349f2ce4	Template unroll for RCCL kernels (#1250 ) * Template unroll for RCCL kernels * Adding unroll template arg during CMake hipification * Reduce linking parallel jobs to avoid OOM in CI * Workaround issues with UT tests SWDEV-469533: register spill fix is needed for mainline build LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs Use -parallel-jobs=8 for linking * CI: do not use -j 16 when building * CI: use -j 8 when building * Only reduce parallel linking job for CI extended * Restore original jenkins command. Change parallel linking jobs in cmake * Disable MSCCLPP --------- Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>	2024-07-19 08:15:59 -07:00
Nilesh M Negi	a1ef217b32	Consistent channel shuffling for MI300X multi-node (#1255 ) * Revert "[GRAPH] Use channel shuffling only for IB systems (#1228)" This reverts commit `5be3b713ef`. * Revert "Revert "Changing channel stride for MI300X multinode (#1196)" (#1224)" This reverts commit `ad31d93f3d`.	2024-07-18 10:18:09 -05:00
Nilesh M Negi	67e867271f	[GRAPH] Disable MSCCL override of no. of channels (#1187 ) Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2024-07-15 10:45:21 -05:00
Nilesh M Negi	5be3b713ef	[GRAPH] Use channel shuffling only for IB systems (#1228 ) * [GRAPH] Use channel shuffling only for IB systems Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> * [GRAPH] Define channels=48 for gfx94 RoCE systems Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> * [GRAPH] Increase channels for RoCE gfx94 systems Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> --------- Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2024-07-02 12:20:40 -05:00
Nusrat Islam	b09ea29d66	graph: fix minNchannels for multi-node overwrite (#1230 )	2024-06-26 16:56:10 -05:00
Wenkai Du	ad31d93f3d	Revert "Changing channel stride for MI300X multinode (#1196 )" (#1224 ) This reverts channel stride change in commit `0948eecbba`	2024-06-25 14:03:30 -07:00
saurabhAMD	e170f41ddd	Unit Tests for testing channels (#1222 )	2024-06-25 10:10:10 -05:00
Nusrat Islam	05df0f8cea	graph: fix minNchannels for multi-node Multi-node rccl was not correctly setting the minNchannels value. This PR fixes the bug.	2024-06-24 16:42:44 -05:00
Nusrat Islam	9660e2e2dc	Merge pull request #1200 from nusislam/multi-node-256-fix graph: fix multi-node channel count	2024-06-07 14:34:20 -05:00
gilbertlee-amd	9b94a1052f	Disabling NUMA maching for model 79 for some VM configs (#1204 )	2024-06-06 17:15:04 -06:00
Nusrat Islam	526cce9bf4	graph: restrict maxChannels to 64 for multi-node and RCCL_ENABLE_INTRANET=1	2024-06-06 10:58:41 -05:00
Nusrat Islam	6ab20a7c6b	graph: fix multi-node minChannel count	2024-06-06 10:56:39 -05:00
Nusrat Islam	9746d8ca3f	set MIN_NCHANNEL limit to 64 for multi-node	2024-06-03 13:05:05 -05:00
Nusrat Islam	ef442f8f92	set MAXCHANNELS to 128	2024-06-03 13:05:05 -05:00
Nusrat Islam	9f654f6cf5	graph: restrict MAXCHANNELS for certain platforms	2024-06-03 13:05:01 -05:00
gilbertlee-amd	0948eecbba	Changing channel stride for MI300X multinode (#1196 ) * Shuffling MI300X multi-node channels * Updating tree channel logic	2024-06-03 10:00:55 -06:00
gilbertlee-amd	354e0b29a6	Addressing possible out-of-bounds mem access during channel duplication (#1193 )	2024-05-30 14:02:14 -06:00
Wenkai Du	73221b4230	Add ring simple chunk size tuning (#1180 ) * Add ring simple chunk size tuning * modifying the tuning table to improve the performance of broadcast for 8MB to 32MB for single-node MI300X after ring simple chunk size tuning * modifying the tuning table to improve the performance of reduce for 1MB to 4MB for single-node MI300X after ring simple chunk size tuning --------- Co-authored-by: PedramAlizadeh <pmohamma@amd.com>	2024-05-29 07:59:47 -07:00
Pedram Alizadeh	73acf3eeec	modifying the tuning table to improve the performance of broadcast for 1MB to 64MB for single-node MI300X (#1172 )	2024-05-08 15:49:33 -04:00
mberenjk	408278209d	Adding ASAN changes to address memory leak issue" (#1170 ) Co-authored-by: akolliasAMD <akollias@amd.com>	2024-05-08 09:16:00 -05:00
Wenkai Du	b18784d8b8	Add compiler warning for uninitialized variable and fix (#1163 ) * Add compiler warning for uninitialized variable and fix * Add -Wsometimes-uninitialized * Convert warning to error	2024-05-08 07:00:25 -07:00
Wenkai Du	f679db6ff6	Use normal permute path when one NIC per GPU (#1171 )	2024-05-08 06:59:57 -07:00
Wenkai Du	b513c3970a	Bypass NVIDIA Ampere related tuning (#1165 )	2024-05-03 17:57:16 -07:00
Wenkai Du	bb58b1c258	Fix ignore NUMA not being observed for NICs during model matching (#1164 )	2024-05-03 16:42:07 -07:00
Wenkai Du	9e0c9b4ed8	Replace __HIP_PLATFORM_HCC__ with __HIP_PLATFORM_AMD__ (#1154 )	2024-04-25 07:19:18 -07:00
BertanDogancay	e1a835910e	Merge remote-tracking branch 'nccl/master' into develop	2024-04-23 13:34:00 -07:00
Wenkai Du	220066197a	Use hipExtMallocWithFlags to allocate host memory on APU (#1149 ) Also use SM60 as CUDA compatibility level.	2024-04-17 16:56:38 -07:00
gilbertlee-amd	4cb62f999a	Rail optimization for rings (#1140 ) - Modifies the ring creation algorithm to be friendlier to rail-optimized topologies (should not affect classic fabric topologies)	2024-04-15 12:03:57 -06:00
Sylvain Jeaugey	ab2b89c4c3	2.21.5-1 Add support for IB SHARP 1PPN operation with user buffers. Improve support for MNNVL, add NVLS support and multi-clique support. * Detect the NVLS clique through NVML * Exchange XML between peers in the same NVLS clique and fuse XMLs before creating the topology graph. * Rework bootstrap allgather algorithms to allow for large allgather operations intra-node (XML exchange). Net/IB: add support for dynamic GID detection. * Automatically select RoCEv2/IPv4 interface by default. Allow to select IPv6 or even the network/mask. Reduce NVLS memory usage. * Add stepSize as property of a connection to allow for different sizes on different peers; set it to 128K for NVLink SHARP. Improve tuner loading * Look for more paths, be more consistent with the network device plugin. * Also search for tuner support inside the net plugin. Improve tuner API * Add context to support multi-device per process. Add magic number around comm object to detect comm corruption. * Add some basic check around communicators so that we can report a problem when a communicator gets corrupted or a wrong comm pointer is passed to NCCL. Fix net/IB error path. Github PR #1164 Fix collnet rail mapping with split comm. Fix packet reordering issue causing bootstrap mismatch * Use a different tag in ncclTransportP2pSetup for the connectInfo exchange and the following barrier. Fix hang when crossNic is inconsistent between ranks. Fix minCompCap/maxCompCap computation. Github issue #1184	2024-04-02 01:53:21 -07:00
Wenkai Du	df98a6957d	Add another Rome model (#1095 )	2024-02-28 10:46:05 -08:00
Sylvain Jeaugey	48bb7fec79	2.20.5-1 Fix UDS connection failure when using ncclCommSplit. Issue #1185	2024-02-26 02:52:39 -08:00
Wenkai Du	74f9e5db64	Add new GPU model (#1080 )	2024-02-23 12:19:42 -08:00
Bertan Dogancay	2fb12a9358	Merge pull request #1079 from BertanDogancay/2.19.4-sync 2.19.4 Sync	2024-02-16 09:50:11 -07:00
akolliasAMD	bac57421c7	Allow bus id to be null (#1085 ) * Allow bus id to be null	2024-02-15 16:36:51 -07:00
Sylvain Jeaugey	b6475625fb	2.20.3-1 Add support for alternating rings, allow for cross-nic rings without cross-rail communication. Add support for user buffer registration for network send/recv. Optimize aggregated operations to better utilize all channels. Add flattening for BCM PCI gen5 switches. Add support for inter-node NVLink communication Add support for port fusion in NET/IB. Add support for ReduceScatter and AllGather using Collnet. Update net API to v8. Fix hang during A2A connection.	2024-02-13 04:22:38 -08:00
Wenkai Du	d999d9ad21	Merge remote-tracking branch 'rccl/develop' into 2.19.4	2024-02-09 11:31:03 -06:00

1 2 3 4 5 ...

272 커밋