Commit-Graf

545 Incheckningar

Upphovsman SHA1 Meddelande Datum
gilbertlee-amd 5bcd3768cc Minor fix for some msccl installations (#862) 2023-08-23 13:48:58 -06:00
Wenkai Du 6a0a6a37d9 Use relaxed atomics for LL on GFX11 (#859) 2023-08-21 16:28:39 -07:00
David Pagan 2ec2648247 Fix static_assert string literal that contains a "\%". This is no longer (#860)
valid. They can only be simple escape sequences. Removing '\' fixes
issue. Assert message now compiles and emits the '%' as expected.
2023-08-21 16:19:59 -07:00
akolliasAMD d33cd5a233 NCCL_TREES variable and rome model fixes (#856) 2023-08-21 10:35:37 -06:00
Wenkai Du f70e3e569b gfx11: don't use LL for sendrecv (#853)
* gfx11: don't use LL for sendrecv

* Use builtin instead of inline asm
2023-08-17 08:50:51 -07:00
Wenkai Du 7044599575 Add new model support (#847)
* Add new model support

* Update new rings
2023-08-10 17:14:51 -07:00
Ziyue Yang d33a70e620 NPKit update (#844)
* NPKit update

1. Enable NPKit for MSCCL kernels
2. Fix NPKit context index calculation for sendrecv kernels

* Update build script for npkit
2023-08-08 17:30:40 -07:00
Wenkai Du d65c0830c6 Detect HIP_UNCACHED_MEMORY support from HIP version (#842) 2023-08-04 10:17:04 -07:00
Wenkai Du 8e58b65873 gfx11xx: disable LL protocol to workaround mtype issue (#840) 2023-08-04 07:53:07 -07:00
Wenkai Du 60efe26549 Fix merge error and replace inline asm (#838) 2023-08-03 13:46:40 -07:00
Wenkai Du c8085eb704 Improve collective trace (#835) 2023-08-03 07:16:12 -07:00
Bertan Dogancay 64c32d1c5b Disable MSCCL kernels at compile time (#834)
* Disable MSCCL kernels at compile time
2023-08-02 09:45:18 -06:00
gilbertlee-amd afa4a5ecf8 Updating Doxygen documentation (#831) 2023-07-28 16:09:06 -06:00
Wenkai Du 3db371c9a5 Revert "Enable Ll128 on gfx90a (#823)" (#829)
This reverts commit 420f8af6a0.

Also increase number of parallel jobs for linking
2023-07-27 20:25:18 -07:00
Ziyue Yang f7dc7b7e6a Fix MSCCL proxy number of chunks calculation (#821)
Current number of transmissions parsed from MSCCL algorithm is 1-based value,
but when calculating proxy number of chunks, it's taken as 0-based value.
This commit fixes this issue.
2023-07-26 13:24:49 -07:00
Wenkai Du 420f8af6a0 Enable Ll128 on gfx90a (#823) 2023-07-26 11:44:15 -07:00
Wenkai Du 4d20b4b758 Replace atomicExch with __atomic_store_n (#818)
* Replace atomicExch with __atomic_store_n

* Remove extra semicolon
2023-07-25 11:15:21 -07:00
Nusrat Islam 47f754e6f5 Merge pull request #810 from nusislam/tune-send-recv
device: fine-tune RCCL send-recv on MI250/MI200
2023-07-25 10:18:12 -05:00
Bertan Dogancay 8bab4f04b7 Implement RCCL Replayer (#817)
* Implement RCCL Replayer
2023-07-24 16:26:22 -06:00
Wenkai Du a7fcd58a97 Enable gfx94x (#808) (#816)
(cherry picked from commit 94da229a7788d74685d1591a4e75a8341de64f41)
2023-07-21 07:31:27 -07:00
Nusrat Islam b2a0a0bd3a device: fine-tune RCCL send-recv 2023-07-17 21:45:52 -05:00
Ziyue Yang 848e60b00c Fix path finding in msccl internal scheduler (#794) 2023-07-12 13:31:28 -07:00
Wenkai Du 0f14e5a640 npkit: separate network timing between send and test (#798) 2023-07-10 09:31:49 -07:00
Nusrat Islam 58e53dfd37 device: fine tune MI200/MI250 simple protocol performance
With Simple protocol, unroll factor of 4 offers better
performance for most of the collectives (on MI200. MI250, and
MI300) except large message allreduce with Ring algorithm
on MI250 and MI200). This PR changes the default unroll factor
to 4 while adding fine tuning for reduction operations.
2023-07-08 20:21:18 -05:00
Wenkai Du e0c70af46b Merge remote-tracking branch 'nccl/master' into nccl_sync 2023-07-05 07:53:53 -07:00
Wenkai Du ce6a2ffac8 Merge pull request #782 from ROCmSoftwarePlatform/2.18.3
Sync up with NCCL 2.18.3
2023-06-29 15:04:16 -07:00
akolliasAMD 9bba4a2f2a added npkit support into the all_gather run ring algorithm (#790) 2023-06-29 13:59:54 -06:00
Dmitrii Gabor 6e24ef4e1f Prevent WR index truncation in the InfiniBand transport plugin 2023-06-28 11:39:19 +02:00
Wenkai Du abd0615351 Merge remote-tracking branch 'nccl/master' into develop 2023-06-26 22:51:56 +00:00
arvindcheru bd14ac8b59 ASAN build excluding additional files, Algodir support for share folder
* ASAN build excluding additional files, Algodir support for share folder (#786)
* Algodir support for share folder
2023-06-23 10:57:20 -04:00
Bertan Dogancay 0c77c66221 Disable Colltrace for --fast option (#778)
* Disable Colltrace for --fast option

* Limit nprocs for CI
2023-06-21 14:16:09 -06:00
akolliasAMD 9bdf6797a5 fixed npkit size to never be a negative number (#779) 2023-06-21 08:26:40 -06:00
gilbertlee-amd 52a28ff2fc Switching to using atomicAdd_system within kernel for collective trace (#780) 2023-06-20 17:49:52 -06:00
Nusrat Islam 3a741787bf device: use unroll factor based on platforms 2023-06-14 13:36:15 -05:00
Bertan Dogancay f35777e9b0 improve compilation time and create timetrace plot (#773)
* improve compilation time and create time-trace plot

* set default value for nproc
2023-06-14 09:17:51 -06:00
Sylvain Jeaugey ea38312273 2.18.3-1
Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC.
Fix hang with Collnet on bfloat16 on systems with less than one NIC
per GPU.
Fix long initialization time.
Fix data corruption with Collnet when mixing multi-process and
multi-GPU per process.
Fix crash when shared memory creation fails.
Fix Avg operation with Collnet/Chain.
Fix performance of alltoall at scale with more than one NIC per GPU.
Fix performance for DGX H800.
Fix race condition in connection progress causing a crash.
Fix network flush with Collnet.
Fix performance of aggregated allGather/reduceScatter operations.
Fix PXN operation when CUDA_VISIBLE_DEVICES is set.
Fix NVTX3 compilation issues on Debian 10.
2023-06-14 01:29:17 -07:00
akolliasAMD 9cdac774ea Wall clock update and npkit trace script Update (#771)
* changed builtin clock to wall_clock64
* updated npkit_Trace_generator to the new version of npkit
2023-06-07 17:47:10 -06:00
Cory Bloor b1a65afd58 Fix build on additional architectures (#740)
* Fix build on additional architectures

Instead of directly wrapping a platform-specific operation with a
preprocessor check against a gfx macro, it can be more flexible to
check a macro that can be overriden by the user. The gfx macro can then
just provide the default value for the macro, resulting in the same
default behaviour as if the gfx macro was checked directly but with
more control at build-time.

For example, to build rccl without using buffer_wbinvl1_vol on
gfx902, but still use the default on other archs, a user could
export CXXFLAGS='-Xarch_gfx902 -DRCCL_USE_WBINVL1_VOL=1' before
configuring the build. This flexibility isn't always necessary, but
it's nicer to have it and not need it than to need it and not have it.

* Define WARP_SIZE using warpSize builtin
2023-06-06 16:45:50 -06:00
Wenkai Du 3af90902c8 Add NCCL_NCHANNELS_PER_PEER override (#767)
Also fix topol_expl build issue
2023-06-06 08:41:38 -07:00
Bertan Dogancay d52b6c0d24 add DMA_BUF support (#763)
* add DMA_BUF support

* remove unused libraries in src/init.cc

* change NCCL_ALL to NCCL_INIT

* remove extra pointer functions in transport/net.cc
2023-06-01 12:46:42 -06:00
Wenkai Du 5a38ff192b Rework barrier and event code (#761)
* Rework barrier and event code

* Switch to inline asm
2023-05-31 13:36:51 -07:00
Nusrat Islam 4d1cfb17c8 device: change unroll factor
The default value of unroll factor is 2. Changing the unroll
factor to 4 provides better performance for most of the collectives.
2023-05-25 15:42:35 -05:00
Ziyue Yang 7d6e7bcd7d revert npkit (#748) 2023-05-24 07:41:05 -07:00
Ziyue Yang ed252c30f4 Limit MSCCL reduce unrolling to pow-2 cases to shrink kernel size (#746) 2023-05-19 11:46:36 -07:00
Ziyue Yang 11676267b5 fix min, max and avg (#745) 2023-05-18 11:02:59 -07:00
Wen-Heng (Jack) Chung eba4e9e100 Merge pull request #742 from whchung/skip_done_event_msccl
Allow skipping doneEvent inside MSCCL.
2023-05-18 10:17:20 -05:00
Wenkai Du 403cda6322 Fix merge error (#744) 2023-05-18 08:09:27 -07:00
Wen-Heng (Jack) Chung ca4a1dfd67 Address review feedbacks and make the flag be disabled by default. 2023-05-17 17:50:25 +00:00
Wen-Heng (Jack) Chung 12dba425de Skip doneEvent inside MSCCL by default.
Added a RCCL_MSCCL_ENABLE_DONE_EVENT env var, set it be 0 by default.

The env var is to control whether to use doneEvent when invoking MSCCL
kernels.

Skipping doneEvent would cause the firmware to skip L2 cache flush,
resulting in overall performance improvement.
2023-05-17 16:49:42 +00:00
Wenkai Du 4ca7742c61 Revert "Ensure memory copy integrity during transport setup (#731)" (#741)
* Revert "Ensure memory copy integrity during transport setup (#731)"

This reverts commit 36e453c61e.

Add stream synchronization in ncclStrongStreamRelease.

* Use event record and wait
2023-05-16 10:34:47 -07:00