Grafik Komit

206 Melakukan

Penulis SHA1 Pesan Tanggal
BertanDogancay 9ff53eeeae Merge remote-tracking branch 'nccl/master' into develop 2024-01-30 14:43:43 -08:00
BertanDogancay 81ddf9de89 Merge remote-tracking branch 'nccl/v2.19' into develop 2024-01-24 15:25:33 -08:00
Bertan Dogancay c4dbf8a914 Fix collective trace when rccl is configured (#1056)
* Fix collective trace when rccl is configured
2024-01-22 09:26:44 -07:00
Wenkai Du 7e25d5bc55 Use new HIP graph API compatible with CUDA 11030 (#991)
* Use new HIP graph API compatible with CUDA 11030

* Update dependency to ROCm 6.1

* Fix single stream use case
2024-01-21 19:00:50 -08:00
Bertan Dogancay 28d9b170c9 [DEV] Configure functions in RCCL (#986)
* configure functions in rccl
2024-01-18 15:07:16 -07:00
Nilesh M Negi 249e9f7f65 Un-escaped character causes error with address sanitizer builds (#992)
Signed-off-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Co-authored-by: Jenkins <jenkins-compute@amd.com>
2024-01-09 13:28:32 -06:00
Pedram Alizadeh aa5c84c997 Merge pull request #1022 from PedramAlizadeh/sync_nccl_2.18.6
Sync to nccl 2.18.6
2024-01-09 13:29:29 -05:00
akolliasAMD f4858e14b2 rearranged how the min and max functions are part of msccl (#1025)
* rearranged how the min and max functions are part of msccl

* added more coverage on in place graph tests
2023-12-21 08:58:33 -07:00
PedramAlizadeh 0d515f9388 resolved conflicts, fixed the localNetCount/0 bug 2023-12-18 08:11:34 +00:00
Ziyue Yang 655742a3a6 Fully disable MSCCL when machine is not matched (#1017)
* Disable MSCCL algorithm meta loading when machine is not matched

* fully disable init

* fix potential segfault
2023-12-13 08:36:21 -08:00
Wenkai Du 12c08fc52a msccl: build same number of kernels as in ROCm 5.7 (#1005)
Removed fullOps kernels from build
2023-12-07 13:36:04 -06:00
Wen-Heng (Jack) Chung 293f0fb752 Use a map to host scratch buffers (#1004)
* Use a map to host scratch buffers

* Address review feedbacks. Deliberately keep mscclSetupScratch function.
2023-12-05 13:15:28 -06:00
Bertan Dogancay 7c0f49a878 IFC mix build (#998) 2023-12-02 18:49:52 -07:00
Wenkai Du 4ba65d1d6a Increase max channles to 64 (#993) 2023-12-01 16:01:11 -08:00
Ziyue Yang 4bb0b4a380 Move MSCCL algorithm loading to initialization to workaround HIP graph conflict (#982)
* MSCCL: pre-specify channels and pre-load algorithms

* add mutex

* fix bug

* clean include

* disable all-gathers temporarily
2023-11-30 09:47:20 -08:00
akolliasAMD 56ce9ef05f recreated pr 914 to work with current develop branch (#979) 2023-11-28 16:33:47 -07:00
Ziyue Yang 7fc891bc8d Fix MSCCL work FIFO allocation with HIP graph enabled (#967) 2023-11-15 16:43:28 -08:00
Sylvain Jeaugey 88d44d777f 2.19.4-1
Split transport connect phase into multiple steps to avoid port
exhaustion when connecting alltoall at large scale. Defaults to 128
peers per round.
Fix memory leaks on CUDA graph capture.
Fix alltoallv crash on self-sendrecv.
Make topology detection more deterministic when PCI speeds are not
available (fix issue #1020).
Properly close shared memory in NVLS resources.
Revert proxy detach after 5 seconds.
Add option to print progress during transport connect.
Add option to set NCCL_DEBUG to INFO on first WARN.
2023-11-13 10:36:12 -08:00
Wenkai Du 5a800e00cd msccl: enable basic collective trace (#959)
To avoid increasing number of kernels, colltrace is only enabled with
RCCL_MSCCL_FORCE_FULLOPS=1
2023-11-08 20:14:28 -08:00
Wenkai Du f484ff17b9 msccl: add templated kernel (#945)
* msccl: add templated kernel

* Use defines to improve code readability

* Fix kernel indexing and review feedback
2023-11-02 17:21:53 -07:00
Wenkai Du a497722894 NPkit: misc fixes for MSCCL (#936)
* msccl: add xcc_id to timestamp sync

* NPKit: add timestamp for rrc operator

* NPKit: add timestamp for MSCCL init
2023-10-30 10:00:12 -07:00
Nilesh M Negi 1e5ca6820b Fix gcnArchName bug in topology dump (#937)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2023-10-28 12:30:36 -05:00
Ziyue Yang 4c117e5335 Fix MSCCL work FIFO out-of-bound issue (#935) 2023-10-27 11:24:52 -07:00
Nilesh M Negi f22df90e5c remove gcnArch support (#920)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2023-10-26 12:09:15 -05:00
Wen-Heng (Jack) Chung 7ee5c1c28b Change MSCCL kernel signature to allow kernel arguments be preloaded via SGPR (#911)
* Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895)

Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu>

* Only build gfx941

* demo

* fine tune malloc

* Fix merge errors

* Fix merge errors

* Disable parallel build

* Adopt --amdgpu-kernarg-preload-count

* Revert "Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895)"

This reverts commit f5e252dddf02a41b4d1bc512f306f45f97166304.

* Revert CMake changes.

* NPKIT changes.

* Remove some license declarations.

* Address code review feedbacks on msccl_kernel_impl.h

* Update CMakeLists.txt

* Add CMake logic to check the existence of --amdgpu-kernarg-preload-count

* Fix NPKIT trace logic.

---------

Co-authored-by: Pedram Alizadeh <pmohamma@amd.com>
Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
2023-10-12 20:17:08 -05:00
Sylvain Jeaugey 8c6c595185 2.19.3-1
H800/H100 fixes and tuning.
Re-enable intra-process direct pointer buffer access when CUMEM is
enabled.
2023-09-26 05:57:15 -07:00
Sylvain Jeaugey 3435178b6c Merge remote-tracking branch 'origin/master' into v2.19 2023-09-26 05:55:56 -07:00
Sylvain Jeaugey f9c3dc251e 2.19.1-1
Add local user buffer registration for NVLink SHARP.
Add tuning plugin support.
Increase net API to v7 to allow for device-side packet reordering;
remove support for v4 plugins.
Add support for RoCE ECE.
Add support for C2C links.
Better detect SHM allocation failures to avoid crash with Bus Error.
Fix missing thread unlocks in bootstrap (Fixes #936).
Disable network flush by default on H100.
Move device code from src/collectives/device to src/device.
2023-09-26 05:50:33 -07:00
Kaiming Ouyang 4365458757 Fix cudaMemcpyAsync bug
We are trying to use the copy result of first cudaMemcpyAsync in the
second cudaMemcpyAsync without sync in between. This patch fixes it
by allocating a CPU side array to cache device side addr so that we
can avoid this consecutive cuda mem copy.

Fixes #957
2023-09-20 05:51:14 -07:00
akolliasAMD b85d73c02e changed the form that RCCL_TREE uses (#888)
* changed the form that RCCL_TREE uses
2023-09-15 15:01:33 -06:00
Wenkai Du 26e982d913 Reduce NPKit latency overhead in MSCCL kernel (#893)
* Reduce NPKit latency overhead in MSCCL kernel

* Fix build error without NPKit enable
2023-09-15 13:28:26 -07:00
Audrey MP e58ec78d35 Gcn arch name (#886)
We use CMake to determine if we're compiling against a version of ROCm that supports gcnArchName and handles architecture checking appropriately. It includes a few helper functions as drop ins for the functionality we used gcnArch for before; sometimes to enable flags, and sometimes to set frequencies.
2023-09-12 15:34:40 -04:00
Andy li e1dc4d5e42 enable hip graph on multi-node (#884)
* initial checkin

* enable msccl when hip graph is on

* remove the commented out code of msccl enable check

* clean up the code

* remove the msccl HighestTransportType check logic
2023-09-11 15:30:04 -07:00
Nusrat Islam a283f55f12 msccl: add NPKIT profiling for MSCCL send-recv 2023-09-08 13:11:16 -05:00
David Pagan 2ec2648247 Fix static_assert string literal that contains a "\%". This is no longer (#860)
valid. They can only be simple escape sequences. Removing '\' fixes
issue. Assert message now compiles and emits the '%' as expected.
2023-08-21 16:19:59 -07:00
Ziyue Yang d33a70e620 NPKit update (#844)
* NPKit update

1. Enable NPKit for MSCCL kernels
2. Fix NPKit context index calculation for sendrecv kernels

* Update build script for npkit
2023-08-08 17:30:40 -07:00
Wenkai Du d65c0830c6 Detect HIP_UNCACHED_MEMORY support from HIP version (#842) 2023-08-04 10:17:04 -07:00
Wenkai Du c8085eb704 Improve collective trace (#835) 2023-08-03 07:16:12 -07:00
Wenkai Du a7fcd58a97 Enable gfx94x (#808) (#816)
(cherry picked from commit 94da229a7788d74685d1591a4e75a8341de64f41)
2023-07-21 07:31:27 -07:00
Wenkai Du 0f14e5a640 npkit: separate network timing between send and test (#798) 2023-07-10 09:31:49 -07:00
Wenkai Du ce6a2ffac8 Merge pull request #782 from ROCmSoftwarePlatform/2.18.3
Sync up with NCCL 2.18.3
2023-06-29 15:04:16 -07:00
akolliasAMD 9bba4a2f2a added npkit support into the all_gather run ring algorithm (#790) 2023-06-29 13:59:54 -06:00
Wenkai Du abd0615351 Merge remote-tracking branch 'nccl/master' into develop 2023-06-26 22:51:56 +00:00
akolliasAMD 9bdf6797a5 fixed npkit size to never be a negative number (#779) 2023-06-21 08:26:40 -06:00
Sylvain Jeaugey ea38312273 2.18.3-1
Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC.
Fix hang with Collnet on bfloat16 on systems with less than one NIC
per GPU.
Fix long initialization time.
Fix data corruption with Collnet when mixing multi-process and
multi-GPU per process.
Fix crash when shared memory creation fails.
Fix Avg operation with Collnet/Chain.
Fix performance of alltoall at scale with more than one NIC per GPU.
Fix performance for DGX H800.
Fix race condition in connection progress causing a crash.
Fix network flush with Collnet.
Fix performance of aggregated allGather/reduceScatter operations.
Fix PXN operation when CUDA_VISIBLE_DEVICES is set.
Fix NVTX3 compilation issues on Debian 10.
2023-06-14 01:29:17 -07:00
akolliasAMD 9cdac774ea Wall clock update and npkit trace script Update (#771)
* changed builtin clock to wall_clock64
* updated npkit_Trace_generator to the new version of npkit
2023-06-07 17:47:10 -06:00
Cory Bloor b1a65afd58 Fix build on additional architectures (#740)
* Fix build on additional architectures

Instead of directly wrapping a platform-specific operation with a
preprocessor check against a gfx macro, it can be more flexible to
check a macro that can be overriden by the user. The gfx macro can then
just provide the default value for the macro, resulting in the same
default behaviour as if the gfx macro was checked directly but with
more control at build-time.

For example, to build rccl without using buffer_wbinvl1_vol on
gfx902, but still use the default on other archs, a user could
export CXXFLAGS='-Xarch_gfx902 -DRCCL_USE_WBINVL1_VOL=1' before
configuring the build. This flexibility isn't always necessary, but
it's nicer to have it and not need it than to need it and not have it.

* Define WARP_SIZE using warpSize builtin
2023-06-06 16:45:50 -06:00
Ziyue Yang 7d6e7bcd7d revert npkit (#748) 2023-05-24 07:41:05 -07:00
Ziyue Yang 11676267b5 fix min, max and avg (#745) 2023-05-18 11:02:59 -07:00
Wenkai Du 897745a266 Remove references to NVLS functions 2023-05-05 07:55:20 -07:00