커밋 그래프

166 커밋

작성자 SHA1 메시지 날짜
Wenkai Du ce6a2ffac8 Merge pull request #782 from ROCmSoftwarePlatform/2.18.3
Sync up with NCCL 2.18.3
2023-06-29 15:04:16 -07:00
akolliasAMD 9bba4a2f2a added npkit support into the all_gather run ring algorithm (#790) 2023-06-29 13:59:54 -06:00
Wenkai Du abd0615351 Merge remote-tracking branch 'nccl/master' into develop 2023-06-26 22:51:56 +00:00
akolliasAMD 9bdf6797a5 fixed npkit size to never be a negative number (#779) 2023-06-21 08:26:40 -06:00
Sylvain Jeaugey ea38312273 2.18.3-1
Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC.
Fix hang with Collnet on bfloat16 on systems with less than one NIC
per GPU.
Fix long initialization time.
Fix data corruption with Collnet when mixing multi-process and
multi-GPU per process.
Fix crash when shared memory creation fails.
Fix Avg operation with Collnet/Chain.
Fix performance of alltoall at scale with more than one NIC per GPU.
Fix performance for DGX H800.
Fix race condition in connection progress causing a crash.
Fix network flush with Collnet.
Fix performance of aggregated allGather/reduceScatter operations.
Fix PXN operation when CUDA_VISIBLE_DEVICES is set.
Fix NVTX3 compilation issues on Debian 10.
2023-06-14 01:29:17 -07:00
akolliasAMD 9cdac774ea Wall clock update and npkit trace script Update (#771)
* changed builtin clock to wall_clock64
* updated npkit_Trace_generator to the new version of npkit
2023-06-07 17:47:10 -06:00
Cory Bloor b1a65afd58 Fix build on additional architectures (#740)
* Fix build on additional architectures

Instead of directly wrapping a platform-specific operation with a
preprocessor check against a gfx macro, it can be more flexible to
check a macro that can be overriden by the user. The gfx macro can then
just provide the default value for the macro, resulting in the same
default behaviour as if the gfx macro was checked directly but with
more control at build-time.

For example, to build rccl without using buffer_wbinvl1_vol on
gfx902, but still use the default on other archs, a user could
export CXXFLAGS='-Xarch_gfx902 -DRCCL_USE_WBINVL1_VOL=1' before
configuring the build. This flexibility isn't always necessary, but
it's nicer to have it and not need it than to need it and not have it.

* Define WARP_SIZE using warpSize builtin
2023-06-06 16:45:50 -06:00
Ziyue Yang 7d6e7bcd7d revert npkit (#748) 2023-05-24 07:41:05 -07:00
Ziyue Yang 11676267b5 fix min, max and avg (#745) 2023-05-18 11:02:59 -07:00
Wenkai Du 897745a266 Remove references to NVLS functions 2023-05-05 07:55:20 -07:00
Wenkai Du 53a1f91857 Merge remote-tracking branch 'nccl/master' into develop 2023-04-25 15:38:32 -07:00
Sylvain Jeaugey d97a32fac8 2.18.1-1
Add support for IB SHARP to NVLS (NVLink SHARP algorithm).
Add NVLS+Tree algorithm.
Add support for memory management using cuMem* functions.
Use all NICs for Send/Receive operations on systems with more than
one NIC per GPU (#804).
Add ncclCommSplit primitive, with resource sharing option in config.
Fix alltoallv hang (#788)
Increase number of channels on H100 when we're not limited by NVLink.
Improve error reporting in case of IB failure, printing local and
remote ID (#779).
Add build option to allow compilation against RDMA includes instead
of dynamically loading IB verbs symbols (#802).
Fix context creation for progress thread (#803).
NET/IB: add option to use multiple QPs in round-robin mode.
Fix tree performance issue when NVB is disabled on HCM topologies.
2023-04-18 03:58:25 -07:00
Wenkai Du 4b09ffba43 msccl: print stack and memory usage (#723)
* msccl: print stack and memory usage

* Update number of kernels calculation
2023-04-14 14:59:03 -07:00
Ziyue Yang e3b2342f39 MSCCL: Improve executor and integrate scheduler (#694)
* MSCCL: improve executor and add scheduler for testing

* Use external scheduler

* Fix cmake error

* Address comments

* Fix thread safe issue

* Make MSCCL lifecycle APIs thread safe

* Make MSCCL internal scheduler aware of topology hint

* Revise error message
2023-03-14 14:34:25 -07:00
Sylvain Jeaugey 5d3ab08b69 2.17.1-1
Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.
2023-03-01 00:39:04 -08:00
Ziyue Yang f4bf47f325 NPKit: improve clock calibration and fix GPU clock API (#683)
* Improve clock calibration in NPKit

* Improve gfx macro

* Fix macro
2023-02-17 12:26:57 -07:00
Wenkai Du f7a456122c Remove workaround and use indirect function call (#684) 2023-02-14 13:59:48 -08:00
Wenkai Du e1cb45ff22 Merge remote-tracking branch 'nccl/master' into HEAD 2023-02-04 01:44:43 +00:00
Ziyue Yang adafc0f759 Add MSCCL Support (#658)
* Add MSCCL support

* Add alignment and message size checking

* Fix nRanks checking, in-place and out-of-place tests and group call handling

* Fix hipGraph unit test

* Change MSCCL init warning to INFO

* Revise license info
2022-12-12 15:51:04 -08:00
akolliasAMD eca623df07 decreased warp size for gfx110x (#655) 2022-12-01 12:19:21 -07:00
Sylvain Jeaugey 28189e2df8 2.16.2-1
Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.
2022-11-30 02:31:59 -08:00
Wenkai Du 562dd87036 Move hipify to cmake stage
Add minimal ROCm/HIP version requirements for Graph support
2022-11-14 18:10:45 +00:00
Wenkai Du 9a077e6947 Merge remote-tracking branch 'nccl/master' into develop 2022-11-03 21:17:42 +00:00
Sylvain Jeaugey cb111f764a 2.15.5-1
Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.
2022-10-25 00:55:55 -07:00
Wenkai Du 4f0e223db4 Merge remote-tracking branch 'nccl/master' into develop 2022-10-20 15:41:29 +00:00
Wenkai Du 9ddf0e0649 Support P2P with invisible devices (#636)
* Support P2P with invisible devices

* Update copyright year
2022-10-17 10:24:59 -07:00
Ziyue Yang 7d6bbc19d4 apply npkit 2022-10-14 01:28:17 +00:00
Edgar Gabriel e645b02cd8 introduce a hw topology aware bintree
for hayabusa architecture.
2022-10-03 15:26:21 +00:00
Sylvain Jeaugey da8152e57a 2.15.1-1
Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.
2022-09-27 02:31:13 -07:00
Wenkai Du a06e14e39b Misc fixes and disable binTree 2022-09-14 00:26:19 +00:00
Edgar Gabriel be935d7ce7 Merge branch 'develop' into 2.13.4 2022-09-13 17:19:04 -05:00
Edgar Gabriel 65e2ae20e5 add binary tree
In addition, introduce the ability to have 2 trees at the same time.
Only for allreduce at the moment.
2022-09-13 20:52:32 +00:00
Gilbert Lee 009e79623f Merge branch 'develop' into 2.13.4 2022-09-09 23:07:04 +00:00
gilbertlee-amd dd56135a9a Updating stream caching (#614)
- Adding non-captured hipStream for use in setup
2022-09-09 16:30:15 -06:00
Wenkai Du a79d9e3586 Merge remote-tracking branch 'nccl/master' into develop 2022-09-09 16:05:38 +00:00
Wenkai Du 7bbce085cc Enable LL128 protocol support (#605)
* Enable LL128 protocol support

* Use shared memory object directly when possible
2022-09-08 14:45:27 -07:00
gilbertlee-amd 47b2fc3a30 Adding opt-in hipGraph support for RCCL via RCCL_ENABLE_HIPGRAPH (#608)
Adding opt-in hipGraph support via RCCL_ENABLE_HIPGRAPH
2022-09-06 10:29:46 -06:00
akolliasAMD 06bce9d0c9 added stream synch after hipMemset (#609) 2022-08-30 16:18:37 -06:00
Cosmic Fusion 080fc2d9d6 fix error: use of undeclared identifier 'free'
include stdlib.h to fix compilation error in rccl :

[39/58] Building CXX object CMakeFiles/rccl.dir/src/misc/signals.cc.o
FAILED: CMakeFiles/rccl.dir/src/misc/signals.cc.o 
/opt/rocm/bin/hipcc -DENABLE_COLLTRACE -DHAVE_BFD -DHAVE_CPLUS_DEMANGLE -DUSE_ROCM_SMI64CONFIG -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -Drccl_EXPORTS -I/home/cosmo/build/flgrwqa/build/include -I/home/cosmo/build/flgrwqa/build/include/rccl -I/home/cosmo/build/flgrwqa/rccl/src -I/home/cosmo/build/flgrwqa/rccl/src/include -I/home/cosmo/build/flgrwqa/rccl/src/collectives -I/home/cosmo/build/flgrwqa/rccl/src/collectives/device -I/opt/hsa/include -fPIC -fvisibility=hidden -fgpu-rdc -parallel-jobs=8 -Wno-format-nonliteral -x hip --offload-arch=gfx803 --offload-arch=gfx900:xnack- --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx1030 -std=c++14 -MD -MT CMakeFiles/rccl.dir/src/misc/signals.cc.o -MF CMakeFiles/rccl.dir/src/misc/signals.cc.o.d -o CMakeFiles/rccl.dir/src/misc/signals.cc.o -c /home/cosmo/build/flgrwqa/rccl/src/misc/signals.cc
In file included from /home/cosmo/build/flgrwqa/rccl/src/misc/signals.cc:8:
/home/cosmo/build/flgrwqa/rccl/src/include/BfdBacktrace.hpp:138:9: error: use of undeclared identifier 'free'
        free(file->syms);
        ^
/home/cosmo/build/flgrwqa/rccl/src/include/BfdBacktrace.hpp:155:5: error: use of undeclared identifier 'free'
    free(file->syms);
    ^
2022-08-19 20:25:06 +03:00
Wenkai Du 14b8ff153f Repurpose profiling implementation to simple timestamps tracing (#600) 2022-08-18 15:34:46 -07:00
Sylvain Jeaugey c4e2aa6c79 2.14.3-1
Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.
2022-08-18 02:53:17 -07:00
akolliasAMD 451c287aa6 Removing redundant LOAD and STORE on primitives plus adding some atomics (#585) 2022-07-21 13:04:57 -06:00
Sylvain Jeaugey 19ab67d172 2.13.4-1
Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.
2022-07-11 08:10:34 -07:00
Yifan Xiong 80f53cc171 Reduce AlltoAll port usage in send/recv proxy (#577)
* Reduce AlltoAll port usage when connecting proxy

Reuse socket ports when connecting proxies in AlltoAll.

Existing port usage in AlltoAll is O(n) for recv and O(n) for send,
reusing socket ports in server or client side will make one of them
O(1), reusing both will reduce the total port usage to O(1) and enables
AlltoAll in >64 MI200 nodes.

* Update changelog accordingly

Update changelog accordingly.
2022-07-07 16:15:52 -07:00
gilbertlee-amd a89a9966aa Adding git hash info to version output line (#572) 2022-06-28 16:42:51 -06:00
Ziyue Yang 6e93fafdc3 Add Feature - Add NPKit Support in RCCL (#564)
* apply npkit

* fix bug

* add npkit in readme
2022-06-20 14:30:19 -07:00
Edgar 0336ffdf70 Introduce multi-rank support per device.
This is a single commit of the source code changes required to
introduce support for multiple ranks per device.
A new interface (ncclCommRankInitMulti) has to be used to make use of
this new feature.
2022-06-10 14:23:12 +00:00
Wenkai Du 7a6c6927ae Enable timing profile option (#558) 2022-06-03 07:05:13 -07:00
Aristotelis e0864e7093 Merge remote-tracking branch 'ncclRepo/master' into develop 2022-06-02 15:27:24 +00:00
Wenkai Du eef812bed7 Revert chunksteps changes (#555) 2022-05-31 14:45:51 -07:00