rocm-systems

مولف	SHA1	پیام	تاریخ
akolliasAMD	9cdac774ea	Wall clock update and npkit trace script Update (#771 ) * changed builtin clock to wall_clock64 * updated npkit_Trace_generator to the new version of npkit	2023-06-07 17:47:10 -06:00
Cory Bloor	b1a65afd58	Fix build on additional architectures (#740 ) * Fix build on additional architectures Instead of directly wrapping a platform-specific operation with a preprocessor check against a gfx macro, it can be more flexible to check a macro that can be overriden by the user. The gfx macro can then just provide the default value for the macro, resulting in the same default behaviour as if the gfx macro was checked directly but with more control at build-time. For example, to build rccl without using buffer_wbinvl1_vol on gfx902, but still use the default on other archs, a user could export CXXFLAGS='-Xarch_gfx902 -DRCCL_USE_WBINVL1_VOL=1' before configuring the build. This flexibility isn't always necessary, but it's nicer to have it and not need it than to need it and not have it. * Define WARP_SIZE using warpSize builtin	2023-06-06 16:45:50 -06:00
Ziyue Yang	7d6e7bcd7d	revert npkit (#748 )	2023-05-24 07:41:05 -07:00
Ziyue Yang	11676267b5	fix min, max and avg (#745 )	2023-05-18 11:02:59 -07:00
Wenkai Du	897745a266	Remove references to NVLS functions	2023-05-05 07:55:20 -07:00
Wenkai Du	53a1f91857	Merge remote-tracking branch 'nccl/master' into develop	2023-04-25 15:38:32 -07:00
Wenkai Du	4b09ffba43	msccl: print stack and memory usage (#723 ) * msccl: print stack and memory usage * Update number of kernels calculation	2023-04-14 14:59:03 -07:00
Ziyue Yang	e3b2342f39	MSCCL: Improve executor and integrate scheduler (#694 ) * MSCCL: improve executor and add scheduler for testing * Use external scheduler * Fix cmake error * Address comments * Fix thread safe issue * Make MSCCL lifecycle APIs thread safe * Make MSCCL internal scheduler aware of topology hint * Revise error message	2023-03-14 14:34:25 -07:00
Sylvain Jeaugey	5d3ab08b69	2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize.	2023-03-01 00:39:04 -08:00
Ziyue Yang	f4bf47f325	NPKit: improve clock calibration and fix GPU clock API (#683 ) * Improve clock calibration in NPKit * Improve gfx macro * Fix macro	2023-02-17 12:26:57 -07:00
Wenkai Du	f7a456122c	Remove workaround and use indirect function call (#684 )	2023-02-14 13:59:48 -08:00
Wenkai Du	e1cb45ff22	Merge remote-tracking branch 'nccl/master' into HEAD	2023-02-04 01:44:43 +00:00
Ziyue Yang	adafc0f759	Add MSCCL Support (#658 ) * Add MSCCL support * Add alignment and message size checking * Fix nRanks checking, in-place and out-of-place tests and group call handling * Fix hipGraph unit test * Change MSCCL init warning to INFO * Revise license info	2022-12-12 15:51:04 -08:00
akolliasAMD	eca623df07	decreased warp size for gfx110x (#655 )	2022-12-01 12:19:21 -07:00
Sylvain Jeaugey	28189e2df8	2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments.	2022-11-30 02:31:59 -08:00
Wenkai Du	562dd87036	Move hipify to cmake stage Add minimal ROCm/HIP version requirements for Graph support	2022-11-14 18:10:45 +00:00
Wenkai Du	9a077e6947	Merge remote-tracking branch 'nccl/master' into develop	2022-11-03 21:17:42 +00:00
Sylvain Jeaugey	cb111f764a	2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs.	2022-10-25 00:55:55 -07:00
Wenkai Du	4f0e223db4	Merge remote-tracking branch 'nccl/master' into develop	2022-10-20 15:41:29 +00:00
Wenkai Du	9ddf0e0649	Support P2P with invisible devices (#636 ) * Support P2P with invisible devices * Update copyright year	2022-10-17 10:24:59 -07:00
Ziyue Yang	7d6bbc19d4	apply npkit	2022-10-14 01:28:17 +00:00
Edgar Gabriel	e645b02cd8	introduce a hw topology aware bintree for hayabusa architecture.	2022-10-03 15:26:21 +00:00
Sylvain Jeaugey	da8152e57a	2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities.	2022-09-27 02:31:13 -07:00
Wenkai Du	a06e14e39b	Misc fixes and disable binTree	2022-09-14 00:26:19 +00:00
Edgar Gabriel	be935d7ce7	Merge branch 'develop' into 2.13.4	2022-09-13 17:19:04 -05:00
Edgar Gabriel	65e2ae20e5	add binary tree In addition, introduce the ability to have 2 trees at the same time. Only for allreduce at the moment.	2022-09-13 20:52:32 +00:00
Gilbert Lee	009e79623f	Merge branch 'develop' into 2.13.4	2022-09-09 23:07:04 +00:00
gilbertlee-amd	dd56135a9a	Updating stream caching (#614 ) - Adding non-captured hipStream for use in setup	2022-09-09 16:30:15 -06:00
Wenkai Du	a79d9e3586	Merge remote-tracking branch 'nccl/master' into develop	2022-09-09 16:05:38 +00:00
Wenkai Du	7bbce085cc	Enable LL128 protocol support (#605 ) * Enable LL128 protocol support * Use shared memory object directly when possible	2022-09-08 14:45:27 -07:00
gilbertlee-amd	47b2fc3a30	Adding opt-in hipGraph support for RCCL via RCCL_ENABLE_HIPGRAPH (#608 ) Adding opt-in hipGraph support via RCCL_ENABLE_HIPGRAPH	2022-09-06 10:29:46 -06:00
akolliasAMD	06bce9d0c9	added stream synch after hipMemset (#609 )	2022-08-30 16:18:37 -06:00
Cosmic Fusion	080fc2d9d6	fix error: use of undeclared identifier 'free' include stdlib.h to fix compilation error in rccl : [39/58] Building CXX object CMakeFiles/rccl.dir/src/misc/signals.cc.o FAILED: CMakeFiles/rccl.dir/src/misc/signals.cc.o /opt/rocm/bin/hipcc -DENABLE_COLLTRACE -DHAVE_BFD -DHAVE_CPLUS_DEMANGLE -DUSE_ROCM_SMI64CONFIG -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -Drccl_EXPORTS -I/home/cosmo/build/flgrwqa/build/include -I/home/cosmo/build/flgrwqa/build/include/rccl -I/home/cosmo/build/flgrwqa/rccl/src -I/home/cosmo/build/flgrwqa/rccl/src/include -I/home/cosmo/build/flgrwqa/rccl/src/collectives -I/home/cosmo/build/flgrwqa/rccl/src/collectives/device -I/opt/hsa/include -fPIC -fvisibility=hidden -fgpu-rdc -parallel-jobs=8 -Wno-format-nonliteral -x hip --offload-arch=gfx803 --offload-arch=gfx900:xnack- --offload-arch=gfx906:xnack- --offload-arch=gfx908:xnack- --offload-arch=gfx90a:xnack- --offload-arch=gfx90a:xnack+ --offload-arch=gfx1030 -std=c++14 -MD -MT CMakeFiles/rccl.dir/src/misc/signals.cc.o -MF CMakeFiles/rccl.dir/src/misc/signals.cc.o.d -o CMakeFiles/rccl.dir/src/misc/signals.cc.o -c /home/cosmo/build/flgrwqa/rccl/src/misc/signals.cc In file included from /home/cosmo/build/flgrwqa/rccl/src/misc/signals.cc:8: /home/cosmo/build/flgrwqa/rccl/src/include/BfdBacktrace.hpp:138:9: error: use of undeclared identifier 'free' free(file->syms); ^ /home/cosmo/build/flgrwqa/rccl/src/include/BfdBacktrace.hpp:155:5: error: use of undeclared identifier 'free' free(file->syms); ^	2022-08-19 20:25:06 +03:00
Wenkai Du	14b8ff153f	Repurpose profiling implementation to simple timestamps tracing (#600 )	2022-08-18 15:34:46 -07:00
Sylvain Jeaugey	c4e2aa6c79	2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier.	2022-08-18 02:53:17 -07:00
akolliasAMD	451c287aa6	Removing redundant LOAD and STORE on primitives plus adding some atomics (#585 )	2022-07-21 13:04:57 -06:00
Sylvain Jeaugey	19ab67d172	2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy.	2022-07-11 08:10:34 -07:00
Yifan Xiong	80f53cc171	Reduce AlltoAll port usage in send/recv proxy (#577 ) * Reduce AlltoAll port usage when connecting proxy Reuse socket ports when connecting proxies in AlltoAll. Existing port usage in AlltoAll is O(n) for recv and O(n) for send, reusing socket ports in server or client side will make one of them O(1), reusing both will reduce the total port usage to O(1) and enables AlltoAll in >64 MI200 nodes. * Update changelog accordingly Update changelog accordingly.	2022-07-07 16:15:52 -07:00
gilbertlee-amd	a89a9966aa	Adding git hash info to version output line (#572 )	2022-06-28 16:42:51 -06:00
Ziyue Yang	6e93fafdc3	Add Feature - Add NPKit Support in RCCL (#564 ) * apply npkit * fix bug * add npkit in readme	2022-06-20 14:30:19 -07:00
Edgar	0336ffdf70	Introduce multi-rank support per device. This is a single commit of the source code changes required to introduce support for multiple ranks per device. A new interface (ncclCommRankInitMulti) has to be used to make use of this new feature.	2022-06-10 14:23:12 +00:00
Wenkai Du	7a6c6927ae	Enable timing profile option (#558 )	2022-06-03 07:05:13 -07:00
Aristotelis	e0864e7093	Merge remote-tracking branch 'ncclRepo/master' into develop	2022-06-02 15:27:24 +00:00
Wenkai Du	eef812bed7	Revert chunksteps changes (#555 )	2022-05-31 14:45:51 -07:00
akolliasAMD	98f0809a39	Added creation of new tree and added switch for using treesplit for specific cases (#551 )	2022-05-25 18:55:14 -04:00
gilbertlee-amd	700b473211	Moving opt-in custom signal handler from UnitTests into RCCL (#550 ) * Enable via RCCL_ENABLE_SIGNALHANDLER=1	2022-05-20 09:56:38 -06:00
Sylvain Jeaugey	7aa1c46fd5	2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process.	2022-05-13 00:26:57 -07:00
Wenkai Du	d28e1cb44f	Merge remote-tracking branch 'nccl/master' into develop	2022-04-18 11:15:25 -07:00
Wenkai Du	15b572751e	Increase chunk steps of broadcast and reduce (#528 )	2022-04-07 13:34:04 -07:00
Wenkai Du	bbe780ca6c	Support multiple tuning tables (#522 ) * Support multiple tuning tables * [UnitTests] Skip managed memory testing	2022-03-31 17:09:21 -07:00

1 2 3 4

160 کامیت‌ها