rocm-systems

Yazar	SHA1	Mesaj	Tarih
gilbertlee-amd	afa4a5ecf8	Updating Doxygen documentation (#831 )	2023-07-28 16:09:06 -06:00
Wenkai Du	3db371c9a5	Revert "Enable Ll128 on gfx90a (#823 )" (#829 ) This reverts commit `420f8af6a0`. Also increase number of parallel jobs for linking	2023-07-27 20:25:18 -07:00
Ziyue Yang	f7dc7b7e6a	Fix MSCCL proxy number of chunks calculation (#821 ) Current number of transmissions parsed from MSCCL algorithm is 1-based value, but when calculating proxy number of chunks, it's taken as 0-based value. This commit fixes this issue.	2023-07-26 13:24:49 -07:00
Wenkai Du	420f8af6a0	Enable Ll128 on gfx90a (#823 )	2023-07-26 11:44:15 -07:00
Wenkai Du	4d20b4b758	Replace atomicExch with __atomic_store_n (#818 ) * Replace atomicExch with __atomic_store_n * Remove extra semicolon	2023-07-25 11:15:21 -07:00
Nusrat Islam	47f754e6f5	Merge pull request #810 from nusislam/tune-send-recv device: fine-tune RCCL send-recv on MI250/MI200	2023-07-25 10:18:12 -05:00
Bertan Dogancay	8bab4f04b7	Implement RCCL Replayer (#817 ) * Implement RCCL Replayer	2023-07-24 16:26:22 -06:00
Wenkai Du	a7fcd58a97	Enable gfx94x (#808 ) (#816 ) (cherry picked from commit 94da229a7788d74685d1591a4e75a8341de64f41)	2023-07-21 07:31:27 -07:00
Nusrat Islam	b2a0a0bd3a	device: fine-tune RCCL send-recv	2023-07-17 21:45:52 -05:00
Ziyue Yang	848e60b00c	Fix path finding in msccl internal scheduler (#794 )	2023-07-12 13:31:28 -07:00
Wenkai Du	0f14e5a640	npkit: separate network timing between send and test (#798 )	2023-07-10 09:31:49 -07:00
Nusrat Islam	58e53dfd37	device: fine tune MI200/MI250 simple protocol performance With Simple protocol, unroll factor of 4 offers better performance for most of the collectives (on MI200. MI250, and MI300) except large message allreduce with Ring algorithm on MI250 and MI200). This PR changes the default unroll factor to 4 while adding fine tuning for reduction operations.	2023-07-08 20:21:18 -05:00
Wenkai Du	e0c70af46b	Merge remote-tracking branch 'nccl/master' into nccl_sync	2023-07-05 07:53:53 -07:00
Wenkai Du	ce6a2ffac8	Merge pull request #782 from ROCmSoftwarePlatform/2.18.3 Sync up with NCCL 2.18.3	2023-06-29 15:04:16 -07:00
akolliasAMD	9bba4a2f2a	added npkit support into the all_gather run ring algorithm (#790 )	2023-06-29 13:59:54 -06:00
Dmitrii Gabor	6e24ef4e1f	Prevent WR index truncation in the InfiniBand transport plugin	2023-06-28 11:39:19 +02:00
Wenkai Du	abd0615351	Merge remote-tracking branch 'nccl/master' into develop	2023-06-26 22:51:56 +00:00
arvindcheru	bd14ac8b59	ASAN build excluding additional files, Algodir support for share folder * ASAN build excluding additional files, Algodir support for share folder (#786) * Algodir support for share folder	2023-06-23 10:57:20 -04:00
Bertan Dogancay	0c77c66221	Disable Colltrace for --fast option (#778 ) * Disable Colltrace for --fast option * Limit nprocs for CI	2023-06-21 14:16:09 -06:00
akolliasAMD	9bdf6797a5	fixed npkit size to never be a negative number (#779 )	2023-06-21 08:26:40 -06:00
gilbertlee-amd	52a28ff2fc	Switching to using atomicAdd_system within kernel for collective trace (#780 )	2023-06-20 17:49:52 -06:00
Nusrat Islam	3a741787bf	device: use unroll factor based on platforms	2023-06-14 13:36:15 -05:00
Bertan Dogancay	f35777e9b0	improve compilation time and create timetrace plot (#773 ) * improve compilation time and create time-trace plot * set default value for nproc	2023-06-14 09:17:51 -06:00
Sylvain Jeaugey	ea38312273	2.18.3-1 Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC. Fix hang with Collnet on bfloat16 on systems with less than one NIC per GPU. Fix long initialization time. Fix data corruption with Collnet when mixing multi-process and multi-GPU per process. Fix crash when shared memory creation fails. Fix Avg operation with Collnet/Chain. Fix performance of alltoall at scale with more than one NIC per GPU. Fix performance for DGX H800. Fix race condition in connection progress causing a crash. Fix network flush with Collnet. Fix performance of aggregated allGather/reduceScatter operations. Fix PXN operation when CUDA_VISIBLE_DEVICES is set. Fix NVTX3 compilation issues on Debian 10.	2023-06-14 01:29:17 -07:00
akolliasAMD	9cdac774ea	Wall clock update and npkit trace script Update (#771 ) * changed builtin clock to wall_clock64 * updated npkit_Trace_generator to the new version of npkit	2023-06-07 17:47:10 -06:00
Cory Bloor	b1a65afd58	Fix build on additional architectures (#740 ) * Fix build on additional architectures Instead of directly wrapping a platform-specific operation with a preprocessor check against a gfx macro, it can be more flexible to check a macro that can be overriden by the user. The gfx macro can then just provide the default value for the macro, resulting in the same default behaviour as if the gfx macro was checked directly but with more control at build-time. For example, to build rccl without using buffer_wbinvl1_vol on gfx902, but still use the default on other archs, a user could export CXXFLAGS='-Xarch_gfx902 -DRCCL_USE_WBINVL1_VOL=1' before configuring the build. This flexibility isn't always necessary, but it's nicer to have it and not need it than to need it and not have it. * Define WARP_SIZE using warpSize builtin	2023-06-06 16:45:50 -06:00
Wenkai Du	3af90902c8	Add NCCL_NCHANNELS_PER_PEER override (#767 ) Also fix topol_expl build issue	2023-06-06 08:41:38 -07:00
Bertan Dogancay	d52b6c0d24	add DMA_BUF support (#763 ) * add DMA_BUF support * remove unused libraries in src/init.cc * change NCCL_ALL to NCCL_INIT * remove extra pointer functions in transport/net.cc	2023-06-01 12:46:42 -06:00
Wenkai Du	5a38ff192b	Rework barrier and event code (#761 ) * Rework barrier and event code * Switch to inline asm	2023-05-31 13:36:51 -07:00
Nusrat Islam	4d1cfb17c8	device: change unroll factor The default value of unroll factor is 2. Changing the unroll factor to 4 provides better performance for most of the collectives.	2023-05-25 15:42:35 -05:00
Ziyue Yang	7d6e7bcd7d	revert npkit (#748 )	2023-05-24 07:41:05 -07:00
Ziyue Yang	ed252c30f4	Limit MSCCL reduce unrolling to pow-2 cases to shrink kernel size (#746 )	2023-05-19 11:46:36 -07:00
Ziyue Yang	11676267b5	fix min, max and avg (#745 )	2023-05-18 11:02:59 -07:00
Wen-Heng (Jack) Chung	eba4e9e100	Merge pull request #742 from whchung/skip_done_event_msccl Allow skipping doneEvent inside MSCCL.	2023-05-18 10:17:20 -05:00
Wenkai Du	403cda6322	Fix merge error (#744 )	2023-05-18 08:09:27 -07:00
Wen-Heng (Jack) Chung	ca4a1dfd67	Address review feedbacks and make the flag be disabled by default.	2023-05-17 17:50:25 +00:00
Wen-Heng (Jack) Chung	12dba425de	Skip doneEvent inside MSCCL by default. Added a RCCL_MSCCL_ENABLE_DONE_EVENT env var, set it be 0 by default. The env var is to control whether to use doneEvent when invoking MSCCL kernels. Skipping doneEvent would cause the firmware to skip L2 cache flush, resulting in overall performance improvement.	2023-05-17 16:49:42 +00:00
Wenkai Du	4ca7742c61	Revert "Ensure memory copy integrity during transport setup (#731 )" (#741 ) * Revert "Ensure memory copy integrity during transport setup (#731)" This reverts commit `36e453c61e`. Add stream synchronization in ncclStrongStreamRelease. * Use event record and wait	2023-05-16 10:34:47 -07:00
Wenkai Du	8bb3340fcb	Skip checking of some settings in Cray OS (#739 )	2023-05-09 07:59:56 -07:00
Wenkai Du	897745a266	Remove references to NVLS functions	2023-05-05 07:55:20 -07:00
Wenkai Du	53a1f91857	Merge remote-tracking branch 'nccl/master' into develop	2023-04-25 15:38:32 -07:00
Wenkai Du	36e453c61e	Ensure memory copy integrity during transport setup (#731 )	2023-04-25 14:41:43 -07:00
Sylvain Jeaugey	d97a32fac8	2.18.1-1 Add support for IB SHARP to NVLS (NVLink SHARP algorithm). Add NVLS+Tree algorithm. Add support for memory management using cuMem* functions. Use all NICs for Send/Receive operations on systems with more than one NIC per GPU (#804). Add ncclCommSplit primitive, with resource sharing option in config. Fix alltoallv hang (#788) Increase number of channels on H100 when we're not limited by NVLink. Improve error reporting in case of IB failure, printing local and remote ID (#779). Add build option to allow compilation against RDMA includes instead of dynamically loading IB verbs symbols (#802). Fix context creation for progress thread (#803). NET/IB: add option to use multiple QPs in round-robin mode. Fix tree performance issue when NVB is disabled on HCM topologies.	2023-04-18 03:58:25 -07:00
Wenkai Du	4b09ffba43	msccl: print stack and memory usage (#723 ) * msccl: print stack and memory usage * Update number of kernels calculation	2023-04-14 14:59:03 -07:00
Kaiming Ouyang	006b6bc7dc	Add a comment to shutdown() in ncclSocketClose	2023-04-13 09:13:44 -07:00
Kaiming Ouyang	367e9b61c3	Shutdown socket before close in ncclSocketClose()	2023-04-13 09:11:52 -07:00
Ziyue Yang	7289c05146	MSCCL: Fix memcpy bug (#721 )	2023-04-11 14:46:53 -07:00
Ziyue Yang	c8e33b1232	fix msccl stream usage (#717 )	2023-03-24 10:59:36 -07:00
Wenkai Du	b02fd04165	Fix unit test HIP graph error (#712 )	2023-03-20 15:34:09 -07:00
Ziyue Yang	e3b2342f39	MSCCL: Improve executor and integrate scheduler (#694 ) * MSCCL: improve executor and add scheduler for testing * Use external scheduler * Fix cmake error * Address comments * Fix thread safe issue * Make MSCCL lifecycle APIs thread safe * Make MSCCL internal scheduler aware of topology hint * Revise error message	2023-03-14 14:34:25 -07:00

1 2 3 4 5 ...

533 İşleme