rocm-systems

Автор	SHA1	Сообщение	Дата
alex-breslow-amd	2f6b20c00a	Use One Slice per Basic Primitive for AllReduce, ReduceScatter, AllGather (#1681 ) for Single Node on Some GFX9 Systems Using a single slice rather than the typical two provides about 5% speedup (sometimes more or less) on some GFX9 systems for single node.	2025-05-29 16:17:35 -07:00
Nilesh M Negi	12517a957e	Re-apply unroll=1 and 112 channels for gfx950 (#1706 ) * Reapply "[SRC] Enable unroll=1 for gfx950 (#1602)" (#1667) This reverts commit `329e13efff`. * Reapply "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)" (#1620) This reverts commit `b17338d164`.	2025-05-28 14:58:10 -05:00
PedramAlizadeh	7f878baef0	Revert "[AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641 )" This reverts commit `00c1eb098c`.	2025-05-21 20:21:27 -05:00
corey-derochie-amd	170acf3bda	Switched to using the hip_fp8 header instead of rccl_float8, resolving compatibility issues. (#1546 ) * Revert "Revert "replacing rccl_float8 with hip_fp8 and address compatibility …" This reverts commit `824b81c034`. * [UT] Modify max stack size to 496 * adding a check for OCP type and replacing ROCM_VERSION with HIP_VERSION * addressing the ci failure * Adding the device tag --------- Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>	2025-05-14 15:33:03 -05:00
Mustafa Abduljabbar	00c1eb098c	[AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641 ) * Update LL128 elems per thread * Precompute ix[g] in LL128 prim * Make Threadthreshold part of tuning models * Ignore channel tuning when channels are env controlled * Tune LL128 max limit for AG * Tune LL128 max limit for RS * Retune AR LL128 limits due to changes * Update CHANGELOG.md --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-05-14 14:35:54 -05:00
Dingming Wu	51f87fbb43	Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv() (#1683 ) * Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv() For rocm older than 6.4, we need to set HSA_NO_SCRATCH_RECLAIM=1 to use LL128 protocol. This Env is set outside of RCCL, add the logging to detect whether its set during runtime. * check hip runtime ver via hipRuntimeGetVersion * move the detection to ncclinit func * correct rocm version integer * update warning message * avoid unnecessary info msg on hsa_no_scratch_reclaim detection	2025-05-14 10:12:45 -05:00
mberenjk	1cefcee51f	moving the thread_fence to apply before atomic fetch (#1672 ) * applying thread_fence only on warp 0 before atomic fetch --------- Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>	2025-05-14 10:10:05 -05:00
gilbertlee-amd	9ef45df8f7	Fix when more than 64 channels are used for multi-collective group calls (#1688 ) * Fix when more than 64 channels are used for multi-collective group calls * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-05-12 18:05:57 -05:00
Avinash	5f6805b4f4	RCCL Multinode DMA Buffer crash fix (#1682 ) This commit handles DMABUF initialization and call appropriate handling function. This fixes crash in OS with no peermem support and relying on only DMABUF. * Initial test commit * Handling Dmabuf_fd opening and closing * Cleanup * Use DMABuff or Peermem as needed * Using user input for ibDmaBufSupportInitOnce * Revert all changes to rocmwrap.cc * Revert all changes to rocmwrap.cc * Changing to func definition braces * Reverting line removal in utils.h * useDmaBuf to calculate flushEnabled	2025-05-08 19:17:39 -05:00
Avinash	c54a0c085a	collective trace improvements for debugging (#1661 )	2025-05-07 13:37:31 -05:00
Bertan Dogancay	590ad6acc2	Merge pull request #1662 from BertanDogancay/2.25 [SYNC] 2.25.1-1	2025-05-06 09:39:09 -04:00
Mustafa Abduljabbar	f3f3336468	Fix topo explorer's compatibility with NCCL 2.24 (#1671 ) * Fix build issues * Fix failure to find path remote rank	2025-05-05 15:26:29 -04:00
Bertan Dogancay	acfac55516	[Graph] Try using P2P by default (#1670 )	2025-05-02 11:54:30 -04:00
Nilesh M Negi	329e13efff	Revert "[SRC] Enable unroll=1 for gfx950 (#1602 )" (#1667 ) * Revert "[SRC] Enable unroll=1 for gfx950 (#1602)" This reverts commit `307bc10781`. * Update Changelog --------- Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2025-04-30 23:33:08 -05:00
BertanDogancay	cb6e23ae67	Merge remote-tracking branch 'nccl/master' into develop	2025-04-30 13:31:41 -05:00
Tim	dc0c5f9153	minor fix for empty scope (group) (#1666 )	2025-04-30 13:29:13 -04:00
BertanDogancay	a6bf9bfc9e	Merge remote-tracking branch 'nccl/master' into develop	2025-04-23 20:47:43 -07:00
Mustafa Abduljabbar	82afb2bcfe	Expose production tuning table in topo_explorer using internal RCCL/NCCL logic (#1628 ) * Internal RCCL/NCCL functionality exposed when RCCL_EXPOSE_STATIC is enabled * Algo/protocol/max channels can be obtained with the new RCCL API * Introduce rccl_static and rccl_static_inline macros to work around invisible functions in core source files like enqueue.cc * Add usage example in topo-explorer tool	2025-04-23 15:44:56 -04:00
Bertan Dogancay	ac8ec4c08c	Fix NPKit for SendRecv (#1651 )	2025-04-21 12:34:47 -04:00
Tim	9a55ff60a9	RCCL Replayer update (#1603 ) RCCL recorder w/ suggested change and UT	2025-04-19 00:21:27 -04:00
Mustafa Abduljabbar	52bfdf05dc	Address nested designator compiler warning issue (#1633 )	2025-04-18 17:09:50 -04:00
Nusrat Islam	f20c33effd	Fix MSCCLPP accuracy issue for allreduce7 (#1634 ) * ext-src: fix a graph-mode bug in allreduce7 * change MSCCLPP threshold to 16MB * ext-src: change message size threshold for allreduce7 * ext-src: address review comments	2025-04-18 08:54:32 -05:00
AbandiGa	7a84c5dbb0	added copyright (#1635 )	2025-04-14 09:46:18 -05:00
Dingming Wu	1786c0268b	Adding #include <dlfcn.h> in profiler.cc to pass build (#1632 ) w/o the header, experience following errors: ``` error: use of undeclared identifier 'RTLD_NOW' error: use of undeclared identifier 'RTLD_LOCAL' error: use of undeclared identifier 'dlerror' error: use of undeclared identifier 'dlsym' error: use of undeclared identifier 'dlclose' ```	2025-04-10 08:48:18 -07:00
Arm Patinyasakdikul	52654e2301	Add device synchronization before destroying proxy thread. (#1631 ) This commit ensures that GPU finishes all kernel before destroying communicator thread.	2025-04-10 10:44:16 -05:00
Pedram Alizadeh	e40ff4f84a	all_reduce LL/LL128 and Ring/Tree multi-node tuning for MI300 (#1627 ) * Enabling LL128 by default on MI300 * Add missing CUDACHECK * Adjust BW correction factors to fix the Tree->Ring switching point * Refactor and add ll128 AR logarithmic factor to tuning models * Move RCCL tuning changes to a separate file * Use enum for tunable indexing * Use explicit indexing in tuning models to avoid mismatch issues * Place rcclGetSizePerRank in a function * Remove HIP ifdef for rccl-only call --------- Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>	2025-04-10 11:43:54 -04:00
Mustafa Abduljabbar	4be06f04d8	Add AllGather LL128 multi-node tuning and include LL cutoff points in tuning models (#1618 ) * Enable LL/LL128 cutoff points in tuning models * Initializing ll/ll128 model cutoffs for MI300 * Use RCCL_LL_LIMITS_UNDEFINED --------- Co-authored-by: PedramAlizadeh <pmohamma@amd.com>	2025-04-02 16:26:23 -04:00
Nilesh M Negi	b17338d164	Revert "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596 )" (#1620 ) * Revert "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)" This reverts commit `1df73e209e`. * [DOC] Update Changelog * [DOC] Update CHANGELOG	2025-03-28 17:57:06 -05:00
Bertan Dogancay	532f54c244	Merge pull request #1559 from BertanDogancay/2.23 [SYNC] 2.23.4-1	2025-03-28 17:06:56 -04:00
Nilesh M Negi	307bc10781	[SRC] Enable unroll=1 for gfx950 (#1602 ) * [SRC] Enable unroll=1 for gfx950 * Fix typo from rebase in generate.py * Support for unroll=1 and gfx90a when building for all GPU targets --------- Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2025-03-27 18:21:35 -05:00
BertanDogancay	0b2062c560	Merge remote-tracking branch 'nccl/master' into develop	2025-03-27 12:53:04 -05:00
isaki001	9dc23d9265	Disable mscclpp (#1614 ) * disable mscclpp by default	2025-03-25 15:21:16 -05:00
gilbertlee-amd	626dc50ab5	Removing the experimental clique kernel files (#1610 )	2025-03-20 18:10:01 -06:00
Wenkai Du	90ad586d94	Add fault injection of starting warps with random variations (#1593 ) * Add fault injection of starting warps with random variations This is done by inserting randomly delays after __syncthreads(). The feature can be turned off by FAULT_INJECTION=OFF in cmake. * Remove manually introduced bug for demo purpose * Use only one thread per warp for checking wall clock	2025-03-20 16:11:43 -07:00
corey-derochie-amd	6505639cf4	removed gfx940 and gfx941 (#1606 ) * removed gfx940 and gfx941 * removed gfx940 and gfx941 * Update "gfx94" to "gfx942" in init.cc * Updated remaining "gfx94" updates to "gfx942" * Update filenames and variables from gfx940 to gfx942 --------- Co-authored-by: akolliasAMD <akollias@amd.com>	2025-03-20 09:34:53 -06:00
Wenkai Du	bd0092e8f1	GDRCOPY support: Off by default (#1605 )	2025-03-18 08:17:01 -07:00
Avinash	ccb0820743	Memory leak fix when numIBDevices = 0 (#1429 ) * Initial commit for testing * Fix memory leak in checkOptions * Fix memory leak in checkOption * x * Delete cmake-3.28.2-linux-x86_64.sh * gcn changes * gcn memleak fixes * gcn leak fix * memory leak fixes for parseRome4P2H and ncclTopoAddGPU * Keeping only necessary file for fixes Deleting temporary scripts I created for debugging and testing * changing to GCN_ARCH_NAME_LEN * Added sanity check directory * refactoring scripts * Updated to sanity checks folder * Initial fixes * changes in tools * pointing RCCL lib build to debug version * Removed second pthread_detach * Removing sanity checks * Keeping only code changes * addressing memory leaks in ncclIbinit --------- Co-authored-by: Chao Chen <cchen104@amd.com>	2025-03-17 11:21:19 -05:00
Mustafa Abduljabbar	f67b2cc908	Multi-node reduce_scatter improved auto-selection for LL and LL128 on gfx942 (#1604 ) * Add reduce_scatter LL and LL128 thresholds * Always honor user choice for protocol	2025-03-17 11:21:01 -04:00
Wenkai Du	245c2de909	Enable LL128 on gfx942 (#1549 )	2025-03-16 15:10:05 -07:00
Nilesh M Negi	1df73e209e	[GRAPH] Increase default nChannels to 112 for gfx950 (#1596 ) Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2025-03-14 14:47:03 -07:00
Wenkai Du	4237caad69	Limit P2P channels per peer to not exceeding max channels (#1594 ) * Limit P2P channels per peer to not exceeding max channels * [UT] test single GPU cases for all collectives * [UT] fix out of range root value	2025-03-11 09:32:09 -07:00
Tim	f6c6d451a9	mscclpp compatibility check for ubr (#1573 )	2025-03-09 22:10:47 -04:00
Nusrat Islam	ac823818aa	misc/msccl: force use of mscclpp (#1581 )	2025-03-04 12:48:59 -06:00
Bertan Dogancay	d88cca3098	[Transport] Fix IntraNet (#1582 )	2025-03-04 13:30:36 -05:00
Wenkai Du	f957c4fe22	NPKit: enable reduce scatter profiling (#1580 )	2025-03-04 10:03:56 -08:00
Nusrat Islam	23c0b7bd84	misc/msccl: Read graph capture status for every collective call (#1576 ) * misc/msccl: read graphCaptureStatus for every collective call * fix a bug in checking whether UBR is enabled in MSCCLPP * cmake: Fix patch reversal order * misc/msccl: add logging	2025-02-28 17:16:07 -06:00
Wenkai Du	60c1264d27	Improve RDMA flushing by write dummy payload with RO=0 (#1570 ) * Improve RDMA flushing by write dummy payload with RO=0 * Rename env var for disabling this change to RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING	2025-02-27 16:20:32 -08:00
Nilesh M Negi	02b8e504f0	[BUILD] Fix generate.py for gfx950 (#1575 ) * [BUILD] Fix generate.py for gfx950 Signed-off-by: nilenegi <Nilesh.Negi@amd.com> * [BUILD] Cleaner way to check for gfx targets Signed-off-by: nilenegi <Nilesh.Negi@amd.com> --------- Signed-off-by: nilenegi <Nilesh.Negi@amd.com>	2025-02-26 18:43:37 -06:00
Bertan Dogancay	85eb1f16bc	Use bit reversal based mapping for multi-node (#1572 )	2025-02-26 09:48:03 -05:00
Pedram Alizadeh	f268553ee4	enable building rccl for gfx950 (#1571 )	2025-02-25 16:13:48 -05:00

1 2 3 4 5 ...

887 Коммитов