rocm-systems

Автор	SHA1	Сообщение	Дата
Atul Kulkarni	892d258319	Add missing header in alloc.h (#2086 )	2025-12-04 11:26:19 -06:00
Wenkai Du	185e78a8f0	Use one side stream per process (#2063 ) * Use one side stream per process * Handle multiple GPUs per process * Reset stream when not found * Address review comments * Fix missing mutex initializer	2025-12-02 10:03:15 -08:00
Pedram Alizadeh	fb67e5b467	Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic (#2037 ) * Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic * Switching to hip_bf16.h from ROCm 6.0.0	2025-11-13 15:56:18 -05:00
Kapil S. Pawar	acdafac49f	[RcclReplayer] Compile without the need for RCCL to be compiled (#2039 )	2025-11-10 15:38:48 -06:00
Bertan Dogancay	b1e680adc0	[GEN/BUILD] Refactor generator script and reduce build time for old archs. (#2030 )	2025-11-07 15:15:25 -05:00
Bertan Dogancay	a9bb7e9807	[Launch] Enable Implicit order launch with serial mode (#2033 )	2025-11-07 13:29:53 -05:00
Ghadeer Ahmed H Alabandi	45991fadad	[NET] Enable capping the number of QPs created for send/recv colls (#1998 )	2025-11-07 00:47:01 +00:00
Arm Patinyasakdikul	1ce83d5cc0	Added ERROR message class to handle fatal error messages. (#2002 ) * Added ERROR message class to handle fatal error messages. New ERROR message class will print the message in all debug level, including none. Change some of the fatal error message to be in ERROR instead of WARN. Added new error handler function to print out more meaningful error message in the future. * Added CHANGELOG entry. * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Change to no longer reuse NONE as ERROR. ERROR is now a separated class. * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-10-30 16:14:20 -05:00
Arm Patinyasakdikul	84fdcab68a	Added copyrights for Palamida scan 7.2. (#2018 )	2025-10-30 13:33:20 -05:00
isaki001	641c0eb51c	P2p batching hang-fix (#2011 ) * prevent batching when send/recv bytes dont match, restore bit reversal for channel to part mapping, prevent batching beyond 32-nodes * correct computation for channel to part mapping * update changelog * disabling p2p-batching by default	2025-10-30 13:32:01 -05:00
Mustafa Abduljabbar	12f51ba8bf	[Device] Adjust threadblock size for gfx950 to increase LL64/Simple performance for AR, RS and AG (#1978 ) * Add initial commit to increase tb size to 512 * Fix LL perf issue when subset of NCCL_MAX_NTHREADS is used Adding a constant to barrier_generic logic from using fallback logic when nthreads < NCCL_MAX_NTHREADS and nthreads == blockDim.X * Adjust nthreads for LL * Opt threads for reduce_scatter upper small range * Add macro for single node * Restrict MSCCL to 256 threads to prevent mem access fault * Support pre-MI350 compatibility * Partially refactor threadblock size override * Use const macros instead of numerals * opt out of unused function	2025-10-29 23:24:32 -05:00
Nilesh M Negi	c35bc721ad	Fix ncclDevFuncId for AllReduceWithBias (#1980 )	2025-10-17 09:28:57 -05:00
gilbertlee-amd	fedddb452c	Enabling gdrcopy option for gfx950 (#1955 )	2025-10-15 10:55:25 -06:00
alex-breslow-amd	c70f5b4621	[gfx950] Make bypassing __threadfence the default for multinode. (#1947 ) * Gate based on ROCM version, safe for ROCm 7.0.2 and beyond. * Updates naming to gfx9CheapFenceOff since we use this for gfx942 and gfx950. Thanks Nilesh. * Add info logging statement to NCCL_INIT to print whether enabled when INFO logging is enabled.	2025-10-15 09:15:36 -07:00
isaki001	0f99fd84a3	gfx950 channel tuning for ReduceScatter and AllGather (#1940 ) * add channel thresholds to override channel-count adjustments	2025-10-14 09:50:44 -05:00
Artem Kuzmitckii	00a42c80f3	Reverse logic of context tracking enablement from #1927 (#1971 ) In this commit it disabled by default and can be enabled via `RCCL_ENABLE_CONTEXT_TRACKING=1` for both (CDNA, RDNA) Original PR https://github.com/ROCm/rccl/pull/1927	2025-10-09 10:24:09 +02:00
BertanDogancay	3f94267f21	Merge remote-tracking branch 'nccl/master' into develop	2025-10-06 18:36:49 -04:00
Nilesh M Negi	342ec086e3	Revert "changes for hugepages backed host buffer for larger allocations (#1841 )" (#1951 ) This reverts commit `65b69bf318`.	2025-10-02 23:43:09 -05:00
Bhuvan Mital	65b69bf318	changes for hugepages backed host buffer for larger allocations (#1841 )	2025-09-28 00:40:22 -05:00
Artem Kuzmitckii	07925ec027	Revert disabling of context tracking for Radeon (#1927 ) * Revert disabling of context tracking for Radeon Original commit `6fc228e2` `Disable context tracking for the current version. (#1839)` * Add env variable for disabling of context tracking for Radeon `export NCCL_DISABLE_CONTEXT_TRACKING=1` to force disable of context tracking * Update docs/how-to/rccl-usage-tips.rst Fix grammar, thanks @amd-jnovotny Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Rename NCCL_DISABLE_CONTEXT_TRACKING -> RCCL_DISABLE_CONTEXT_TRACKING * Revert changes in includes and rename util function --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-09-27 15:19:50 -04:00
Mustafa Abduljabbar	0dd2b2f65e	Fix extra token typo (#1943 )	2025-09-26 11:18:43 -04:00
Mustafa Abduljabbar	7a329bbd94	Expose symbols for RCCL algo/proto/channels selection functions (#1923 ) * Unhide symbols for algo/proto functions * Add all_gather direct usage detection	2025-09-25 18:58:30 -04:00
corey-derochie-amd	d86cf78810	Moved new functions to the bottom of the function table to maintain backward compatibility (#1931 ) * Moved new functions to the bottom of the function table to maintain backward compatibility * Added ordering fixes to api_trace.cc	2025-09-23 13:30:27 -06:00
Mustafa Abduljabbar	c1e1f2faeb	Use batched P2P to enhance alltoall small message performance (#1902 ) * Batch P2P operations (2 per CU/channel) and update channel-part mapping - Revert bitreversal and fix channel mapping to be compatible with P2P batching and avoid hangs - P2P batching is only used for more than 2 nodes to avoid aggregating intra-node traffic when it is dominant for less than 2 nodes * Address single node regression and channel per net peer * Add batching threshold * Add enable switch for batching * Update CHANGELOG.md * Add minor comment change * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-09-22 16:25:10 -04:00
corey-derochie-amd	ed095cad35	Moved latency_profiler license into subdirs and updated NOTICES. (#1918 )	2025-09-18 12:54:39 -06:00
Venkateshwar Reddy Kandula	0cc896910e	due nccl api sync update RCCL_API_TRACE_VERSION_PATCH to 2 (#1916 )	2025-09-18 07:36:50 -06:00
Nilesh M Negi	da06c69cb8	[INIT] Use rocm-smi API instead of CLI for querying FW version (#1920 )	2025-09-17 19:17:19 -05:00
isaki001	9c36439354	add reduce/broadcast algo/proto selection table for multi-node gfx940 (#1889 )	2025-09-10 14:25:23 -05:00
Mustafa Abduljabbar	7ccc6f268f	Force enable proto and/or algo after model selection (#1799 ) * Force enable proto or algo * Remove inc nccl_common.h * Move logic and add error checks * Fix topo_expl compatibility * Allow algo/proto overrides * Remove extra function decl * Clarify warning message * Move algo/proto overrides into separate functions * Update CHANGELOG.md	2025-09-03 08:54:13 -04:00
ycui1984	361d596229	[rocm_regression] Return errors when HSA_NO_SCRATCH_RECLAIM=1 even for rocm>=6.4.0 (#1867 ) * [rocm_regression] Return errors when HSA_NO_SCRATCH_RECLAIM=1 even for rocm >= 6.4.0 * [rocm_regression] Check firmware version * [rocm_regression] Resolve review comments * [rocm_regression] Move hsa env checking into init once func * [rocm_regression] Prevent hot fix version in firmware * [rocm_regression] Improve unit tests	2025-08-29 11:18:23 -05:00
BertanDogancay	08a7be231b	Merge remote-tracking branch 'nccl/master' into develop	2025-08-28 15:46:28 -05:00
Nusrat Islam	df448862c3	Device allocation tracker (#1878 ) * alloc: add memory allocation tracker * alloc: add tracker for ncclCuMemAlloc() APIs * alloc: add null pointer check during free	2025-08-27 09:30:51 -05:00
Mustafa Abduljabbar	277747c199	[Device] Add dynamic fetch/reduce pipelining for reduction collectives - Simple protocol (#1861 ) * Support pipelining codegen and template specialization * Support ReduceCopy pipelining for AllReduce, ReduceScatter, and Reduce (currently enabled for bfloat16) * Remove need for FUNC_INDEX_TOTAL * Add pipeline field to device function key construction logic * Avoid unneeded codegen for LL/LL64 kernels * Modify conditions and add pipeline dtypes env * Optimize selection for both gfx942 and gfx950 * Increase pipeline bitfield width * Use __forceinline__ for all device functions * Realign reduceCopy with original form * Add opt-out option to enable perf debugs * Remove force-reduce-pipelining option from README * Update CHANGELOG.md --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-08-26 15:03:54 -04:00
Nusrat Islam	5e7937effb	Add direct allgather algorithm (#1868 ) * add direct allgather algorithm * minor fix * add debug print for memory allocation tracker * add message size threshold for direct allgather * scatter transfers across ranks * update changelog * minor fix * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * enable direct AG when pxn is ON on MI300X or MI350 --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-08-25 07:55:10 -05:00
Mustafa Abduljabbar	c1b3cd8911	Have ncclDevFuncId use 64-Bit keyed map with field packing (#1857 ) - Updated ncclDevFuncId to use a hash-based lookup with std::unordered_map. - Keys are now 64-bit integers, which pack coll, algo, proto, devRedOp, and type fields. - Improved flexibility and maintainability by moving away from row-based indexing. - Added error handling for missing keys in the hash map. - Aligned key generation logic with generate.py and updated generate.py.	2025-08-19 16:41:19 -04:00
isaki001	44121db890	[TUNING] gfx950 16N tuning (#1835 ) * change gfx950 algo/proto selection for multinode allreduce, allgather, reduceScatter * gfx950 tuning: enable tuning for broadcast, allreduce starts LL128 earlier and switches to ring earlier, change LL128 start for allgather and reduceScatter * lower LL128 threshold * update reduceScatter LL128 min to match LL max for consistency * enable multinode PXN and increase chunksize for gfx950 * change LL128 start to 128KB, adjust ring-start according to node-count * disable code-path for fused-AR on LL128 for gfx950 * use LL128 starting from 1KB for multinode allgather on gfx950 * start LL128 earlier for multinode reduceScatter on gfx950 * start LL128 earlier for multinode broadcast on gfx950 * set multinode allreduce to start simple on 64MB for gfx950 * start LL128 from 1KB for multinode broadcast on gfx950 * setting multinode AR to use tree instead of ring at 16MB, 64MB, 128MB * set multinode broadcast to use LL for up to 256KB depending on node-count for gfx950 * adjust algo for 32MB multinode allreduce on gfx950 * make 32MB tree LL128 for multinode AR on gfx950 * make sure ring is not picked on 2N allreduce on small sizes	2025-08-15 15:12:45 -05:00
mberenjk	c61152baa4	Added useAcc as a template parameter to address the performance regression (#1856 ) * Added useAcc as a template parameter to address the 2% performance regression in allreduceWithBias --------- Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>	2025-08-14 15:58:54 -05:00
Karthikeyan Arumugam	6d41e5ba99	Add cstring header explictly as it is removed from HIP (#1859 )	2025-08-13 15:14:22 -07:00
Avinash	3f8cac388e	Compiler warnings fix 2 (#1801 ) * Changes to device code * Changes to src/misc * Changes to graph * src/include changes * src/transport changes * changes in init, enqueue, proxy * Changes to CMakeLists.txt * Additional changes to device code * Additional changes to net.cc * adding 'compiler warning' tag to ease upstream merge' * typo correction * Addessing comments * Additional changes for new commits	2025-08-05 17:36:23 -05:00
ycui1984	874cd657ef	Add collective latency profiler (#1785 ) * [LatencyProfiler] Initial commit * [LatencyProfiler] Add unit tests * [LatencyProfiler] add more * [LatencyProfiler] Pass unit tests * [LatencyProfiler] Add hooks to integrate with meta internal tools * [LatencyProfiler] Restore install.sh * [LatencyProfiler] Resolved comments 1. add proper license 2. use proper namespace * [LatencyProfiler] Add header	2025-07-30 14:59:28 -07:00
Mustafa Abduljabbar	4ce3df8d3a	Optimize alltoall for 64 GPUs and above for gfx942 (#1828 ) Add pxn and p2p net chunksize mi300x tuning	2025-07-30 15:14:43 -04:00
mberenjk	c84ee3d298	Upcast FP8 to Half (FP16) for Sum Operation (#1775 ) * adding hadd and hadd2 support using builtin functions. --------- Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>	2025-07-29 11:33:06 -05:00
Atul Kulkarni	1c3d1b3842	Added new unit tests for src/transport/shm.cc (#1689 )	2025-07-25 05:54:42 -05:00
Wenkai Du	9a4213356d	Support fused all reduce and elementwise operations (#1729 ) * Support fused all reduce and elementwise operations Add additional "acc" parameter to RCCL Replayer logs Add flag which indicates availability of new API * Fix Recorder json parsing * Remove unreachable code * Remove extra acc pointer check * . * Revert "[DEVICE] Adding ability to choose unroll factor at runtime (#1734)" This reverts commit `9d72be7b2f`. * Use noinline to reduce kernels linking time * Don't use noinline for gfx942 and gfx950 to avoid perf regression --------- Co-authored-by: AtlantaPepsi <timhu102@amd.com> Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>	2025-07-23 09:04:17 -07:00
alex-breslow-amd	11fabf1de1	Cheaper threadfence for gfx942 in postPeer [1/N]: enable for single node allreduce (#1766 ) Boosts single node bfloat16 allreduce performance by up to 20% for some data sizes and provides gating with the RCCL_GFX942_CHEAP_FENCE_OFF environment variable	2025-07-22 07:15:15 -07:00
Kamil Iskra	7c12c627c6	NCCL 2.27.6-1 Improve support for DirectNIC (CX8) * Add support for XDR speed detection. * When DirectNIC is enabled, report only the RDMA interfaces. Extend the P2C (PXN over C2C) support to send/receive operations. Support compilation with GCC 14 (Issues #1743, #1751). Fix the unloading of network plugins that also provide tuner capability. Fix the change of the current device across the calls to ncclCommDestroy() and ncclCommAbort(). A note for users on MNNVL systems: please ensure an adequate stack size for NCCL threads. While the default Linux stack size limit of 8192 KB is known to be sufficient, we've seen crashes if the limit is changed to "unlimited", as it causes the glibc library to unexpectedly decrease the stack size of NCCL's background threads to just 2048 KB. Use "ulimit -s" in bash to print the current limit; if needed, reset it to 8192 KB using "ulimit -s 8192" (one also needs to ensure that the new setting is propagated to other nodes when launching a multi-node NCCL job).	2025-07-11 07:32:13 -07:00
Nilesh M Negi	6b4ad0fd74	[BUILD] Use fmt-header instead of libfmt (#1791 )	2025-07-10 17:19:53 -05:00
Nilesh M Negi	2c099fe29a	[INIT] Fix fallback for unsupported user-specified runtime unroll factor (#1780 ) * [INIT] Fix fallback for unsupported user-specified runtime unroll factor * Add CollTrace guard * Move `commSetUnrollFactor()` to rccl_wrap.cc * Modify comments in the device-code generator script	2025-07-10 10:56:18 -05:00
mberenjk	697bee4ee8	Improving build time by removing the gfx11xx and host code from rccl_float8.h (#1789 ) * removing extra build time by removing the gfx11xx arch from using hip_fp8 --------- Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>	2025-07-09 14:03:47 -05:00
Bertan Dogancay	e96c8473a1	[DEVICE] Enable PAT algo for RCCL 1ppn (#1756 ) * Enable PAT algo for RCCL 1ppn	2025-07-04 13:45:18 -04:00

1 2 3 4 5 ...

361 Коммитов