rocm-systems

Autor	SHA1	Wiadomość	Data
Arm Patinyasakdikul	ff75860d73	Fix unroll factor display bug. (#1969 )	2025-10-10 15:35:06 -05:00
Surya Periaswamy	5bd5079de1	MSCCL++ fix split path null deref (#1959 ) * Add speriaswamy-amd to CODEOWNERS * MSCCL++: fix split path null deref; key maps by parent ncclUniqueId * removed no-op	2025-10-09 14:08:38 -05:00
Rahul Vaidya	6b200ee6c5	Fix LL128 proto selection to respect user setting (#1822 )	2025-10-09 14:08:03 -05:00
Nusrat Islam	d22a39e954	Update direct AG and single node LL threshold (#1944 ) * update AG direct and single node LL threshold * update thresholds based on MI350 expeirmental results * disable using LL for direct AG * enable direct AG for lower GPU counts * direct AG single node tuning * fix in-place buffer allocation for AG unit test * whitespace fix * gate direct AG for gfx950 and gfx942 --------- Co-authored-by: Nusrat Islam <nusislam@nova-login-gtu2.prov.gtu.zts.cpe.ice.amd.com>	2025-10-09 10:48:50 -05:00
Artem Kuzmitckii	00a42c80f3	Reverse logic of context tracking enablement from #1927 (#1971 ) In this commit it disabled by default and can be enabled via `RCCL_ENABLE_CONTEXT_TRACKING=1` for both (CDNA, RDNA) Original PR https://github.com/ROCm/rccl/pull/1927	2025-10-09 10:24:09 +02:00
BertanDogancay	3f94267f21	Merge remote-tracking branch 'nccl/master' into develop	2025-10-06 18:36:49 -04:00
Nilesh M Negi	342ec086e3	Revert "changes for hugepages backed host buffer for larger allocations (#1841 )" (#1951 ) This reverts commit `65b69bf318`.	2025-10-02 23:43:09 -05:00
amd-jiali	5978d2f9ab	Print out the hipRuntimeVersion message from WARN to always show up (#1911 ) Authored-by: Jiali Li <jialili@amd.com>	2025-10-02 11:32:32 -05:00
Bhuvan Mital	65b69bf318	changes for hugepages backed host buffer for larger allocations (#1841 )	2025-09-28 00:40:22 -05:00
Artem Kuzmitckii	07925ec027	Revert disabling of context tracking for Radeon (#1927 ) * Revert disabling of context tracking for Radeon Original commit `6fc228e2` `Disable context tracking for the current version. (#1839)` * Add env variable for disabling of context tracking for Radeon `export NCCL_DISABLE_CONTEXT_TRACKING=1` to force disable of context tracking * Update docs/how-to/rccl-usage-tips.rst Fix grammar, thanks @amd-jnovotny Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Rename NCCL_DISABLE_CONTEXT_TRACKING -> RCCL_DISABLE_CONTEXT_TRACKING * Revert changes in includes and rename util function --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-09-27 15:19:50 -04:00
alex-breslow-amd	45166f6586	Gate code by rocm_version (#1945 )	2025-09-26 13:28:41 -07:00
Mustafa Abduljabbar	0dd2b2f65e	Fix extra token typo (#1943 )	2025-09-26 11:18:43 -04:00
Mustafa Abduljabbar	7a329bbd94	Expose symbols for RCCL algo/proto/channels selection functions (#1923 ) * Unhide symbols for algo/proto functions * Add all_gather direct usage detection	2025-09-25 18:58:30 -04:00
Larry Meadows	cb14fccdcc	- LL Protocol: Add missing fences for gfx950, this fixes the hang issue (#1932 ) - Remove asm flat_store_dwordx4, not needed	2025-09-25 14:07:07 -07:00
corey-derochie-amd	d86cf78810	Moved new functions to the bottom of the function table to maintain backward compatibility (#1931 ) * Moved new functions to the bottom of the function table to maintain backward compatibility * Added ordering fixes to api_trace.cc	2025-09-23 13:30:27 -06:00
Mustafa Abduljabbar	c1e1f2faeb	Use batched P2P to enhance alltoall small message performance (#1902 ) * Batch P2P operations (2 per CU/channel) and update channel-part mapping - Revert bitreversal and fix channel mapping to be compatible with P2P batching and avoid hangs - P2P batching is only used for more than 2 nodes to avoid aggregating intra-node traffic when it is dominant for less than 2 nodes * Address single node regression and channel per net peer * Add batching threshold * Add enable switch for batching * Update CHANGELOG.md * Add minor comment change * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-09-22 16:25:10 -04:00
corey-derochie-amd	9b04b2a42f	Added an implementation of `ncclSymGetKernelPtr` for when `GENERATE_SYM_KERNELS` is not defined, as it is normally generated code. (#1925 )	2025-09-19 07:52:33 -06:00
corey-derochie-amd	ed095cad35	Moved latency_profiler license into subdirs and updated NOTICES. (#1918 )	2025-09-18 12:54:39 -06:00
Venkateshwar Reddy Kandula	0cc896910e	due nccl api sync update RCCL_API_TRACE_VERSION_PATCH to 2 (#1916 )	2025-09-18 07:36:50 -06:00
Nilesh M Negi	da06c69cb8	[INIT] Use rocm-smi API instead of CLI for querying FW version (#1920 )	2025-09-17 19:17:19 -05:00
Karthik Ganesan	740dfd1efd	Update prims_simple.h to keep header file access to rccl_metadata.h uniform (#1906 ) Header files in device/ folder are directly referenced in the code base except here.	2025-09-16 08:58:50 -05:00
Bertan Dogancay	93d86dd8e3	[BUILD] Stop generating sym kernels by default (#1907 ) * Stop generating sym kernels by default	2025-09-15 12:19:35 -04:00
mberenjk	ada4e12360	disabling msccl for fp8 datatype (#1888 ) * disabling msccl for fp8 datatype --------- Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>	2025-09-11 13:09:34 -05:00
Wenkai Du	de9ebd8a8b	Treat PIX and PXB as same GDR distance (#1894 )	2025-09-11 10:44:10 -05:00
isaki001	9c36439354	add reduce/broadcast algo/proto selection table for multi-node gfx940 (#1889 )	2025-09-10 14:25:23 -05:00
Wenkai Du	c2bccf9156	Enable LL128 and use same tuning table for gfx942 4 NICs (#1898 )	2025-09-10 11:11:15 -04:00
Mustafa Abduljabbar	7ccc6f268f	Force enable proto and/or algo after model selection (#1799 ) * Force enable proto or algo * Remove inc nccl_common.h * Move logic and add error checks * Fix topo_expl compatibility * Allow algo/proto overrides * Remove extra function decl * Clarify warning message * Move algo/proto overrides into separate functions * Update CHANGELOG.md	2025-09-03 08:54:13 -04:00
ycui1984	361d596229	[rocm_regression] Return errors when HSA_NO_SCRATCH_RECLAIM=1 even for rocm>=6.4.0 (#1867 ) * [rocm_regression] Return errors when HSA_NO_SCRATCH_RECLAIM=1 even for rocm >= 6.4.0 * [rocm_regression] Check firmware version * [rocm_regression] Resolve review comments * [rocm_regression] Move hsa env checking into init once func * [rocm_regression] Prevent hot fix version in firmware * [rocm_regression] Improve unit tests	2025-08-29 11:18:23 -05:00
BertanDogancay	08a7be231b	Merge remote-tracking branch 'nccl/master' into develop	2025-08-28 15:46:28 -05:00
Nusrat Islam	df448862c3	Device allocation tracker (#1878 ) * alloc: add memory allocation tracker * alloc: add tracker for ncclCuMemAlloc() APIs * alloc: add null pointer check during free	2025-08-27 09:30:51 -05:00
Mustafa Abduljabbar	277747c199	[Device] Add dynamic fetch/reduce pipelining for reduction collectives - Simple protocol (#1861 ) * Support pipelining codegen and template specialization * Support ReduceCopy pipelining for AllReduce, ReduceScatter, and Reduce (currently enabled for bfloat16) * Remove need for FUNC_INDEX_TOTAL * Add pipeline field to device function key construction logic * Avoid unneeded codegen for LL/LL64 kernels * Modify conditions and add pipeline dtypes env * Optimize selection for both gfx942 and gfx950 * Increase pipeline bitfield width * Use __forceinline__ for all device functions * Realign reduceCopy with original form * Add opt-out option to enable perf debugs * Remove force-reduce-pipelining option from README * Update CHANGELOG.md --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-08-26 15:03:54 -04:00
Nusrat Islam	b882af9ffd	fixup: remove extra semicolon (#1881 )	2025-08-26 10:57:25 -05:00
Nusrat Islam	5e7937effb	Add direct allgather algorithm (#1868 ) * add direct allgather algorithm * minor fix * add debug print for memory allocation tracker * add message size threshold for direct allgather * scatter transfers across ranks * update changelog * minor fix * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * enable direct AG when pxn is ON on MI300X or MI350 --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-08-25 07:55:10 -05:00
Nilesh M Negi	bf6660ee4e	[BUILD] Populate host_table entries only for 1 unroll (#1871 )	2025-08-23 00:15:38 -05:00
Arm Patinyasakdikul	28a83c3ea6	Removing "Could not find any local path from gpu X to net." warning (#1866 ) * Removing "Could not find any local path from gpu X to net." warning to avoid confusion.	2025-08-20 16:52:35 -05:00
Arm Patinyasakdikul	fb882e80f6	Remove noinline attribute from reduceCopyPacks and (#1864 ) reduceCopyPacksWithBias.	2025-08-19 20:24:31 -05:00
Mustafa Abduljabbar	c1b3cd8911	Have ncclDevFuncId use 64-Bit keyed map with field packing (#1857 ) - Updated ncclDevFuncId to use a hash-based lookup with std::unordered_map. - Keys are now 64-bit integers, which pack coll, algo, proto, devRedOp, and type fields. - Improved flexibility and maintainability by moving away from row-based indexing. - Added error handling for missing keys in the hash map. - Aligned key generation logic with generate.py and updated generate.py.	2025-08-19 16:41:19 -04:00
Nusrat Islam	6ade5065b4	device: optimize threadfence for ll64 protocol (#1858 ) * device: optimize threadfence for ll64 protocol * device: use __atomic_signal_fence() --------- Co-authored-by: Nusrat Islam <nusislam@useocpslog-003.amd.com>	2025-08-18 09:16:41 -05:00
Nilesh M Negi	c3b8de4ec8	[DEVICE] Use noinline for LLGenericOp only on gfx950 (#1849 )	2025-08-15 15:15:02 -05:00
isaki001	44121db890	[TUNING] gfx950 16N tuning (#1835 ) * change gfx950 algo/proto selection for multinode allreduce, allgather, reduceScatter * gfx950 tuning: enable tuning for broadcast, allreduce starts LL128 earlier and switches to ring earlier, change LL128 start for allgather and reduceScatter * lower LL128 threshold * update reduceScatter LL128 min to match LL max for consistency * enable multinode PXN and increase chunksize for gfx950 * change LL128 start to 128KB, adjust ring-start according to node-count * disable code-path for fused-AR on LL128 for gfx950 * use LL128 starting from 1KB for multinode allgather on gfx950 * start LL128 earlier for multinode reduceScatter on gfx950 * start LL128 earlier for multinode broadcast on gfx950 * set multinode allreduce to start simple on 64MB for gfx950 * start LL128 from 1KB for multinode broadcast on gfx950 * setting multinode AR to use tree instead of ring at 16MB, 64MB, 128MB * set multinode broadcast to use LL for up to 256KB depending on node-count for gfx950 * adjust algo for 32MB multinode allreduce on gfx950 * make 32MB tree LL128 for multinode AR on gfx950 * make sure ring is not picked on 2N allreduce on small sizes	2025-08-15 15:12:45 -05:00
alex-breslow-amd	1aa2570b48	Disable the __threadfence on the sender side of the simple protocol when possible. (#1830 ) Leverages the traits of extended-scope fine-grain memory to get rid of a device-scope acquire-release fence. This improves throughput for single node workloads on gfx942 and gfx950 for some input sizes (e.g., ~32 MiB to about 256 MiB) when using the simple protocol. Multinode workloads on MI300X see a smaller but statistically significant uplift for some message sizes. Runtime disablement is supported via setting the environment variable RCCL_GFX942_CHEAP_FENCE_ON to 0.	2025-08-15 07:54:54 -07:00
mberenjk	c61152baa4	Added useAcc as a template parameter to address the performance regression (#1856 ) * Added useAcc as a template parameter to address the 2% performance regression in allreduceWithBias --------- Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>	2025-08-14 15:58:54 -05:00
Karthikeyan Arumugam	6d41e5ba99	Add cstring header explictly as it is removed from HIP (#1859 )	2025-08-13 15:14:22 -07:00
isaki001	74d82a8145	enable more events for LL128 NPKIT trace collection (#1827 )	2025-08-07 11:19:36 -05:00
Avinash	3f8cac388e	Compiler warnings fix 2 (#1801 ) * Changes to device code * Changes to src/misc * Changes to graph * src/include changes * src/transport changes * changes in init, enqueue, proxy * Changes to CMakeLists.txt * Additional changes to device code * Additional changes to net.cc * adding 'compiler warning' tag to ease upstream merge' * typo correction * Addessing comments * Additional changes for new commits	2025-08-05 17:36:23 -05:00
Arm Patinyasakdikul	6fc228e247	Disable context tracking for the current version. (#1839 )	2025-08-04 10:48:00 -05:00
Nilesh M Negi	bd55f876e9	[DEVICE] Add unroll=2 for gfx950 multi-node (#1824 )	2025-07-31 02:35:26 -05:00
ycui1984	874cd657ef	Add collective latency profiler (#1785 ) * [LatencyProfiler] Initial commit * [LatencyProfiler] Add unit tests * [LatencyProfiler] add more * [LatencyProfiler] Pass unit tests * [LatencyProfiler] Add hooks to integrate with meta internal tools * [LatencyProfiler] Restore install.sh * [LatencyProfiler] Resolved comments 1. add proper license 2. use proper namespace * [LatencyProfiler] Add header	2025-07-30 14:59:28 -07:00
Mustafa Abduljabbar	4ce3df8d3a	Optimize alltoall for 64 GPUs and above for gfx942 (#1828 ) Add pxn and p2p net chunksize mi300x tuning	2025-07-30 15:14:43 -04:00
mberenjk	c84ee3d298	Upcast FP8 to Half (FP16) for Sum Operation (#1775 ) * adding hadd and hadd2 support using builtin functions. --------- Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>	2025-07-29 11:33:06 -05:00

1 2 3 4 5 ...

993 Commity