rocm-systems

Auteur	SHA1	Bericht	Datum
Mustafa Abduljabbar	93fdcb160c	[WarpSpeed] Improve handling for auto and manual modes (#2125 ) * Force ring in WarpSpeed manual mode and log event * Skip usage for non-ring in WarpSpeed auto mode * Enable WarpSpeed when its CU count is set	2026-01-06 10:21:49 -05:00
Avinash	6f62165369	Virtual device enablement ( Minimal changes ) (#2110 ) * minimal changes * Setting Default tuning table * Add warnings NIC merge accross PCIe Root complexes,NUMA --------- Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>	2025-12-25 15:06:33 -06:00
Mustafa Abduljabbar	d009ab144e	[Device] WarpSpeed enablement and single node CU and perf opt for MI350 (#2073 )	2025-12-11 19:04:35 -05:00
Ahmed Khan	08dd75712f	Add ncclCommDump API (#2068 ) * Add ncclCommDump API * remove trailing whitespace changes * Add more proxy trace timestamps * Add facebook_rccl namespace before proxyTrace timestamp call * Clean up ProxyTrae construction * Move updateProxyOpCounter to member function * Move setProxyOpTimestamp to member function * Move addNewProxyOp to member function * Make internal methods private * Make ProxyTrace thread safe * Fix unit tests * Fix overwritten ProxyTrace DONE setting in net.cc	2025-12-11 15:02:35 -07:00
Wenkai Du	185e78a8f0	Use one side stream per process (#2063 ) * Use one side stream per process * Handle multiple GPUs per process * Reset stream when not found * Address review comments * Fix missing mutex initializer	2025-12-02 10:03:15 -08:00
Dingming Wu	b811645688	Adjust nChannels on gfx950 based on ranks and nodes for better bandwidth (#2027 )	2025-11-11 09:46:51 -06:00
Dingming Wu	05f914c997	Fail the job if flag HIP_HOST_UNCACHED_MEMORY is not set on MI350x (#2023 ) * Fail the job if compiler flag HIP_HOST_UNCACHED_MEMORY is not turned on on mi350x Place the check after initTransportsRank as the GPU arch info in comm->topo->nodes info is populated after that. * Update src/init.cc to use ERROR instead of WARN Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>	2025-11-10 11:54:35 -06:00
Dingming Wu	b00ee4c83c	Increment opCount for intra-node comms as well (#2024 ) * Enhance logging in NCCL initialization It's convenient to log comms obj and default channels together for debugging * Add opCount to collDevWork and update increment logic Added opCount to collDevWork and incremented it when proxyOpQueue is empty (e.g., for intra-node comms) * Clarify opCount increment logic in enqueue.cc Updated comment to clarify incrementing opCount for intranode communications. * Refactor NCCL_INIT logging format Updated logging format for NCCL_INIT to improve clarity. * Remove duplicate INFO logging in init.cc	2025-11-10 11:23:49 -06:00
Bertan Dogancay	b1e680adc0	[GEN/BUILD] Refactor generator script and reduce build time for old archs. (#2030 )	2025-11-07 15:15:25 -05:00
alex-breslow-amd	56e0b4e445	[gfx950] Turn On Single Node One Slice Optimization for gfx950 and MI300A (#2017 ) * Internal benchmarking shows nice single-node performance uplift for MI300A and MI350	2025-11-06 12:12:45 -08:00
Arm Patinyasakdikul	1ce83d5cc0	Added ERROR message class to handle fatal error messages. (#2002 ) * Added ERROR message class to handle fatal error messages. New ERROR message class will print the message in all debug level, including none. Change some of the fatal error message to be in ERROR instead of WARN. Added new error handler function to print out more meaningful error message in the future. * Added CHANGELOG entry. * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Change to no longer reuse NONE as ERROR. ERROR is now a separated class. * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-10-30 16:14:20 -05:00
Nilesh M Negi	8444b3c6e9	Fix gfx950 gating conditions to match ROCm 7.0.2 (#2003 )	2025-10-29 23:27:04 -05:00
Mustafa Abduljabbar	12f51ba8bf	[Device] Adjust threadblock size for gfx950 to increase LL64/Simple performance for AR, RS and AG (#1978 ) * Add initial commit to increase tb size to 512 * Fix LL perf issue when subset of NCCL_MAX_NTHREADS is used Adding a constant to barrier_generic logic from using fallback logic when nthreads < NCCL_MAX_NTHREADS and nthreads == blockDim.X * Adjust nthreads for LL * Opt threads for reduce_scatter upper small range * Add macro for single node * Restrict MSCCL to 256 threads to prevent mem access fault * Support pre-MI350 compatibility * Partially refactor threadblock size override * Use const macros instead of numerals * opt out of unused function	2025-10-29 23:24:32 -05:00
alex-breslow-amd	c70f5b4621	[gfx950] Make bypassing __threadfence the default for multinode. (#1947 ) * Gate based on ROCM version, safe for ROCm 7.0.2 and beyond. * Updates naming to gfx9CheapFenceOff since we use this for gfx942 and gfx950. Thanks Nilesh. * Add info logging statement to NCCL_INIT to print whether enabled when INFO logging is enabled.	2025-10-15 09:15:36 -07:00
Surya Periaswamy	5bd5079de1	MSCCL++ fix split path null deref (#1959 ) * Add speriaswamy-amd to CODEOWNERS * MSCCL++: fix split path null deref; key maps by parent ncclUniqueId * removed no-op	2025-10-09 14:08:38 -05:00
Artem Kuzmitckii	00a42c80f3	Reverse logic of context tracking enablement from #1927 (#1971 ) In this commit it disabled by default and can be enabled via `RCCL_ENABLE_CONTEXT_TRACKING=1` for both (CDNA, RDNA) Original PR https://github.com/ROCm/rccl/pull/1927	2025-10-09 10:24:09 +02:00
BertanDogancay	3f94267f21	Merge remote-tracking branch 'nccl/master' into develop	2025-10-06 18:36:49 -04:00
Nilesh M Negi	342ec086e3	Revert "changes for hugepages backed host buffer for larger allocations (#1841 )" (#1951 ) This reverts commit `65b69bf318`.	2025-10-02 23:43:09 -05:00
amd-jiali	5978d2f9ab	Print out the hipRuntimeVersion message from WARN to always show up (#1911 ) Authored-by: Jiali Li <jialili@amd.com>	2025-10-02 11:32:32 -05:00
Bhuvan Mital	65b69bf318	changes for hugepages backed host buffer for larger allocations (#1841 )	2025-09-28 00:40:22 -05:00
Artem Kuzmitckii	07925ec027	Revert disabling of context tracking for Radeon (#1927 ) * Revert disabling of context tracking for Radeon Original commit `6fc228e2` `Disable context tracking for the current version. (#1839)` * Add env variable for disabling of context tracking for Radeon `export NCCL_DISABLE_CONTEXT_TRACKING=1` to force disable of context tracking * Update docs/how-to/rccl-usage-tips.rst Fix grammar, thanks @amd-jnovotny Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Rename NCCL_DISABLE_CONTEXT_TRACKING -> RCCL_DISABLE_CONTEXT_TRACKING * Revert changes in includes and rename util function --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-09-27 15:19:50 -04:00
Nilesh M Negi	da06c69cb8	[INIT] Use rocm-smi API instead of CLI for querying FW version (#1920 )	2025-09-17 19:17:19 -05:00
ycui1984	361d596229	[rocm_regression] Return errors when HSA_NO_SCRATCH_RECLAIM=1 even for rocm>=6.4.0 (#1867 ) * [rocm_regression] Return errors when HSA_NO_SCRATCH_RECLAIM=1 even for rocm >= 6.4.0 * [rocm_regression] Check firmware version * [rocm_regression] Resolve review comments * [rocm_regression] Move hsa env checking into init once func * [rocm_regression] Prevent hot fix version in firmware * [rocm_regression] Improve unit tests	2025-08-29 11:18:23 -05:00
BertanDogancay	08a7be231b	Merge remote-tracking branch 'nccl/master' into develop	2025-08-28 15:46:28 -05:00
Nusrat Islam	5e7937effb	Add direct allgather algorithm (#1868 ) * add direct allgather algorithm * minor fix * add debug print for memory allocation tracker * add message size threshold for direct allgather * scatter transfers across ranks * update changelog * minor fix * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * enable direct AG when pxn is ON on MI300X or MI350 --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-08-25 07:55:10 -05:00
alex-breslow-amd	1aa2570b48	Disable the __threadfence on the sender side of the simple protocol when possible. (#1830 ) Leverages the traits of extended-scope fine-grain memory to get rid of a device-scope acquire-release fence. This improves throughput for single node workloads on gfx942 and gfx950 for some input sizes (e.g., ~32 MiB to about 256 MiB) when using the simple protocol. Multinode workloads on MI300X see a smaller but statistically significant uplift for some message sizes. Runtime disablement is supported via setting the environment variable RCCL_GFX942_CHEAP_FENCE_ON to 0.	2025-08-15 07:54:54 -07:00
Avinash	3f8cac388e	Compiler warnings fix 2 (#1801 ) * Changes to device code * Changes to src/misc * Changes to graph * src/include changes * src/transport changes * changes in init, enqueue, proxy * Changes to CMakeLists.txt * Additional changes to device code * Additional changes to net.cc * adding 'compiler warning' tag to ease upstream merge' * typo correction * Addessing comments * Additional changes for new commits	2025-08-05 17:36:23 -05:00
Arm Patinyasakdikul	6fc228e247	Disable context tracking for the current version. (#1839 )	2025-08-04 10:48:00 -05:00
Nilesh M Negi	bd55f876e9	[DEVICE] Add unroll=2 for gfx950 multi-node (#1824 )	2025-07-31 02:35:26 -05:00
ycui1984	874cd657ef	Add collective latency profiler (#1785 ) * [LatencyProfiler] Initial commit * [LatencyProfiler] Add unit tests * [LatencyProfiler] add more * [LatencyProfiler] Pass unit tests * [LatencyProfiler] Add hooks to integrate with meta internal tools * [LatencyProfiler] Restore install.sh * [LatencyProfiler] Resolved comments 1. add proper license 2. use proper namespace * [LatencyProfiler] Add header	2025-07-30 14:59:28 -07:00
Mustafa Abduljabbar	4ce3df8d3a	Optimize alltoall for 64 GPUs and above for gfx942 (#1828 ) Add pxn and p2p net chunksize mi300x tuning	2025-07-30 15:14:43 -04:00
Wenkai Du	9a4213356d	Support fused all reduce and elementwise operations (#1729 ) * Support fused all reduce and elementwise operations Add additional "acc" parameter to RCCL Replayer logs Add flag which indicates availability of new API * Fix Recorder json parsing * Remove unreachable code * Remove extra acc pointer check * . * Revert "[DEVICE] Adding ability to choose unroll factor at runtime (#1734)" This reverts commit `9d72be7b2f`. * Use noinline to reduce kernels linking time * Don't use noinline for gfx942 and gfx950 to avoid perf regression --------- Co-authored-by: AtlantaPepsi <timhu102@amd.com> Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>	2025-07-23 09:04:17 -07:00
alex-breslow-amd	11fabf1de1	Cheaper threadfence for gfx942 in postPeer [1/N]: enable for single node allreduce (#1766 ) Boosts single node bfloat16 allreduce performance by up to 20% for some data sizes and provides gating with the RCCL_GFX942_CHEAP_FENCE_OFF environment variable	2025-07-22 07:15:15 -07:00
Wenkai Du	708ad75f7a	Disable P2P net option by default (#1793 )	2025-07-14 08:55:39 -07:00
Kamil Iskra	7c12c627c6	NCCL 2.27.6-1 Improve support for DirectNIC (CX8) * Add support for XDR speed detection. * When DirectNIC is enabled, report only the RDMA interfaces. Extend the P2C (PXN over C2C) support to send/receive operations. Support compilation with GCC 14 (Issues #1743, #1751). Fix the unloading of network plugins that also provide tuner capability. Fix the change of the current device across the calls to ncclCommDestroy() and ncclCommAbort(). A note for users on MNNVL systems: please ensure an adequate stack size for NCCL threads. While the default Linux stack size limit of 8192 KB is known to be sufficient, we've seen crashes if the limit is changed to "unlimited", as it causes the glibc library to unexpectedly decrease the stack size of NCCL's background threads to just 2048 KB. Use "ulimit -s" in bash to print the current limit; if needed, reset it to 8192 KB using "ulimit -s 8192" (one also needs to ensure that the new setting is propagated to other nodes when launching a multi-node NCCL job).	2025-07-11 07:32:13 -07:00
Nilesh M Negi	2c099fe29a	[INIT] Fix fallback for unsupported user-specified runtime unroll factor (#1780 ) * [INIT] Fix fallback for unsupported user-specified runtime unroll factor * Add CollTrace guard * Move `commSetUnrollFactor()` to rccl_wrap.cc * Modify comments in the device-code generator script	2025-07-10 10:56:18 -05:00
Dingming Wu	020dcf0a7c	Add proxyTrace (#1732 ) This feature tracks the proxy events and status of each send/recv op. ProxyTrace keeps a fixed number of active ops in host mem and dumps the status of each op when the program crashes or hangs.	2025-06-25 23:01:34 -05:00
Bertan Dogancay	675b495a00	[NPKit] Create dump dir regardless of default or user provided path (#1757 )	2025-06-21 21:18:20 -05:00
BertanDogancay	aaf023976a	Merge remote-tracking branch 'nccl/master' into develop	2025-06-20 07:54:49 -05:00
Kamil Iskra	3ea7eedf3b	NCCL 2.27.5-1 Improvements for GB200 systems * Optimize the network performance by alternating the direction of the rings and the NIC to GPU assignment across communicators to limit unnecessary sharing. * Fix the detection of C2C links in case GPU Direct RDMA is disabled between a GPU and a NIC. * Fix PXN support on MNNVL systems, where NCCL would try (and fail) to share regular host memory across multiple nodes. * Fix P2C (PXN over C2C), which is now preferred over regular PXN. This support is currently preliminary and is disabled by default; use NCCL_PXN_C2C=1 to enable. Further reduce the overheads of CUDA graph capturing, which increased in NCCL 2.26.2 for large graphs. Optimize the network performance on DGX B200 systems by adjusting the bandwidths provided to the graph search algorithm. Enable fp8 reductions in symmetric kernels on Blackwell with CUDA 12.8. Restore the plugin name handling logic to make it possible to specify a path to the plugin (Issue #1732). Restore the ability to change NCCL_COLLNET_ENABLE during execution (Issue #1741). Add an example tuner plugin with CSV-based overrides. Remove an x86 dependency from the example profiler.	2025-06-18 10:34:47 -07:00
Bertan Dogancay	39211c6b41	[NPKit] Use default output directory when env var is not set (#1747 )	2025-06-16 15:26:53 -04:00
Nilesh M Negi	9d72be7b2f	[DEVICE] Adding ability to choose unroll factor at runtime (#1734 ) * Adding runtime unroll factor selection via RCCL_UNROLL_FACTOR * [BUILD] Add support for user-defined UNROLL for debugging * Update CHANGELOG.md * Fix COLLTRACE errors in CI * Add debug statements for unroll and resolve warnings * Incorporate UNROLL into ONLY_FUNCS for debugging --------- Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com> Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-06-11 00:07:59 -05:00
Arm Patinyasakdikul	d5b5f6b159	Increase default WORK_FIFO size to accommodate larger alltoall. (#1722 )	2025-06-05 09:02:45 -05:00
Avinash	e94b360246	SPLITCOMM design fix in src/misc/msccl (#1715 ) * Fix TOC-TOU in mcclInit * Improving vector resize thread safety * Initial commit rank to comm change * Removing unwanted include header changes * Updated CHANGELOG.md * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-06-01 21:00:38 -05:00
Kamil Iskra	72d2432094	NCCL 2.27.3-1 Symmetric memory API and symmetric kernels * Redesign from the ground up, enabling major latency and bandwidth improvements. * Add new API calls to register user-allocated memory among communicator ranks into a NCCL window: ncclCommWindowRegister() and ncclCommWindowDeregister(). The calls currently support symmetric registration for P2P and NVLS, and require VMM memory buffers (i.e., CUMEM must be operational). * Implement specialized kernels taking advantage of symmetrically registered memory, with performance gains expected particularly for small to medium message sizes. * The kernels support 32 bit floating point types and smaller, and sum as the reduction operator, with no more than one collective operation per group. * Floating point summation is always done in fp32 accumulators (with the exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus, the accuracy with fp8 and fp16 data types should be much improved. * This initial implementation supports non-network communicators only (P2P and NVLS transports). * To explore this functionality users need to use the new memory registration API calls with the NCCL_WIN_COLL_SYMMETRIC flag and all ranks of a communicator must pass buffers at the same offset in the same registration when invoking a collective NCCL operation. Add support for DGX Spark. Add support for DirectNIC (CX8) to the internal IB plugin. Add a new ncclCommShrink() API call * It is a non-collective call similar to ncclCommSplit(), which makes it possible to exclude some (possibly unresponsive) ranks from the parent communicator. Add support for loading multiple network plugins * This enables the creation of generic containers that can work across a range of providers. * Allow NCCL_NET_PLUGIN to accept a comma-separated list of plugins to load. NVLink SHARP (NVLS) improvements * Implement NVLS+IB SHARP support for AllGather and ReduceScatter with user buffer registration. This improves performance and reduces the number of CTAs needed to achieve peak bandwidth. * Gracefully fall back by default to other transports if NVLS initialization fails (the old behavior of returning an error code from a NCCL call can be preserved by setting NCCL_NVLS_ENABLE=1). * Decrease the NVLS channel count to 24 on Blackwell systems with multiple NVLink domains per communicator. * Enable fine-tuning of NCCL behavior per communicator using new "ncclConfig_t" members "collnetEnable", "CTAPolicy", and "nvlsCTAs". Profiler improvements * Extend the init function by adding communicator name, comm id (hash), rank, number of ranks, number of nodes, and the NCCL log function to the argument list. This makes the name and the comm id available to all events in the communicator without explicitly passing them to each individual event. Add the communicator id and rank to the profiler trace filename. Now, the communicator name can be set via a new "ncclConfig_t" member "commName". * Improve the accuracy of the GPU kernel events by providing GPU-generated timestamps for the start and stop of every NCCL operation. * Harmonize proxy events, removing overlaps between ProxyOp and ProxyStep states. * Add support for network-defined event updates (through "recordEventState"). * Report the correct number of channels used by every collective/p2p operation (used to be set to nMaxChannels for collectives and absent for p2ps). * Fix the logic on proxyCtrl Idle/Active events (Issue #1162). * Fix an issue where the network proxy profiler could lose track of an event identifier (Issue #1682). * Improve the backward compatibility with plugins older than v4. * Ensure that the work counters are 0-initialized. * Fix a potential race condition in the network profiler that could result in an event being linked to a wrong parent. MNNVL improvements * Increase to 16 the number of NICs used to communicate between MNNVL domains on GB200 systems, to optimize the performance of collective operations. * Add support for more complex MNNVL topologies with up to 32 NICs per node. * If the MNNVL fabric initialization was unsuccessful, NCCL will now fail by default, so as to avoid inadvertently falling back to a potentially much slower network transport. Such failures are typically due to a misconfigured IMEX support on the system. To continue without MNNVL, restart the job with NCCL_MNNVL_ENABLE=0. * Fix a potential hang in alltoall-like communication patterns at a scale of over 80 ranks. * Make NCCL_P2P_DISABLE=1 imply NCCL_MNNVL_ENABLE=0 (so the latter no longer needs to be specified on MNNVL systems). * Fix an initialization failure when NCCL_TOPO_FILE is used on MNNVL systems. * Fix the graph search to exclude non-local NICs. * Fix the SHM transport to use fabric handles on MNNVL systems. NIC Fusion improvements * Disable the creation of fused NICs for physical devices that haven't been merged. * Flatten multiple ports to a single PCI device within the internal IB plugin and reparent dual-port NICs under the first PCI parent. If the parent is not a PCI switch, PCI devices for fused NICs won't be duplicated. * Route traffic on GB200-CX8 systems through DirectNIC, not the host interface. Improve support for platforms with C2C connectivity (e.g., GB200) * Enable GPUDirect RDMA for the NICs by default. * Add support for P2C (PXN over C2C) and the LL128 protocol. Extend NCCL fault tolerance in multithreaded scenarios * Support the creation of multiple nonblocking communicators within a single group and polling in parallel for the completion using multiple threads (one per communicator). Enable ncclImplicitOrderLaunch for CUDA 12.9+ * This can potentially speed up NCCL_IMPLICIT_LAUNCH_ORDER. Improve the netSocket transport latency and control * Provide finer control over the size of the socket send/receive buffers, the task size, and the number of sockets that a single peer can open. * Add support for the inlining of small messages behind the header when using multiple sockets per connection. Improve the readability of the CPU affinity in the debug output * Print it as a range string rather than a bitmask. Fix a potential race condition in graph execution * A contention could arise when mixing graph and non-graph execution. Improve PXN connection code * Avoid duplicate and unused connections. RAS fixes * Fix a memory corruption at job termination time in case of a previously failed initialization of a RAS socket connection. * Fix a race condition leading to a crash when generating a RAS report during communicator initialization (Issues #1669, #1718). * Fix a potential race condition when gathering data for a RAS status report. Fix a potential memory corruption in ncclCommSplit() * Memory could get corrupted when resource sharing was in use and the size of the NVLink domain in the new communicator was smaller than in the old one. Fix asynchronous graph upload * Fix a small memory leak. * Fix oversychronization. Add a check for out-of-memory conditions in ncclMemAlloc() Clean up the NCCL socket code * accept() will retry also if just reading the magic failed (Issue #1613). * connect() will retry also if poll() did not return a POLLOUT event (Issue #1618). * Add error checking in a few instances (Issue #1539). * Fix the loop condition in ncclFindInterfaceMatchSubnet() (Issue #1574). * Clean up the debug output, downgrading WARN messages to INFO in non-critical cases, and printing the peer's address where relevant. Switch NCCL_DEBUG_FILE to line buffering * This should help avoid mixed-up partial output lines in multithreaded cases. Other minor fixes * Improve the checks for buffer overflows in the graph code (Issue #1585). * Extend logging and state clearing to all four events in the internal IB plugin (Issue #1650). * Fix the error path in case IB communication is not ready (Issue #1489). * Add ECE logging for IB fabric. * Fix various minor issues in the graph module (Issue #1635). * Clean up the debug output in the graph code, downgrading WARN messages to INFO in non-critical cases. * Add a missing argument to a directSend() call (Issue #1628). * Remove duplicate code in sendProxySetup() (Issue #1420). * Fix the order of arguments of cudaDeviceCanAccessPeer() (Issue #1507). * Fix compiler warnings with GCC 14. * Fix a typo in a comment (Issue #1236).	2025-05-29 20:56:40 -07:00
alex-breslow-amd	2f6b20c00a	Use One Slice per Basic Primitive for AllReduce, ReduceScatter, AllGather (#1681 ) for Single Node on Some GFX9 Systems Using a single slice rather than the typical two provides about 5% speedup (sometimes more or less) on some GFX9 systems for single node.	2025-05-29 16:17:35 -07:00
Nilesh M Negi	12517a957e	Re-apply unroll=1 and 112 channels for gfx950 (#1706 ) * Reapply "[SRC] Enable unroll=1 for gfx950 (#1602)" (#1667) This reverts commit `329e13efff`. * Reapply "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)" (#1620) This reverts commit `b17338d164`.	2025-05-28 14:58:10 -05:00
Dingming Wu	51f87fbb43	Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv() (#1683 ) * Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv() For rocm older than 6.4, we need to set HSA_NO_SCRATCH_RECLAIM=1 to use LL128 protocol. This Env is set outside of RCCL, add the logging to detect whether its set during runtime. * check hip runtime ver via hipRuntimeGetVersion * move the detection to ncclinit func * correct rocm version integer * update warning message * avoid unnecessary info msg on hsa_no_scratch_reclaim detection	2025-05-14 10:12:45 -05:00
Avinash	c54a0c085a	collective trace improvements for debugging (#1661 )	2025-05-07 13:37:31 -05:00
Bertan Dogancay	590ad6acc2	Merge pull request #1662 from BertanDogancay/2.25 [SYNC] 2.25.1-1	2025-05-06 09:39:09 -04:00

1 2 3 4 5 ...

293 Commits