This PR is intended to move RCCL to gating changes on CI failures. Right now, only build/unittests run per PR consistently. We should eventually add all single and multi-node test status badges once those tests are running in presubmit and continuously on develop
* Added new binary for executing unit tests
Added new unit tests for argcheck.cc and alt_rsmi.cc files
Modified the method to execute unit tests to cover static methods
by using a bash script to convert static to non-static functions
and variables on the fly restricted to debug build type.
* Added new unit tests for src/transport/shm.cc
* Added new unit tests for graph/xml.cc
- Introduced double-buffering to reduce copy overhead and overlap BF16 arithmetic with data prefetching.
- Aimed to improve performance of reduction-based collectives by up to 10%.
- Implemented based on recommendations from Guennadi Riguer (AMD)
- Added --force-reduce-pipeline option to install.sh to activate this optimization for BF16 reductions.
- Feature is disabled by default to prevent regressions with large messages until auto-tuning logic is upstreamed.
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Co-authored-by: Pedram Alizadeh <pmohamma@amd.com>
Prevent initialization failures in certain configurations when attempting
to load fp8-specific symmetric multicast kernels on GPUs older than
Blackwell.
* Support fused all reduce and elementwise operations
Add additional "acc" parameter to RCCL Replayer logs
Add flag which indicates availability of new API
* Fix Recorder json parsing
* Remove unreachable code
* Remove extra acc pointer check
* .
* Revert "[DEVICE] Adding ability to choose unroll factor at runtime (#1734)"
This reverts commit 9d72be7b2f.
* Use noinline to reduce kernels linking time
* Don't use noinline for gfx942 and gfx950 to avoid perf regression
---------
Co-authored-by: AtlantaPepsi <timhu102@amd.com>
Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>
Boosts single node bfloat16 allreduce performance by up to 20% for some data sizes and provides gating with the RCCL_GFX942_CHEAP_FENCE_OFF environment variable
- Added a check to skip issues labeled "ongoing" in the close-old-issues script
- Adjusted the condition to compare both creation and update dates against six months ago
* Increased max stack size to 640
* Added new binary for executing unit tests
Added new unit tests for argcheck.cc and alt_rsmi.cc files
Modified the method to execute unit tests to cover static methods
by using a bash script to convert static to non-static functions
and variables on the fly restricted to debug build type.
We add 3 different issue types issue/question/RFE and add some predefined
questions to speed up the debugging process.
We also add a custom action which will close all issues create mode than 6
months ago which have not been updated for more than a month.
Improve support for DirectNIC (CX8)
* Add support for XDR speed detection.
* When DirectNIC is enabled, report only the RDMA interfaces.
Extend the P2C (PXN over C2C) support to send/receive operations.
Support compilation with GCC 14 (Issues #1743, #1751).
Fix the unloading of network plugins that also provide tuner capability.
Fix the change of the current device across the calls to ncclCommDestroy()
and ncclCommAbort().
A note for users on MNNVL systems: please ensure an adequate stack size for
NCCL threads. While the default Linux stack size limit of 8192 KB is known
to be sufficient, we've seen crashes if the limit is changed to
"unlimited", as it causes the glibc library to unexpectedly *decrease* the
stack size of NCCL's background threads to just 2048 KB. Use "ulimit -s"
in bash to print the current limit; if needed, reset it to 8192 KB using
"ulimit -s 8192" (one also needs to ensure that the new setting is
propagated to other nodes when launching a multi-node NCCL job).