* Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv()
For rocm older than 6.4, we need to set HSA_NO_SCRATCH_RECLAIM=1 to use LL128 protocol.
This Env is set outside of RCCL, add the logging to detect whether its set during runtime.
* check hip runtime ver via hipRuntimeGetVersion
* move the detection to ncclinit func
* correct rocm version integer
* update warning message
* avoid unnecessary info msg on hsa_no_scratch_reclaim detection
* [SRC] Enable unroll=1 for gfx950
* Fix typo from rebase in generate.py
* Support for unroll=1 and gfx90a when building for all GPU targets
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* Add fault injection of starting warps with random variations
This is done by inserting randomly delays after __syncthreads().
The feature can be turned off by FAULT_INJECTION=OFF in cmake.
* Remove manually introduced bug for demo purpose
* Use only one thread per warp for checking wall clock
* removed gfx940 and gfx941
* removed gfx940 and gfx941
* Update "gfx94" to "gfx942" in init.cc
* Updated remaining "gfx94" updates to "gfx942"
* Update filenames and variables from gfx940 to gfx942
---------
Co-authored-by: akolliasAMD <akollias@amd.com>
* Fix collective trace
* Use nontemporal for st_global
* Fix previous commit
* Add HDP flush to data receive path
* Fix previous commit
* Control flushing by NCCL_NET_FORCE_FLUSH and RCCL_NET_HDP_FLUSH
* Introduce RCCL_NET_HDP_FLUSH and RCCL_NET_GDR_FLUSH
Both are on by default. Turn both off will skip all flush will likely
result in data error.
* Enable GDR copy by default
* Remove GDR flush env var because it is disabled by GDC flush
* Output kernel collective trace at comm destroy by default
* Limit kernel timeout messages to 100
* Use system relaxed atomic for loadInt
* Refine timeout messages and use atomic for setting offset from CPU
* Add kernel trace for barrier timeout
* Add backup barrier to avoid race in atomicAdd
* Use different counters for different warps
* Rework barrier implementation
* Fix for other GFX
* Use __hip_atomic_store and __hip_atomic_load
* Fix bug in previous commit
* Don't reset barrier values in running kernel
* Update trace format
* Fix typo
* Switch back to hip_atomic_fetch_add
* Use same barrier implementation for all GFX
* Remove extra threadfence
* Turn off HDP flush by default
Please use RCCL_NET_HDP_FLUSH=1 to switch on HDP flush
* Remove unnecessary changes from alterative barrier implementation
* Added back __threadfence_block
* Revert back to threadfence for gfx other than gfx94x
Add Blackwell/SM100 support
* Add compilation for sm100
* Add graph search speeds for Blackwell
* Optimize graph search to converge on large NVLink domains
* Limit NVLS heads to 32
* Increase various limits to fit large NVLink domains
* Add extra checks for IMEX setup, needed for MNNVL
* Increase MAXCHANNELS to 64
Extend NVTX instrumentation to track NCCL communicators
* Add communicator ID to NVTX traces to allow for correlation
between ranks.
RAS fixes
Network user buffer support for collectives
* Leverage user buffer registration to achieve zero-copy
inter-node communications for Ring, NVLS and Collnet
Add RAS subsystem
* Create a RAS thread keeping track of all NCCL communicators.
* Add a ncclras tool contacting the RAS thread and getting a
report.
Add fp8 support
* Add support for e5m2 and e4m3 8-bit floating point operations.
* Use Tree/PAT algorithms when possible for better numerical
stability.
Add NIC fusion
* Add a NET API to ask the network plugin to fuse a set of
interfaces together.
* Fuse multiple NICs under the same PCI switch as a single,
larger NIC.
Socket connection failure retry
* Retry in case of socket connection failure (unreachable host)
* Avoid "Software caused connection abort" errors on retries
QP connection failure retry
* Retry in case of IB QP connection failure during ibv_modify_qp.
NET API improvements
* Allow plugins to force a flush in case data and completion
ordering is not guaranteed.
* Indicate when completion is not needed (e.g. for the LL128
protocol), allowing plugins to skip generating a completion.
* Allow for full offload of allgather operations when using one
GPU per node.
NCCL_ALGO/NCCL_PROTO strict enforcement
* Extend NCCL_ALGO/NCCL_PROTO syntax to be able to specify
ALGO/PROTO filters for each collective operation.
* Strictly enforce the ALGO/PROTO filters, no longer fall back
on the ring algorithm when the filtering leaves no option and
error out instead.
Enable CUMEM host allocations
* Use cumem functions for host memory allocation by default.
Improved profiler plugin API
* Avoid dependencies with NCCL includes.
* Add information on whether the buffer is registered or not
Adjust PAT tuning
* Improve transition between PAT and ring at scale.
Fix hangs when running with different CPU architectures
* Detect when we use a mix of GPU architectures
* Ensure Algo/Proto decisions are made based on that unified
state.
Fix FD leak in UDS
* Fix a leak when mapping buffers intra-node with cumem IPCs.
Fix crash when mixing buffer registration and graph buffer registration.
* Separate local and graph registration to avoid crashes when we free
buffers.
Fix user buffer registration with dmabuf
* Make ncclSend/ncclRecv communication with buffer registration functional
on network plugins relying on dmabuf for buffer registration.
Fix crash in IB code caused by uninitialized fields.
Fix non-blocking ncclSend/ncclRecv
* Fix case where ncclSend/ncclRecv would return ncclSuccess in non-blocking
mode even though the operation was not enqueued onto the stream.
* Issue #1495
Various compiler tweaks and fixes
* PR #758
Fix typo in ncclTopoPrintGraph
* Issue #1468
Add scalable init API
* Add new ncclCommInitRankScalable to allow for passing multiple
unique IDs to the init function.
* Spreads the load onto multiple bootstrap roots, allowing for
constant bootstrap time.
* Requires multiple ranks to create a unique ID, and the CPU-side
ID exchange code to call allgather[v] instead of broadcast.
Accelerate init bootstrap operations
* Reduce the number of calls to allgather.
* Allow roots to reply early to ranks when information is already
available.
* Add an option to use ncclNet instead of sockets to perform
bootstrap allgather operations.
Add PAT algorithms for Allgather and ReduceScatter
* Parallel Aggregated Trees, variation of Bruck algorithm.
* Logarithmic number of network steps for small sizes at scale.
* Only supports one rank per node at the moment.
Add support for registered buffers for intra-node communication.
* Allow registered user buffers to be accessed directly intra-node
* Avoids extra copies in algorithms which permit it, saving
memory bandwidth and helping with compute overlap.
Add profiler plugin API
* New plugin API for profiling
* Supports various levels of profiling, with a hierarchy.
Asynchronous graph allocation
* Make calls to cudaMalloc and cudaMemcpy during graph allocation
asynchronous.
* Significantly speeds up graph capture.
Use fatal IB asynchronous events to stop network operation
* Avoids many other error messages
* Only fatal errors are affected; potentially transient errors
(e.g. port down) do not cause an immediate stop.
Set P2P level to PXB on AMD CPUs when using more than 2 GPUs per node
* P2P would cause a significant performance degradation when using
many GPUs, and therefore many interleaved data flows.
* Disable P2P through the CPU when we have 3+ GPUs per node; keep it
enabled when we only have 2 GPUs.
Improve the init logs to report the real NCCL function.
* Make the log report ncclCommInitRank or ncclCommSplit, rather than
the generic ncclCommInitRankFunc.
Add a parameter to set the location of the user configuration file.
* Add NCCL_CONF_FILE environment variable to set where the user's
configuration file resides.
Increase default IB timeout
* Increase IB timeout value from 18 to 20.
* Should help avoid fatal errors on large RoCE systems.
Add new check for nvidia peermem
* On linux kernels 6.6+, /sys/kernel/mm/memory_peers is no longer
present; check for /sys/module/nvidia_peermem/version instead.
Fix old performance regression when mixing small and large operations.
* Improves distribution of work on channels.
Fix crash when NUMA IDs are equal to -1.
* Can happen when a NIC is a virtual NIC, or when linux doesn't
know which NUMA node a device is attached to
* Issue NVIDIA/nccl-tests#233
Fix tree graph search when NCCL_CROSS_NIC is set to 1.
* Would force NCCL to use the balanced_tree pattern, thereby
disabling LL128 on platforms with 1 GPU+1 NIC per PCI switch.
* Would also try to use alternate rings even though it was not
needed.
Compiler tweaks and fixes
* PR #1177
* PR #1228
Fix stack smash
* PR #1325
Fixes for multi-node NVLink + IB operation
Coverity fixes and comments.
* Moved call to `mscclpp_ncclGetUniqueId` into `ncclCommInitRankFunc` to avoid setting up transport early in environments where MSCCL++ isn't valid.
* Checking `mscclEnabled` for the process and the topology to gate MSCCL++.
* Allowed `mscclForceEnable` to enable MSCCL++.
* Added restrictions around calling MSCCL++ collectives (#1281)
* Added restriction to non-zero 32-byte multiple message sizes to MSCCL++ AllGather.
* Renamed and refactored some mscclpp types.
* Only transmit the MSCCL++ unique id for non-split comm init. For splitting comm, it has already been transmitted. Instead, save the MSCCL++ communicator in child communicators when calling `ncclCommSplit`. Only destroy MSCCL++ communicators when no RCCL communicators remain that use it. Also improved trace logging.
* Disable MSCCL++ when using managed memory buffers as it isn't supported.
* Added datatype and op constraints for MSCCL++ AllReduce.
* Added documentation on MSCCL++ restrictions to the README.
* [BUILD] Support custom CMake flags in MSCCLPP (#1275)
* [BUILD] Support custom CMAKE_PREFIX_PATH in MSCCLPP
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* [BUILD] CMake flags to support build-id in MSCCLPP
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* [BUILD] Fix CMake warnings in MSCCLPP build
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* Wrapped all cmake arguments passed to mscclpp to remove empty arguments and properly format them.
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Corey Derochie <corey.derochie@amd.com>
* Link to libmscclpp_nccl statically (#1282)
* Switched mscclpp_nccl to static linking. Added a build step to rename the NCCL API functions.
* Undid separation of building libmscclpp_nccl from building librccl with MSCCL++ integration. With a static build, it's either fully enabled or fully disabled.
* `nm` isn't always available in docker containers due to being stripped down. Removed use of `nm` in `cmake` and hard-coded the output into mscclpp_nccl_syms.txt.
* Removed IBVerbs dependency for integrating with MSCCL++ (#1313)
* Renamed `RCCL_ENABLE_MSCCLPP` to `RCCL_MSCCLPP_ENABLE` to conform to MSCCL. Set `RCCL_MSCCLPP_ENABLE` to 1 by default if `ENABLE_MSCCLPP` is defined, or 0 otherwise. Added a log warning if `RCCL_MSCCLPP_ENABLE` is set to 1 but `ENABLE_MSCCLPP` is not defined. (#1294)
* Include mscclpp as a git submodule (#1314)
* Added the desired mscclpp commit as a git submodule.
* Added step to automatically checkout the mscclpp submodule if it isn't already present, in case the user forgot to clone recursively.
* Added instruction to README to clone using --recurse-submodules to get the mscclpp submodule.
* Enabled MSCCL++ feature build.
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
* adding all nccl apis to api_support to enable rccl tracing by rocprofv3
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
MSCCL can now run in a multi-threaded configuration. To test in the unit tests, added the ENABLE_OPENMP compile definition flag and the --openmp-test-enable flag to the unit test build script. To activate, set the environment variables UT_MULTITHREADED=1 and UT_PROCESS_MASK=1. Set Jenkins to use this mode.