* erase handle key from mscclpp communicator during deregistration
* remove check on buffer size being a multiple of 32 from registration/deregistration routines since these checks are applied during enqueue
* add check for greater than zero buffer size in mscclpp registration
[ROCm/rccl commit: 19105206f6]
* Fix collective trace
* Use nontemporal for st_global
* Fix previous commit
* Add HDP flush to data receive path
* Fix previous commit
* Control flushing by NCCL_NET_FORCE_FLUSH and RCCL_NET_HDP_FLUSH
* Introduce RCCL_NET_HDP_FLUSH and RCCL_NET_GDR_FLUSH
Both are on by default. Turn both off will skip all flush will likely
result in data error.
* Enable GDR copy by default
* Remove GDR flush env var because it is disabled by GDC flush
* Output kernel collective trace at comm destroy by default
* Limit kernel timeout messages to 100
* Use system relaxed atomic for loadInt
* Refine timeout messages and use atomic for setting offset from CPU
* Add kernel trace for barrier timeout
* Add backup barrier to avoid race in atomicAdd
* Use different counters for different warps
* Rework barrier implementation
* Fix for other GFX
* Use __hip_atomic_store and __hip_atomic_load
* Fix bug in previous commit
* Don't reset barrier values in running kernel
* Update trace format
* Fix typo
* Switch back to hip_atomic_fetch_add
* Use same barrier implementation for all GFX
* Remove extra threadfence
* Turn off HDP flush by default
Please use RCCL_NET_HDP_FLUSH=1 to switch on HDP flush
* Remove unnecessary changes from alterative barrier implementation
* Added back __threadfence_block
* Revert back to threadfence for gfx other than gfx94x
[ROCm/rccl commit: caba0bc049]
* Added RCCL env params to control setting the SO_REUSEADDR and SO_LINGER socket options. This can allow control over the number of file descriptors created during bootstrapping.
* Casted the linger value to `int` sooner to avoid a scope of unknown typed-ness.
* Added CHANGELOG entry for this feature.
[ROCm/rccl commit: 2e35417fe5]
* ext-src: add MSCCLPP memory registration APIs
* update mem-reg patch with mscclpp helper routine to check if buffer is registered
* RCCL integration of MSCCL++ user-buffer registration APIs
* only include mscclpp_nccl header if ENABLE_MSCCLPP is defined
* ext-src: update mscclpp mem-reg patch
* add helper routine to patch
* check handle before MSCCL++ deregister
* fix typo to replace send buff with recv buff
* in case of no mscclpp registration, dduring deRegister call, ont fall back to rccl deRegister which will return an error
* Apply suggestions from code review
Whitespace suggestions and reducing diffs to avoid future merge conflicts
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
* rename helper functions and change their return type
* set RCCL user-buffer registration to occur if attempting MSCCL++ registration with a buffer in managed memory
---------
Co-authored-by: isaki001 <Ioannis.Sakiotis@amd.com>
Co-authored-by: isaki001 <36317038+isaki001@users.noreply.github.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
[ROCm/rccl commit: e9b6bbca8a]
It seems like here wants to check xgmi_node instead. If checks node for "nvlink", it will verify the link_info everytime.
If checks node for "xgmi", when get yes answer, it won't need check vsmi topo interface.
[ROCm/rccl commit: f2ee8d9132]
* Switched calls to `cudaMemcpyAsync` to be `cudaMemcpy` in `ncclTransportP2pSetup` to avoid race condition with `cudaIpcOpenMemHandle` inside p2p `connect`. See `ncclP2pImportShareableBuffer`.
* Moved synchronize outside of the loop, as it isn't necessary to sync between every iteration of the loop.
[ROCm/rccl commit: c158d3a9b4]
* Initializing all ranks to the same value to avoid failure of UT AllReduce for FP8 type
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
[ROCm/rccl commit: 39483c55f8]
Certain CMake functions deduplicates arguments by default. For example, if we
have two `target_link_options` with both `-Xoffload-linker -opt-A` and then
`-Xoffload-linker -opt-B`, the final link command would be `-Xoffload-linker
-opt-A -opt-B`, which is not what we want.
[ROCm/rccl commit: 7386fac64a]
* Add RCCL debugging guide
* Changes from external review
* More edits from internal review
* Additional edits
* Minor correction
* More changes after external review
* Integrate index and ToC changes with incoming merge changes
* Integrate feedback from management review
* Minor edits from the internal review
[ROCm/rccl commit: 6d34fb7632]