* Fix collective trace
* Use nontemporal for st_global
* Fix previous commit
* Add HDP flush to data receive path
* Fix previous commit
* Control flushing by NCCL_NET_FORCE_FLUSH and RCCL_NET_HDP_FLUSH
* Introduce RCCL_NET_HDP_FLUSH and RCCL_NET_GDR_FLUSH
Both are on by default. Turn both off will skip all flush will likely
result in data error.
* Enable GDR copy by default
* Remove GDR flush env var because it is disabled by GDC flush
* Output kernel collective trace at comm destroy by default
* Limit kernel timeout messages to 100
* Use system relaxed atomic for loadInt
* Refine timeout messages and use atomic for setting offset from CPU
* Add kernel trace for barrier timeout
* Add backup barrier to avoid race in atomicAdd
* Use different counters for different warps
* Rework barrier implementation
* Fix for other GFX
* Use __hip_atomic_store and __hip_atomic_load
* Fix bug in previous commit
* Don't reset barrier values in running kernel
* Update trace format
* Fix typo
* Switch back to hip_atomic_fetch_add
* Use same barrier implementation for all GFX
* Remove extra threadfence
* Turn off HDP flush by default
Please use RCCL_NET_HDP_FLUSH=1 to switch on HDP flush
* Remove unnecessary changes from alterative barrier implementation
* Added back __threadfence_block
* Revert back to threadfence for gfx other than gfx94x
* Added RCCL env params to control setting the SO_REUSEADDR and SO_LINGER socket options. This can allow control over the number of file descriptors created during bootstrapping.
* Casted the linger value to `int` sooner to avoid a scope of unknown typed-ness.
* Added CHANGELOG entry for this feature.
* ext-src: add MSCCLPP memory registration APIs
* update mem-reg patch with mscclpp helper routine to check if buffer is registered
* RCCL integration of MSCCL++ user-buffer registration APIs
* only include mscclpp_nccl header if ENABLE_MSCCLPP is defined
* ext-src: update mscclpp mem-reg patch
* add helper routine to patch
* check handle before MSCCL++ deregister
* fix typo to replace send buff with recv buff
* in case of no mscclpp registration, dduring deRegister call, ont fall back to rccl deRegister which will return an error
* Apply suggestions from code review
Whitespace suggestions and reducing diffs to avoid future merge conflicts
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
* rename helper functions and change their return type
* set RCCL user-buffer registration to occur if attempting MSCCL++ registration with a buffer in managed memory
---------
Co-authored-by: isaki001 <Ioannis.Sakiotis@amd.com>
Co-authored-by: isaki001 <36317038+isaki001@users.noreply.github.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
It seems like here wants to check xgmi_node instead. If checks node for "nvlink", it will verify the link_info everytime.
If checks node for "xgmi", when get yes answer, it won't need check vsmi topo interface.
* Switched calls to `cudaMemcpyAsync` to be `cudaMemcpy` in `ncclTransportP2pSetup` to avoid race condition with `cudaIpcOpenMemHandle` inside p2p `connect`. See `ncclP2pImportShareableBuffer`.
* Moved synchronize outside of the loop, as it isn't necessary to sync between every iteration of the loop.
Certain CMake functions deduplicates arguments by default. For example, if we
have two `target_link_options` with both `-Xoffload-linker -opt-A` and then
`-Xoffload-linker -opt-B`, the final link command would be `-Xoffload-linker
-opt-A -opt-B`, which is not what we want.
* Add RCCL debugging guide
* Changes from external review
* More edits from internal review
* Additional edits
* Minor correction
* More changes after external review
* Integrate index and ToC changes with incoming merge changes
* Integrate feedback from management review
* Minor edits from the internal review
* Add Topologies for 16-GPU gfx942 SuperNode
- Add GigaIO topologies to tools/topo_expl for dev and testing
- Add GigaIO Columba 16 GPU romeModel and adjust topology
matching algorithm in rome_models for 16 GPU system
- Fix bug which failed to match Rome Model when using subsets
of system resources (i.e. ROCR_VISIBLE_DEVICES is set)
- Fixes for topo_expl
* Fix bug w/ 1H16P