It seems like here wants to check xgmi_node instead. If checks node for "nvlink", it will verify the link_info everytime.
If checks node for "xgmi", when get yes answer, it won't need check vsmi topo interface.
* Add Topologies for 16-GPU gfx942 SuperNode
- Add GigaIO topologies to tools/topo_expl for dev and testing
- Add GigaIO Columba 16 GPU romeModel and adjust topology
matching algorithm in rome_models for 16 GPU system
- Fix bug which failed to match Rome Model when using subsets
of system resources (i.e. ROCR_VISIBLE_DEVICES is set)
- Fixes for topo_expl
* Fix bug w/ 1H16P
* Changing C-strings to be const.
* Changed variable-length arrays to std::vector to avoid warnings. VLA is a compiler extension.
* Changed `#define` inside functions into `constexpr int` to preserve scoping and avoid macro redefinition warnings.
* Disabled warnings for modifying `CMAKE_CXX_FLAGS` caused by `check_symbol_exists`, which temporarily modifies the flag to do a compile check.
* Fixed VLA in rccl UT.
* Add another rome model and override
* Fix bug
* Fix typo
* Add ring
* Update ring
* Fix model matching
* Clean up
* Clean up
* Reverse rings for NCCL_RINGS input
* Only reverse NCCL_RINGS for ring graph
* Fix mapping issue when using NCCL_RINGS
* Add NCCL_RINGS_REMAP to handle inconsistant net names
* Template unroll for RCCL kernels
* Adding unroll template arg during CMake hipification
* Reduce linking parallel jobs to avoid OOM in CI
* Workaround issues with UT tests
SWDEV-469533: register spill fix is needed for mainline build
LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs
Use -parallel-jobs=8 for linking
* CI: do not use -j 16 when building
* CI: use -j 8 when building
* Only reduce parallel linking job for CI extended
* Restore original jenkins command. Change parallel linking jobs in cmake
* Disable MSCCLPP
---------
Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>
* Revert "[GRAPH] Use channel shuffling only for IB systems (#1228)"
This reverts commit 5be3b713ef.
* Revert "Revert "Changing channel stride for MI300X multinode (#1196)" (#1224)"
This reverts commit ad31d93f3d.
* [GRAPH] Use channel shuffling only for IB systems
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* [GRAPH] Define channels=48 for gfx94 RoCE systems
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* [GRAPH] Increase channels for RoCE gfx94 systems
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Rework core for NVIDIA Trusted Computing
* Compress work structs so that they are shared between channels
* Utilize the full amount of kernel argument space permitted (4k)
before resorting to work fifo.
* Rework the task preprocessing phase.
* Use a separate abortDevFlag which is kept in sync with abortFlag
using cudaMemcpy operations.
* Rename src/include/align.h to src/include/bitops.h
Add lazy connection establishment for collective operations
* Move buffer allocation and connection establishment to the first
collective operation using that algorithm.
* Accelerate init time and reduce memory usage.
* Avoid allocating NVLS buffers if all calls are registered.
* Compute algo/proto in ncclLaunchCollTasksInfo early on.
* Connect peers in ncclCollPreconnectFunc if not connected already.
* Also move shared buffer creation to the first send/recv call.
Accelerate intra-node NVLink detection
* Make each rank only detect NVLinks attached to its GPU.
* Fuse XMLs to reconstruct the full NVLink topology
Add init profiling to report time spend in different init phases.
* Report timings of bootstrap, allgather, search, connect, etc.
* Add new "PROFILE" category for NCCL_DEBUG_SUBSYS.
Add support for PCI p2p on split PCI switches
* Detect split PCI switches through a kernel module exposing
switch information.
* Update the topology XML and graph to add those inter-switch
connections.
Add cost estimation API
* Add a new ncclGroupEndSimulate primitive to return the estimated
time a group would take.
Net/IB: Add separate traffic class for fifo messages
* Add NCCL_IB_FIFO_TC to control the traffic class of fifo messages
independently from NCCL_IB_TC.
Merges PR #1194
Net/IB: Add support for IB router
* Use flid instead of lid if subnets do not match
* Warn if flid is 0
Optimizations and fixes for device network offload (unpack)
* Double the default number of channels
* Cache netDeviceType
* Fix save/increment head logic to enable Tree support.
Support ncclGroupStart/End for ncclCommAbort/Destroy
* Allow Abort/Destroy to be called within a group when managing
multiple GPUs with a single process.
Improve Tuner API
* Provide to the plugin the original cost table so that the plugin
can leave unknown or disabled algo/proto combinations untouched.
* Remove nvlsSupport and collnetSupport.
Do not print version to stdout when using a debug file
* Also print version from all processes with INFO debug level.
Fixes issue #1271
Fix clang warnings in NVTX headers
* Update NVTX headers to the latest version
Fixes issue #1270
Disable port fusion in heterogeneous systems
* Do not fuse ports if a mix of multi-port and single port are detected.
Fix NVLS graphs search for dual NICs.
* Fix NVLS graph search when we have more than one NIC per GPU.
Fix crash with collnetDirect
* Add separate graph search for collnetDirect, testing alltoall paths
and working similarly to the NVLS search.
Fix hang when nodes have different CPU types
* Add the CPU type to the rank peer info.
* Align all ranks on the CPU type after the first allgather.
* Only use the aligned CPU type for all tuning operations.
Fixes issue #1136
Fixes issue #1184
Fix performance of registered send/recv operations
* Allow for single full size operations
* Add INFO to confirm the registration of send/recv buffers.
Move all sync ops to finalize stage
* Ensure ncclCommDestroy is non-blocking if ncclCommFinalize has
been called.
Improve error reporting during SHM segment creation
Improve support of various compilers
Merges PR #1177
Merges PR #1228
Allow net and tuner plugins to be statically linked
* Search for ncclNet or ncclTuner symbols in the main binary.
Merges PR #979
Plugin examples includes cleanup
* Harmonize err.h and common.h usage.
* Add mixed plugin with both net and tuner.
* Add ring simple chunk size tuning
* modifying the tuning table to improve the performance of broadcast for 8MB to 32MB for single-node MI300X after ring simple chunk size tuning
* modifying the tuning table to improve the performance of reduce for 1MB to 4MB for single-node MI300X after ring simple chunk size tuning
---------
Co-authored-by: PedramAlizadeh <pmohamma@amd.com>