Commit Graph

782 Commits

Author SHA1 Message Date
Arm Patinyasakdikul 29f87c7191 Increased maximum number of XML nodes to support CPX mode. (#1386) 2024-10-23 11:15:11 -05:00
Wenkai Du e0780ba4d4 Fix topology discovery in container with subset of GPUs (#1384)
* Fix topology discovery in container with subset of GPUs

* Move links counting out of loop
2024-10-22 13:50:23 -07:00
Bertan Dogancay 373f113524 Dynamically select unroll factor to build for when targeting local arch (#1371)
* Dynamically select unroll factor to build for when targeting local arch only
2024-10-21 10:53:11 -04:00
Wenkai Du 7c077db307 Increase CQ size to 3*MAX_REQUESTS (#1374)
* Increase CQ size to 3*MAX_REQUESTS

Suggested by Rukhsana Ansari <rukhsana.ansari@broadcom.com>

* Reword comments based on feedback from Rukhsana
2024-10-18 11:01:03 -07:00
akolliasAMD af5678641d added atomic acquire for gfx12 on prims_simple (#1382) 2024-10-18 11:26:38 -06:00
Wenkai Du c8d3543d3f Add back missing net flush (#1376) 2024-10-15 08:12:26 -07:00
Wenkai Du 821d2e1f30 Allow zero byte sendrecv in alltoallv (#1349)
* Allow zero byte sendrecv in alltoallv

* Fix previous merge error
2024-10-11 10:40:32 -07:00
Wenkai Du 5c367a21d0 Improve model matching for GPUs with alltoall XGMI connection (#1372) 2024-10-11 09:53:14 -07:00
Arm Patinyasakdikul 133ea201cf Increase default number of channels for MI300A in multi-node scenario. (#1366)
This commit changed the default of channels of MI300A from 8 upto 24.
This helps bring up multi-node performance to the expected level.
2024-10-11 11:37:48 -05:00
Wenkai Du b55b6be0cb Fix crash when PXN is enabled on some platforms (#1369) 2024-10-11 09:02:59 -07:00
corey-derochie-amd c11f6b1531 Only set minNchannels if we are actually using MSCCL, checked using comm->mscclCompatible. (#1337) 2024-10-08 10:20:55 -06:00
akolliasAMD bc519fd733 disabled wbinvl1 for gfx9x on ll128 (#1365) 2024-10-08 08:43:29 -06:00
Nilesh M Negi 8ad76f8d10 [TRANSPORT] Add RCCL_FORCE_ENABLE_GDRDMA for debugging (#1356)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-10-06 18:43:49 -05:00
Bertan Dogancay 2dd10c8f17 [BUILD] Move code generation to python from CMake (#1360)
* Use generate.py for func generation

* Convert AddUnroll.cmake to bash
2024-10-03 10:21:19 -04:00
BertanDogancay 84081064a0 Merge remote-tracking branch 'nccl/master' into develop 2024-10-02 09:31:25 -05:00
Wenkai Du e453f1ced9 Add another Rome model (#1354) 2024-10-01 17:41:27 -05:00
Ziyue Yang 7830af5844 Fix size matching in MSCCL (#1318) 2024-10-01 13:32:41 -07:00
Mustafa Abduljabbar 03a3ef3c34 MSCCL Multithreaded regression root cause fix (#1347)
* Make sure the target device is used for MSCCL

* Enable single process mode by default to use MSCCL in MT

* Create a per-rank state when GPUs share a thread
2024-09-25 15:24:25 -04:00
Nilesh M Negi 105ff1611f [TRANSPORT] GDRDMA enablement for linux kernel 6.4.0 or newer (#1328)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-09-25 11:29:52 -05:00
Mustafa Abduljabbar 2fe1e9f7db Fix MSCCLPP seg-fault when RCCL_MSCCL_ENABLE_SINGLE_PROCESS is enabled (#1338)
Removing unnecessary changes.

rename unique hosts function

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

use updated function name

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

Missed one instance of `mscclIsMultithreadedComm`.

rename unique hosts function

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

use updated function name

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

Missed one instance of `mscclIsMultithreadedComm`.
2024-09-20 11:22:05 -05:00
corey-derochie-amd 853a0586b4 Moved mscclpp_ncclGetUniqueId call into ncclCommInitRankFunc (#1332)
* Moved call to `mscclpp_ncclGetUniqueId` into `ncclCommInitRankFunc` to avoid setting up transport early in environments where MSCCL++ isn't valid.

* Checking `mscclEnabled` for the process and the topology to gate MSCCL++.

* Allowed `mscclForceEnable` to enable MSCCL++.
2024-09-16 16:41:40 -06:00
corey-derochie-amd 736a705875 Re-enabled MSCCL++ (#1325)
* Added restrictions around calling MSCCL++ collectives (#1281)

* Added restriction to non-zero 32-byte multiple message sizes to MSCCL++ AllGather.

* Renamed and refactored some mscclpp types.

* Only transmit the MSCCL++ unique id for non-split comm init. For splitting comm, it has already been transmitted. Instead, save the MSCCL++ communicator in child communicators when calling `ncclCommSplit`. Only destroy MSCCL++ communicators when no RCCL communicators remain that use it. Also improved trace logging.

* Disable MSCCL++ when using managed memory buffers as it isn't supported.

* Added datatype and op constraints for MSCCL++ AllReduce.

* Added documentation on MSCCL++ restrictions to the README.

* [BUILD] Support custom CMake flags in MSCCLPP (#1275)

* [BUILD] Support custom CMAKE_PREFIX_PATH in MSCCLPP

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] CMake flags to support build-id in MSCCLPP

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] Fix CMake warnings in MSCCLPP build

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Wrapped all cmake arguments passed to mscclpp to remove empty arguments and properly format them.

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Corey Derochie <corey.derochie@amd.com>

* Link to libmscclpp_nccl statically (#1282)

* Switched mscclpp_nccl to static linking. Added a build step to rename the NCCL API functions.

* Undid separation of building libmscclpp_nccl from building librccl with MSCCL++ integration. With a static build, it's either fully enabled or fully disabled.

* `nm` isn't always available in docker containers due to being stripped down. Removed use of `nm` in `cmake` and hard-coded the output into mscclpp_nccl_syms.txt.

* Removed IBVerbs dependency for integrating with MSCCL++ (#1313)

* Renamed `RCCL_ENABLE_MSCCLPP` to `RCCL_MSCCLPP_ENABLE` to conform to MSCCL. Set `RCCL_MSCCLPP_ENABLE` to 1 by default if `ENABLE_MSCCLPP` is defined, or 0 otherwise. Added a log warning if `RCCL_MSCCLPP_ENABLE` is set to 1 but `ENABLE_MSCCLPP` is not defined. (#1294)

* Include mscclpp as a git submodule (#1314)

* Added the desired mscclpp commit as a git submodule.

* Added step to automatically checkout the mscclpp submodule if it isn't already present, in case the user forgot to clone recursively.

* Added instruction to README to clone using --recurse-submodules to get the mscclpp submodule.

* Enabled MSCCL++ feature build.

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
2024-09-11 09:55:16 -06:00
mberenjk 4ceb672179 replacing nccl/cuda related part of the api_trace.h with rccl/hip (#1326)
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2024-09-10 11:05:14 -05:00
corey-derochie-amd e056fe8f7e Disable MSCCL for the non-multi-process case by default (#1307)
* Added `RCCL_MSCCL_ENABLE_SINGLE_PROCESS` runtime flag to return to the original MSCCL enablement behaviour except when explicitly enabling for multi-thread.

* Added documentation for the new `RCCL_MSCCL_ENABLE_SINGLE_PROCESS` runtime env var.
2024-09-04 11:11:50 -06:00
Nusrat Islam 833435be18 graph: fix for MI300X 64 GPU case (#1308)
PR #1290 introduced a failure for 64 GPU case on MI300X. This PR
fixes the failure.
2024-08-26 18:37:58 -05:00
Wenkai Du 532b70afb6 Add new Rome model (#1304)
* Add another rome model and override

* Fix bug

* Fix typo

* Add ring

* Update ring

* Fix model matching

* Clean up

* Clean up

* Reverse rings for NCCL_RINGS input

* Only reverse NCCL_RINGS for ring graph

* Fix mapping issue when using  NCCL_RINGS

* Add NCCL_RINGS_REMAP to handle inconsistant net names
2024-08-23 08:45:43 +08:00
mberenjk db840f024e adding all nccl apis to api_support to enable rccl tracing by rocprofv3 (#1297)
* adding all nccl apis to api_support to enable rccl tracing by rocprofv3

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
2024-08-22 12:36:07 -05:00
Wenkai Du d3171b51b7 Fix gfx940 CPX mode (#1290) 2024-08-16 08:46:06 +08:00
Wenkai Du eff56735b0 Fix model matching with PXN enable (#1295) 2024-08-16 06:16:00 +08:00
akolliasAMD d6c317d6ae removed hcc mentions (#1291) 2024-08-14 15:04:13 -06:00
Pedram Alizadeh a25ca9bb90 adding new tunning table for very large number of nodes (#1288) 2024-08-09 10:47:42 -04:00
Tim 4200964202 Adding core binding in info (#1212)
Signed-off-by: AtlantaPepsi <timhu102@amd.com>
2024-08-08 11:36:24 -04:00
Richard Barnes d09b152aa0 Remove unused but set variable from all_reduce.h (#1258)
Allows `-Wunused-but-set-variable` to pass
2024-07-29 08:11:24 -07:00
Richard Barnes 86a4ad6e8b Remove unused but set variable from prims_ll128.h (#1257)
Allows `-Wunused-but-set-variable` to pass
2024-07-29 08:11:01 -07:00
Richard Barnes 7ad432ee23 Remove unused but set variable from prims_ll.h (#1256)
Allows `-Wunused-but-set-variable` to pass
2024-07-29 08:10:38 -07:00
akolliasAMD c246e25f8e gfx12 Disable ll protocol (#1268) 2024-07-26 08:59:55 -06:00
corey-derochie-amd 69135976d6 Fix bug where the first collective call was using MSCCL instead of MSCCL++ (#1260) 2024-07-22 15:46:47 -06:00
saurabhAMD cf311b71ee Adding performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing (#1265)
* Adding performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing

* Performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing
2024-07-22 10:21:29 -05:00
corey-derochie-amd b31b4082dd Only initialize MSCCL++ when runtime-enabled. (#1266) 2024-07-22 00:41:31 -06:00
Nusrat Islam 6f331b0d43 Enable CPX mode for MI300X (#1259)
* graph: enable cpx mode for MI300X

* graph: tune limits for cpx and cleanup
2024-07-19 11:30:37 -05:00
Wenkai Du 89349f2ce4 Template unroll for RCCL kernels (#1250)
* Template unroll for RCCL kernels

* Adding unroll template arg during CMake hipification

* Reduce linking parallel jobs to avoid OOM in CI

* Workaround issues with UT tests

SWDEV-469533: register spill fix is needed for mainline build
LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs
Use -parallel-jobs=8 for linking

* CI: do not use -j 16 when building

* CI: use -j 8 when building

* Only reduce parallel linking job for CI extended

* Restore original jenkins command. Change parallel linking jobs in cmake

* Disable MSCCLPP

---------

Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>
2024-07-19 08:15:59 -07:00
Nilesh M Negi a1ef217b32 Consistent channel shuffling for MI300X multi-node (#1255)
* Revert "[GRAPH] Use channel shuffling only for IB systems (#1228)"

This reverts commit 5be3b713ef.

* Revert "Revert "Changing channel stride for MI300X multinode (#1196)" (#1224)"

This reverts commit ad31d93f3d.
2024-07-18 10:18:09 -05:00
Nilesh M Negi 67e867271f [GRAPH] Disable MSCCL override of no. of channels (#1187)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-07-15 10:45:21 -05:00
corey-derochie-amd 9cbb3da224 Only enable MSCCL++ AllReduce for message sizes that are multiples 32 (#1253)
* Only enable MSCCL++ AllReduce for message sizes that are multiples of 32. MSCCL++ does not handle these other sizes.

* Sanitized MSCCL++ logging.
2024-07-12 17:04:23 -07:00
corey-derochie-amd 6dc47eecd7 Integrated RCCL with MSCCL++ for small message sizes (#1231) 2024-07-12 15:32:58 -06:00
Rahul Vaidya c755b9cf93 Improved version reporting in NCCL_DEBUG=VERSION (#1232)
* Improved version reporting in NCCL_DEBUG=VERSION.

Signed-off-by: rahulvaidya20 <ravaidya@amd.com>

* Version reporting changes

Signed-off-by: rahulvaidya20 <ravaidya@amd.com>

* Versioning changes: Initialized char arrays to null and fixed typo.

---------

Signed-off-by: rahulvaidya20 <ravaidya@amd.com>
2024-07-12 08:14:29 -05:00
akolliasAMD 63e4d76e23 gfx12 initial enablement (#1219) 2024-07-10 13:32:09 -06:00
corey-derochie-amd 0c36d571ea Enable multi-threading for MSCCL (#1203)
MSCCL can now run in a multi-threaded configuration. To test in the unit tests, added the ENABLE_OPENMP compile definition flag and the --openmp-test-enable flag to the unit test build script. To activate, set the environment variables UT_MULTITHREADED=1 and UT_PROCESS_MASK=1. Set Jenkins to use this mode.
2024-07-04 09:34:38 -06:00
Wenkai Du 45f3fbc52f Checking kernel header files only when missing sysfs entry (#1239) 2024-07-03 15:53:15 -07:00
Nilesh M Negi 5be3b713ef [GRAPH] Use channel shuffling only for IB systems (#1228)
* [GRAPH] Use channel shuffling only for IB systems

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [GRAPH] Define channels=48 for gfx94 RoCE systems

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [GRAPH] Increase channels for RoCE gfx94 systems

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-07-02 12:20:40 -05:00