Граф коммитов

1528 Коммитов

Автор SHA1 Сообщение Дата
Wenkai Du c8d3543d3f Add back missing net flush (#1376) 2024-10-15 08:12:26 -07:00
Wenkai Du 62d10fdc25 msccl: disable 1-shot xmls (#1375)
MSCCL 1-shot xmls may cause different output values on different ranks.
Disabling them for now to avoid undefined behavior in applications.
2024-10-14 15:10:53 -07:00
Wenkai Du a680e329e6 Temporarily disable MSCCL all gather XMLs due to UT failure (#1373) 2024-10-12 08:43:16 -07:00
Wenkai Du 821d2e1f30 Allow zero byte sendrecv in alltoallv (#1349)
* Allow zero byte sendrecv in alltoallv

* Fix previous merge error
2024-10-11 10:40:32 -07:00
Wenkai Du 5c367a21d0 Improve model matching for GPUs with alltoall XGMI connection (#1372) 2024-10-11 09:53:14 -07:00
Arm Patinyasakdikul 133ea201cf Increase default number of channels for MI300A in multi-node scenario. (#1366)
This commit changed the default of channels of MI300A from 8 upto 24.
This helps bring up multi-node performance to the expected level.
2024-10-11 11:37:48 -05:00
Wenkai Du b55b6be0cb Fix crash when PXN is enabled on some platforms (#1369) 2024-10-11 09:02:59 -07:00
Nusrat Islam 6160603d4c ext-src: Fix compiler warnings for MSCCLPP integration (#1368) 2024-10-10 08:20:02 -05:00
Nilesh M Negi 364a6c2130 [BUILD] Simplify CMake args for building MSCCLPP (#1363)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-10-09 23:52:04 -05:00
Nilesh M Negi 41a2c02773 [BUILD] Require use of Python3 interpreter (#1367)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-10-09 22:36:50 -05:00
Nusrat Islam 4d68751ce1 Add a custom allreduce algorithm in MSCCLPP for cpx mode (#1362)
* cmake: remove mscclpp patch after build is complete

To enable mscclpp in cpx mode, a patch cpx.patch needs to be applied.
This patch can be removed after building is done. This helps with the
build process the following time.

* Use read-based mscclpp allreduce from rccl

MSCCLPP by default uses remote write in the allreduce kernel for
large (> 1MB) messages. This PR adds an allreduce kernel that uses
remote read. It needs the users to use an environment variable
MSCCLPP_READ_ALLRED=1.
2024-10-08 14:42:12 -05:00
corey-derochie-amd c11f6b1531 Only set minNchannels if we are actually using MSCCL, checked using comm->mscclCompatible. (#1337) 2024-10-08 10:20:55 -06:00
akolliasAMD bc519fd733 disabled wbinvl1 for gfx9x on ll128 (#1365) 2024-10-08 08:43:29 -06:00
Nilesh M Negi 8ad76f8d10 [TRANSPORT] Add RCCL_FORCE_ENABLE_GDRDMA for debugging (#1356)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-10-06 18:43:49 -05:00
akolliasAMD 7fb9189760 Regression timing fix (#1361)
* Removed testbed initialization on standalone tests
* .jenkins renabled all tests
2024-10-03 10:41:26 -06:00
Bertan Dogancay 2dd10c8f17 [BUILD] Move code generation to python from CMake (#1360)
* Use generate.py for func generation

* Convert AddUnroll.cmake to bash
2024-10-03 10:21:19 -04:00
dependabot[bot] 038517b169 Bump rocm-docs-core from 1.7.2 to 1.8.2 in /docs/sphinx (#1348)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.7.2 to 1.8.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/v1.8.2/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.7.2...v1.8.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-02 16:33:26 -06:00
Bertan Dogancay 833b185a2d Merge pull request #1358 from BertanDogancay/nccl-2.21-sync 2024-10-02 18:21:06 -04:00
Nusrat Islam d13f9c44f5 Enable MSCCLPP use in CPX mode (#1355)
This PR enables the use of MSCCLPP in CPX mode for 8 GPUs.
2024-10-02 11:52:04 -05:00
BertanDogancay 84081064a0 Merge remote-tracking branch 'nccl/master' into develop 2024-10-02 09:31:25 -05:00
Wenkai Du e453f1ced9 Add another Rome model (#1354) 2024-10-01 17:41:27 -05:00
Ziyue Yang 7830af5844 Fix size matching in MSCCL (#1318) 2024-10-01 13:32:41 -07:00
Nilesh M Negi 8b3ed8f104 [CI] Temporarily disable RCCL UT Standalone.RegressionTiming in CI (#1350)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-09-27 14:08:36 -05:00
corey-derochie-amd 7231808c58 Checkout submodules with shallow depth (#1353)
* Make submodules shallow

* Updated README for the shallow checkout changes.
2024-09-27 11:07:16 -06:00
spolifroni-amd 06a0ddb3b4 Merge pull request #1345 from ROCm/spolifroni-amd/update-changelog
Updated  6.2.1 changelog so that it reflects what's in the 6.2.1 RN
2024-09-27 10:15:30 -04:00
Mustafa Abduljabbar 03a3ef3c34 MSCCL Multithreaded regression root cause fix (#1347)
* Make sure the target device is used for MSCCL

* Enable single process mode by default to use MSCCL in MT

* Create a per-rank state when GPUs share a thread
2024-09-25 15:24:25 -04:00
Nilesh M Negi 105ff1611f [TRANSPORT] GDRDMA enablement for linux kernel 6.4.0 or newer (#1328)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-09-25 11:29:52 -05:00
Tim 40e93ebc29 Remove 0 size UBR (#1346)
ncclCommRegister, required for UBR, will call IB dmabuf regMr directly which forbids 0 size message
2024-09-24 18:16:51 -04:00
Nilesh M Negi 3c61e934f2 [BUILD] Enable MSCCL++ for gfx942 variants (#1344)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-09-23 19:05:49 -05:00
Sandra Polifroni 7f87b0cd85 Updated the information for 6.2.1 in the changelog so that it reflects what's in the 6.2.1 release notes 2024-09-23 14:27:58 -04:00
Nilesh M Negi 707377b3cd Add Dockerfile to build rccl and rccl-tests (#1011)
* [BUILD] Add Dockerfile for RCCL and RCCL-Tests

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Update docker/Dockerfile.ubuntu

Typo for LD_LIBRARY_PATH

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* Update docker/Dockerfile.ubuntu

use `-b` for `git clone` instead of additional `git checkout`

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* Update docker/Dockerfile.ubuntu

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2024-09-22 03:53:16 -05:00
Mustafa Abduljabbar 2fe1e9f7db Fix MSCCLPP seg-fault when RCCL_MSCCL_ENABLE_SINGLE_PROCESS is enabled (#1338)
Removing unnecessary changes.

rename unique hosts function

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

use updated function name

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

Missed one instance of `mscclIsMultithreadedComm`.

rename unique hosts function

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

use updated function name

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

Missed one instance of `mscclIsMultithreadedComm`.
2024-09-20 11:22:05 -05:00
gilbertlee-amd 575afee5de Fixing install.sh to properly accept spaces in ONLY_FUNCS (#1339) 2024-09-18 17:25:36 -06:00
corey-derochie-amd 853a0586b4 Moved mscclpp_ncclGetUniqueId call into ncclCommInitRankFunc (#1332)
* Moved call to `mscclpp_ncclGetUniqueId` into `ncclCommInitRankFunc` to avoid setting up transport early in environments where MSCCL++ isn't valid.

* Checking `mscclEnabled` for the process and the topology to gate MSCCL++.

* Allowed `mscclForceEnable` to enable MSCCL++.
2024-09-16 16:41:40 -06:00
dependabot[bot] ad94c651ad Bump rocm-docs-core from 1.7.1 to 1.7.2 in /docs/sphinx (#1306)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.7.1 to 1.7.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.7.1...v1.7.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-13 16:57:45 -06:00
Mustafa Abduljabbar 05c7b7e69b RCCL Tuner Plugin Docs 2024-09-12 13:43:45 -05:00
corey-derochie-amd b3b0ffdbf3 Added nlohmann/json:v3.11.3 as a submodule in ext-src and passed its path into the mscclpp build to avoid downloading the package at build time. (#1330) 2024-09-11 16:54:26 -06:00
corey-derochie-amd 736a705875 Re-enabled MSCCL++ (#1325)
* Added restrictions around calling MSCCL++ collectives (#1281)

* Added restriction to non-zero 32-byte multiple message sizes to MSCCL++ AllGather.

* Renamed and refactored some mscclpp types.

* Only transmit the MSCCL++ unique id for non-split comm init. For splitting comm, it has already been transmitted. Instead, save the MSCCL++ communicator in child communicators when calling `ncclCommSplit`. Only destroy MSCCL++ communicators when no RCCL communicators remain that use it. Also improved trace logging.

* Disable MSCCL++ when using managed memory buffers as it isn't supported.

* Added datatype and op constraints for MSCCL++ AllReduce.

* Added documentation on MSCCL++ restrictions to the README.

* [BUILD] Support custom CMake flags in MSCCLPP (#1275)

* [BUILD] Support custom CMAKE_PREFIX_PATH in MSCCLPP

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] CMake flags to support build-id in MSCCLPP

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] Fix CMake warnings in MSCCLPP build

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Wrapped all cmake arguments passed to mscclpp to remove empty arguments and properly format them.

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Corey Derochie <corey.derochie@amd.com>

* Link to libmscclpp_nccl statically (#1282)

* Switched mscclpp_nccl to static linking. Added a build step to rename the NCCL API functions.

* Undid separation of building libmscclpp_nccl from building librccl with MSCCL++ integration. With a static build, it's either fully enabled or fully disabled.

* `nm` isn't always available in docker containers due to being stripped down. Removed use of `nm` in `cmake` and hard-coded the output into mscclpp_nccl_syms.txt.

* Removed IBVerbs dependency for integrating with MSCCL++ (#1313)

* Renamed `RCCL_ENABLE_MSCCLPP` to `RCCL_MSCCLPP_ENABLE` to conform to MSCCL. Set `RCCL_MSCCLPP_ENABLE` to 1 by default if `ENABLE_MSCCLPP` is defined, or 0 otherwise. Added a log warning if `RCCL_MSCCLPP_ENABLE` is set to 1 but `ENABLE_MSCCLPP` is not defined. (#1294)

* Include mscclpp as a git submodule (#1314)

* Added the desired mscclpp commit as a git submodule.

* Added step to automatically checkout the mscclpp submodule if it isn't already present, in case the user forgot to clone recursively.

* Added instruction to README to clone using --recurse-submodules to get the mscclpp submodule.

* Enabled MSCCL++ feature build.

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
2024-09-11 09:55:16 -06:00
saurabhAMD 4856309413 Making variable names consistent in EnvVars.cpp (#1327)
* Making variable names consistent in EnvVars.cpp
2024-09-11 09:23:31 -05:00
mberenjk 4ceb672179 replacing nccl/cuda related part of the api_trace.h with rccl/hip (#1326)
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2024-09-10 11:05:14 -05:00
saurabhAMD 289a80c4e9 Enabling Unit Tests for CPX mode (#1324)
* Unit Tests for RCCL in CPX mode

* override pow2gpus set by cpx mode by user argument

* Adding comment for UT_POW2_GPUS

* Additional comment on why using pow2gpus for cpx mode.
2024-09-09 10:12:33 -05:00
dependabot[bot] c85ac2bd1c Bump cryptography from 42.0.7 to 43.0.1 in /docs/sphinx (#1317)
Bumps [cryptography](https://github.com/pyca/cryptography) from 42.0.7 to 43.0.1.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/42.0.7...43.0.1)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-06 14:28:54 -06:00
Tim 8169cf1dfd Merge pull request #1320 from AtlantaPepsi/UT_cpx_hotfix
Temporary patch for unit tests in cpx mode
2024-09-06 12:07:03 -04:00
Ziyue Yang 8282baae7f Revise MSCCL link in README to Azure repo (#1311) 2024-09-05 17:10:49 -05:00
randyh62 4e2eeafdf6 Update README.md (#1321)
update note formatting
2024-09-05 14:23:36 -07:00
Nilesh M Negi d3012d3307 [BUILD] Support clang++ compiler (#1316)
* [BUILD] Support clang++ compiler

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] Enable check_symbol_exists for BFD and clang++

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] Define default C compiler

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-09-05 09:59:58 -05:00
randyh62 391c7ea070 what-is-rccl (#1312)
* what-is-rccl

* create Installation instreuctions from README

* update README link

* Add using-nccl

* Add note about docs

* correct doc path

* sources to source

* correct docs link
2024-09-05 06:54:48 -07:00
Tim 757d1891e9 Update EnvVars.cpp 2024-09-04 16:55:36 -04:00
corey-derochie-amd e056fe8f7e Disable MSCCL for the non-multi-process case by default (#1307)
* Added `RCCL_MSCCL_ENABLE_SINGLE_PROCESS` runtime flag to return to the original MSCCL enablement behaviour except when explicitly enabling for multi-thread.

* Added documentation for the new `RCCL_MSCCL_ENABLE_SINGLE_PROCESS` runtime env var.
2024-09-04 11:11:50 -06:00
Wenkai Du 1a48e19b18 topo_expl: update sm fields in topo xml files (#1310) 2024-08-29 12:03:51 -07:00