提交線圖

1513 次程式碼提交

作者 SHA1 備註 日期
Bertan Dogancay 2dd10c8f17 [BUILD] Move code generation to python from CMake (#1360)
* Use generate.py for func generation

* Convert AddUnroll.cmake to bash
2024-10-03 10:21:19 -04:00
dependabot[bot] 038517b169 Bump rocm-docs-core from 1.7.2 to 1.8.2 in /docs/sphinx (#1348)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.7.2 to 1.8.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/v1.8.2/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.7.2...v1.8.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-02 16:33:26 -06:00
Bertan Dogancay 833b185a2d Merge pull request #1358 from BertanDogancay/nccl-2.21-sync 2024-10-02 18:21:06 -04:00
Nusrat Islam d13f9c44f5 Enable MSCCLPP use in CPX mode (#1355)
This PR enables the use of MSCCLPP in CPX mode for 8 GPUs.
2024-10-02 11:52:04 -05:00
BertanDogancay 84081064a0 Merge remote-tracking branch 'nccl/master' into develop 2024-10-02 09:31:25 -05:00
Wenkai Du e453f1ced9 Add another Rome model (#1354) 2024-10-01 17:41:27 -05:00
Ziyue Yang 7830af5844 Fix size matching in MSCCL (#1318) 2024-10-01 13:32:41 -07:00
Nilesh M Negi 8b3ed8f104 [CI] Temporarily disable RCCL UT Standalone.RegressionTiming in CI (#1350)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-09-27 14:08:36 -05:00
corey-derochie-amd 7231808c58 Checkout submodules with shallow depth (#1353)
* Make submodules shallow

* Updated README for the shallow checkout changes.
2024-09-27 11:07:16 -06:00
spolifroni-amd 06a0ddb3b4 Merge pull request #1345 from ROCm/spolifroni-amd/update-changelog
Updated  6.2.1 changelog so that it reflects what's in the 6.2.1 RN
2024-09-27 10:15:30 -04:00
Mustafa Abduljabbar 03a3ef3c34 MSCCL Multithreaded regression root cause fix (#1347)
* Make sure the target device is used for MSCCL

* Enable single process mode by default to use MSCCL in MT

* Create a per-rank state when GPUs share a thread
2024-09-25 15:24:25 -04:00
Nilesh M Negi 105ff1611f [TRANSPORT] GDRDMA enablement for linux kernel 6.4.0 or newer (#1328)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-09-25 11:29:52 -05:00
Tim 40e93ebc29 Remove 0 size UBR (#1346)
ncclCommRegister, required for UBR, will call IB dmabuf regMr directly which forbids 0 size message
2024-09-24 18:16:51 -04:00
Nilesh M Negi 3c61e934f2 [BUILD] Enable MSCCL++ for gfx942 variants (#1344)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-09-23 19:05:49 -05:00
Sandra Polifroni 7f87b0cd85 Updated the information for 6.2.1 in the changelog so that it reflects what's in the 6.2.1 release notes 2024-09-23 14:27:58 -04:00
Nilesh M Negi 707377b3cd Add Dockerfile to build rccl and rccl-tests (#1011)
* [BUILD] Add Dockerfile for RCCL and RCCL-Tests

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Update docker/Dockerfile.ubuntu

Typo for LD_LIBRARY_PATH

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* Update docker/Dockerfile.ubuntu

use `-b` for `git clone` instead of additional `git checkout`

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* Update docker/Dockerfile.ubuntu

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2024-09-22 03:53:16 -05:00
Mustafa Abduljabbar 2fe1e9f7db Fix MSCCLPP seg-fault when RCCL_MSCCL_ENABLE_SINGLE_PROCESS is enabled (#1338)
Removing unnecessary changes.

rename unique hosts function

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

use updated function name

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

Missed one instance of `mscclIsMultithreadedComm`.

rename unique hosts function

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

use updated function name

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

Missed one instance of `mscclIsMultithreadedComm`.
2024-09-20 11:22:05 -05:00
gilbertlee-amd 575afee5de Fixing install.sh to properly accept spaces in ONLY_FUNCS (#1339) 2024-09-18 17:25:36 -06:00
corey-derochie-amd 853a0586b4 Moved mscclpp_ncclGetUniqueId call into ncclCommInitRankFunc (#1332)
* Moved call to `mscclpp_ncclGetUniqueId` into `ncclCommInitRankFunc` to avoid setting up transport early in environments where MSCCL++ isn't valid.

* Checking `mscclEnabled` for the process and the topology to gate MSCCL++.

* Allowed `mscclForceEnable` to enable MSCCL++.
2024-09-16 16:41:40 -06:00
dependabot[bot] ad94c651ad Bump rocm-docs-core from 1.7.1 to 1.7.2 in /docs/sphinx (#1306)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.7.1 to 1.7.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.7.1...v1.7.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-13 16:57:45 -06:00
Mustafa Abduljabbar 05c7b7e69b RCCL Tuner Plugin Docs 2024-09-12 13:43:45 -05:00
corey-derochie-amd b3b0ffdbf3 Added nlohmann/json:v3.11.3 as a submodule in ext-src and passed its path into the mscclpp build to avoid downloading the package at build time. (#1330) 2024-09-11 16:54:26 -06:00
corey-derochie-amd 736a705875 Re-enabled MSCCL++ (#1325)
* Added restrictions around calling MSCCL++ collectives (#1281)

* Added restriction to non-zero 32-byte multiple message sizes to MSCCL++ AllGather.

* Renamed and refactored some mscclpp types.

* Only transmit the MSCCL++ unique id for non-split comm init. For splitting comm, it has already been transmitted. Instead, save the MSCCL++ communicator in child communicators when calling `ncclCommSplit`. Only destroy MSCCL++ communicators when no RCCL communicators remain that use it. Also improved trace logging.

* Disable MSCCL++ when using managed memory buffers as it isn't supported.

* Added datatype and op constraints for MSCCL++ AllReduce.

* Added documentation on MSCCL++ restrictions to the README.

* [BUILD] Support custom CMake flags in MSCCLPP (#1275)

* [BUILD] Support custom CMAKE_PREFIX_PATH in MSCCLPP

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] CMake flags to support build-id in MSCCLPP

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] Fix CMake warnings in MSCCLPP build

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Wrapped all cmake arguments passed to mscclpp to remove empty arguments and properly format them.

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Corey Derochie <corey.derochie@amd.com>

* Link to libmscclpp_nccl statically (#1282)

* Switched mscclpp_nccl to static linking. Added a build step to rename the NCCL API functions.

* Undid separation of building libmscclpp_nccl from building librccl with MSCCL++ integration. With a static build, it's either fully enabled or fully disabled.

* `nm` isn't always available in docker containers due to being stripped down. Removed use of `nm` in `cmake` and hard-coded the output into mscclpp_nccl_syms.txt.

* Removed IBVerbs dependency for integrating with MSCCL++ (#1313)

* Renamed `RCCL_ENABLE_MSCCLPP` to `RCCL_MSCCLPP_ENABLE` to conform to MSCCL. Set `RCCL_MSCCLPP_ENABLE` to 1 by default if `ENABLE_MSCCLPP` is defined, or 0 otherwise. Added a log warning if `RCCL_MSCCLPP_ENABLE` is set to 1 but `ENABLE_MSCCLPP` is not defined. (#1294)

* Include mscclpp as a git submodule (#1314)

* Added the desired mscclpp commit as a git submodule.

* Added step to automatically checkout the mscclpp submodule if it isn't already present, in case the user forgot to clone recursively.

* Added instruction to README to clone using --recurse-submodules to get the mscclpp submodule.

* Enabled MSCCL++ feature build.

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
2024-09-11 09:55:16 -06:00
saurabhAMD 4856309413 Making variable names consistent in EnvVars.cpp (#1327)
* Making variable names consistent in EnvVars.cpp
2024-09-11 09:23:31 -05:00
mberenjk 4ceb672179 replacing nccl/cuda related part of the api_trace.h with rccl/hip (#1326)
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2024-09-10 11:05:14 -05:00
saurabhAMD 289a80c4e9 Enabling Unit Tests for CPX mode (#1324)
* Unit Tests for RCCL in CPX mode

* override pow2gpus set by cpx mode by user argument

* Adding comment for UT_POW2_GPUS

* Additional comment on why using pow2gpus for cpx mode.
2024-09-09 10:12:33 -05:00
dependabot[bot] c85ac2bd1c Bump cryptography from 42.0.7 to 43.0.1 in /docs/sphinx (#1317)
Bumps [cryptography](https://github.com/pyca/cryptography) from 42.0.7 to 43.0.1.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/42.0.7...43.0.1)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-06 14:28:54 -06:00
Tim 8169cf1dfd Merge pull request #1320 from AtlantaPepsi/UT_cpx_hotfix
Temporary patch for unit tests in cpx mode
2024-09-06 12:07:03 -04:00
Ziyue Yang 8282baae7f Revise MSCCL link in README to Azure repo (#1311) 2024-09-05 17:10:49 -05:00
randyh62 4e2eeafdf6 Update README.md (#1321)
update note formatting
2024-09-05 14:23:36 -07:00
Nilesh M Negi d3012d3307 [BUILD] Support clang++ compiler (#1316)
* [BUILD] Support clang++ compiler

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] Enable check_symbol_exists for BFD and clang++

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] Define default C compiler

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-09-05 09:59:58 -05:00
randyh62 391c7ea070 what-is-rccl (#1312)
* what-is-rccl

* create Installation instreuctions from README

* update README link

* Add using-nccl

* Add note about docs

* correct doc path

* sources to source

* correct docs link
2024-09-05 06:54:48 -07:00
Tim 757d1891e9 Update EnvVars.cpp 2024-09-04 16:55:36 -04:00
corey-derochie-amd e056fe8f7e Disable MSCCL for the non-multi-process case by default (#1307)
* Added `RCCL_MSCCL_ENABLE_SINGLE_PROCESS` runtime flag to return to the original MSCCL enablement behaviour except when explicitly enabling for multi-thread.

* Added documentation for the new `RCCL_MSCCL_ENABLE_SINGLE_PROCESS` runtime env var.
2024-09-04 11:11:50 -06:00
Wenkai Du 1a48e19b18 topo_expl: update sm fields in topo xml files (#1310) 2024-08-29 12:03:51 -07:00
Nusrat Islam 833435be18 graph: fix for MI300X 64 GPU case (#1308)
PR #1290 introduced a failure for 64 GPU case on MI300X. This PR
fixes the failure.
2024-08-26 18:37:58 -05:00
Nilesh M Negi 607e34dd99 [BUILD] Enable RCCL build with amdclang++ (#1128)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-08-25 13:44:22 -04:00
Edgar Gabriel bba3559334 Merge pull request #1299 from edgargabriel/topic/remove-multirank-examples
Remove MultiRank examples
2024-08-23 08:32:16 -05:00
Wenkai Du 532b70afb6 Add new Rome model (#1304)
* Add another rome model and override

* Fix bug

* Fix typo

* Add ring

* Update ring

* Fix model matching

* Clean up

* Clean up

* Reverse rings for NCCL_RINGS input

* Only reverse NCCL_RINGS for ring graph

* Fix mapping issue when using  NCCL_RINGS

* Add NCCL_RINGS_REMAP to handle inconsistant net names
2024-08-23 08:45:43 +08:00
mberenjk db840f024e adding all nccl apis to api_support to enable rccl tracing by rocprofv3 (#1297)
* adding all nccl apis to api_support to enable rccl tracing by rocprofv3

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
2024-08-22 12:36:07 -05:00
dependabot[bot] 93b0c7418f Bump rocm-docs-core from 1.6.2 to 1.7.1 in /docs/sphinx (#1305)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.6.2 to 1.7.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.6.2...v1.7.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-08-20 13:04:36 -06:00
Wenkai Du d3171b51b7 Fix gfx940 CPX mode (#1290) 2024-08-16 08:46:06 +08:00
Wenkai Du eff56735b0 Fix model matching with PXN enable (#1295) 2024-08-16 06:16:00 +08:00
Edgar Gabriel 8953a26bcd Remove MultiRank examples
remove the MultiRank examples, the features was never released (because
it didn't work reliably), and it might just cause confusion if somebody
sees it. In additional, the locdation in tools was suboptimal.
2024-08-14 14:11:16 -07:00
akolliasAMD d6c317d6ae removed hcc mentions (#1291) 2024-08-14 15:04:13 -06:00
dependabot[bot] dfd5106a4b Bump rocm-docs-core from 1.5.0 to 1.6.2 in /docs/sphinx (#1287)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.5.0 to 1.6.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.5.0...v1.6.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-08-09 11:04:57 -06:00
Pedram Alizadeh a25ca9bb90 adding new tunning table for very large number of nodes (#1288) 2024-08-09 10:47:42 -04:00
Tim 4200964202 Adding core binding in info (#1212)
Signed-off-by: AtlantaPepsi <timhu102@amd.com>
2024-08-08 11:36:24 -04:00
Nilesh M Negi a2474846f5 [README] Tips on using less than 8 MI300 GPUs (#1270)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-08-06 11:12:09 -05:00
Nilesh M Negi 4f31ab85ea [BUILD] Update gfxTargets for ASAN build (#1242)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-08-06 10:53:51 -05:00