Граф коммитов

1374 Коммитов

Автор SHA1 Сообщение Дата
corey-derochie-amd c8f4dedfd1 Added nlohmann/json:v3.11.3 as a submodule in ext-src and passed its path into the mscclpp build to avoid downloading the package at build time. (#1330)
[ROCm/rccl commit: b3b0ffdbf3]
2024-09-11 16:54:26 -06:00
corey-derochie-amd 9ffd893c5a Re-enabled MSCCL++ (#1325)
* Added restrictions around calling MSCCL++ collectives (#1281)

* Added restriction to non-zero 32-byte multiple message sizes to MSCCL++ AllGather.

* Renamed and refactored some mscclpp types.

* Only transmit the MSCCL++ unique id for non-split comm init. For splitting comm, it has already been transmitted. Instead, save the MSCCL++ communicator in child communicators when calling `ncclCommSplit`. Only destroy MSCCL++ communicators when no RCCL communicators remain that use it. Also improved trace logging.

* Disable MSCCL++ when using managed memory buffers as it isn't supported.

* Added datatype and op constraints for MSCCL++ AllReduce.

* Added documentation on MSCCL++ restrictions to the README.

* [BUILD] Support custom CMake flags in MSCCLPP (#1275)

* [BUILD] Support custom CMAKE_PREFIX_PATH in MSCCLPP

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] CMake flags to support build-id in MSCCLPP

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] Fix CMake warnings in MSCCLPP build

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Wrapped all cmake arguments passed to mscclpp to remove empty arguments and properly format them.

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Corey Derochie <corey.derochie@amd.com>

* Link to libmscclpp_nccl statically (#1282)

* Switched mscclpp_nccl to static linking. Added a build step to rename the NCCL API functions.

* Undid separation of building libmscclpp_nccl from building librccl with MSCCL++ integration. With a static build, it's either fully enabled or fully disabled.

* `nm` isn't always available in docker containers due to being stripped down. Removed use of `nm` in `cmake` and hard-coded the output into mscclpp_nccl_syms.txt.

* Removed IBVerbs dependency for integrating with MSCCL++ (#1313)

* Renamed `RCCL_ENABLE_MSCCLPP` to `RCCL_MSCCLPP_ENABLE` to conform to MSCCL. Set `RCCL_MSCCLPP_ENABLE` to 1 by default if `ENABLE_MSCCLPP` is defined, or 0 otherwise. Added a log warning if `RCCL_MSCCLPP_ENABLE` is set to 1 but `ENABLE_MSCCLPP` is not defined. (#1294)

* Include mscclpp as a git submodule (#1314)

* Added the desired mscclpp commit as a git submodule.

* Added step to automatically checkout the mscclpp submodule if it isn't already present, in case the user forgot to clone recursively.

* Added instruction to README to clone using --recurse-submodules to get the mscclpp submodule.

* Enabled MSCCL++ feature build.

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 736a705875]
2024-09-11 09:55:16 -06:00
saurabhAMD e3b39ab309 Making variable names consistent in EnvVars.cpp (#1327)
* Making variable names consistent in EnvVars.cpp

[ROCm/rccl commit: 4856309413]
2024-09-11 09:23:31 -05:00
mberenjk 78e0b3fe9e replacing nccl/cuda related part of the api_trace.h with rccl/hip (#1326)
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>

[ROCm/rccl commit: 4ceb672179]
2024-09-10 11:05:14 -05:00
saurabhAMD fdaef9dd82 Enabling Unit Tests for CPX mode (#1324)
* Unit Tests for RCCL in CPX mode

* override pow2gpus set by cpx mode by user argument

* Adding comment for UT_POW2_GPUS

* Additional comment on why using pow2gpus for cpx mode.

[ROCm/rccl commit: 289a80c4e9]
2024-09-09 10:12:33 -05:00
dependabot[bot] 7873e551b1 Bump cryptography from 42.0.7 to 43.0.1 in /docs/sphinx (#1317)
Bumps [cryptography](https://github.com/pyca/cryptography) from 42.0.7 to 43.0.1.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/42.0.7...43.0.1)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: c85ac2bd1c]
2024-09-06 14:28:54 -06:00
Tim 144a54f178 Merge pull request #1320 from AtlantaPepsi/UT_cpx_hotfix
Temporary patch for unit tests in cpx mode

[ROCm/rccl commit: 8169cf1dfd]
2024-09-06 12:07:03 -04:00
Ziyue Yang 7830806b4b Revise MSCCL link in README to Azure repo (#1311)
[ROCm/rccl commit: 8282baae7f]
2024-09-05 17:10:49 -05:00
randyh62 e2d093cc3a Update README.md (#1321)
update note formatting

[ROCm/rccl commit: 4e2eeafdf6]
2024-09-05 14:23:36 -07:00
Nilesh M Negi 9018e2f466 [BUILD] Support clang++ compiler (#1316)
* [BUILD] Support clang++ compiler

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] Enable check_symbol_exists for BFD and clang++

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] Define default C compiler

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: d3012d3307]
2024-09-05 09:59:58 -05:00
randyh62 0f98c58804 what-is-rccl (#1312)
* what-is-rccl

* create Installation instreuctions from README

* update README link

* Add using-nccl

* Add note about docs

* correct doc path

* sources to source

* correct docs link

[ROCm/rccl commit: 391c7ea070]
2024-09-05 06:54:48 -07:00
Tim 1bd3db8fc7 Update EnvVars.cpp
[ROCm/rccl commit: 757d1891e9]
2024-09-04 16:55:36 -04:00
corey-derochie-amd dc04844405 Disable MSCCL for the non-multi-process case by default (#1307)
* Added `RCCL_MSCCL_ENABLE_SINGLE_PROCESS` runtime flag to return to the original MSCCL enablement behaviour except when explicitly enabling for multi-thread.

* Added documentation for the new `RCCL_MSCCL_ENABLE_SINGLE_PROCESS` runtime env var.

[ROCm/rccl commit: e056fe8f7e]
2024-09-04 11:11:50 -06:00
Wenkai Du 27e0569eed topo_expl: update sm fields in topo xml files (#1310)
[ROCm/rccl commit: 1a48e19b18]
2024-08-29 12:03:51 -07:00
Nusrat Islam 7676d49260 graph: fix for MI300X 64 GPU case (#1308)
PR #1290 introduced a failure for 64 GPU case on MI300X. This PR
fixes the failure.

[ROCm/rccl commit: 833435be18]
2024-08-26 18:37:58 -05:00
Nilesh M Negi 7c4389ee31 [BUILD] Enable RCCL build with amdclang++ (#1128)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 607e34dd99]
2024-08-25 13:44:22 -04:00
Edgar Gabriel 5368ea024a Merge pull request #1299 from edgargabriel/topic/remove-multirank-examples
Remove MultiRank examples

[ROCm/rccl commit: bba3559334]
2024-08-23 08:32:16 -05:00
Wenkai Du 157cc5f6ba Add new Rome model (#1304)
* Add another rome model and override

* Fix bug

* Fix typo

* Add ring

* Update ring

* Fix model matching

* Clean up

* Clean up

* Reverse rings for NCCL_RINGS input

* Only reverse NCCL_RINGS for ring graph

* Fix mapping issue when using  NCCL_RINGS

* Add NCCL_RINGS_REMAP to handle inconsistant net names

[ROCm/rccl commit: 532b70afb6]
2024-08-23 08:45:43 +08:00
mberenjk 886b576722 adding all nccl apis to api_support to enable rccl tracing by rocprofv3 (#1297)
* adding all nccl apis to api_support to enable rccl tracing by rocprofv3

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>

[ROCm/rccl commit: db840f024e]
2024-08-22 12:36:07 -05:00
dependabot[bot] 03e6755cf0 Bump rocm-docs-core from 1.6.2 to 1.7.1 in /docs/sphinx (#1305)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.6.2 to 1.7.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.6.2...v1.7.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 93b0c7418f]
2024-08-20 13:04:36 -06:00
Wenkai Du d5a33e7107 Fix gfx940 CPX mode (#1290)
[ROCm/rccl commit: d3171b51b7]
2024-08-16 08:46:06 +08:00
Wenkai Du e770054b7f Fix model matching with PXN enable (#1295)
[ROCm/rccl commit: eff56735b0]
2024-08-16 06:16:00 +08:00
Edgar Gabriel 3d44929d08 Remove MultiRank examples
remove the MultiRank examples, the features was never released (because
it didn't work reliably), and it might just cause confusion if somebody
sees it. In additional, the locdation in tools was suboptimal.


[ROCm/rccl commit: 8953a26bcd]
2024-08-14 14:11:16 -07:00
akolliasAMD 38e189bb1e removed hcc mentions (#1291)
[ROCm/rccl commit: d6c317d6ae]
2024-08-14 15:04:13 -06:00
dependabot[bot] 09fe7f9b97 Bump rocm-docs-core from 1.5.0 to 1.6.2 in /docs/sphinx (#1287)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.5.0 to 1.6.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.5.0...v1.6.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: dfd5106a4b]
2024-08-09 11:04:57 -06:00
Pedram Alizadeh e40a0939f1 adding new tunning table for very large number of nodes (#1288)
[ROCm/rccl commit: a25ca9bb90]
2024-08-09 10:47:42 -04:00
Tim 9fdecceefb Adding core binding in info (#1212)
Signed-off-by: AtlantaPepsi <timhu102@amd.com>

[ROCm/rccl commit: 4200964202]
2024-08-08 11:36:24 -04:00
Nilesh M Negi 3e52d15ced [README] Tips on using less than 8 MI300 GPUs (#1270)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: a2474846f5]
2024-08-06 11:12:09 -05:00
Nilesh M Negi 35f4a405f0 [BUILD] Update gfxTargets for ASAN build (#1242)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 4f31ab85ea]
2024-08-06 10:53:51 -05:00
Ziyue Yang 30e3db969f Fix number of loops in p2p-latency-test (#1286)
[ROCm/rccl commit: 145a13235a]
2024-08-05 13:35:56 -07:00
Nilesh M Negi 713ed3341d [BUILD] Disable MSCCLPP build by default (#1283)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: cb2e0615d7]
2024-08-02 23:17:51 -05:00
Tim 3261e2a5fd Adding User Buffer Registration support for Unit test (#1199)
* Adding UBR support for UT SendRecv

Signed-off-by: Tim Hu <timhu102@amd.com>

* Update test/common/TestBedChild.cpp

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Signed-off-by: Tim Hu <timhu102@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: a4793286c7]
2024-07-30 13:39:25 -04:00
Wenkai Du 27b7998d13 Restore number of parallel linking jobs (#1278)
* Restore number of parallel linking jobs

* Dynamically adjust number of linker jobs with limit of 16 jobs max

* Fix typo

* Add cgroup v1 support

[ROCm/rccl commit: ca5341d419]
2024-07-30 08:04:14 -07:00
Pedram Alizadeh 562eb08978 Adding tuner plugin example for MI300 (#1274)
[ROCm/rccl commit: b005c13292]
2024-07-29 15:43:36 -04:00
Richard Barnes 92d874be50 Remove unused but set variable from all_reduce.h (#1258)
Allows `-Wunused-but-set-variable` to pass

[ROCm/rccl commit: d09b152aa0]
2024-07-29 08:11:24 -07:00
Richard Barnes 3d208c8eb9 Remove unused but set variable from prims_ll128.h (#1257)
Allows `-Wunused-but-set-variable` to pass

[ROCm/rccl commit: 86a4ad6e8b]
2024-07-29 08:11:01 -07:00
Richard Barnes 780324296c Remove unused but set variable from prims_ll.h (#1256)
Allows `-Wunused-but-set-variable` to pass

[ROCm/rccl commit: 7ad432ee23]
2024-07-29 08:10:38 -07:00
akolliasAMD 37c44d531b gfx12 Disable ll protocol (#1268)
[ROCm/rccl commit: c246e25f8e]
2024-07-26 08:59:55 -06:00
Sam Wu 0aa81c3194 Double compile timeout for extended ci to 400 min (#1277)
[ROCm/rccl commit: 05dca6def9]
2024-07-26 09:59:36 -05:00
Benjamin Kitor d2df042c36 topo_expl: Update channel masks for >64 channels (#1279)
[ROCm/rccl commit: 4bc118336a]
2024-07-25 17:27:34 -07:00
Joseph Macaranas 496e98a73f Merge pull request #1262 from ROCm/amd/jmacaran/externalCImainline
External CI: Add triggers for mainline branch

[ROCm/rccl commit: 00cd4dae1e]
2024-07-22 20:41:01 -04:00
corey-derochie-amd 94910b8f80 Fix bug where the first collective call was using MSCCL instead of MSCCL++ (#1260)
[ROCm/rccl commit: 69135976d6]
2024-07-22 15:46:47 -06:00
saurabhAMD 24e1ed5288 Adding performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing (#1265)
* Adding performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing

* Performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing

[ROCm/rccl commit: cf311b71ee]
2024-07-22 10:21:29 -05:00
corey-derochie-amd f2b2372056 Only initialize MSCCL++ when runtime-enabled. (#1266)
[ROCm/rccl commit: b31b4082dd]
2024-07-22 00:41:31 -06:00
mberenjk 863b213fd2 adding rocprof and pytorch parser scripts (#1214)
* adding rocprof parser script

* adding the support for multiple json files

* adding pytorch profiler script

* remove filtering from pytorch log

* adding the addressing the comments and add the feature to parse all kernels

* completing the report for torch profiler

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>

[ROCm/rccl commit: 519843d2cf]
2024-07-19 14:51:28 -05:00
Nusrat Islam df63c9772f Enable CPX mode for MI300X (#1259)
* graph: enable cpx mode for MI300X

* graph: tune limits for cpx and cleanup

[ROCm/rccl commit: 6f331b0d43]
2024-07-19 11:30:37 -05:00
Wenkai Du 54e4899607 Template unroll for RCCL kernels (#1250)
* Template unroll for RCCL kernels

* Adding unroll template arg during CMake hipification

* Reduce linking parallel jobs to avoid OOM in CI

* Workaround issues with UT tests

SWDEV-469533: register spill fix is needed for mainline build
LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs
Use -parallel-jobs=8 for linking

* CI: do not use -j 16 when building

* CI: use -j 8 when building

* Only reduce parallel linking job for CI extended

* Restore original jenkins command. Change parallel linking jobs in cmake

* Disable MSCCLPP

---------

Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>

[ROCm/rccl commit: 89349f2ce4]
2024-07-19 08:15:59 -07:00
Nilesh M Negi 73e17b3e70 Consistent channel shuffling for MI300X multi-node (#1255)
* Revert "[GRAPH] Use channel shuffling only for IB systems (#1228)"

This reverts commit 8a7dd0e590.

* Revert "Revert "Changing channel stride for MI300X multinode (#1196)" (#1224)"

This reverts commit bb6fab3d8e.

[ROCm/rccl commit: a1ef217b32]
2024-07-18 10:18:09 -05:00
amd-jmacaran 1ade35ac64 External CI: Add triggers for mainline branch
[ROCm/rccl commit: 346fee4c83]
2024-07-17 23:16:49 -04:00
Nilesh M Negi 13134c6c64 [GRAPH] Disable MSCCL override of no. of channels (#1187)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 67e867271f]
2024-07-15 10:45:21 -05:00