Граф коммитов

1052 Коммитов

Автор SHA1 Сообщение Дата
Wenkai Du 446c8cbf66 msccl: reduce debug output when using NCCL_DEBUG=INFO (#932)
[ROCm/rccl commit: fb0eccb57b]
2023-10-25 08:05:19 -07:00
Wen-Heng (Jack) Chung 769f00db5c Introduce allgather for MSCCL on 8 sockets up to 320KB. (#931)
[ROCm/rccl commit: bfb8642450]
2023-10-24 18:41:12 -05:00
Wen-Heng (Jack) Chung 89a8493ef8 Introduce allgather MSCCL XML specification for MI250X up to 320KB. (#930)
[ROCm/rccl commit: 3f9ffe4788]
2023-10-24 18:35:55 -05:00
Wen-Heng (Jack) Chung fc2a13c077 Introduce 1-shot allreduce for MI250X Hayabusa. (#929)
[ROCm/rccl commit: 72d5fbddfd]
2023-10-24 16:31:18 -05:00
Wenkai Du cc4de02a86 Add missing gfx942 support (#927)
[ROCm/rccl commit: c4e65fd382]
2023-10-23 12:04:37 -07:00
akolliasAMD bc7df769a2 AllReduceTests,fixed the number of roots (#925)
[ROCm/rccl commit: d8dc282eeb]
2023-10-20 10:25:11 -06:00
dependabot[bot] 187e9c1958 Bump urllib3 from 1.26.17 to 1.26.18 in /docs/sphinx (#921)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.17 to 1.26.18.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/1.26.17...1.26.18)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: b173c13831]
2023-10-20 10:15:42 -06:00
searlmc1 212453b2fb Remove quotes causing asan build breakage
The quotes around "-fsanitize=address -shared-libasan" cause the BUILD_ADDRESS_SANITIZER build to fail; remove the quotes

[ROCm/rccl commit: f59de10524]
2023-10-19 16:13:39 -07:00
Bertan Dogancay 1a538d0218 Update install.sh --fast and README (#924)
[ROCm/rccl commit: 3807c203fc]
2023-10-19 16:35:10 -06:00
Wenkai Du 6f0f614d0b Remove LDS based software barriers from MSCCL (#923)
[ROCm/rccl commit: dbb5611a3a]
2023-10-19 16:39:41 -05:00
Wenkai Du edeea499b5 Update rome models (#922)
[ROCm/rccl commit: 4278a9918b]
2023-10-18 17:28:01 -07:00
Wen-Heng (Jack) Chung 49e52e7269 Introduce 1pass allreduce. Tailor it for very small message sizes <= 20KB. (#919)
[ROCm/rccl commit: 341926c60a]
2023-10-16 16:31:08 -05:00
Wenkai Du e0cc7de446 NPKit: add xcc_id field (#918)
[ROCm/rccl commit: 39812ce757]
2023-10-13 15:24:59 -07:00
Wenkai Du c0bd012e6c Fix incorrect arch name parsing (#916)
[ROCm/rccl commit: 1b80d041cb]
2023-10-13 10:01:11 -07:00
Wenkai Du 102f0165d6 Port init_once fix from NCCL (#915)
[ROCm/rccl commit: 6d0b5c1e89]
2023-10-13 08:01:12 -07:00
dependabot[bot] 376de87fa9 Bump rocm-docs-core from 0.25.0 to 0.26.0 in /docs/sphinx (#917)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.25.0 to 0.26.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.25.0...v0.26.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: f7e530259d]
2023-10-13 08:27:04 -06:00
Wen-Heng (Jack) Chung dfa0d98f9e Change MSCCL kernel signature to allow kernel arguments be preloaded via SGPR (#911)
* Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895)

Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu>

* Only build gfx941

* demo

* fine tune malloc

* Fix merge errors

* Fix merge errors

* Disable parallel build

* Adopt --amdgpu-kernarg-preload-count

* Revert "Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895)"

This reverts commit f5e252dddf02a41b4d1bc512f306f45f97166304.

* Revert CMake changes.

* NPKIT changes.

* Remove some license declarations.

* Address code review feedbacks on msccl_kernel_impl.h

* Update CMakeLists.txt

* Add CMake logic to check the existence of --amdgpu-kernarg-preload-count

* Fix NPKIT trace logic.

---------

Co-authored-by: Pedram Alizadeh <pmohamma@amd.com>
Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>

[ROCm/rccl commit: 7ee5c1c28b]
2023-10-12 20:17:08 -05:00
mberenjk 9a0c9ba3e9 adding cuda support for EmptyKernelTest (#913)
[ROCm/rccl commit: 7e2d905376]
2023-10-11 14:11:12 -05:00
dependabot[bot] 5096358a70 Bump gitpython from 3.1.35 to 3.1.37 in /docs/sphinx (#912)
Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.35 to 3.1.37.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases)
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES)
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.35...3.1.37)

---
updated-dependencies:
- dependency-name: gitpython
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 01d9da8046]
2023-10-10 15:43:49 -06:00
gilbertlee-amd c1a7b56b9b Adding a simple EmptyKernelTest to measure launch latency (#910)
[ROCm/rccl commit: 7dbf47e07b]
2023-10-04 17:22:48 -06:00
Bertan Dogancay 6f7965796f Revert "Remove 2H4P condition from P2P channels adjustment (#890)" (#904)
This reverts commit 057e30e705.

[ROCm/rccl commit: a6ff4618c7]
2023-10-04 09:46:11 -06:00
dependabot[bot] c0a707ea50 Bump rocm-docs-core from 0.24.2 to 0.25.0 in /docs/sphinx (#909)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.2 to 0.25.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.2...v0.25.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: f2600af812]
2023-10-04 09:14:59 -06:00
dependabot[bot] 928cf93c4b Bump urllib3 from 1.26.15 to 1.26.17 in /docs/sphinx (#906)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.15 to 1.26.17.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/1.26.15...1.26.17)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 4b4e7ecdf9]
2023-10-03 15:54:37 -06:00
akolliasAMD 1ffd3eff31 Dma buf support optin (#905)
* dmaBufSupport Optin added on every part of the code that should invoke it

[ROCm/rccl commit: 28d7fe5629]
2023-10-03 03:17:48 -06:00
Edgar Gabriel e6c3e9fd8e turn bfd compilation off by default
revert the logic to ensure that we are not accidentally creating
a dependency on the bfd libraries when deploying rccl binaries.


[ROCm/rccl commit: 88a55cef83]
2023-09-29 20:25:33 +00:00
akolliasAMD 12b2fc9774 install.sh fix (#903)
[ROCm/rccl commit: a773def279]
2023-09-29 07:42:17 -06:00
Cen Zhao d3c20a1210 Update install.sh to take "--static" option (#894)
* Update install.sh to take "--static" option

* Fix static build errors

---------

Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>

[ROCm/rccl commit: fb57a438d7]
2023-09-27 12:45:21 -04:00
Bertan Dogancay b35ea4bd78 Modify All-To-All doc (#896)
* Modify All-To-All doc

* Update nccl.h.in

* update unit-tests

---------

Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com>

[ROCm/rccl commit: c1f57a7041]
2023-09-27 12:45:21 -04:00
dependabot[bot] 01c72d16d5 Bump gitpython from 3.1.34 to 3.1.35 in /docs/sphinx (#898)
Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.34 to 3.1.35.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases)
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES)
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.34...3.1.35)

---
updated-dependencies:
- dependency-name: gitpython
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 50bc92f1d5]
2023-09-27 12:45:21 -04:00
dependabot[bot] 2c5a37a6b1 Bump cryptography from 41.0.3 to 41.0.4 in /docs/sphinx (#897)
Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.3 to 41.0.4.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/41.0.3...41.0.4)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 1bbc3742b0]
2023-09-27 12:45:21 -04:00
Pedram Alizadeh 279da575be Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895)
[ROCm/rccl commit: 3f6c2b9b32]
2023-09-27 12:44:36 -04:00
akolliasAMD 6f7eb65308 changed the form that RCCL_TREE uses (#888)
* changed the form that RCCL_TREE uses

[ROCm/rccl commit: b85d73c02e]
2023-09-15 15:01:33 -06:00
Wenkai Du 3cc41809dd Reduce NPKit latency overhead in MSCCL kernel (#893)
* Reduce NPKit latency overhead in MSCCL kernel

* Fix build error without NPKit enable

[ROCm/rccl commit: 26e982d913]
2023-09-15 13:28:26 -07:00
Wenkai Du 057e30e705 Remove 2H4P condition from P2P channels adjustment (#890)
[ROCm/rccl commit: 16dd05a58a]
2023-09-13 12:54:21 -07:00
Ziyue Yang 6d593761dc Add single-node MI300X topology (#889)
[ROCm/rccl commit: c1bfd5f0d8]
2023-09-13 11:07:17 -07:00
akolliasAMD 8685535346 Fixed topo_expl (#891)
[ROCm/rccl commit: 762a42859e]
2023-09-13 12:05:35 -06:00
Wenkai Du b0a16d80ff Fix crash when NPKit is enabled (#887)
[ROCm/rccl commit: 6a4d5ec089]
2023-09-13 11:00:12 -07:00
Audrey MP 2e3d45a53a Gcn arch name (#886)
We use CMake to determine if we're compiling against a version of ROCm that supports gcnArchName and handles architecture checking appropriately. It includes a few helper functions as drop ins for the functionality we used gcnArch for before; sometimes to enable flags, and sometimes to set frequencies.

[ROCm/rccl commit: e58ec78d35]
2023-09-12 15:34:40 -04:00
Andy li 43a9fd00ee enable hip graph on multi-node (#884)
* initial checkin

* enable msccl when hip graph is on

* remove the commented out code of msccl enable check

* clean up the code

* remove the msccl HighestTransportType check logic

[ROCm/rccl commit: e1dc4d5e42]
2023-09-11 15:30:04 -07:00
Nusrat Islam e0ddc8f549 Merge pull request #880 from nusislam/msccl-npkit
msccl: add NPKIT profiling for MSCCL send-recv

[ROCm/rccl commit: e46602e44a]
2023-09-08 14:13:14 -05:00
Nusrat Islam ffbfe43500 msccl: add NPKIT profiling for MSCCL send-recv
[ROCm/rccl commit: a283f55f12]
2023-09-08 13:11:16 -05:00
dependabot[bot] ae27ee7108 Bump rocm-docs-core from 0.22.0 to 0.24.0 in /docs/sphinx (#882)
* Bump rocm-docs-core from 0.22.0 to 0.24.0 in /docs/sphinx

Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.22.0 to 0.24.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.22.0...v0.24.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update requirements.in

* Update requirements.txt

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sam Wu <sam.wu2@amd.com>

[ROCm/rccl commit: a893e8a4ab]
2023-09-07 11:27:53 -06:00
dependabot[bot] ecd3fb42b0 Bump gitpython from 3.1.32 to 3.1.34 in /docs/sphinx (#879)
Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.32 to 3.1.34.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases)
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES)
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.32...3.1.34)

---
updated-dependencies:
- dependency-name: gitpython
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 62a09100a6]
2023-09-06 14:08:45 -06:00
Bertan Dogancay 2aa31c89df RCCL should use hipPointerAttribute_t.type (#872)
[ROCm/rccl commit: 6230b5f6b3]
2023-09-05 09:44:12 -06:00
Wenkai Du 009990efca Remove --hipcc-func-supp with recent compilers (#874)
* Remove --hipcc-func-supp with recent compilers

* Remove HIP_UNCACHED_MEMORY deetction from header file

[ROCm/rccl commit: 2baca3a55a]
2023-09-01 07:53:18 -07:00
dependabot[bot] 6ec15d550d Bump rocm-docs-core from 0.21.0 to 0.22.0 in /docs/sphinx (#875)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.21.0 to 0.22.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/v0.22.0/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.21.0...v0.22.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: e642681fd3]
2023-09-01 08:46:43 -06:00
Wenkai Du be412b848b Update ll_latency_test and add CUDA version (#873)
[ROCm/rccl commit: c6dd6f6237]
2023-08-30 16:29:42 -07:00
dependabot[bot] 29b01e4b3b Bump rocm-docs-core from 0.20.0 to 0.21.0 in /docs/sphinx (#870)
* Bump rocm-docs-core from 0.20.0 to 0.21.0 in /docs/sphinx

Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.20.0 to 0.21.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.20.0...v0.21.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* replace noCI with ci:docs-only label

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sam Wu <sam.wu2@amd.com>

[ROCm/rccl commit: a433fcc726]
2023-08-30 08:56:38 -06:00
gilbertlee-amd 5fe857c562 More robust msccl shared directory location discovery (#868)
[ROCm/rccl commit: 4297315de7]
2023-08-30 08:10:14 -06:00
Pedram Alizadeh b4f96a23e6 optimizing COLL_UNROLL for MI100 machines (#863)
[ROCm/rccl commit: e7f27c66e0]
2023-08-29 12:49:06 -04:00