Commit Graph

1037 Commits

Author SHA1 Message Date
dependabot[bot] 376de87fa9 Bump rocm-docs-core from 0.25.0 to 0.26.0 in /docs/sphinx (#917)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.25.0 to 0.26.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.25.0...v0.26.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: f7e530259d]
2023-10-13 08:27:04 -06:00
Wen-Heng (Jack) Chung dfa0d98f9e Change MSCCL kernel signature to allow kernel arguments be preloaded via SGPR (#911)
* Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895)

Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu>

* Only build gfx941

* demo

* fine tune malloc

* Fix merge errors

* Fix merge errors

* Disable parallel build

* Adopt --amdgpu-kernarg-preload-count

* Revert "Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895)"

This reverts commit f5e252dddf02a41b4d1bc512f306f45f97166304.

* Revert CMake changes.

* NPKIT changes.

* Remove some license declarations.

* Address code review feedbacks on msccl_kernel_impl.h

* Update CMakeLists.txt

* Add CMake logic to check the existence of --amdgpu-kernarg-preload-count

* Fix NPKIT trace logic.

---------

Co-authored-by: Pedram Alizadeh <pmohamma@amd.com>
Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>

[ROCm/rccl commit: 7ee5c1c28b]
2023-10-12 20:17:08 -05:00
mberenjk 9a0c9ba3e9 adding cuda support for EmptyKernelTest (#913)
[ROCm/rccl commit: 7e2d905376]
2023-10-11 14:11:12 -05:00
dependabot[bot] 5096358a70 Bump gitpython from 3.1.35 to 3.1.37 in /docs/sphinx (#912)
Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.35 to 3.1.37.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases)
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES)
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.35...3.1.37)

---
updated-dependencies:
- dependency-name: gitpython
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 01d9da8046]
2023-10-10 15:43:49 -06:00
gilbertlee-amd c1a7b56b9b Adding a simple EmptyKernelTest to measure launch latency (#910)
[ROCm/rccl commit: 7dbf47e07b]
2023-10-04 17:22:48 -06:00
Bertan Dogancay 6f7965796f Revert "Remove 2H4P condition from P2P channels adjustment (#890)" (#904)
This reverts commit 057e30e705.

[ROCm/rccl commit: a6ff4618c7]
2023-10-04 09:46:11 -06:00
dependabot[bot] c0a707ea50 Bump rocm-docs-core from 0.24.2 to 0.25.0 in /docs/sphinx (#909)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.2 to 0.25.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.2...v0.25.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: f2600af812]
2023-10-04 09:14:59 -06:00
dependabot[bot] 928cf93c4b Bump urllib3 from 1.26.15 to 1.26.17 in /docs/sphinx (#906)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.15 to 1.26.17.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/1.26.15...1.26.17)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 4b4e7ecdf9]
2023-10-03 15:54:37 -06:00
akolliasAMD 1ffd3eff31 Dma buf support optin (#905)
* dmaBufSupport Optin added on every part of the code that should invoke it

[ROCm/rccl commit: 28d7fe5629]
2023-10-03 03:17:48 -06:00
Edgar Gabriel e6c3e9fd8e turn bfd compilation off by default
revert the logic to ensure that we are not accidentally creating
a dependency on the bfd libraries when deploying rccl binaries.


[ROCm/rccl commit: 88a55cef83]
2023-09-29 20:25:33 +00:00
akolliasAMD 12b2fc9774 install.sh fix (#903)
[ROCm/rccl commit: a773def279]
2023-09-29 07:42:17 -06:00
Cen Zhao d3c20a1210 Update install.sh to take "--static" option (#894)
* Update install.sh to take "--static" option

* Fix static build errors

---------

Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>

[ROCm/rccl commit: fb57a438d7]
2023-09-27 12:45:21 -04:00
Bertan Dogancay b35ea4bd78 Modify All-To-All doc (#896)
* Modify All-To-All doc

* Update nccl.h.in

* update unit-tests

---------

Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com>

[ROCm/rccl commit: c1f57a7041]
2023-09-27 12:45:21 -04:00
dependabot[bot] 01c72d16d5 Bump gitpython from 3.1.34 to 3.1.35 in /docs/sphinx (#898)
Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.34 to 3.1.35.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases)
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES)
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.34...3.1.35)

---
updated-dependencies:
- dependency-name: gitpython
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 50bc92f1d5]
2023-09-27 12:45:21 -04:00
dependabot[bot] 2c5a37a6b1 Bump cryptography from 41.0.3 to 41.0.4 in /docs/sphinx (#897)
Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.3 to 41.0.4.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/41.0.3...41.0.4)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 1bbc3742b0]
2023-09-27 12:45:21 -04:00
Pedram Alizadeh 279da575be Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895)
[ROCm/rccl commit: 3f6c2b9b32]
2023-09-27 12:44:36 -04:00
akolliasAMD 6f7eb65308 changed the form that RCCL_TREE uses (#888)
* changed the form that RCCL_TREE uses

[ROCm/rccl commit: b85d73c02e]
2023-09-15 15:01:33 -06:00
Wenkai Du 3cc41809dd Reduce NPKit latency overhead in MSCCL kernel (#893)
* Reduce NPKit latency overhead in MSCCL kernel

* Fix build error without NPKit enable

[ROCm/rccl commit: 26e982d913]
2023-09-15 13:28:26 -07:00
Wenkai Du 057e30e705 Remove 2H4P condition from P2P channels adjustment (#890)
[ROCm/rccl commit: 16dd05a58a]
2023-09-13 12:54:21 -07:00
Ziyue Yang 6d593761dc Add single-node MI300X topology (#889)
[ROCm/rccl commit: c1bfd5f0d8]
2023-09-13 11:07:17 -07:00
akolliasAMD 8685535346 Fixed topo_expl (#891)
[ROCm/rccl commit: 762a42859e]
2023-09-13 12:05:35 -06:00
Wenkai Du b0a16d80ff Fix crash when NPKit is enabled (#887)
[ROCm/rccl commit: 6a4d5ec089]
2023-09-13 11:00:12 -07:00
Audrey MP 2e3d45a53a Gcn arch name (#886)
We use CMake to determine if we're compiling against a version of ROCm that supports gcnArchName and handles architecture checking appropriately. It includes a few helper functions as drop ins for the functionality we used gcnArch for before; sometimes to enable flags, and sometimes to set frequencies.

[ROCm/rccl commit: e58ec78d35]
2023-09-12 15:34:40 -04:00
Andy li 43a9fd00ee enable hip graph on multi-node (#884)
* initial checkin

* enable msccl when hip graph is on

* remove the commented out code of msccl enable check

* clean up the code

* remove the msccl HighestTransportType check logic

[ROCm/rccl commit: e1dc4d5e42]
2023-09-11 15:30:04 -07:00
Nusrat Islam e0ddc8f549 Merge pull request #880 from nusislam/msccl-npkit
msccl: add NPKIT profiling for MSCCL send-recv

[ROCm/rccl commit: e46602e44a]
2023-09-08 14:13:14 -05:00
Nusrat Islam ffbfe43500 msccl: add NPKIT profiling for MSCCL send-recv
[ROCm/rccl commit: a283f55f12]
2023-09-08 13:11:16 -05:00
dependabot[bot] ae27ee7108 Bump rocm-docs-core from 0.22.0 to 0.24.0 in /docs/sphinx (#882)
* Bump rocm-docs-core from 0.22.0 to 0.24.0 in /docs/sphinx

Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.22.0 to 0.24.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.22.0...v0.24.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update requirements.in

* Update requirements.txt

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sam Wu <sam.wu2@amd.com>

[ROCm/rccl commit: a893e8a4ab]
2023-09-07 11:27:53 -06:00
dependabot[bot] ecd3fb42b0 Bump gitpython from 3.1.32 to 3.1.34 in /docs/sphinx (#879)
Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.32 to 3.1.34.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases)
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES)
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.32...3.1.34)

---
updated-dependencies:
- dependency-name: gitpython
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 62a09100a6]
2023-09-06 14:08:45 -06:00
Bertan Dogancay 2aa31c89df RCCL should use hipPointerAttribute_t.type (#872)
[ROCm/rccl commit: 6230b5f6b3]
2023-09-05 09:44:12 -06:00
Wenkai Du 009990efca Remove --hipcc-func-supp with recent compilers (#874)
* Remove --hipcc-func-supp with recent compilers

* Remove HIP_UNCACHED_MEMORY deetction from header file

[ROCm/rccl commit: 2baca3a55a]
2023-09-01 07:53:18 -07:00
dependabot[bot] 6ec15d550d Bump rocm-docs-core from 0.21.0 to 0.22.0 in /docs/sphinx (#875)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.21.0 to 0.22.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/v0.22.0/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.21.0...v0.22.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: e642681fd3]
2023-09-01 08:46:43 -06:00
Wenkai Du be412b848b Update ll_latency_test and add CUDA version (#873)
[ROCm/rccl commit: c6dd6f6237]
2023-08-30 16:29:42 -07:00
dependabot[bot] 29b01e4b3b Bump rocm-docs-core from 0.20.0 to 0.21.0 in /docs/sphinx (#870)
* Bump rocm-docs-core from 0.20.0 to 0.21.0 in /docs/sphinx

Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.20.0 to 0.21.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.20.0...v0.21.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* replace noCI with ci:docs-only label

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sam Wu <sam.wu2@amd.com>

[ROCm/rccl commit: a433fcc726]
2023-08-30 08:56:38 -06:00
gilbertlee-amd 5fe857c562 More robust msccl shared directory location discovery (#868)
[ROCm/rccl commit: 4297315de7]
2023-08-30 08:10:14 -06:00
Pedram Alizadeh b4f96a23e6 optimizing COLL_UNROLL for MI100 machines (#863)
[ROCm/rccl commit: e7f27c66e0]
2023-08-29 12:49:06 -04:00
Bertan Dogancay bcb8075e38 Add 0-byte test for send/recv (#865)
[ROCm/rccl commit: 0a01dc2f19]
2023-08-29 09:14:18 -06:00
Wenkai Du 4fefe1ce7d rccl-prim-test: use non-temporal access (#867)
[ROCm/rccl commit: aa95985867]
2023-08-28 08:28:05 -07:00
Bertan Dogancay 487391e8bb Add ncclCommSplit test (#852)
Add ncclSplitCommTest

[ROCm/rccl commit: 9d11cd092f]
2023-08-25 16:26:45 -06:00
Wenkai Du af04103d72 Add MSCCL xml files (#861)
[ROCm/rccl commit: aeca1af374]
2023-08-23 14:12:34 -07:00
gilbertlee-amd 3dd880fe74 Minor fix for some msccl installations (#862)
[ROCm/rccl commit: 5bcd3768cc]
2023-08-23 13:48:58 -06:00
arvindcheru 5e60fb93d5 366827 - Disable file reorg backward compatibility support by default (#849)
* Disable file reorg backward compatibility support by default

- File Reorg backward compatibility option set to OFF

* Update install.sh

[ROCm/rccl commit: 6ee758382e]
2023-08-22 09:14:49 -04:00
Wenkai Du 5983f0e371 Use relaxed atomics for LL on GFX11 (#859)
[ROCm/rccl commit: 6a0a6a37d9]
2023-08-21 16:28:39 -07:00
David Pagan 75e3927f50 Fix static_assert string literal that contains a "\%". This is no longer (#860)
valid. They can only be simple escape sequences. Removing '\' fixes
issue. Assert message now compiles and emits the '%' as expected.

[ROCm/rccl commit: 2ec2648247]
2023-08-21 16:19:59 -07:00
akolliasAMD 56129830a6 NCCL_TREES variable and rome model fixes (#856)
[ROCm/rccl commit: d33cd5a233]
2023-08-21 10:35:37 -06:00
Wenkai Du 47330a62a6 p2p/ll-latency-test: convert to single thread tests (#857)
[ROCm/rccl commit: 148e3430f4]
2023-08-21 07:48:37 -07:00
Wenkai Du 6fdb4103b7 gfx11: don't use LL for sendrecv (#853)
* gfx11: don't use LL for sendrecv

* Use builtin instead of inline asm

[ROCm/rccl commit: f70e3e569b]
2023-08-17 08:50:51 -07:00
dependabot[bot] 84c77f7652 Bump gitpython from 3.1.31 to 3.1.32 in /docs/sphinx (#850)
Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.31 to 3.1.32.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases)
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES)
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.31...3.1.32)

---
updated-dependencies:
- dependency-name: gitpython
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 621ccd783b]
2023-08-11 15:08:00 -06:00
Wenkai Du 0c31452135 Add new model support (#847)
* Add new model support

* Update new rings

[ROCm/rccl commit: 7044599575]
2023-08-10 17:14:51 -07:00
Bertan Dogancay f602b16bcb Fix mscclLoadAlgo error (#846)
[ROCm/rccl commit: da107ff2bc]
2023-08-09 11:39:21 -06:00
Ziyue Yang 18811f6159 NPKit update (#844)
* NPKit update

1. Enable NPKit for MSCCL kernels
2. Fix NPKit context index calculation for sendrecv kernels

* Update build script for npkit

[ROCm/rccl commit: d33a70e620]
2023-08-08 17:30:40 -07:00