Grafico dei commit

1165 Commit

Autore SHA1 Messaggio Data
Wenkai Du a497722894 NPkit: misc fixes for MSCCL (#936)
* msccl: add xcc_id to timestamp sync

* NPKit: add timestamp for rrc operator

* NPKit: add timestamp for MSCCL init
2023-10-30 10:00:12 -07:00
Nilesh M Negi 1e5ca6820b Fix gcnArchName bug in topology dump (#937)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2023-10-28 12:30:36 -05:00
Ziyue Yang 4c117e5335 Fix MSCCL work FIFO out-of-bound issue (#935) 2023-10-27 11:24:52 -07:00
Nilesh M Negi 96ec3ffe2e SRC/INIT: fix typo for ENABLE_PROFILING (#934)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2023-10-26 23:52:46 -05:00
Nilesh M Negi f22df90e5c remove gcnArch support (#920)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2023-10-26 12:09:15 -05:00
Wenkai Du fb0eccb57b msccl: reduce debug output when using NCCL_DEBUG=INFO (#932) 2023-10-25 08:05:19 -07:00
Wen-Heng (Jack) Chung bfb8642450 Introduce allgather for MSCCL on 8 sockets up to 320KB. (#931) 2023-10-24 18:41:12 -05:00
Wen-Heng (Jack) Chung 3f9ffe4788 Introduce allgather MSCCL XML specification for MI250X up to 320KB. (#930) 2023-10-24 18:35:55 -05:00
Wen-Heng (Jack) Chung 72d5fbddfd Introduce 1-shot allreduce for MI250X Hayabusa. (#929) 2023-10-24 16:31:18 -05:00
Wenkai Du c4e65fd382 Add missing gfx942 support (#927) 2023-10-23 12:04:37 -07:00
akolliasAMD d8dc282eeb AllReduceTests,fixed the number of roots (#925) 2023-10-20 10:25:11 -06:00
dependabot[bot] b173c13831 Bump urllib3 from 1.26.17 to 1.26.18 in /docs/sphinx (#921)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.17 to 1.26.18.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/1.26.17...1.26.18)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-20 10:15:42 -06:00
searlmc1 dd5f01aeaf Merge pull request #926 from ROCmSoftwarePlatform/searlmc1-patch-1
Remove quotes causing asan build breakage
2023-10-20 07:52:39 -07:00
searlmc1 f59de10524 Remove quotes causing asan build breakage
The quotes around "-fsanitize=address -shared-libasan" cause the BUILD_ADDRESS_SANITIZER build to fail; remove the quotes
2023-10-19 16:13:39 -07:00
Bertan Dogancay 3807c203fc Update install.sh --fast and README (#924) 2023-10-19 16:35:10 -06:00
Wenkai Du dbb5611a3a Remove LDS based software barriers from MSCCL (#923) 2023-10-19 16:39:41 -05:00
Wenkai Du 4278a9918b Update rome models (#922) 2023-10-18 17:28:01 -07:00
Wen-Heng (Jack) Chung 341926c60a Introduce 1pass allreduce. Tailor it for very small message sizes <= 20KB. (#919) 2023-10-16 16:31:08 -05:00
Wenkai Du 39812ce757 NPKit: add xcc_id field (#918) 2023-10-13 15:24:59 -07:00
Wenkai Du 1b80d041cb Fix incorrect arch name parsing (#916) 2023-10-13 10:01:11 -07:00
Wenkai Du 6d0b5c1e89 Port init_once fix from NCCL (#915) 2023-10-13 08:01:12 -07:00
dependabot[bot] f7e530259d Bump rocm-docs-core from 0.25.0 to 0.26.0 in /docs/sphinx (#917)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.25.0 to 0.26.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.25.0...v0.26.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-13 08:27:04 -06:00
Wen-Heng (Jack) Chung 7ee5c1c28b Change MSCCL kernel signature to allow kernel arguments be preloaded via SGPR (#911)
* Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895)

Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu>

* Only build gfx941

* demo

* fine tune malloc

* Fix merge errors

* Fix merge errors

* Disable parallel build

* Adopt --amdgpu-kernarg-preload-count

* Revert "Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895)"

This reverts commit f5e252dddf02a41b4d1bc512f306f45f97166304.

* Revert CMake changes.

* NPKIT changes.

* Remove some license declarations.

* Address code review feedbacks on msccl_kernel_impl.h

* Update CMakeLists.txt

* Add CMake logic to check the existence of --amdgpu-kernarg-preload-count

* Fix NPKIT trace logic.

---------

Co-authored-by: Pedram Alizadeh <pmohamma@amd.com>
Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
2023-10-12 20:17:08 -05:00
mberenjk 7e2d905376 adding cuda support for EmptyKernelTest (#913) 2023-10-11 14:11:12 -05:00
dependabot[bot] 01d9da8046 Bump gitpython from 3.1.35 to 3.1.37 in /docs/sphinx (#912)
Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.35 to 3.1.37.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases)
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES)
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.35...3.1.37)

---
updated-dependencies:
- dependency-name: gitpython
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-10 15:43:49 -06:00
gilbertlee-amd 7dbf47e07b Adding a simple EmptyKernelTest to measure launch latency (#910) 2023-10-04 17:22:48 -06:00
Bertan Dogancay a6ff4618c7 Revert "Remove 2H4P condition from P2P channels adjustment (#890)" (#904)
This reverts commit 16dd05a58a.
2023-10-04 09:46:11 -06:00
dependabot[bot] f2600af812 Bump rocm-docs-core from 0.24.2 to 0.25.0 in /docs/sphinx (#909)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.2 to 0.25.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.2...v0.25.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-04 09:14:59 -06:00
dependabot[bot] 4b4e7ecdf9 Bump urllib3 from 1.26.15 to 1.26.17 in /docs/sphinx (#906)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.15 to 1.26.17.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/1.26.15...1.26.17)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-03 15:54:37 -06:00
akolliasAMD 28d7fe5629 Dma buf support optin (#905)
* dmaBufSupport Optin added on every part of the code that should invoke it
2023-10-03 03:17:48 -06:00
Edgar Gabriel c90ef5f035 Merge pull request #899 from edgargabriel/topic/disable-bfd-by-default
turn bfd compilation off by default
2023-10-01 09:40:05 -05:00
Edgar Gabriel 88a55cef83 turn bfd compilation off by default
revert the logic to ensure that we are not accidentally creating
a dependency on the bfd libraries when deploying rccl binaries.
2023-09-29 20:25:33 +00:00
akolliasAMD a773def279 install.sh fix (#903) 2023-09-29 07:42:17 -06:00
Cen Zhao fb57a438d7 Update install.sh to take "--static" option (#894)
* Update install.sh to take "--static" option

* Fix static build errors

---------

Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>
2023-09-27 12:45:21 -04:00
Bertan Dogancay c1f57a7041 Modify All-To-All doc (#896)
* Modify All-To-All doc

* Update nccl.h.in

* update unit-tests

---------

Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com>
2023-09-27 12:45:21 -04:00
dependabot[bot] 50bc92f1d5 Bump gitpython from 3.1.34 to 3.1.35 in /docs/sphinx (#898)
Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.34 to 3.1.35.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases)
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES)
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.34...3.1.35)

---
updated-dependencies:
- dependency-name: gitpython
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-09-27 12:45:21 -04:00
dependabot[bot] 1bbc3742b0 Bump cryptography from 41.0.3 to 41.0.4 in /docs/sphinx (#897)
Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.3 to 41.0.4.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/41.0.3...41.0.4)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-09-27 12:45:21 -04:00
Pedram Alizadeh 3f6c2b9b32 Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895) 2023-09-27 12:44:36 -04:00
akolliasAMD b85d73c02e changed the form that RCCL_TREE uses (#888)
* changed the form that RCCL_TREE uses
2023-09-15 15:01:33 -06:00
Wenkai Du 26e982d913 Reduce NPKit latency overhead in MSCCL kernel (#893)
* Reduce NPKit latency overhead in MSCCL kernel

* Fix build error without NPKit enable
2023-09-15 13:28:26 -07:00
Wenkai Du 16dd05a58a Remove 2H4P condition from P2P channels adjustment (#890) 2023-09-13 12:54:21 -07:00
Ziyue Yang c1bfd5f0d8 Add single-node MI300X topology (#889) 2023-09-13 11:07:17 -07:00
akolliasAMD 762a42859e Fixed topo_expl (#891) 2023-09-13 12:05:35 -06:00
Wenkai Du 6a4d5ec089 Fix crash when NPKit is enabled (#887) 2023-09-13 11:00:12 -07:00
Audrey MP e58ec78d35 Gcn arch name (#886)
We use CMake to determine if we're compiling against a version of ROCm that supports gcnArchName and handles architecture checking appropriately. It includes a few helper functions as drop ins for the functionality we used gcnArch for before; sometimes to enable flags, and sometimes to set frequencies.
2023-09-12 15:34:40 -04:00
Andy li e1dc4d5e42 enable hip graph on multi-node (#884)
* initial checkin

* enable msccl when hip graph is on

* remove the commented out code of msccl enable check

* clean up the code

* remove the msccl HighestTransportType check logic
2023-09-11 15:30:04 -07:00
Nusrat Islam e46602e44a Merge pull request #880 from nusislam/msccl-npkit
msccl: add NPKIT profiling for MSCCL send-recv
2023-09-08 14:13:14 -05:00
Nusrat Islam a283f55f12 msccl: add NPKIT profiling for MSCCL send-recv 2023-09-08 13:11:16 -05:00
dependabot[bot] a893e8a4ab Bump rocm-docs-core from 0.22.0 to 0.24.0 in /docs/sphinx (#882)
* Bump rocm-docs-core from 0.22.0 to 0.24.0 in /docs/sphinx

Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.22.0 to 0.24.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.22.0...v0.24.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update requirements.in

* Update requirements.txt

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sam Wu <sam.wu2@amd.com>
2023-09-07 11:27:53 -06:00
dependabot[bot] 62a09100a6 Bump gitpython from 3.1.32 to 3.1.34 in /docs/sphinx (#879)
Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.32 to 3.1.34.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases)
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES)
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.32...3.1.34)

---
updated-dependencies:
- dependency-name: gitpython
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-09-06 14:08:45 -06:00