Wenkai Du
a497722894
NPkit: misc fixes for MSCCL ( #936 )
...
* msccl: add xcc_id to timestamp sync
* NPKit: add timestamp for rrc operator
* NPKit: add timestamp for MSCCL init
2023-10-30 10:00:12 -07:00
Nilesh M Negi
1e5ca6820b
Fix gcnArchName bug in topology dump ( #937 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2023-10-28 12:30:36 -05:00
Ziyue Yang
4c117e5335
Fix MSCCL work FIFO out-of-bound issue ( #935 )
2023-10-27 11:24:52 -07:00
Nilesh M Negi
96ec3ffe2e
SRC/INIT: fix typo for ENABLE_PROFILING ( #934 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2023-10-26 23:52:46 -05:00
Nilesh M Negi
f22df90e5c
remove gcnArch support ( #920 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2023-10-26 12:09:15 -05:00
Wenkai Du
fb0eccb57b
msccl: reduce debug output when using NCCL_DEBUG=INFO ( #932 )
2023-10-25 08:05:19 -07:00
Wen-Heng (Jack) Chung
bfb8642450
Introduce allgather for MSCCL on 8 sockets up to 320KB. ( #931 )
2023-10-24 18:41:12 -05:00
Wen-Heng (Jack) Chung
3f9ffe4788
Introduce allgather MSCCL XML specification for MI250X up to 320KB. ( #930 )
2023-10-24 18:35:55 -05:00
Wen-Heng (Jack) Chung
72d5fbddfd
Introduce 1-shot allreduce for MI250X Hayabusa. ( #929 )
2023-10-24 16:31:18 -05:00
Wenkai Du
c4e65fd382
Add missing gfx942 support ( #927 )
2023-10-23 12:04:37 -07:00
akolliasAMD
d8dc282eeb
AllReduceTests,fixed the number of roots ( #925 )
2023-10-20 10:25:11 -06:00
dependabot[bot]
b173c13831
Bump urllib3 from 1.26.17 to 1.26.18 in /docs/sphinx ( #921 )
...
Bumps [urllib3](https://github.com/urllib3/urllib3 ) from 1.26.17 to 1.26.18.
- [Release notes](https://github.com/urllib3/urllib3/releases )
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst )
- [Commits](https://github.com/urllib3/urllib3/compare/1.26.17...1.26.18 )
---
updated-dependencies:
- dependency-name: urllib3
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-20 10:15:42 -06:00
searlmc1
dd5f01aeaf
Merge pull request #926 from ROCmSoftwarePlatform/searlmc1-patch-1
...
Remove quotes causing asan build breakage
2023-10-20 07:52:39 -07:00
searlmc1
f59de10524
Remove quotes causing asan build breakage
...
The quotes around "-fsanitize=address -shared-libasan" cause the BUILD_ADDRESS_SANITIZER build to fail; remove the quotes
2023-10-19 16:13:39 -07:00
Bertan Dogancay
3807c203fc
Update install.sh --fast and README ( #924 )
2023-10-19 16:35:10 -06:00
Wenkai Du
dbb5611a3a
Remove LDS based software barriers from MSCCL ( #923 )
2023-10-19 16:39:41 -05:00
Wenkai Du
4278a9918b
Update rome models ( #922 )
2023-10-18 17:28:01 -07:00
Wen-Heng (Jack) Chung
341926c60a
Introduce 1pass allreduce. Tailor it for very small message sizes <= 20KB. ( #919 )
2023-10-16 16:31:08 -05:00
Wenkai Du
39812ce757
NPKit: add xcc_id field ( #918 )
2023-10-13 15:24:59 -07:00
Wenkai Du
1b80d041cb
Fix incorrect arch name parsing ( #916 )
2023-10-13 10:01:11 -07:00
Wenkai Du
6d0b5c1e89
Port init_once fix from NCCL ( #915 )
2023-10-13 08:01:12 -07:00
dependabot[bot]
f7e530259d
Bump rocm-docs-core from 0.25.0 to 0.26.0 in /docs/sphinx ( #917 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.25.0 to 0.26.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.25.0...v0.26.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-13 08:27:04 -06:00
Wen-Heng (Jack) Chung
7ee5c1c28b
Change MSCCL kernel signature to allow kernel arguments be preloaded via SGPR ( #911 )
...
* Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895 )
Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu >
* Only build gfx941
* demo
* fine tune malloc
* Fix merge errors
* Fix merge errors
* Disable parallel build
* Adopt --amdgpu-kernarg-preload-count
* Revert "Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895 )"
This reverts commit f5e252dddf02a41b4d1bc512f306f45f97166304.
* Revert CMake changes.
* NPKIT changes.
* Remove some license declarations.
* Address code review feedbacks on msccl_kernel_impl.h
* Update CMakeLists.txt
* Add CMake logic to check the existence of --amdgpu-kernarg-preload-count
* Fix NPKIT trace logic.
---------
Co-authored-by: Pedram Alizadeh <pmohamma@amd.com >
Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu >
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com >
2023-10-12 20:17:08 -05:00
mberenjk
7e2d905376
adding cuda support for EmptyKernelTest ( #913 )
2023-10-11 14:11:12 -05:00
dependabot[bot]
01d9da8046
Bump gitpython from 3.1.35 to 3.1.37 in /docs/sphinx ( #912 )
...
Bumps [gitpython](https://github.com/gitpython-developers/GitPython ) from 3.1.35 to 3.1.37.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases )
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES )
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.35...3.1.37 )
---
updated-dependencies:
- dependency-name: gitpython
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-10 15:43:49 -06:00
gilbertlee-amd
7dbf47e07b
Adding a simple EmptyKernelTest to measure launch latency ( #910 )
2023-10-04 17:22:48 -06:00
Bertan Dogancay
a6ff4618c7
Revert "Remove 2H4P condition from P2P channels adjustment ( #890 )" ( #904 )
...
This reverts commit 16dd05a58a .
2023-10-04 09:46:11 -06:00
dependabot[bot]
f2600af812
Bump rocm-docs-core from 0.24.2 to 0.25.0 in /docs/sphinx ( #909 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.24.2 to 0.25.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.2...v0.25.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-04 09:14:59 -06:00
dependabot[bot]
4b4e7ecdf9
Bump urllib3 from 1.26.15 to 1.26.17 in /docs/sphinx ( #906 )
...
Bumps [urllib3](https://github.com/urllib3/urllib3 ) from 1.26.15 to 1.26.17.
- [Release notes](https://github.com/urllib3/urllib3/releases )
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst )
- [Commits](https://github.com/urllib3/urllib3/compare/1.26.15...1.26.17 )
---
updated-dependencies:
- dependency-name: urllib3
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-03 15:54:37 -06:00
akolliasAMD
28d7fe5629
Dma buf support optin ( #905 )
...
* dmaBufSupport Optin added on every part of the code that should invoke it
2023-10-03 03:17:48 -06:00
Edgar Gabriel
c90ef5f035
Merge pull request #899 from edgargabriel/topic/disable-bfd-by-default
...
turn bfd compilation off by default
2023-10-01 09:40:05 -05:00
Edgar Gabriel
88a55cef83
turn bfd compilation off by default
...
revert the logic to ensure that we are not accidentally creating
a dependency on the bfd libraries when deploying rccl binaries.
2023-09-29 20:25:33 +00:00
akolliasAMD
a773def279
install.sh fix ( #903 )
2023-09-29 07:42:17 -06:00
Cen Zhao
fb57a438d7
Update install.sh to take "--static" option ( #894 )
...
* Update install.sh to take "--static" option
* Fix static build errors
---------
Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com >
2023-09-27 12:45:21 -04:00
Bertan Dogancay
c1f57a7041
Modify All-To-All doc ( #896 )
...
* Modify All-To-All doc
* Update nccl.h.in
* update unit-tests
---------
Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com >
2023-09-27 12:45:21 -04:00
dependabot[bot]
50bc92f1d5
Bump gitpython from 3.1.34 to 3.1.35 in /docs/sphinx ( #898 )
...
Bumps [gitpython](https://github.com/gitpython-developers/GitPython ) from 3.1.34 to 3.1.35.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases )
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES )
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.34...3.1.35 )
---
updated-dependencies:
- dependency-name: gitpython
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-09-27 12:45:21 -04:00
dependabot[bot]
1bbc3742b0
Bump cryptography from 41.0.3 to 41.0.4 in /docs/sphinx ( #897 )
...
Bumps [cryptography](https://github.com/pyca/cryptography ) from 41.0.3 to 41.0.4.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst )
- [Commits](https://github.com/pyca/cryptography/compare/41.0.3...41.0.4 )
---
updated-dependencies:
- dependency-name: cryptography
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-09-27 12:45:21 -04:00
Pedram Alizadeh
3f6c2b9b32
Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite ( #895 )
2023-09-27 12:44:36 -04:00
akolliasAMD
b85d73c02e
changed the form that RCCL_TREE uses ( #888 )
...
* changed the form that RCCL_TREE uses
2023-09-15 15:01:33 -06:00
Wenkai Du
26e982d913
Reduce NPKit latency overhead in MSCCL kernel ( #893 )
...
* Reduce NPKit latency overhead in MSCCL kernel
* Fix build error without NPKit enable
2023-09-15 13:28:26 -07:00
Wenkai Du
16dd05a58a
Remove 2H4P condition from P2P channels adjustment ( #890 )
2023-09-13 12:54:21 -07:00
Ziyue Yang
c1bfd5f0d8
Add single-node MI300X topology ( #889 )
2023-09-13 11:07:17 -07:00
akolliasAMD
762a42859e
Fixed topo_expl ( #891 )
2023-09-13 12:05:35 -06:00
Wenkai Du
6a4d5ec089
Fix crash when NPKit is enabled ( #887 )
2023-09-13 11:00:12 -07:00
Audrey MP
e58ec78d35
Gcn arch name ( #886 )
...
We use CMake to determine if we're compiling against a version of ROCm that supports gcnArchName and handles architecture checking appropriately. It includes a few helper functions as drop ins for the functionality we used gcnArch for before; sometimes to enable flags, and sometimes to set frequencies.
2023-09-12 15:34:40 -04:00
Andy li
e1dc4d5e42
enable hip graph on multi-node ( #884 )
...
* initial checkin
* enable msccl when hip graph is on
* remove the commented out code of msccl enable check
* clean up the code
* remove the msccl HighestTransportType check logic
2023-09-11 15:30:04 -07:00
Nusrat Islam
e46602e44a
Merge pull request #880 from nusislam/msccl-npkit
...
msccl: add NPKIT profiling for MSCCL send-recv
2023-09-08 14:13:14 -05:00
Nusrat Islam
a283f55f12
msccl: add NPKIT profiling for MSCCL send-recv
2023-09-08 13:11:16 -05:00
dependabot[bot]
a893e8a4ab
Bump rocm-docs-core from 0.22.0 to 0.24.0 in /docs/sphinx ( #882 )
...
* Bump rocm-docs-core from 0.22.0 to 0.24.0 in /docs/sphinx
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.22.0 to 0.24.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.22.0...v0.24.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
* Update requirements.in
* Update requirements.txt
---------
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sam Wu <sam.wu2@amd.com >
2023-09-07 11:27:53 -06:00
dependabot[bot]
62a09100a6
Bump gitpython from 3.1.32 to 3.1.34 in /docs/sphinx ( #879 )
...
Bumps [gitpython](https://github.com/gitpython-developers/GitPython ) from 3.1.32 to 3.1.34.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases )
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES )
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.32...3.1.34 )
---
updated-dependencies:
- dependency-name: gitpython
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-09-06 14:08:45 -06:00