dependabot[bot]
e6e99a1ae9
Bump rocm-docs-core from 0.29.0 to 0.30.1 in /docs/sphinx ( #1008 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.29.0 to 0.30.1.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.29.0...v0.30.1 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 9c3fea1751 ]
2023-12-07 10:32:06 -07:00
Wen-Heng (Jack) Chung
0266febb31
Let 320KB message size uses LL protocol. ( #1006 )
...
[ROCm/rccl commit: 8e8323252a ]
2023-12-06 18:14:31 -06:00
Wen-Heng (Jack) Chung
33aa8b67be
Use a map to host scratch buffers ( #1004 )
...
* Use a map to host scratch buffers
* Address review feedbacks. Deliberately keep mscclSetupScratch function.
[ROCm/rccl commit: 293f0fb752 ]
2023-12-05 13:15:28 -06:00
Nilesh M Negi
403a91137c
Fix gcnArch bug in IFC mix build ( #998 ) ( #1002 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: bc44e3faa7 ]
2023-12-04 16:20:22 -06:00
Bertan Dogancay
ae0bdad45c
IFC mix build ( #998 )
...
[ROCm/rccl commit: 7c0f49a878 ]
2023-12-02 18:49:52 -07:00
Wenkai Du
b38b7fa3a2
Increase max channles to 64 ( #993 )
...
[ROCm/rccl commit: 4ba65d1d6a ]
2023-12-01 16:01:11 -08:00
pradeep-ramanna
bf57487384
Fix GPU to NIC mapping for peertopeer ( #994 )
...
[ROCm/rccl commit: 0b53f79196 ]
2023-12-01 08:00:17 -08:00
Ziyue Yang
cef45b8311
Fix mscclAlgoHandle not initialized issue ( #995 )
...
[ROCm/rccl commit: e44e112a17 ]
2023-12-01 07:58:01 -08:00
dependabot[bot]
d237322b8d
Bump cryptography from 41.0.4 to 41.0.6 in /docs/sphinx ( #985 )
...
Bumps [cryptography](https://github.com/pyca/cryptography ) from 41.0.4 to 41.0.6.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst )
- [Commits](https://github.com/pyca/cryptography/compare/41.0.4...41.0.6 )
---
updated-dependencies:
- dependency-name: cryptography
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: ddd5c07b56 ]
2023-11-30 15:05:50 -07:00
Ziyue Yang
f0c47d085e
Move MSCCL algorithm loading to initialization to workaround HIP graph conflict ( #982 )
...
* MSCCL: pre-specify channels and pre-load algorithms
* add mutex
* fix bug
* clean include
* disable all-gathers temporarily
[ROCm/rccl commit: 4bb0b4a380 ]
2023-11-30 09:47:20 -08:00
Bertan Dogancay
5efe13655d
Renaming unit-tests package ( #987 )
...
[ROCm/rccl commit: 20b02af19b ]
2023-11-29 15:05:32 -07:00
dependabot[bot]
51e0dd2ab8
Bump rocm-docs-core from 0.28.0 to 0.29.0 in /docs/sphinx ( #980 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.28.0 to 0.29.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.28.0...v0.29.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 10a7cb7556 ]
2023-11-29 09:39:33 -07:00
akolliasAMD
bd982864d5
recreated pr 914 to work with current develop branch ( #979 )
...
[ROCm/rccl commit: 56ce9ef05f ]
2023-11-28 16:33:47 -07:00
akolliasAMD
81cc39899b
npkit trace script now syncs the on average difference per rank ( #981 )
...
[ROCm/rccl commit: c71bae1608 ]
2023-11-28 11:03:55 -07:00
gilbertlee-amd
d0a194ec16
JitterBench ( #975 )
...
[ROCm/rccl commit: 213869a6b4 ]
2023-11-23 11:14:11 -07:00
Wenkai Du
dcf623f2ec
Add special handling of gfx940 ( #976 )
...
* Add special handling of gfx940
* Update ring base
[ROCm/rccl commit: 50b2dd9fd7 ]
2023-11-22 15:07:36 -08:00
dependabot[bot]
68ffd1e90d
Bump rocm-docs-core from 0.27.0 to 0.28.0 in /docs/sphinx ( #969 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.27.0 to 0.28.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.27.0...v0.28.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: b1c746e7b5 ]
2023-11-22 08:58:26 -07:00
Wenkai Du
4c2fa05a23
msccl: allocate scratch as ext-scope fine-grained ( #968 )
...
[ROCm/rccl commit: 569d3f7d59 ]
2023-11-16 09:57:25 -06:00
searlmc1
b5642f39ed
Update README.md ( #955 )
...
Remove references to HCC, which was removed from ROCm ~2yrs ago
[ROCm/rccl commit: 15fa77bb57 ]
2023-11-15 18:01:45 -08:00
Wenkai Du
7c0920cd62
Fix kernel command line warnings ( #961 )
...
* Fix kernel command line warnings
* Remove while loop
[ROCm/rccl commit: bc8661f092 ]
2023-11-15 18:01:12 -08:00
Ziyue Yang
6ce074d92d
Fix MSCCL work FIFO allocation with HIP graph enabled ( #967 )
...
[ROCm/rccl commit: 7fc891bc8d ]
2023-11-15 16:43:28 -08:00
Bertan Dogancay
9e8eb41337
Check to support older ROCm versions ( #963 )
...
[ROCm/rccl commit: 198f14923b ]
2023-11-15 12:36:31 -07:00
Ziyue Yang
2351578d5b
Optimize MSCCL all-gather algorithms for gfx942 ( #964 )
...
[ROCm/rccl commit: 7ae95db5b8 ]
2023-11-15 08:18:59 -08:00
Ziyue Yang
2c6eededec
Optimize MSCCL reduce primitive switching for gfx942 ( #962 )
...
* Optimize reduce primitive switching for gfx942
* address comment
[ROCm/rccl commit: df128879a6 ]
2023-11-15 08:18:44 -08:00
Wenkai Du
534af85d0f
msccl: enable basic collective trace ( #959 )
...
To avoid increasing number of kernels, colltrace is only enabled with
RCCL_MSCCL_FORCE_FULLOPS=1
[ROCm/rccl commit: 5a800e00cd ]
2023-11-08 20:14:28 -08:00
Bertan Dogancay
1eef637d8c
Revert "Remove hip::device ( #954 )" ( #956 )
...
This reverts commit 2d04f5390d .
[ROCm/rccl commit: 8e0258a73d ]
2023-11-07 13:40:41 -07:00
Wen-Heng (Jack) Chung
270aa41f6b
Use send instead of sendWithBarrier. ( #727 )
...
[ROCm/rccl commit: efc42d9045 ]
2023-11-07 13:47:24 -06:00
dependabot[bot]
cf60052394
Bump rocm-docs-core from 0.26.0 to 0.27.0 in /docs/sphinx ( #951 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.26.0 to 0.27.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.26.0...v0.27.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 7291144c94 ]
2023-11-07 10:01:16 -07:00
Nusrat Islam
83a36c65c1
Merge pull request #950 from nusislam/msccl-red2
...
msccl: remove cases from numReduction switch statement
[ROCm/rccl commit: 022735d208 ]
2023-11-04 02:48:03 -05:00
Bertan Dogancay
2d04f5390d
Remove hip::device ( #954 )
...
[ROCm/rccl commit: 7edb486154 ]
2023-11-03 19:31:00 -06:00
Wenkai Du
aa02d2b675
Use parallel init of LDS and adjust P2P channels for gfx94x ( #943 )
...
* Use parallel init of LDS and adjust P2P channels for gfx94x
* Move another init to parallel
* Fix NCCL_NCHANNELS_PER_PEER setting
[ROCm/rccl commit: dbcba2923b ]
2023-11-03 16:06:49 -07:00
Nusrat Islam
b6f47bad7c
msccl: remove cases from numReduction switch statement
...
[ROCm/rccl commit: f545b94d4b ]
2023-11-03 16:56:51 -05:00
gilbertlee-amd
d8471eaddf
Adding LaunchBench tool ( #952 )
...
[ROCm/rccl commit: d50bab28bf ]
2023-11-03 12:04:52 -06:00
Wenkai Du
1557a1f258
msccl: use 32-bit LDS access and add RCCL_MSCCL_FORCE_FULLOPS ( #953 )
...
[ROCm/rccl commit: bb84345943 ]
2023-11-03 10:38:02 -07:00
akolliasAMD
4cd86b185c
MSCCL stream fix ( #948 )
...
[ROCm/rccl commit: 988efe605a ]
2023-11-03 09:10:52 -06:00
Wenkai Du
297023a7a6
msccl: add templated kernel ( #945 )
...
* msccl: add templated kernel
* Use defines to improve code readability
* Fix kernel indexing and review feedback
[ROCm/rccl commit: f484ff17b9 ]
2023-11-02 17:21:53 -07:00
Nusrat Islam
7c24d9970e
Merge pull request #946 from nusislam/msccl-redop
...
msccl: remove dereference of reduce args
[ROCm/rccl commit: 61aed56ca7 ]
2023-11-02 16:22:24 -05:00
Nusrat Islam
88c8bb1495
msccl: remove dereference of reduce args
...
It can be removed because the msccl kernel will never execute this code
according to the current msccl setup.
[ROCm/rccl commit: 6b80a0d0d4 ]
2023-11-02 13:20:00 -05:00
Wenkai Du
3eeaea3f00
msccl: use atomic to set dependency flags ( #941 )
...
[ROCm/rccl commit: a7400218a2 ]
2023-10-31 14:46:57 -07:00
akolliasAMD
691df735a3
Revert "Introduce allgather for MSCCL on 8 sockets up to 320KB. ( #931 )" ( #939 )
...
This reverts commit 769f00db5c .
[ROCm/rccl commit: 9f02ee8dea ]
2023-10-30 23:52:58 -06:00
Wenkai Du
b736b506c0
NPkit: misc fixes for MSCCL ( #936 )
...
* msccl: add xcc_id to timestamp sync
* NPKit: add timestamp for rrc operator
* NPKit: add timestamp for MSCCL init
[ROCm/rccl commit: a497722894 ]
2023-10-30 10:00:12 -07:00
Nilesh M Negi
8b1254a4f1
Fix gcnArchName bug in topology dump ( #937 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 1e5ca6820b ]
2023-10-28 12:30:36 -05:00
Ziyue Yang
e1dfb82023
Fix MSCCL work FIFO out-of-bound issue ( #935 )
...
[ROCm/rccl commit: 4c117e5335 ]
2023-10-27 11:24:52 -07:00
Nilesh M Negi
b4ba7cc79d
SRC/INIT: fix typo for ENABLE_PROFILING ( #934 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 96ec3ffe2e ]
2023-10-26 23:52:46 -05:00
Nilesh M Negi
706750597c
remove gcnArch support ( #920 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: f22df90e5c ]
2023-10-26 12:09:15 -05:00
Wenkai Du
446c8cbf66
msccl: reduce debug output when using NCCL_DEBUG=INFO ( #932 )
...
[ROCm/rccl commit: fb0eccb57b ]
2023-10-25 08:05:19 -07:00
Wen-Heng (Jack) Chung
769f00db5c
Introduce allgather for MSCCL on 8 sockets up to 320KB. ( #931 )
...
[ROCm/rccl commit: bfb8642450 ]
2023-10-24 18:41:12 -05:00
Wen-Heng (Jack) Chung
89a8493ef8
Introduce allgather MSCCL XML specification for MI250X up to 320KB. ( #930 )
...
[ROCm/rccl commit: 3f9ffe4788 ]
2023-10-24 18:35:55 -05:00
Wen-Heng (Jack) Chung
fc2a13c077
Introduce 1-shot allreduce for MI250X Hayabusa. ( #929 )
...
[ROCm/rccl commit: 72d5fbddfd ]
2023-10-24 16:31:18 -05:00
Wenkai Du
cc4de02a86
Add missing gfx942 support ( #927 )
...
[ROCm/rccl commit: c4e65fd382 ]
2023-10-23 12:04:37 -07:00