Wenkai Du
cfc04a8aef
p2p-latency-tests: fix build by switching to gcnArchName ( #1030 )
...
* p2p-latency-tests: fix build by switching to gcnArchName
* rccl-prim-test: switch to gcnArchName
2024-01-04 13:36:48 -08:00
Wenkai Du
abf265a911
Rework barriers and adjust scope of atomics ( #1019 )
2024-01-04 08:18:48 -08:00
Ziyue Yang
0a53077c9c
Improve MSCCL algorithms ( #1023 )
2024-01-03 14:51:34 -08:00
akolliasAMD
f4858e14b2
rearranged how the min and max functions are part of msccl ( #1025 )
...
* rearranged how the min and max functions are part of msccl
* added more coverage on in place graph tests
2023-12-21 08:58:33 -07:00
dependabot[bot]
7e1cbb440d
Bump rocm-docs-core from 0.30.2 to 0.30.3 in /docs/sphinx ( #1024 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.30.2 to 0.30.3.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.2...v0.30.3 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-12-20 10:37:13 -07:00
dependabot[bot]
d8c53e90d7
Bump rocm-docs-core from 0.30.1 to 0.30.2 in /docs/sphinx ( #1021 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.30.1 to 0.30.2.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.1...v0.30.2 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-12-19 13:34:37 -07:00
akolliasAMD
a924454f0f
CMake does not allow for capital letters been used in package names ( #1020 )
2023-12-15 12:39:17 -07:00
Ziyue Yang
655742a3a6
Fully disable MSCCL when machine is not matched ( #1017 )
...
* Disable MSCCL algorithm meta loading when machine is not matched
* fully disable init
* fix potential segfault
2023-12-13 08:36:21 -08:00
Wenkai Du
53d807a5b9
msccl: disable on multi-node ( #1018 )
2023-12-13 07:41:40 -08:00
Wenkai Du
81602814a7
msccl: fix data corruption with MTYPE_RW ( #1014 )
2023-12-11 20:33:15 -08:00
Bertan Dogancay
fca459baaf
correct package name ( #1012 )
2023-12-11 09:40:29 -07:00
Wenkai Du
7965c8b53c
Fix memory fence and use non-temporal store ( #1007 )
...
* Fix memory fence and use non-temporal store
* Use amdgcn builtin instead of inline asm
* Move threadfence location
* Revert changes to gfx90a
* Rework gfx90a change
* Apply changes to gfx94x
2023-12-09 12:16:08 -08:00
Ziyue Yang
c002f20029
Fix MSCCL scratch allocation ( #1010 )
2023-12-08 17:47:10 -06:00
Ziyue Yang
bb144dcd50
Tune MSCCL all-reduce algorithm ( #1009 )
2023-12-08 17:47:02 -06:00
Wen-Heng (Jack) Chung
baadda4bd8
Relax workgroup barrier implementation for MSCCL send/recv ops. ( #997 )
...
* Trim logic.
* Revert "Trim logic."
This reverts commit 8f2dba6c764108acf2bf5428366b9f41d4d206b9.
* Introduce MSCCL template parameters to send / recv.
* Address review feedbacks.
2023-12-08 17:46:53 -06:00
Wenkai Du
12c08fc52a
msccl: build same number of kernels as in ROCm 5.7 ( #1005 )
...
Removed fullOps kernels from build
2023-12-07 13:36:04 -06:00
dependabot[bot]
9c3fea1751
Bump rocm-docs-core from 0.29.0 to 0.30.1 in /docs/sphinx ( #1008 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.29.0 to 0.30.1.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.29.0...v0.30.1 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-12-07 10:32:06 -07:00
Wen-Heng (Jack) Chung
8e8323252a
Let 320KB message size uses LL protocol. ( #1006 )
2023-12-06 18:14:31 -06:00
Wen-Heng (Jack) Chung
293f0fb752
Use a map to host scratch buffers ( #1004 )
...
* Use a map to host scratch buffers
* Address review feedbacks. Deliberately keep mscclSetupScratch function.
2023-12-05 13:15:28 -06:00
Nilesh M Negi
bc44e3faa7
Fix gcnArch bug in IFC mix build ( #998 ) ( #1002 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2023-12-04 16:20:22 -06:00
Bertan Dogancay
7c0f49a878
IFC mix build ( #998 )
2023-12-02 18:49:52 -07:00
Wenkai Du
4ba65d1d6a
Increase max channles to 64 ( #993 )
2023-12-01 16:01:11 -08:00
pradeep-ramanna
0b53f79196
Fix GPU to NIC mapping for peertopeer ( #994 )
2023-12-01 08:00:17 -08:00
Ziyue Yang
e44e112a17
Fix mscclAlgoHandle not initialized issue ( #995 )
2023-12-01 07:58:01 -08:00
dependabot[bot]
ddd5c07b56
Bump cryptography from 41.0.4 to 41.0.6 in /docs/sphinx ( #985 )
...
Bumps [cryptography](https://github.com/pyca/cryptography ) from 41.0.4 to 41.0.6.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst )
- [Commits](https://github.com/pyca/cryptography/compare/41.0.4...41.0.6 )
---
updated-dependencies:
- dependency-name: cryptography
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-30 15:05:50 -07:00
Ziyue Yang
4bb0b4a380
Move MSCCL algorithm loading to initialization to workaround HIP graph conflict ( #982 )
...
* MSCCL: pre-specify channels and pre-load algorithms
* add mutex
* fix bug
* clean include
* disable all-gathers temporarily
2023-11-30 09:47:20 -08:00
Bertan Dogancay
20b02af19b
Renaming unit-tests package ( #987 )
2023-11-29 15:05:32 -07:00
dependabot[bot]
10a7cb7556
Bump rocm-docs-core from 0.28.0 to 0.29.0 in /docs/sphinx ( #980 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.28.0 to 0.29.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.28.0...v0.29.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-29 09:39:33 -07:00
akolliasAMD
56ce9ef05f
recreated pr 914 to work with current develop branch ( #979 )
2023-11-28 16:33:47 -07:00
akolliasAMD
c71bae1608
npkit trace script now syncs the on average difference per rank ( #981 )
2023-11-28 11:03:55 -07:00
gilbertlee-amd
213869a6b4
JitterBench ( #975 )
2023-11-23 11:14:11 -07:00
Wenkai Du
50b2dd9fd7
Add special handling of gfx940 ( #976 )
...
* Add special handling of gfx940
* Update ring base
2023-11-22 15:07:36 -08:00
dependabot[bot]
b1c746e7b5
Bump rocm-docs-core from 0.27.0 to 0.28.0 in /docs/sphinx ( #969 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.27.0 to 0.28.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.27.0...v0.28.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-22 08:58:26 -07:00
Wenkai Du
569d3f7d59
msccl: allocate scratch as ext-scope fine-grained ( #968 )
2023-11-16 09:57:25 -06:00
searlmc1
15fa77bb57
Update README.md ( #955 )
...
Remove references to HCC, which was removed from ROCm ~2yrs ago
2023-11-15 18:01:45 -08:00
Wenkai Du
bc8661f092
Fix kernel command line warnings ( #961 )
...
* Fix kernel command line warnings
* Remove while loop
2023-11-15 18:01:12 -08:00
Ziyue Yang
7fc891bc8d
Fix MSCCL work FIFO allocation with HIP graph enabled ( #967 )
2023-11-15 16:43:28 -08:00
Bertan Dogancay
198f14923b
Check to support older ROCm versions ( #963 )
2023-11-15 12:36:31 -07:00
Ziyue Yang
7ae95db5b8
Optimize MSCCL all-gather algorithms for gfx942 ( #964 )
2023-11-15 08:18:59 -08:00
Ziyue Yang
df128879a6
Optimize MSCCL reduce primitive switching for gfx942 ( #962 )
...
* Optimize reduce primitive switching for gfx942
* address comment
2023-11-15 08:18:44 -08:00
Wenkai Du
5a800e00cd
msccl: enable basic collective trace ( #959 )
...
To avoid increasing number of kernels, colltrace is only enabled with
RCCL_MSCCL_FORCE_FULLOPS=1
2023-11-08 20:14:28 -08:00
Bertan Dogancay
8e0258a73d
Revert "Remove hip::device ( #954 )" ( #956 )
...
This reverts commit 7edb486154 .
2023-11-07 13:40:41 -07:00
Wen-Heng (Jack) Chung
efc42d9045
Use send instead of sendWithBarrier. ( #727 )
2023-11-07 13:47:24 -06:00
dependabot[bot]
7291144c94
Bump rocm-docs-core from 0.26.0 to 0.27.0 in /docs/sphinx ( #951 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.26.0 to 0.27.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.26.0...v0.27.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-07 10:01:16 -07:00
Nusrat Islam
022735d208
Merge pull request #950 from nusislam/msccl-red2
...
msccl: remove cases from numReduction switch statement
2023-11-04 02:48:03 -05:00
Bertan Dogancay
7edb486154
Remove hip::device ( #954 )
2023-11-03 19:31:00 -06:00
Wenkai Du
dbcba2923b
Use parallel init of LDS and adjust P2P channels for gfx94x ( #943 )
...
* Use parallel init of LDS and adjust P2P channels for gfx94x
* Move another init to parallel
* Fix NCCL_NCHANNELS_PER_PEER setting
2023-11-03 16:06:49 -07:00
Nusrat Islam
f545b94d4b
msccl: remove cases from numReduction switch statement
2023-11-03 16:56:51 -05:00
gilbertlee-amd
d50bab28bf
Adding LaunchBench tool ( #952 )
2023-11-03 12:04:52 -06:00
Wenkai Du
bb84345943
msccl: use 32-bit LDS access and add RCCL_MSCCL_FORCE_FULLOPS ( #953 )
2023-11-03 10:38:02 -07:00