Wenkai Du
43bbee4dcc
Remove hipEventDisableSystemFence ( #1122 )
...
There is no indication that disabling system fence has any latency improvement.
Removing it per recommendation from HIP.
[ROCm/rccl commit: 5976f757dd ]
2024-03-25 08:01:57 -07:00
Pedram Alizadeh
61f89d680d
msccl algorithms tuning for alltoall on MI300 ( #1120 )
...
Co-authored-by: PedramAlizadeh <amd@pmohamma.com >
[ROCm/rccl commit: c2fc1d6809 ]
2024-03-21 20:35:29 -04:00
corey-derochie-amd
9c2a57259d
Added @corey-derochie-amd as a code owner (to rocm-documentation) ( #1119 )
...
[ROCm/rccl commit: 606d3e6b6e ]
2024-03-21 14:56:05 -06:00
dependabot[bot]
d956fe9cbd
Bump rocm-docs-core from 0.36.0 to 0.37.0 in /docs/sphinx ( #1117 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.36.0 to 0.37.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.36.0...v0.37.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: cb80586fb9 ]
2024-03-20 09:25:14 -06:00
Nilesh M Negi
f93831cf6a
BUILD: Enable RCCL static build ( #1114 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 53fad75001 ]
2024-03-15 12:18:18 -05:00
srawat
7c8cf72d35
refactor RCCL ( #1112 )
...
* refactor RCCL
* rccl updates
* Update index.rst
* refactor
* Update what-is-rccl.rst
[ROCm/rccl commit: 45ee5734dd ]
2024-03-15 14:14:47 +05:30
Pedram Alizadeh
17b9546da9
msccl algorithms tuning for allgather on MI300 ( #1110 )
...
[ROCm/rccl commit: 50f22e8317 ]
2024-03-14 12:18:26 -04:00
dependabot[bot]
7e22922051
Bump rocm-docs-core from 0.35.1 to 0.36.0 in /docs/sphinx ( #1109 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.35.1 to 0.36.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.35.1...v0.36.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 0867562b18 ]
2024-03-12 09:38:20 -06:00
Andy li
e373bd44bf
Enable fp8 support ( #1101 )
...
* initial checkin
* resolve cr comments
* resolve the build issue
* fix the data correctless issue
* update fp8 header file and update the unit test for fp8 support
* remove fp16 from fp8 headers
* fix ut issue and catch up the latest code from develop
* udate according to cr comments
* update ut according to cr comments
* update num floats for each SumPostDiv from 4 to 6
* update fp8 header file name
* fix the typo
[ROCm/rccl commit: 6777e65c1d ]
2024-03-08 15:17:53 -08:00
Wenkai Du
2354601589
Improve debug messages of memory allocations ( #1107 )
...
[ROCm/rccl commit: ff951e607d ]
2024-03-08 10:55:10 -08:00
Wenkai Du
c2eff3ecd9
topo_expl: 2.19.4 update and fix build error ( #1098 )
...
[ROCm/rccl commit: d2224fd3e1 ]
2024-03-07 08:52:50 -08:00
Wenkai Du
6dd45024f8
msccl: fix scratch memory allocation after API change ( #1103 )
...
[ROCm/rccl commit: 77615cce28 ]
2024-03-06 11:11:04 -08:00
dependabot[bot]
64e4e20da5
Bump rocm-docs-core from 0.35.0 to 0.35.1 in /docs/sphinx ( #1100 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.35.0 to 0.35.1.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.35.0...v0.35.1 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 1f7b6e18d7 ]
2024-03-06 11:15:33 -07:00
yhuiYH
45c166554d
Merge pull request #1099 from ROCm/LisaDelaney-patch-1
...
link fix
[ROCm/rccl commit: 12441e8f6c ]
2024-03-05 13:54:04 -05:00
Lisa
d067682641
link fix
...
[ROCm/rccl commit: a032cb9eeb ]
2024-03-05 09:01:10 -07:00
Bertan Dogancay
1dfe5cca64
Fix bug when configuring for only LL128 ( #1097 )
...
[ROCm/rccl commit: a279e7f32d ]
2024-03-01 18:09:39 -07:00
Wenkai Du
e5aedb153e
Add support for using contiguous for GPU direct RDMA ( #1096 )
...
Enabled by env var RCCL_NET_CONTIGUOUS_MEM=1
[ROCm/rccl commit: cbd955627e ]
2024-02-29 10:06:43 -08:00
Wenkai Du
058886cb20
Add another Rome model ( #1095 )
...
[ROCm/rccl commit: df98a6957d ]
2024-02-28 10:46:05 -08:00
Bertan Dogancay
cee279fd99
Implement ROCTX ( #1094 )
...
* Implement roctx
[ROCm/rccl commit: b617aecc31 ]
2024-02-27 15:46:15 -07:00
dependabot[bot]
d0a346a738
Bump rocm-docs-core from 0.34.2 to 0.35.0 in /docs/sphinx ( #1092 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.34.2 to 0.35.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.34.2...v0.35.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: dae6df6d16 ]
2024-02-26 16:57:14 -07:00
dependabot[bot]
0272742733
Bump cryptography from 42.0.2 to 42.0.4 in /docs/sphinx ( #1090 )
...
Bumps [cryptography](https://github.com/pyca/cryptography ) from 42.0.2 to 42.0.4.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst )
- [Commits](https://github.com/pyca/cryptography/compare/42.0.2...42.0.4 )
---
updated-dependencies:
- dependency-name: cryptography
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: beb1e487ad ]
2024-02-26 16:47:14 -07:00
Tim
826d20495f
Adding FP16 cases to unit tests( #1093 )
...
Signed-off-by: Tim Hu <timhu102@amd.com >
[ROCm/rccl commit: 0d06b0f1de ]
2024-02-26 12:08:04 -05:00
Wenkai Du
874998033f
Add new GPU model ( #1080 )
...
[ROCm/rccl commit: 74f9e5db64 ]
2024-02-23 12:19:42 -08:00
Wenkai Du
4b31894d70
Update RCCL/MSCCL work FIFO depth to 256K ( #1091 )
...
[ROCm/rccl commit: c5ab37211b ]
2024-02-21 17:15:11 -08:00
Bertan Dogancay
4b4bdd904e
LL128 check if all XGMI ( #1089 )
...
[ROCm/rccl commit: b275ed0b56 ]
2024-02-21 09:41:40 -07:00
Pedram Alizadeh
bf48d1bc4d
msccl algorithms tuning for allreduce on MI300 ( #1088 )
...
[ROCm/rccl commit: 5a0f9990a9 ]
2024-02-21 11:31:56 -05:00
dependabot[bot]
3cd03179cb
Bump cryptography from 42.0.0 to 42.0.2 in /docs/sphinx ( #1087 )
...
Bumps [cryptography](https://github.com/pyca/cryptography ) from 42.0.0 to 42.0.2.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst )
- [Commits](https://github.com/pyca/cryptography/compare/42.0.0...42.0.2 )
---
updated-dependencies:
- dependency-name: cryptography
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: b7e3f1da14 ]
2024-02-20 15:03:10 -07:00
dependabot[bot]
46ada18646
Bump rocm-docs-core from 0.34.0 to 0.34.2 in /docs/sphinx ( #1086 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.34.0 to 0.34.2.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.34.0...v0.34.2 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 7e47a77339 ]
2024-02-16 11:21:27 -07:00
Bertan Dogancay
32e1c8cba0
Merge pull request #1079 from BertanDogancay/2.19.4-sync
...
2.19.4 Sync
[ROCm/rccl commit: 2fb12a9358 ]
2024-02-16 09:50:11 -07:00
BertanDogancay
24d9e1c36b
Increase max stack size when ll128 enabled
...
[ROCm/rccl commit: b098120c40 ]
2024-02-15 15:56:59 -08:00
akolliasAMD
e0dd21028f
Allow bus id to be null ( #1085 )
...
* Allow bus id to be null
[ROCm/rccl commit: bac57421c7 ]
2024-02-15 16:36:51 -07:00
BertanDogancay
ef72944015
Disable unsupported ld/st instructions
...
[ROCm/rccl commit: 6f3310605c ]
2024-02-15 13:58:16 -08:00
BertanDogancay
7842411fb3
Merge remote-tracking branch 'rccl/develop' into 2.19.4
...
[ROCm/rccl commit: 76f83f95ab ]
2024-02-15 13:37:14 -08:00
akolliasAMD
5d44815d95
Npkit updates ( #1084 )
...
* removed warmup runs to be an opt in
[ROCm/rccl commit: 16d7f372b7 ]
2024-02-15 07:48:45 -07:00
Wenkai Du
c4e9e2b18a
Use native half without conversion ( #1083 )
...
[ROCm/rccl commit: 51003c9980 ]
2024-02-13 16:57:34 -08:00
Wenkai Du
2f14acf770
Fix undefined symbol when nvtx is not enabled ( #1082 )
...
[ROCm/rccl commit: 1f0af90206 ]
2024-02-13 14:03:43 -08:00
Bertan Dogancay
bee47d9e91
Add stack size UT ( #1081 )
...
* Add stack size UT
[ROCm/rccl commit: dc2d486ba0 ]
2024-02-12 17:56:15 -07:00
BertanDogancay
de6f20b7ae
Fix docs
...
[ROCm/rccl commit: 32cca51894 ]
2024-02-11 22:32:55 -08:00
Wenkai Du
d5f5091e5d
Merge remote-tracking branch 'rccl/develop' into 2.19.4
...
[ROCm/rccl commit: d999d9ad21 ]
2024-02-09 11:31:03 -06:00
Wenkai Du
6775a75906
2.18.5 fix ( #1077 )
...
* Revert "Revert "2.18.5-1""
This reverts commit 7cc572ecf9 .
* Fix initial net device value
[ROCm/rccl commit: 5669b0d7b6 ]
2024-02-09 09:18:38 -08:00
dependabot[bot]
59eac59cea
Bump rocm-docs-core from 0.33.2 to 0.34.0 in /docs/sphinx ( #1078 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.33.2 to 0.34.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.33.2...v0.34.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 3e505a991c ]
2024-02-09 10:12:07 -07:00
Bertan Dogancay
45ed3ef4e7
Nvtx support ( #1076 )
...
* NVTX support
[ROCm/rccl commit: 8a442faa12 ]
2024-02-08 14:08:24 -07:00
Wenkai Du
1538b908ac
msccl: use relaxed atomics on scratch buffer ( #1075 )
...
[ROCm/rccl commit: 5257c753c5 ]
2024-02-08 12:09:56 -08:00
dependabot[bot]
ff2be03272
Bump rocm-docs-core from 0.33.1 to 0.33.2 in /docs/sphinx ( #1073 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.33.1 to 0.33.2.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.33.1...v0.33.2 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: be45f0effd ]
2024-02-08 09:26:47 -07:00
Wenkai Du
ce39eefe65
Doubling P2P channels per peer on single node gfx94x only ( #1074 )
...
[ROCm/rccl commit: 704c9ef0d1 ]
2024-02-07 14:05:57 -08:00
dependabot[bot]
c0745fe0b8
Bump rocm-docs-core from 0.33.0 to 0.33.1 in /docs/sphinx ( #1071 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.33.0 to 0.33.1.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.33.0...v0.33.1 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: a9214032fc ]
2024-02-06 16:00:30 -07:00
dependabot[bot]
b6868a1573
Bump cryptography from 41.0.6 to 42.0.0 in /docs/sphinx ( #1070 )
...
Bumps [cryptography](https://github.com/pyca/cryptography ) from 41.0.6 to 42.0.0.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst )
- [Commits](https://github.com/pyca/cryptography/compare/41.0.6...42.0.0 )
---
updated-dependencies:
- dependency-name: cryptography
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: ca007ddad3 ]
2024-02-06 15:59:52 -07:00
Wenkai Du
57e508f2e4
Doubling P2P channels per peer on single node only ( #1069 )
...
[ROCm/rccl commit: 1d989f6524 ]
2024-02-02 12:41:00 -08:00
Wenkai Du
e319d0a49d
Merge remote-tracking branch 'rccl/develop' into HEAD
...
[ROCm/rccl commit: e64324a64a ]
2024-02-01 12:17:09 -06:00
Nilesh M Negi
f23716de80
Enable kernarg preloading for ROCm 6.1 ( #1068 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 2458f158b1 ]
2024-02-01 12:14:04 -06:00