Wenkai Du
0ff5fc0bad
npkit: add broadcast trace ( #1166 )
...
[ROCm/rccl commit: a0cef69110 ]
2024-05-07 14:00:16 -07:00
Pak Nin Lui
df3d462dd9
Merge pull request #1167 from paklui/dmabuf
...
fix typo for DMABUF_ENABLE
[ROCm/rccl commit: 92a4fc6204 ]
2024-05-07 08:48:44 -07:00
dependabot[bot]
0d025525ad
Bump jinja2 from 3.1.3 to 3.1.4 in /docs/sphinx ( #1168 )
...
Bumps [jinja2](https://github.com/pallets/jinja ) from 3.1.3 to 3.1.4.
- [Release notes](https://github.com/pallets/jinja/releases )
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst )
- [Commits](https://github.com/pallets/jinja/compare/3.1.3...3.1.4 )
---
updated-dependencies:
- dependency-name: jinja2
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: eb562e7b22 ]
2024-05-06 15:35:34 -06:00
paklui
dd8e937948
fix typo for DMABUF_ENABLE
...
[ROCm/rccl commit: 140b7dd40f ]
2024-05-06 13:27:50 -07:00
Wenkai Du
c782aba364
Bypass NVIDIA Ampere related tuning ( #1165 )
...
[ROCm/rccl commit: b513c3970a ]
2024-05-03 17:57:16 -07:00
Wenkai Du
7c811a7582
Fix ignore NUMA not being observed for NICs during model matching ( #1164 )
...
[ROCm/rccl commit: bb58b1c258 ]
2024-05-03 16:42:07 -07:00
Wenkai Du
9638535690
Fix build error when roctracer-dev package is not installed ( #1161 )
...
[ROCm/rccl commit: 6f5a8ce1fb ]
2024-05-01 13:55:09 -07:00
Wenkai Du
3906e992f8
MSCCL: add support for out-of-place all reduce ( #1156 )
...
[ROCm/rccl commit: 4e1b8c1cbb ]
2024-04-28 19:49:09 -07:00
Wenkai Du
703014e960
Add back tree simple chunk size tuning ( #1157 )
...
[ROCm/rccl commit: cd6e840e0b ]
2024-04-28 19:48:53 -07:00
Nilesh M Negi
b99b89e7a2
[GRAPH] Reduce NCCL_TOPO_MAX_NODES to 64 ( #1153 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: b90436d292 ]
2024-04-27 23:41:11 -05:00
Tim
afeaa17475
Merge pull request #1158 from AtlantaPepsi/NPKit_fix
...
Prevent segfault from npkit-enabled rccl build
[ROCm/rccl commit: cc39e91c6f ]
2024-04-26 12:44:04 -04:00
AtlantaPepsi
8cf28704ce
prevent segfault from npkit-enabled rccl build
...
Signed-off-by: AtlantaPepsi <timhu102@amd.com >
[ROCm/rccl commit: 67246649ac ]
2024-04-26 10:54:27 -05:00
Wenkai Du
3c94f98688
Revert "Use relaxed atomics for LL on GFX11 ( #859 )" ( #1148 )
...
This reverts commit 5983f0e371 .
Use inline asm for 128b load on GFX11 for better peformance.
[ROCm/rccl commit: f330b82985 ]
2024-04-26 07:49:55 -07:00
Bertan Dogancay
dea5e83940
[UT] Start supporting multiple group calls and graphs ( #1151 )
...
* Start supporting multiple group calls UT
[ROCm/rccl commit: 0ec41f1386 ]
2024-04-25 11:11:16 -06:00
Shilei Tian
9a203f439c
SWDEV-455705: Fix an UB that could lead to miscompilation ( #1155 )
...
[ROCm/rccl commit: efe99057b0 ]
2024-04-25 10:10:01 -07:00
Wenkai Du
e494f29235
Replace __HIP_PLATFORM_HCC__ with __HIP_PLATFORM_AMD__ ( #1154 )
...
[ROCm/rccl commit: 9e0c9b4ed8 ]
2024-04-25 07:19:18 -07:00
Bertan Dogancay
ed152c5b89
Update CHANGELOG.md for RCCL 2.20.5 ( #1150 )
...
[ROCm/rccl commit: dcc75797a1 ]
2024-04-24 09:07:49 -06:00
Bertan Dogancay
2ad3fee222
Merge pull request #1111 from BertanDogancay/2.20
...
2.20.5 Sync
[ROCm/rccl commit: 8753bec3ea ]
2024-04-24 09:05:41 -06:00
BertanDogancay
36f9492cda
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: e1a835910e ]
2024-04-23 13:34:00 -07:00
Wenkai Du
35f8d269f8
Use hipExtMallocWithFlags to allocate host memory on APU ( #1149 )
...
Also use SM60 as CUDA compatibility level.
[ROCm/rccl commit: 220066197a ]
2024-04-17 16:56:38 -07:00
corey-derochie-amd
34fb1007a7
Updated CHANGELOG for next release ( #1146 )
...
* Updated CHANGELOG to release for ROCm 6.1.0 (#1142 )
* Fixed missing CHANGELOG notes from ROCm 5.5 through unreleased 6.1 (#1141 )
* Update CHANGELOG.md for ROCm release 5.5
(cherry picked from commit 83342e865445b233319466d4a620c1166ecaf181)
* Update CHANGELOG.md for ROCm 5.7.0
(cherry picked from commit a7c3b8dcb5cd0654f0a39cb3be4fdf7e8c820577)
* Added ROCm 6.0 and 6.1 CHANGELOG notes.
---------
Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com >
(cherry picked from commit 28a2b09304 )
* Updated CHANGELOG to release for ROCm 6.1.0
* Removed empty sections from CHANGELOG in latest releases.
(cherry picked from commit 164c9553717f2c3bce86a372764ea73030dd5f72)
* Reverted ROCm 6.1.0 block to "Unreleased"
[ROCm/rccl commit: a14137c062 ]
2024-04-15 16:29:40 -06:00
corey-derochie-amd
fa5d8d7a6b
Created PR template for the rccl repo ( #1118 )
...
[ROCm/rccl commit: 8f471ba537 ]
2024-04-15 15:34:42 -06:00
gilbertlee-amd
422a7ffcbb
Rail optimization for rings ( #1140 )
...
- Modifies the ring creation algorithm to be friendlier to rail-optimized topologies (should not affect classic fabric topologies)
[ROCm/rccl commit: 4cb62f999a ]
2024-04-15 12:03:57 -06:00
Bertan Dogancay
8ddb74e3b1
Add unique files to source list ( #1144 )
...
[ROCm/rccl commit: 3caad91f32 ]
2024-04-15 09:46:53 -06:00
dependabot[bot]
fb20f695ca
Bump idna from 3.4 to 3.7 in /docs/sphinx ( #1143 )
...
Bumps [idna](https://github.com/kjd/idna ) from 3.4 to 3.7.
- [Release notes](https://github.com/kjd/idna/releases )
- [Changelog](https://github.com/kjd/idna/blob/master/HISTORY.rst )
- [Commits](https://github.com/kjd/idna/compare/v3.4...v3.7 )
---
updated-dependencies:
- dependency-name: idna
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: c50eaddc28 ]
2024-04-12 09:28:39 -06:00
corey-derochie-amd
28a2b09304
Fixed missing CHANGELOG notes from ROCm 5.5 through unreleased 6.1 ( #1141 )
...
* Update CHANGELOG.md for ROCm release 5.5
(cherry picked from commit 83342e865445b233319466d4a620c1166ecaf181)
* Update CHANGELOG.md for ROCm 5.7.0
(cherry picked from commit a7c3b8dcb5cd0654f0a39cb3be4fdf7e8c820577)
* Added ROCm 6.0 and 6.1 CHANGELOG notes.
---------
Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com >
[ROCm/rccl commit: 3361abe786 ]
2024-04-11 15:04:40 -06:00
mberenjk
da835cff9c
replacing rccl_bfloat16 with hip_bfloat16 ( #1126 )
...
Co-authored-by: mberenjk <mberenjk@amd.com >
[ROCm/rccl commit: 428837ffe4 ]
2024-04-11 11:30:37 -05:00
dependabot[bot]
165d51b255
Bump rocm-docs-core from 0.38.0 to 0.38.1 in /docs/sphinx ( #1139 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.38.0 to 0.38.1.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.38.0...v0.38.1 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: d3899c0581 ]
2024-04-11 09:32:54 -06:00
arvindcheru
2c0284885a
Update Depends with correct HIP Runtime package name ( #1130 )
...
[ROCm/rccl commit: c1b8eab8e1 ]
2024-04-09 19:27:07 -04:00
Wenkai Du
99c7fc29ba
NPKit: doubling size of event buffers following MAXCHANNELS change ( #1135 )
...
[ROCm/rccl commit: 0ce68f21d4 ]
2024-04-09 08:02:58 -07:00
Wenkai Du
0941d6bc6e
Fix buffer overflow when parsing kernel cmdline ( #1133 )
...
[ROCm/rccl commit: 137571fa01 ]
2024-04-08 11:12:20 -07:00
gilbertlee-amd
62b9f0d3a7
[topo_expl] Adding -n option to override number of nodes ( #1134 )
...
[ROCm/rccl commit: 93982533d7 ]
2024-04-04 15:11:47 -06:00
Wenkai Du
890fafc2f7
rccl_prim_test: increase max number of workgroups and test iterations ( #1132 )
...
[ROCm/rccl commit: e8c76fd806 ]
2024-04-03 11:29:21 -07:00
dependabot[bot]
d6aba883d4
Bump rocm-docs-core from 0.37.0 to 0.38.0 in /docs/sphinx ( #1127 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.37.0 to 0.38.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.37.0...v0.38.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: d0d1bfdeda ]
2024-03-27 11:24:30 -06:00
arvindcheru
a285fda3a1
Static Build update - Moved all cmake install() to rocm-cmake APIs, static build update ( #1123 )
...
[ROCm/rccl commit: c0a51dc84b ]
2024-03-26 11:11:09 -04:00
corey-derochie-amd
62a6a07d49
Replaced ROCmSoftwarePlatform and RadeonOpenCompute links with ROCm links. ( #1125 )
...
[ROCm/rccl commit: 503a472a25 ]
2024-03-25 16:29:13 -06:00
corey-derochie-amd
19897f8d90
Fixes the copyright comment block on each of topo_expl/models/*.xml. The format was not valid XML. ( #1124 )
...
[ROCm/rccl commit: 9eefc68cb5 ]
2024-03-25 16:21:17 -06:00
Wenkai Du
43bbee4dcc
Remove hipEventDisableSystemFence ( #1122 )
...
There is no indication that disabling system fence has any latency improvement.
Removing it per recommendation from HIP.
[ROCm/rccl commit: 5976f757dd ]
2024-03-25 08:01:57 -07:00
Pedram Alizadeh
61f89d680d
msccl algorithms tuning for alltoall on MI300 ( #1120 )
...
Co-authored-by: PedramAlizadeh <amd@pmohamma.com >
[ROCm/rccl commit: c2fc1d6809 ]
2024-03-21 20:35:29 -04:00
corey-derochie-amd
9c2a57259d
Added @corey-derochie-amd as a code owner (to rocm-documentation) ( #1119 )
...
[ROCm/rccl commit: 606d3e6b6e ]
2024-03-21 14:56:05 -06:00
dependabot[bot]
d956fe9cbd
Bump rocm-docs-core from 0.36.0 to 0.37.0 in /docs/sphinx ( #1117 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.36.0 to 0.37.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.36.0...v0.37.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: cb80586fb9 ]
2024-03-20 09:25:14 -06:00
jbachan
b492ab6313
Merge pull request #1217 from crazy-JiangDongHua/bugfix_undo_plan
...
Bug in plan enqueue logic where plans could be silently not launched for some communicators. Triggered when both are true:
1. Multiple communicators per ncclGroup.
2. Communicators within a group have different plan counts.
2. Intra-process launch barrier disabled.
[ROCm/rccl commit: 6dd51f15bf ]
2024-03-18 10:12:26 -07:00
Nilesh M Negi
f93831cf6a
BUILD: Enable RCCL static build ( #1114 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 53fad75001 ]
2024-03-15 12:18:18 -05:00
srawat
7c8cf72d35
refactor RCCL ( #1112 )
...
* refactor RCCL
* rccl updates
* Update index.rst
* refactor
* Update what-is-rccl.rst
[ROCm/rccl commit: 45ee5734dd ]
2024-03-15 14:14:47 +05:30
Pedram Alizadeh
17b9546da9
msccl algorithms tuning for allgather on MI300 ( #1110 )
...
[ROCm/rccl commit: 50f22e8317 ]
2024-03-14 12:18:26 -04:00
dependabot[bot]
7e22922051
Bump rocm-docs-core from 0.35.1 to 0.36.0 in /docs/sphinx ( #1109 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.35.1 to 0.36.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.35.1...v0.36.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 0867562b18 ]
2024-03-12 09:38:20 -06:00
FrankJ
894d3459a5
[bugfix]save undo plans in some case
...
[ROCm/rccl commit: 9ef920a77b ]
2024-03-12 00:00:16 +08:00
Andy li
e373bd44bf
Enable fp8 support ( #1101 )
...
* initial checkin
* resolve cr comments
* resolve the build issue
* fix the data correctless issue
* update fp8 header file and update the unit test for fp8 support
* remove fp16 from fp8 headers
* fix ut issue and catch up the latest code from develop
* udate according to cr comments
* update ut according to cr comments
* update num floats for each SumPostDiv from 4 to 6
* update fp8 header file name
* fix the typo
[ROCm/rccl commit: 6777e65c1d ]
2024-03-08 15:17:53 -08:00
Wenkai Du
2354601589
Improve debug messages of memory allocations ( #1107 )
...
[ROCm/rccl commit: ff951e607d ]
2024-03-08 10:55:10 -08:00
Wenkai Du
c2eff3ecd9
topo_expl: 2.19.4 update and fix build error ( #1098 )
...
[ROCm/rccl commit: d2224fd3e1 ]
2024-03-07 08:52:50 -08:00