Pedram Alizadeh
73acf3eeec
modifying the tuning table to improve the performance of broadcast for 1MB to 64MB for single-node MI300X ( #1172 )
2024-05-08 15:49:33 -04:00
mberenjk
408278209d
Adding ASAN changes to address memory leak issue" ( #1170 )
...
Co-authored-by: akolliasAMD <akollias@amd.com >
2024-05-08 09:16:00 -05:00
Wenkai Du
b18784d8b8
Add compiler warning for uninitialized variable and fix ( #1163 )
...
* Add compiler warning for uninitialized variable and fix
* Add -Wsometimes-uninitialized
* Convert warning to error
2024-05-08 07:00:25 -07:00
Wenkai Du
f679db6ff6
Use normal permute path when one NIC per GPU ( #1171 )
2024-05-08 06:59:57 -07:00
Wenkai Du
a0cef69110
npkit: add broadcast trace ( #1166 )
2024-05-07 14:00:16 -07:00
Pak Nin Lui
92a4fc6204
Merge pull request #1167 from paklui/dmabuf
...
fix typo for DMABUF_ENABLE
2024-05-07 08:48:44 -07:00
dependabot[bot]
eb562e7b22
Bump jinja2 from 3.1.3 to 3.1.4 in /docs/sphinx ( #1168 )
...
Bumps [jinja2](https://github.com/pallets/jinja ) from 3.1.3 to 3.1.4.
- [Release notes](https://github.com/pallets/jinja/releases )
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst )
- [Commits](https://github.com/pallets/jinja/compare/3.1.3...3.1.4 )
---
updated-dependencies:
- dependency-name: jinja2
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-05-06 15:35:34 -06:00
paklui
140b7dd40f
fix typo for DMABUF_ENABLE
2024-05-06 13:27:50 -07:00
Wenkai Du
b513c3970a
Bypass NVIDIA Ampere related tuning ( #1165 )
2024-05-03 17:57:16 -07:00
Wenkai Du
bb58b1c258
Fix ignore NUMA not being observed for NICs during model matching ( #1164 )
2024-05-03 16:42:07 -07:00
Wenkai Du
6f5a8ce1fb
Fix build error when roctracer-dev package is not installed ( #1161 )
2024-05-01 13:55:09 -07:00
Wenkai Du
4e1b8c1cbb
MSCCL: add support for out-of-place all reduce ( #1156 )
2024-04-28 19:49:09 -07:00
Wenkai Du
cd6e840e0b
Add back tree simple chunk size tuning ( #1157 )
2024-04-28 19:48:53 -07:00
Nilesh M Negi
b90436d292
[GRAPH] Reduce NCCL_TOPO_MAX_NODES to 64 ( #1153 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2024-04-27 23:41:11 -05:00
Tim
cc39e91c6f
Merge pull request #1158 from AtlantaPepsi/NPKit_fix
...
Prevent segfault from npkit-enabled rccl build
2024-04-26 12:44:04 -04:00
AtlantaPepsi
67246649ac
prevent segfault from npkit-enabled rccl build
...
Signed-off-by: AtlantaPepsi <timhu102@amd.com >
2024-04-26 10:54:27 -05:00
Wenkai Du
f330b82985
Revert "Use relaxed atomics for LL on GFX11 ( #859 )" ( #1148 )
...
This reverts commit 6a0a6a37d9 .
Use inline asm for 128b load on GFX11 for better peformance.
2024-04-26 07:49:55 -07:00
Bertan Dogancay
0ec41f1386
[UT] Start supporting multiple group calls and graphs ( #1151 )
...
* Start supporting multiple group calls UT
2024-04-25 11:11:16 -06:00
Shilei Tian
efe99057b0
SWDEV-455705: Fix an UB that could lead to miscompilation ( #1155 )
2024-04-25 10:10:01 -07:00
Wenkai Du
9e0c9b4ed8
Replace __HIP_PLATFORM_HCC__ with __HIP_PLATFORM_AMD__ ( #1154 )
2024-04-25 07:19:18 -07:00
Bertan Dogancay
dcc75797a1
Update CHANGELOG.md for RCCL 2.20.5 ( #1150 )
2024-04-24 09:07:49 -06:00
Bertan Dogancay
8753bec3ea
Merge pull request #1111 from BertanDogancay/2.20
...
2.20.5 Sync
2024-04-24 09:05:41 -06:00
BertanDogancay
e1a835910e
Merge remote-tracking branch 'nccl/master' into develop
2024-04-23 13:34:00 -07:00
Wenkai Du
220066197a
Use hipExtMallocWithFlags to allocate host memory on APU ( #1149 )
...
Also use SM60 as CUDA compatibility level.
2024-04-17 16:56:38 -07:00
corey-derochie-amd
a14137c062
Updated CHANGELOG for next release ( #1146 )
...
* Updated CHANGELOG to release for ROCm 6.1.0 (#1142 )
* Fixed missing CHANGELOG notes from ROCm 5.5 through unreleased 6.1 (#1141 )
* Update CHANGELOG.md for ROCm release 5.5
(cherry picked from commit 975327be45f2313dc7249f9c54ad90870e833a4a)
* Update CHANGELOG.md for ROCm 5.7.0
(cherry picked from commit ac8db8d8e0853f1783c10e2858f6c3b86e4d27cb)
* Added ROCm 6.0 and 6.1 CHANGELOG notes.
---------
Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com >
(cherry picked from commit 3361abe786 )
* Updated CHANGELOG to release for ROCm 6.1.0
* Removed empty sections from CHANGELOG in latest releases.
(cherry picked from commit 164c9553717f2c3bce86a372764ea73030dd5f72)
* Reverted ROCm 6.1.0 block to "Unreleased"
2024-04-15 16:29:40 -06:00
corey-derochie-amd
8f471ba537
Created PR template for the rccl repo ( #1118 )
2024-04-15 15:34:42 -06:00
gilbertlee-amd
4cb62f999a
Rail optimization for rings ( #1140 )
...
- Modifies the ring creation algorithm to be friendlier to rail-optimized topologies (should not affect classic fabric topologies)
2024-04-15 12:03:57 -06:00
Bertan Dogancay
3caad91f32
Add unique files to source list ( #1144 )
2024-04-15 09:46:53 -06:00
dependabot[bot]
c50eaddc28
Bump idna from 3.4 to 3.7 in /docs/sphinx ( #1143 )
...
Bumps [idna](https://github.com/kjd/idna ) from 3.4 to 3.7.
- [Release notes](https://github.com/kjd/idna/releases )
- [Changelog](https://github.com/kjd/idna/blob/master/HISTORY.rst )
- [Commits](https://github.com/kjd/idna/compare/v3.4...v3.7 )
---
updated-dependencies:
- dependency-name: idna
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-12 09:28:39 -06:00
corey-derochie-amd
3361abe786
Fixed missing CHANGELOG notes from ROCm 5.5 through unreleased 6.1 ( #1141 )
...
* Update CHANGELOG.md for ROCm release 5.5
(cherry picked from commit 975327be45f2313dc7249f9c54ad90870e833a4a)
* Update CHANGELOG.md for ROCm 5.7.0
(cherry picked from commit ac8db8d8e0853f1783c10e2858f6c3b86e4d27cb)
* Added ROCm 6.0 and 6.1 CHANGELOG notes.
---------
Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com >
2024-04-11 15:04:40 -06:00
mberenjk
428837ffe4
replacing rccl_bfloat16 with hip_bfloat16 ( #1126 )
...
Co-authored-by: mberenjk <mberenjk@amd.com >
2024-04-11 11:30:37 -05:00
dependabot[bot]
d3899c0581
Bump rocm-docs-core from 0.38.0 to 0.38.1 in /docs/sphinx ( #1139 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.38.0 to 0.38.1.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.38.0...v0.38.1 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-11 09:32:54 -06:00
arvindcheru
c1b8eab8e1
Update Depends with correct HIP Runtime package name ( #1130 )
2024-04-09 19:27:07 -04:00
Wenkai Du
0ce68f21d4
NPKit: doubling size of event buffers following MAXCHANNELS change ( #1135 )
2024-04-09 08:02:58 -07:00
Wenkai Du
137571fa01
Fix buffer overflow when parsing kernel cmdline ( #1133 )
2024-04-08 11:12:20 -07:00
gilbertlee-amd
93982533d7
[topo_expl] Adding -n option to override number of nodes ( #1134 )
2024-04-04 15:11:47 -06:00
Wenkai Du
e8c76fd806
rccl_prim_test: increase max number of workgroups and test iterations ( #1132 )
2024-04-03 11:29:21 -07:00
dependabot[bot]
d0d1bfdeda
Bump rocm-docs-core from 0.37.0 to 0.38.0 in /docs/sphinx ( #1127 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.37.0 to 0.38.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.37.0...v0.38.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-03-27 11:24:30 -06:00
arvindcheru
c0a51dc84b
Static Build update - Moved all cmake install() to rocm-cmake APIs, static build update ( #1123 )
2024-03-26 11:11:09 -04:00
corey-derochie-amd
503a472a25
Replaced ROCmSoftwarePlatform and RadeonOpenCompute links with ROCm links. ( #1125 )
2024-03-25 16:29:13 -06:00
corey-derochie-amd
9eefc68cb5
Fixes the copyright comment block on each of topo_expl/models/*.xml. The format was not valid XML. ( #1124 )
2024-03-25 16:21:17 -06:00
Wenkai Du
5976f757dd
Remove hipEventDisableSystemFence ( #1122 )
...
There is no indication that disabling system fence has any latency improvement.
Removing it per recommendation from HIP.
2024-03-25 08:01:57 -07:00
Pedram Alizadeh
c2fc1d6809
msccl algorithms tuning for alltoall on MI300 ( #1120 )
...
Co-authored-by: PedramAlizadeh <amd@pmohamma.com >
2024-03-21 20:35:29 -04:00
corey-derochie-amd
606d3e6b6e
Added @corey-derochie-amd as a code owner (to rocm-documentation) ( #1119 )
2024-03-21 14:56:05 -06:00
dependabot[bot]
cb80586fb9
Bump rocm-docs-core from 0.36.0 to 0.37.0 in /docs/sphinx ( #1117 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.36.0 to 0.37.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.36.0...v0.37.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-03-20 09:25:14 -06:00
jbachan
6dd51f15bf
Merge pull request #1217 from crazy-JiangDongHua/bugfix_undo_plan
...
Bug in plan enqueue logic where plans could be silently not launched for some communicators. Triggered when both are true:
1. Multiple communicators per ncclGroup.
2. Communicators within a group have different plan counts.
2. Intra-process launch barrier disabled.
2024-03-18 10:12:26 -07:00
Nilesh M Negi
53fad75001
BUILD: Enable RCCL static build ( #1114 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2024-03-15 12:18:18 -05:00
srawat
45ee5734dd
refactor RCCL ( #1112 )
...
* refactor RCCL
* rccl updates
* Update index.rst
* refactor
* Update what-is-rccl.rst
2024-03-15 14:14:47 +05:30
Pedram Alizadeh
50f22e8317
msccl algorithms tuning for allgather on MI300 ( #1110 )
2024-03-14 12:18:26 -04:00
dependabot[bot]
0867562b18
Bump rocm-docs-core from 0.35.1 to 0.36.0 in /docs/sphinx ( #1109 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.35.1 to 0.36.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.35.1...v0.36.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-03-12 09:38:20 -06:00