BertanDogancay
c2c9ed2acb
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: 9ff53eeeae ]
2024-01-30 14:43:43 -08:00
BertanDogancay
a2b3e1ab2d
correct data type
...
[ROCm/rccl commit: 31ec5d5cb0 ]
2024-01-28 19:55:19 -08:00
Wenkai Du
99564b560c
Remove enhcompat.cc
...
[ROCm/rccl commit: 6afabf0d0b ]
2024-01-24 17:13:30 -08:00
Wenkai Du
a7a60a06ec
Fix sendrecv merge
...
[ROCm/rccl commit: 4aafb2a3c5 ]
2024-01-24 16:23:53 -08:00
BertanDogancay
404d398bac
Merge remote-tracking branch 'nccl/v2.19' into develop
...
[ROCm/rccl commit: 81ddf9de89 ]
2024-01-24 15:25:33 -08:00
Tim
36fdd214e0
Turned on RCCL signal handler in CI tests ( #1039 )
...
[ROCm/rccl commit: 18a57bac10 ]
2024-01-23 17:07:45 -05:00
Bertan Dogancay
645411f8f4
Use binary search for direct function calls ( #1057 )
...
* Use binary search for direct function calls
* fix scratch mem issue on MI300
[ROCm/rccl commit: 5564d65e71 ]
2024-01-22 17:37:56 -07:00
Bertan Dogancay
56482a8be8
Fix collective trace when rccl is configured ( #1056 )
...
* Fix collective trace when rccl is configured
[ROCm/rccl commit: c4dbf8a914 ]
2024-01-22 09:26:44 -07:00
Wenkai Du
8b8179a689
Use new HIP graph API compatible with CUDA 11030 ( #991 )
...
* Use new HIP graph API compatible with CUDA 11030
* Update dependency to ROCm 6.1
* Fix single stream use case
[ROCm/rccl commit: 7e25d5bc55 ]
2024-01-21 19:00:50 -08:00
Nilesh M Negi
1c9b4fab39
COLLECTIVES: Switch to unroll 2 for MI300 ( #1051 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 8b97a20943 ]
2024-01-19 12:16:05 -06:00
Bertan Dogancay
c91be550bc
Disable LL128 for gfx90a ( #1054 )
...
[ROCm/rccl commit: 1ac800f9fc ]
2024-01-18 18:34:19 -07:00
Bertan Dogancay
4be87126fe
Turn IFC off ( #1053 )
...
[ROCm/rccl commit: 5f365a9957 ]
2024-01-18 15:29:36 -07:00
Bertan Dogancay
11674674fc
[DEV] Configure functions in RCCL ( #986 )
...
* configure functions in rccl
[ROCm/rccl commit: 28d9b170c9 ]
2024-01-18 15:07:16 -07:00
Tim
0343d9ccac
Relaxing default timeout limit, add error log ( #1052 )
...
Signed-off-by: Tim Hu <timhu102@amd.com >
[ROCm/rccl commit: 05850e89f2 ]
2024-01-18 15:09:08 -05:00
Tim
5f7ef6b671
Adding regression test ( #1045 )
...
* adding regression test
Signed-off-by: Tim Hu <timhu102@amd.com >
* modifying regression test
Signed-off-by: Tim Hu <timhu102@amd.com >
* Update StandaloneTests.cpp
---------
Signed-off-by: Tim Hu <timhu102@amd.com >
[ROCm/rccl commit: c2a073a97d ]
2024-01-18 10:46:16 -05:00
Wenkai Du
dbf906d8fa
Only use full MAXCHANNELS for gfx94x ( #1050 )
...
[ROCm/rccl commit: 3325f96c56 ]
2024-01-17 09:00:49 -08:00
Wenkai Du
366cd12bed
topo-expl: fix broken build ( #1048 )
...
[ROCm/rccl commit: 600b44fee5 ]
2024-01-17 08:59:03 -08:00
Tim
245e757b26
Adding timeout functionality/EnvVar to TestBed ( #1044 )
...
* Adding timeout functionality/EnvVar to TestBed
* updating timeout unit to microseconds
Signed-off-by: Tim Hu <timhu102@amd.com >
[ROCm/rccl commit: 9c0ef11ac7 ]
2024-01-17 11:33:01 -05:00
dependabot[bot]
1d62a5f440
Bump rocm-docs-core from 0.30.3 to 0.31.0 in /docs/sphinx ( #1049 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.30.3 to 0.31.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.3...v0.31.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 15f0ccaec7 ]
2024-01-16 11:53:14 -07:00
Pedram Alizadeh
5c349cd729
adding rccl tuning parameters for MI300X gfx942 with 8 GPUs single and multi-node ( #1047 )
...
[ROCm/rccl commit: b08124c85d ]
2024-01-16 13:44:32 -05:00
Wenkai Du
8d38747c65
Add option to force enable network transport on single node ( #1046 )
...
[ROCm/rccl commit: 261707d90a ]
2024-01-16 07:54:18 -08:00
Sam Wu
8e17a75353
Standardize documentation for ReadtheDocs ( #1027 )
...
[ROCm/rccl commit: 246dbd16d7 ]
2024-01-15 09:26:18 -07:00
dependabot[bot]
1204e8de34
Bump rocm-docs-core from 0.30.3 to 0.31.0 in /docs/sphinx ( #1043 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.30.3 to 0.31.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.3...v0.31.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 2a8c632516 ]
2024-01-15 09:10:58 -07:00
PedramAlizadeh
7cc572ecf9
Revert "2.18.5-1"
...
This reverts commit 26b91b9dbb .
[ROCm/rccl commit: 767fde8210 ]
2024-01-12 16:54:19 +00:00
dependabot[bot]
fc73c738ba
Bump jinja2 from 3.1.2 to 3.1.3 in /docs/sphinx ( #1040 )
...
Bumps [jinja2](https://github.com/pallets/jinja ) from 3.1.2 to 3.1.3.
- [Release notes](https://github.com/pallets/jinja/releases )
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst )
- [Commits](https://github.com/pallets/jinja/compare/3.1.2...3.1.3 )
---
updated-dependencies:
- dependency-name: jinja2
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: a1ee3e1ba9 ]
2024-01-11 17:02:03 -07:00
Bertan Dogancay
3d54c3fe5c
Add codeowners ( #1041 )
...
[ROCm/rccl commit: ff7c9c4050 ]
2024-01-11 15:41:08 -07:00
Bertan Dogancay
a056463d4d
Addressing the compiler warning ( #988 )
...
[ROCm/rccl commit: cf248d9402 ]
2024-01-10 14:59:40 -07:00
dependabot[bot]
bf4d1c9fb3
Bump gitpython from 3.1.37 to 3.1.41 in /docs/sphinx ( #1038 )
...
Bumps [gitpython](https://github.com/gitpython-developers/GitPython ) from 3.1.37 to 3.1.41.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases )
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES )
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.37...3.1.41 )
---
updated-dependencies:
- dependency-name: gitpython
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: ccba6f7a74 ]
2024-01-10 11:30:09 -07:00
Hossein Pourreza
df0dbb887d
cover more gpu/nic mapping cases ( #1037 )
...
[ROCm/rccl commit: 735178c1fe ]
2024-01-10 08:01:37 -08:00
Wenkai Du
64cf812da0
Re-enable L128 on gfx90a of compiler supports it ( #1036 )
...
[ROCm/rccl commit: 5851ae5974 ]
2024-01-10 08:01:11 -08:00
Nilesh M Negi
c1acf97c05
Remove FORCE from AMDGPU_TARGETS and add support in install script ( #989 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 414884c6cb ]
2024-01-09 13:29:47 -06:00
Nilesh M Negi
cec06d59d1
Un-escaped character causes error with address sanitizer builds ( #992 )
...
Signed-off-by: Nilesh M Negi <Nilesh.Negi@amd.com >
Co-authored-by: Jenkins <jenkins-compute@amd.com >
[ROCm/rccl commit: 249e9f7f65 ]
2024-01-09 13:28:32 -06:00
Pedram Alizadeh
d3a47bb387
Merge pull request #1022 from PedramAlizadeh/sync_nccl_2.18.6
...
Sync to nccl 2.18.6
[ROCm/rccl commit: aa5c84c997 ]
2024-01-09 13:29:29 -05:00
Wenkai Du
30f744dc35
msccl: use custom reduce function ( #1033 )
...
[ROCm/rccl commit: d9871d171b ]
2024-01-08 14:53:12 -08:00
Wenkai Du
cd7a346297
Doubling buffer size to fix NCCL INFO corruption with increased channels ( #1035 )
...
[ROCm/rccl commit: f7e39fced2 ]
2024-01-08 08:14:33 -08:00
Wenkai Du
626608c172
Increase stack size for gfx906 ( #1034 )
...
Occationally "Memory access fault by GPU node-8 (Agent handle: 0x23a5640) on address 0x7f461ec00000. Reason: Page not present or supervisor privilege" can be seen from gfx906 CI
[ROCm/rccl commit: e5bf56c6d8 ]
2024-01-07 20:25:02 -08:00
Ziyue Yang
1b39fef32a
Fix MSCCL multi-node ( #1032 )
...
1) Move needsProxy initialization before mscclSetupConnections since the latter
will revise it later.
2) Remove mscclAvailable check in net.cc since it's no more required and caused
non-shared buffer allocated for MSCCL which is not expected.
[ROCm/rccl commit: 70bbeb4773 ]
2024-01-05 17:03:43 -08:00
Wenkai Du
4eaf90f84c
p2p-latency-tests: fix build by switching to gcnArchName ( #1030 )
...
* p2p-latency-tests: fix build by switching to gcnArchName
* rccl-prim-test: switch to gcnArchName
[ROCm/rccl commit: cfc04a8aef ]
2024-01-04 13:36:48 -08:00
Wenkai Du
13791d7ee3
Rework barriers and adjust scope of atomics ( #1019 )
...
[ROCm/rccl commit: abf265a911 ]
2024-01-04 08:18:48 -08:00
Ziyue Yang
e3d45f9de4
Improve MSCCL algorithms ( #1023 )
...
[ROCm/rccl commit: 0a53077c9c ]
2024-01-03 14:51:34 -08:00
akolliasAMD
0c1f773021
rearranged how the min and max functions are part of msccl ( #1025 )
...
* rearranged how the min and max functions are part of msccl
* added more coverage on in place graph tests
[ROCm/rccl commit: f4858e14b2 ]
2023-12-21 08:58:33 -07:00
dependabot[bot]
08b097d6ae
Bump rocm-docs-core from 0.30.2 to 0.30.3 in /docs/sphinx ( #1024 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.30.2 to 0.30.3.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.2...v0.30.3 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 7e1cbb440d ]
2023-12-20 10:37:13 -07:00
dependabot[bot]
b94284f000
Bump rocm-docs-core from 0.30.1 to 0.30.2 in /docs/sphinx ( #1021 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.30.1 to 0.30.2.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.1...v0.30.2 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: d8c53e90d7 ]
2023-12-19 13:34:37 -07:00
PedramAlizadeh
401a104c5a
resolved conflicts, fixed the localNetCount/0 bug
...
[ROCm/rccl commit: 0d515f9388 ]
2023-12-18 08:11:34 +00:00
akolliasAMD
9cb0f98e73
CMake does not allow for capital letters been used in package names ( #1020 )
...
[ROCm/rccl commit: a924454f0f ]
2023-12-15 12:39:17 -07:00
Ziyue Yang
e4b63a8ba0
Fully disable MSCCL when machine is not matched ( #1017 )
...
* Disable MSCCL algorithm meta loading when machine is not matched
* fully disable init
* fix potential segfault
[ROCm/rccl commit: 655742a3a6 ]
2023-12-13 08:36:21 -08:00
Wenkai Du
918ce6c2e2
msccl: disable on multi-node ( #1018 )
...
[ROCm/rccl commit: 53d807a5b9 ]
2023-12-13 07:41:40 -08:00
Wenkai Du
48107b18c9
msccl: fix data corruption with MTYPE_RW ( #1014 )
...
[ROCm/rccl commit: 81602814a7 ]
2023-12-11 20:33:15 -08:00
Bertan Dogancay
fe5a902f97
correct package name ( #1012 )
...
[ROCm/rccl commit: fca459baaf ]
2023-12-11 09:40:29 -07:00
Wenkai Du
481a35bc59
Fix memory fence and use non-temporal store ( #1007 )
...
* Fix memory fence and use non-temporal store
* Use amdgcn builtin instead of inline asm
* Move threadfence location
* Revert changes to gfx90a
* Rework gfx90a change
* Apply changes to gfx94x
[ROCm/rccl commit: 7965c8b53c ]
2023-12-09 12:16:08 -08:00