Bertan Dogancay
8a442faa12
Nvtx support ( #1076 )
...
* NVTX support
2024-02-08 14:08:24 -07:00
Wenkai Du
5257c753c5
msccl: use relaxed atomics on scratch buffer ( #1075 )
2024-02-08 12:09:56 -08:00
dependabot[bot]
be45f0effd
Bump rocm-docs-core from 0.33.1 to 0.33.2 in /docs/sphinx ( #1073 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.33.1 to 0.33.2.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.33.1...v0.33.2 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-02-08 09:26:47 -07:00
Wenkai Du
704c9ef0d1
Doubling P2P channels per peer on single node gfx94x only ( #1074 )
2024-02-07 14:05:57 -08:00
dependabot[bot]
a9214032fc
Bump rocm-docs-core from 0.33.0 to 0.33.1 in /docs/sphinx ( #1071 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.33.0 to 0.33.1.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.33.0...v0.33.1 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-02-06 16:00:30 -07:00
dependabot[bot]
ca007ddad3
Bump cryptography from 41.0.6 to 42.0.0 in /docs/sphinx ( #1070 )
...
Bumps [cryptography](https://github.com/pyca/cryptography ) from 41.0.6 to 42.0.0.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst )
- [Commits](https://github.com/pyca/cryptography/compare/41.0.6...42.0.0 )
---
updated-dependencies:
- dependency-name: cryptography
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-02-06 15:59:52 -07:00
Wenkai Du
1d989f6524
Doubling P2P channels per peer on single node only ( #1069 )
2024-02-02 12:41:00 -08:00
Nilesh M Negi
2458f158b1
Enable kernarg preloading for ROCm 6.1 ( #1068 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2024-02-01 12:14:04 -06:00
Bertan Dogancay
01b359027b
Include common.h in enqueue.cc instead ( #1067 )
2024-01-30 08:24:22 -08:00
Wenkai Du
f7550d83b8
msccl: ensure memory coherence after data receive ( #1062 )
2024-01-30 08:22:50 -08:00
dependabot[bot]
8949a28502
Bump rocm-docs-core from 0.31.0 to 0.33.0 in /docs/sphinx ( #1065 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.31.0 to 0.33.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.31.0...v0.33.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-29 17:10:16 -07:00
Bertan Dogancay
d75c364864
Do not use LL128 when disabled ( #1066 )
2024-01-29 14:08:59 -07:00
Shilei Tian
ba9f7917ba
Add a constructor for PtrUnion in case it is not initialized explicitly ( #1064 )
2024-01-26 08:00:27 -08:00
Pedram Alizadeh
ccfb35fa6d
modifying the tuning table to improve the performance of allreduce for 8MB and 16MB for single-node MI300X ( #1063 )
2024-01-26 09:05:53 -05:00
Wenkai Du
be8ef4367f
colltrace: fix dropped trace messages ( #1059 )
...
* colltrace: fix dropped trace messages
* Remove extra space
2024-01-25 13:31:53 -08:00
Wenkai Du
ffde530af5
Increase P2P channels per peer ( #1060 )
2024-01-25 11:21:58 -08:00
Sam Wu
7d6da4c66b
Add codeowners for documentation ( #1061 )
...
* Add codeowners for documentation
* Update CODEOWNERS
---------
Co-authored-by: samjwu <samjwu@users.noreply.github.com >
2024-01-25 09:33:28 -07:00
Wenkai Du
7987015a19
Revert "msccl: build same number of kernels as in ROCm 5.7" ( #1058 )
...
This reverts commit f960174d03be7e5174baa83b256526d388a38842.
2024-01-24 08:43:50 -08:00
Tim
18a57bac10
Turned on RCCL signal handler in CI tests ( #1039 )
2024-01-23 17:07:45 -05:00
Bertan Dogancay
5564d65e71
Use binary search for direct function calls ( #1057 )
...
* Use binary search for direct function calls
* fix scratch mem issue on MI300
2024-01-22 17:37:56 -07:00
Bertan Dogancay
c4dbf8a914
Fix collective trace when rccl is configured ( #1056 )
...
* Fix collective trace when rccl is configured
2024-01-22 09:26:44 -07:00
Wenkai Du
7e25d5bc55
Use new HIP graph API compatible with CUDA 11030 ( #991 )
...
* Use new HIP graph API compatible with CUDA 11030
* Update dependency to ROCm 6.1
* Fix single stream use case
2024-01-21 19:00:50 -08:00
Nilesh M Negi
8b97a20943
COLLECTIVES: Switch to unroll 2 for MI300 ( #1051 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2024-01-19 12:16:05 -06:00
Bertan Dogancay
1ac800f9fc
Disable LL128 for gfx90a ( #1054 )
2024-01-18 18:34:19 -07:00
Bertan Dogancay
5f365a9957
Turn IFC off ( #1053 )
2024-01-18 15:29:36 -07:00
Bertan Dogancay
28d9b170c9
[DEV] Configure functions in RCCL ( #986 )
...
* configure functions in rccl
2024-01-18 15:07:16 -07:00
Tim
05850e89f2
Relaxing default timeout limit, add error log ( #1052 )
...
Signed-off-by: Tim Hu <timhu102@amd.com >
2024-01-18 15:09:08 -05:00
Tim
c2a073a97d
Adding regression test ( #1045 )
...
* adding regression test
Signed-off-by: Tim Hu <timhu102@amd.com >
* modifying regression test
Signed-off-by: Tim Hu <timhu102@amd.com >
* Update StandaloneTests.cpp
---------
Signed-off-by: Tim Hu <timhu102@amd.com >
2024-01-18 10:46:16 -05:00
Wenkai Du
3325f96c56
Only use full MAXCHANNELS for gfx94x ( #1050 )
2024-01-17 09:00:49 -08:00
Wenkai Du
600b44fee5
topo-expl: fix broken build ( #1048 )
2024-01-17 08:59:03 -08:00
Tim
9c0ef11ac7
Adding timeout functionality/EnvVar to TestBed ( #1044 )
...
* Adding timeout functionality/EnvVar to TestBed
* updating timeout unit to microseconds
Signed-off-by: Tim Hu <timhu102@amd.com >
2024-01-17 11:33:01 -05:00
dependabot[bot]
15f0ccaec7
Bump rocm-docs-core from 0.30.3 to 0.31.0 in /docs/sphinx ( #1049 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.30.3 to 0.31.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.3...v0.31.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-16 11:53:14 -07:00
Pedram Alizadeh
b08124c85d
adding rccl tuning parameters for MI300X gfx942 with 8 GPUs single and multi-node ( #1047 )
2024-01-16 13:44:32 -05:00
Wenkai Du
261707d90a
Add option to force enable network transport on single node ( #1046 )
2024-01-16 07:54:18 -08:00
Sam Wu
246dbd16d7
Standardize documentation for ReadtheDocs ( #1027 )
2024-01-15 09:26:18 -07:00
dependabot[bot]
2a8c632516
Bump rocm-docs-core from 0.30.3 to 0.31.0 in /docs/sphinx ( #1043 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.30.3 to 0.31.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.3...v0.31.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-15 09:10:58 -07:00
Pedram Alizadeh
80726361b9
Merge pull request #1042 from PedramAlizadeh/revert_2.18.5
...
Revert nccl "2.18.5-1" from 2.18.6
2024-01-12 12:00:12 -05:00
PedramAlizadeh
767fde8210
Revert "2.18.5-1"
...
This reverts commit 559b70f86c .
2024-01-12 16:54:19 +00:00
dependabot[bot]
a1ee3e1ba9
Bump jinja2 from 3.1.2 to 3.1.3 in /docs/sphinx ( #1040 )
...
Bumps [jinja2](https://github.com/pallets/jinja ) from 3.1.2 to 3.1.3.
- [Release notes](https://github.com/pallets/jinja/releases )
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst )
- [Commits](https://github.com/pallets/jinja/compare/3.1.2...3.1.3 )
---
updated-dependencies:
- dependency-name: jinja2
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-11 17:02:03 -07:00
Bertan Dogancay
ff7c9c4050
Add codeowners ( #1041 )
2024-01-11 15:41:08 -07:00
Bertan Dogancay
cf248d9402
Addressing the compiler warning ( #988 )
2024-01-10 14:59:40 -07:00
dependabot[bot]
ccba6f7a74
Bump gitpython from 3.1.37 to 3.1.41 in /docs/sphinx ( #1038 )
...
Bumps [gitpython](https://github.com/gitpython-developers/GitPython ) from 3.1.37 to 3.1.41.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases )
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES )
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.37...3.1.41 )
---
updated-dependencies:
- dependency-name: gitpython
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-10 11:30:09 -07:00
Hossein Pourreza
735178c1fe
cover more gpu/nic mapping cases ( #1037 )
2024-01-10 08:01:37 -08:00
Wenkai Du
5851ae5974
Re-enable L128 on gfx90a of compiler supports it ( #1036 )
2024-01-10 08:01:11 -08:00
Nilesh M Negi
414884c6cb
Remove FORCE from AMDGPU_TARGETS and add support in install script ( #989 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2024-01-09 13:29:47 -06:00
Nilesh M Negi
249e9f7f65
Un-escaped character causes error with address sanitizer builds ( #992 )
...
Signed-off-by: Nilesh M Negi <Nilesh.Negi@amd.com >
Co-authored-by: Jenkins <jenkins-compute@amd.com >
2024-01-09 13:28:32 -06:00
Pedram Alizadeh
aa5c84c997
Merge pull request #1022 from PedramAlizadeh/sync_nccl_2.18.6
...
Sync to nccl 2.18.6
2024-01-09 13:29:29 -05:00
Wenkai Du
d9871d171b
msccl: use custom reduce function ( #1033 )
2024-01-08 14:53:12 -08:00
Wenkai Du
f7e39fced2
Doubling buffer size to fix NCCL INFO corruption with increased channels ( #1035 )
2024-01-08 08:14:33 -08:00
Wenkai Du
e5bf56c6d8
Increase stack size for gfx906 ( #1034 )
...
Occationally "Memory access fault by GPU node-8 (Agent handle: 0x23a5640) on address 0x7f461ec00000. Reason: Page not present or supervisor privilege" can be seen from gfx906 CI
2024-01-07 20:25:02 -08:00