Graphe des révisions

1277 Révisions

Auteur SHA1 Message Date
Bertan Dogancay 8a442faa12 Nvtx support (#1076)
* NVTX support
2024-02-08 14:08:24 -07:00
Wenkai Du 5257c753c5 msccl: use relaxed atomics on scratch buffer (#1075) 2024-02-08 12:09:56 -08:00
dependabot[bot] be45f0effd Bump rocm-docs-core from 0.33.1 to 0.33.2 in /docs/sphinx (#1073)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.33.1 to 0.33.2.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.33.1...v0.33.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-02-08 09:26:47 -07:00
Wenkai Du 704c9ef0d1 Doubling P2P channels per peer on single node gfx94x only (#1074) 2024-02-07 14:05:57 -08:00
dependabot[bot] a9214032fc Bump rocm-docs-core from 0.33.0 to 0.33.1 in /docs/sphinx (#1071)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.33.0 to 0.33.1.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.33.0...v0.33.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-02-06 16:00:30 -07:00
dependabot[bot] ca007ddad3 Bump cryptography from 41.0.6 to 42.0.0 in /docs/sphinx (#1070)
Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.6 to 42.0.0.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/41.0.6...42.0.0)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-02-06 15:59:52 -07:00
Wenkai Du 1d989f6524 Doubling P2P channels per peer on single node only (#1069) 2024-02-02 12:41:00 -08:00
Nilesh M Negi 2458f158b1 Enable kernarg preloading for ROCm 6.1 (#1068)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-02-01 12:14:04 -06:00
Bertan Dogancay 01b359027b Include common.h in enqueue.cc instead (#1067) 2024-01-30 08:24:22 -08:00
Wenkai Du f7550d83b8 msccl: ensure memory coherence after data receive (#1062) 2024-01-30 08:22:50 -08:00
dependabot[bot] 8949a28502 Bump rocm-docs-core from 0.31.0 to 0.33.0 in /docs/sphinx (#1065)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.31.0 to 0.33.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.31.0...v0.33.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-29 17:10:16 -07:00
Bertan Dogancay d75c364864 Do not use LL128 when disabled (#1066) 2024-01-29 14:08:59 -07:00
Shilei Tian ba9f7917ba Add a constructor for PtrUnion in case it is not initialized explicitly (#1064) 2024-01-26 08:00:27 -08:00
Pedram Alizadeh ccfb35fa6d modifying the tuning table to improve the performance of allreduce for 8MB and 16MB for single-node MI300X (#1063) 2024-01-26 09:05:53 -05:00
Wenkai Du be8ef4367f colltrace: fix dropped trace messages (#1059)
* colltrace: fix dropped trace messages

* Remove extra space
2024-01-25 13:31:53 -08:00
Wenkai Du ffde530af5 Increase P2P channels per peer (#1060) 2024-01-25 11:21:58 -08:00
Sam Wu 7d6da4c66b Add codeowners for documentation (#1061)
* Add codeowners for documentation

* Update CODEOWNERS

---------

Co-authored-by: samjwu <samjwu@users.noreply.github.com>
2024-01-25 09:33:28 -07:00
Wenkai Du 7987015a19 Revert "msccl: build same number of kernels as in ROCm 5.7" (#1058)
This reverts commit f960174d03be7e5174baa83b256526d388a38842.
2024-01-24 08:43:50 -08:00
Tim 18a57bac10 Turned on RCCL signal handler in CI tests (#1039) 2024-01-23 17:07:45 -05:00
Bertan Dogancay 5564d65e71 Use binary search for direct function calls (#1057)
* Use binary search for direct function calls

* fix scratch mem issue on MI300
2024-01-22 17:37:56 -07:00
Bertan Dogancay c4dbf8a914 Fix collective trace when rccl is configured (#1056)
* Fix collective trace when rccl is configured
2024-01-22 09:26:44 -07:00
Wenkai Du 7e25d5bc55 Use new HIP graph API compatible with CUDA 11030 (#991)
* Use new HIP graph API compatible with CUDA 11030

* Update dependency to ROCm 6.1

* Fix single stream use case
2024-01-21 19:00:50 -08:00
Nilesh M Negi 8b97a20943 COLLECTIVES: Switch to unroll 2 for MI300 (#1051)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-01-19 12:16:05 -06:00
Bertan Dogancay 1ac800f9fc Disable LL128 for gfx90a (#1054) 2024-01-18 18:34:19 -07:00
Bertan Dogancay 5f365a9957 Turn IFC off (#1053) 2024-01-18 15:29:36 -07:00
Bertan Dogancay 28d9b170c9 [DEV] Configure functions in RCCL (#986)
* configure functions in rccl
2024-01-18 15:07:16 -07:00
Tim 05850e89f2 Relaxing default timeout limit, add error log (#1052)
Signed-off-by: Tim Hu <timhu102@amd.com>
2024-01-18 15:09:08 -05:00
Tim c2a073a97d Adding regression test (#1045)
* adding regression test

Signed-off-by: Tim Hu <timhu102@amd.com>

* modifying regression test

Signed-off-by: Tim Hu <timhu102@amd.com>

* Update StandaloneTests.cpp

---------

Signed-off-by: Tim Hu <timhu102@amd.com>
2024-01-18 10:46:16 -05:00
Wenkai Du 3325f96c56 Only use full MAXCHANNELS for gfx94x (#1050) 2024-01-17 09:00:49 -08:00
Wenkai Du 600b44fee5 topo-expl: fix broken build (#1048) 2024-01-17 08:59:03 -08:00
Tim 9c0ef11ac7 Adding timeout functionality/EnvVar to TestBed (#1044)
* Adding timeout functionality/EnvVar to TestBed
* updating timeout unit to microseconds

Signed-off-by: Tim Hu <timhu102@amd.com>
2024-01-17 11:33:01 -05:00
dependabot[bot] 15f0ccaec7 Bump rocm-docs-core from 0.30.3 to 0.31.0 in /docs/sphinx (#1049)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.3 to 0.31.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.3...v0.31.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-16 11:53:14 -07:00
Pedram Alizadeh b08124c85d adding rccl tuning parameters for MI300X gfx942 with 8 GPUs single and multi-node (#1047) 2024-01-16 13:44:32 -05:00
Wenkai Du 261707d90a Add option to force enable network transport on single node (#1046) 2024-01-16 07:54:18 -08:00
Sam Wu 246dbd16d7 Standardize documentation for ReadtheDocs (#1027) 2024-01-15 09:26:18 -07:00
dependabot[bot] 2a8c632516 Bump rocm-docs-core from 0.30.3 to 0.31.0 in /docs/sphinx (#1043)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.3 to 0.31.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.3...v0.31.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-15 09:10:58 -07:00
Pedram Alizadeh 80726361b9 Merge pull request #1042 from PedramAlizadeh/revert_2.18.5
Revert nccl "2.18.5-1" from 2.18.6
2024-01-12 12:00:12 -05:00
PedramAlizadeh 767fde8210 Revert "2.18.5-1"
This reverts commit 559b70f86c.
2024-01-12 16:54:19 +00:00
dependabot[bot] a1ee3e1ba9 Bump jinja2 from 3.1.2 to 3.1.3 in /docs/sphinx (#1040)
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.2 to 3.1.3.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/3.1.2...3.1.3)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-11 17:02:03 -07:00
Bertan Dogancay ff7c9c4050 Add codeowners (#1041) 2024-01-11 15:41:08 -07:00
Bertan Dogancay cf248d9402 Addressing the compiler warning (#988) 2024-01-10 14:59:40 -07:00
dependabot[bot] ccba6f7a74 Bump gitpython from 3.1.37 to 3.1.41 in /docs/sphinx (#1038)
Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.37 to 3.1.41.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases)
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES)
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.37...3.1.41)

---
updated-dependencies:
- dependency-name: gitpython
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-10 11:30:09 -07:00
Hossein Pourreza 735178c1fe cover more gpu/nic mapping cases (#1037) 2024-01-10 08:01:37 -08:00
Wenkai Du 5851ae5974 Re-enable L128 on gfx90a of compiler supports it (#1036) 2024-01-10 08:01:11 -08:00
Nilesh M Negi 414884c6cb Remove FORCE from AMDGPU_TARGETS and add support in install script (#989)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-01-09 13:29:47 -06:00
Nilesh M Negi 249e9f7f65 Un-escaped character causes error with address sanitizer builds (#992)
Signed-off-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Co-authored-by: Jenkins <jenkins-compute@amd.com>
2024-01-09 13:28:32 -06:00
Pedram Alizadeh aa5c84c997 Merge pull request #1022 from PedramAlizadeh/sync_nccl_2.18.6
Sync to nccl 2.18.6
2024-01-09 13:29:29 -05:00
Wenkai Du d9871d171b msccl: use custom reduce function (#1033) 2024-01-08 14:53:12 -08:00
Wenkai Du f7e39fced2 Doubling buffer size to fix NCCL INFO corruption with increased channels (#1035) 2024-01-08 08:14:33 -08:00
Wenkai Du e5bf56c6d8 Increase stack size for gfx906 (#1034)
Occationally "Memory access fault by GPU node-8 (Agent handle: 0x23a5640) on address 0x7f461ec00000. Reason: Page not present or supervisor privilege" can be seen from gfx906 CI
2024-01-07 20:25:02 -08:00