コミットグラフ

1160 コミット

作成者 SHA1 メッセージ 日付
BertanDogancay c2c9ed2acb Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 9ff53eeeae]
2024-01-30 14:43:43 -08:00
BertanDogancay a2b3e1ab2d correct data type
[ROCm/rccl commit: 31ec5d5cb0]
2024-01-28 19:55:19 -08:00
Wenkai Du 99564b560c Remove enhcompat.cc
[ROCm/rccl commit: 6afabf0d0b]
2024-01-24 17:13:30 -08:00
Wenkai Du a7a60a06ec Fix sendrecv merge
[ROCm/rccl commit: 4aafb2a3c5]
2024-01-24 16:23:53 -08:00
BertanDogancay 404d398bac Merge remote-tracking branch 'nccl/v2.19' into develop
[ROCm/rccl commit: 81ddf9de89]
2024-01-24 15:25:33 -08:00
Tim 36fdd214e0 Turned on RCCL signal handler in CI tests (#1039)
[ROCm/rccl commit: 18a57bac10]
2024-01-23 17:07:45 -05:00
Bertan Dogancay 645411f8f4 Use binary search for direct function calls (#1057)
* Use binary search for direct function calls

* fix scratch mem issue on MI300

[ROCm/rccl commit: 5564d65e71]
2024-01-22 17:37:56 -07:00
Bertan Dogancay 56482a8be8 Fix collective trace when rccl is configured (#1056)
* Fix collective trace when rccl is configured

[ROCm/rccl commit: c4dbf8a914]
2024-01-22 09:26:44 -07:00
Wenkai Du 8b8179a689 Use new HIP graph API compatible with CUDA 11030 (#991)
* Use new HIP graph API compatible with CUDA 11030

* Update dependency to ROCm 6.1

* Fix single stream use case

[ROCm/rccl commit: 7e25d5bc55]
2024-01-21 19:00:50 -08:00
Nilesh M Negi 1c9b4fab39 COLLECTIVES: Switch to unroll 2 for MI300 (#1051)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 8b97a20943]
2024-01-19 12:16:05 -06:00
Bertan Dogancay c91be550bc Disable LL128 for gfx90a (#1054)
[ROCm/rccl commit: 1ac800f9fc]
2024-01-18 18:34:19 -07:00
Bertan Dogancay 4be87126fe Turn IFC off (#1053)
[ROCm/rccl commit: 5f365a9957]
2024-01-18 15:29:36 -07:00
Bertan Dogancay 11674674fc [DEV] Configure functions in RCCL (#986)
* configure functions in rccl

[ROCm/rccl commit: 28d9b170c9]
2024-01-18 15:07:16 -07:00
Tim 0343d9ccac Relaxing default timeout limit, add error log (#1052)
Signed-off-by: Tim Hu <timhu102@amd.com>

[ROCm/rccl commit: 05850e89f2]
2024-01-18 15:09:08 -05:00
Tim 5f7ef6b671 Adding regression test (#1045)
* adding regression test

Signed-off-by: Tim Hu <timhu102@amd.com>

* modifying regression test

Signed-off-by: Tim Hu <timhu102@amd.com>

* Update StandaloneTests.cpp

---------

Signed-off-by: Tim Hu <timhu102@amd.com>

[ROCm/rccl commit: c2a073a97d]
2024-01-18 10:46:16 -05:00
Wenkai Du dbf906d8fa Only use full MAXCHANNELS for gfx94x (#1050)
[ROCm/rccl commit: 3325f96c56]
2024-01-17 09:00:49 -08:00
Wenkai Du 366cd12bed topo-expl: fix broken build (#1048)
[ROCm/rccl commit: 600b44fee5]
2024-01-17 08:59:03 -08:00
Tim 245e757b26 Adding timeout functionality/EnvVar to TestBed (#1044)
* Adding timeout functionality/EnvVar to TestBed
* updating timeout unit to microseconds

Signed-off-by: Tim Hu <timhu102@amd.com>

[ROCm/rccl commit: 9c0ef11ac7]
2024-01-17 11:33:01 -05:00
dependabot[bot] 1d62a5f440 Bump rocm-docs-core from 0.30.3 to 0.31.0 in /docs/sphinx (#1049)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.3 to 0.31.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.3...v0.31.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 15f0ccaec7]
2024-01-16 11:53:14 -07:00
Pedram Alizadeh 5c349cd729 adding rccl tuning parameters for MI300X gfx942 with 8 GPUs single and multi-node (#1047)
[ROCm/rccl commit: b08124c85d]
2024-01-16 13:44:32 -05:00
Wenkai Du 8d38747c65 Add option to force enable network transport on single node (#1046)
[ROCm/rccl commit: 261707d90a]
2024-01-16 07:54:18 -08:00
Sam Wu 8e17a75353 Standardize documentation for ReadtheDocs (#1027)
[ROCm/rccl commit: 246dbd16d7]
2024-01-15 09:26:18 -07:00
dependabot[bot] 1204e8de34 Bump rocm-docs-core from 0.30.3 to 0.31.0 in /docs/sphinx (#1043)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.3 to 0.31.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.3...v0.31.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 2a8c632516]
2024-01-15 09:10:58 -07:00
PedramAlizadeh 7cc572ecf9 Revert "2.18.5-1"
This reverts commit 26b91b9dbb.


[ROCm/rccl commit: 767fde8210]
2024-01-12 16:54:19 +00:00
dependabot[bot] fc73c738ba Bump jinja2 from 3.1.2 to 3.1.3 in /docs/sphinx (#1040)
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.2 to 3.1.3.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/3.1.2...3.1.3)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: a1ee3e1ba9]
2024-01-11 17:02:03 -07:00
Bertan Dogancay 3d54c3fe5c Add codeowners (#1041)
[ROCm/rccl commit: ff7c9c4050]
2024-01-11 15:41:08 -07:00
Bertan Dogancay a056463d4d Addressing the compiler warning (#988)
[ROCm/rccl commit: cf248d9402]
2024-01-10 14:59:40 -07:00
dependabot[bot] bf4d1c9fb3 Bump gitpython from 3.1.37 to 3.1.41 in /docs/sphinx (#1038)
Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.37 to 3.1.41.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases)
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES)
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.37...3.1.41)

---
updated-dependencies:
- dependency-name: gitpython
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: ccba6f7a74]
2024-01-10 11:30:09 -07:00
Hossein Pourreza df0dbb887d cover more gpu/nic mapping cases (#1037)
[ROCm/rccl commit: 735178c1fe]
2024-01-10 08:01:37 -08:00
Wenkai Du 64cf812da0 Re-enable L128 on gfx90a of compiler supports it (#1036)
[ROCm/rccl commit: 5851ae5974]
2024-01-10 08:01:11 -08:00
Nilesh M Negi c1acf97c05 Remove FORCE from AMDGPU_TARGETS and add support in install script (#989)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 414884c6cb]
2024-01-09 13:29:47 -06:00
Nilesh M Negi cec06d59d1 Un-escaped character causes error with address sanitizer builds (#992)
Signed-off-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Co-authored-by: Jenkins <jenkins-compute@amd.com>

[ROCm/rccl commit: 249e9f7f65]
2024-01-09 13:28:32 -06:00
Pedram Alizadeh d3a47bb387 Merge pull request #1022 from PedramAlizadeh/sync_nccl_2.18.6
Sync to nccl 2.18.6

[ROCm/rccl commit: aa5c84c997]
2024-01-09 13:29:29 -05:00
Wenkai Du 30f744dc35 msccl: use custom reduce function (#1033)
[ROCm/rccl commit: d9871d171b]
2024-01-08 14:53:12 -08:00
Wenkai Du cd7a346297 Doubling buffer size to fix NCCL INFO corruption with increased channels (#1035)
[ROCm/rccl commit: f7e39fced2]
2024-01-08 08:14:33 -08:00
Wenkai Du 626608c172 Increase stack size for gfx906 (#1034)
Occationally "Memory access fault by GPU node-8 (Agent handle: 0x23a5640) on address 0x7f461ec00000. Reason: Page not present or supervisor privilege" can be seen from gfx906 CI

[ROCm/rccl commit: e5bf56c6d8]
2024-01-07 20:25:02 -08:00
Ziyue Yang 1b39fef32a Fix MSCCL multi-node (#1032)
1) Move needsProxy initialization before mscclSetupConnections since the latter
will revise it later.
2) Remove mscclAvailable check in net.cc since it's no more required and caused
non-shared buffer allocated for MSCCL which is not expected.

[ROCm/rccl commit: 70bbeb4773]
2024-01-05 17:03:43 -08:00
Wenkai Du 4eaf90f84c p2p-latency-tests: fix build by switching to gcnArchName (#1030)
* p2p-latency-tests: fix build by switching to gcnArchName

* rccl-prim-test: switch to gcnArchName

[ROCm/rccl commit: cfc04a8aef]
2024-01-04 13:36:48 -08:00
Wenkai Du 13791d7ee3 Rework barriers and adjust scope of atomics (#1019)
[ROCm/rccl commit: abf265a911]
2024-01-04 08:18:48 -08:00
Ziyue Yang e3d45f9de4 Improve MSCCL algorithms (#1023)
[ROCm/rccl commit: 0a53077c9c]
2024-01-03 14:51:34 -08:00
akolliasAMD 0c1f773021 rearranged how the min and max functions are part of msccl (#1025)
* rearranged how the min and max functions are part of msccl

* added more coverage on in place graph tests

[ROCm/rccl commit: f4858e14b2]
2023-12-21 08:58:33 -07:00
dependabot[bot] 08b097d6ae Bump rocm-docs-core from 0.30.2 to 0.30.3 in /docs/sphinx (#1024)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.2 to 0.30.3.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.2...v0.30.3)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 7e1cbb440d]
2023-12-20 10:37:13 -07:00
dependabot[bot] b94284f000 Bump rocm-docs-core from 0.30.1 to 0.30.2 in /docs/sphinx (#1021)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.1 to 0.30.2.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.1...v0.30.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: d8c53e90d7]
2023-12-19 13:34:37 -07:00
PedramAlizadeh 401a104c5a resolved conflicts, fixed the localNetCount/0 bug
[ROCm/rccl commit: 0d515f9388]
2023-12-18 08:11:34 +00:00
akolliasAMD 9cb0f98e73 CMake does not allow for capital letters been used in package names (#1020)
[ROCm/rccl commit: a924454f0f]
2023-12-15 12:39:17 -07:00
Ziyue Yang e4b63a8ba0 Fully disable MSCCL when machine is not matched (#1017)
* Disable MSCCL algorithm meta loading when machine is not matched

* fully disable init

* fix potential segfault

[ROCm/rccl commit: 655742a3a6]
2023-12-13 08:36:21 -08:00
Wenkai Du 918ce6c2e2 msccl: disable on multi-node (#1018)
[ROCm/rccl commit: 53d807a5b9]
2023-12-13 07:41:40 -08:00
Wenkai Du 48107b18c9 msccl: fix data corruption with MTYPE_RW (#1014)
[ROCm/rccl commit: 81602814a7]
2023-12-11 20:33:15 -08:00
Bertan Dogancay fe5a902f97 correct package name (#1012)
[ROCm/rccl commit: fca459baaf]
2023-12-11 09:40:29 -07:00
Wenkai Du 481a35bc59 Fix memory fence and use non-temporal store (#1007)
* Fix memory fence and use non-temporal store

* Use amdgcn builtin instead of inline asm

* Move threadfence location

* Revert changes to gfx90a

* Rework gfx90a change

* Apply changes to gfx94x

[ROCm/rccl commit: 7965c8b53c]
2023-12-09 12:16:08 -08:00