Wykres commitów

1205 Commity

Autor SHA1 Wiadomość Data
dependabot[bot] e6e99a1ae9 Bump rocm-docs-core from 0.29.0 to 0.30.1 in /docs/sphinx (#1008)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.29.0 to 0.30.1.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.29.0...v0.30.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 9c3fea1751]
2023-12-07 10:32:06 -07:00
Wen-Heng (Jack) Chung 0266febb31 Let 320KB message size uses LL protocol. (#1006)
[ROCm/rccl commit: 8e8323252a]
2023-12-06 18:14:31 -06:00
Wen-Heng (Jack) Chung 33aa8b67be Use a map to host scratch buffers (#1004)
* Use a map to host scratch buffers

* Address review feedbacks. Deliberately keep mscclSetupScratch function.

[ROCm/rccl commit: 293f0fb752]
2023-12-05 13:15:28 -06:00
Nilesh M Negi 403a91137c Fix gcnArch bug in IFC mix build (#998) (#1002)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: bc44e3faa7]
2023-12-04 16:20:22 -06:00
Bertan Dogancay ae0bdad45c IFC mix build (#998)
[ROCm/rccl commit: 7c0f49a878]
2023-12-02 18:49:52 -07:00
Wenkai Du b38b7fa3a2 Increase max channles to 64 (#993)
[ROCm/rccl commit: 4ba65d1d6a]
2023-12-01 16:01:11 -08:00
pradeep-ramanna bf57487384 Fix GPU to NIC mapping for peertopeer (#994)
[ROCm/rccl commit: 0b53f79196]
2023-12-01 08:00:17 -08:00
Ziyue Yang cef45b8311 Fix mscclAlgoHandle not initialized issue (#995)
[ROCm/rccl commit: e44e112a17]
2023-12-01 07:58:01 -08:00
dependabot[bot] d237322b8d Bump cryptography from 41.0.4 to 41.0.6 in /docs/sphinx (#985)
Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.4 to 41.0.6.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/41.0.4...41.0.6)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: ddd5c07b56]
2023-11-30 15:05:50 -07:00
Ziyue Yang f0c47d085e Move MSCCL algorithm loading to initialization to workaround HIP graph conflict (#982)
* MSCCL: pre-specify channels and pre-load algorithms

* add mutex

* fix bug

* clean include

* disable all-gathers temporarily

[ROCm/rccl commit: 4bb0b4a380]
2023-11-30 09:47:20 -08:00
Bertan Dogancay 5efe13655d Renaming unit-tests package (#987)
[ROCm/rccl commit: 20b02af19b]
2023-11-29 15:05:32 -07:00
dependabot[bot] 51e0dd2ab8 Bump rocm-docs-core from 0.28.0 to 0.29.0 in /docs/sphinx (#980)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.28.0 to 0.29.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.28.0...v0.29.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 10a7cb7556]
2023-11-29 09:39:33 -07:00
akolliasAMD bd982864d5 recreated pr 914 to work with current develop branch (#979)
[ROCm/rccl commit: 56ce9ef05f]
2023-11-28 16:33:47 -07:00
akolliasAMD 81cc39899b npkit trace script now syncs the on average difference per rank (#981)
[ROCm/rccl commit: c71bae1608]
2023-11-28 11:03:55 -07:00
gilbertlee-amd d0a194ec16 JitterBench (#975)
[ROCm/rccl commit: 213869a6b4]
2023-11-23 11:14:11 -07:00
Wenkai Du dcf623f2ec Add special handling of gfx940 (#976)
* Add special handling of gfx940

* Update ring base

[ROCm/rccl commit: 50b2dd9fd7]
2023-11-22 15:07:36 -08:00
dependabot[bot] 68ffd1e90d Bump rocm-docs-core from 0.27.0 to 0.28.0 in /docs/sphinx (#969)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.27.0 to 0.28.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.27.0...v0.28.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: b1c746e7b5]
2023-11-22 08:58:26 -07:00
Wenkai Du 4c2fa05a23 msccl: allocate scratch as ext-scope fine-grained (#968)
[ROCm/rccl commit: 569d3f7d59]
2023-11-16 09:57:25 -06:00
searlmc1 b5642f39ed Update README.md (#955)
Remove references to HCC, which was removed from ROCm ~2yrs ago

[ROCm/rccl commit: 15fa77bb57]
2023-11-15 18:01:45 -08:00
Wenkai Du 7c0920cd62 Fix kernel command line warnings (#961)
* Fix kernel command line warnings

* Remove while loop

[ROCm/rccl commit: bc8661f092]
2023-11-15 18:01:12 -08:00
Ziyue Yang 6ce074d92d Fix MSCCL work FIFO allocation with HIP graph enabled (#967)
[ROCm/rccl commit: 7fc891bc8d]
2023-11-15 16:43:28 -08:00
Bertan Dogancay 9e8eb41337 Check to support older ROCm versions (#963)
[ROCm/rccl commit: 198f14923b]
2023-11-15 12:36:31 -07:00
Ziyue Yang 2351578d5b Optimize MSCCL all-gather algorithms for gfx942 (#964)
[ROCm/rccl commit: 7ae95db5b8]
2023-11-15 08:18:59 -08:00
Ziyue Yang 2c6eededec Optimize MSCCL reduce primitive switching for gfx942 (#962)
* Optimize reduce primitive switching for gfx942

* address comment

[ROCm/rccl commit: df128879a6]
2023-11-15 08:18:44 -08:00
Wenkai Du 534af85d0f msccl: enable basic collective trace (#959)
To avoid increasing number of kernels, colltrace is only enabled with
RCCL_MSCCL_FORCE_FULLOPS=1

[ROCm/rccl commit: 5a800e00cd]
2023-11-08 20:14:28 -08:00
Bertan Dogancay 1eef637d8c Revert "Remove hip::device (#954)" (#956)
This reverts commit 2d04f5390d.

[ROCm/rccl commit: 8e0258a73d]
2023-11-07 13:40:41 -07:00
Wen-Heng (Jack) Chung 270aa41f6b Use send instead of sendWithBarrier. (#727)
[ROCm/rccl commit: efc42d9045]
2023-11-07 13:47:24 -06:00
dependabot[bot] cf60052394 Bump rocm-docs-core from 0.26.0 to 0.27.0 in /docs/sphinx (#951)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.26.0 to 0.27.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.26.0...v0.27.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 7291144c94]
2023-11-07 10:01:16 -07:00
Nusrat Islam 83a36c65c1 Merge pull request #950 from nusislam/msccl-red2
msccl: remove cases from numReduction switch statement

[ROCm/rccl commit: 022735d208]
2023-11-04 02:48:03 -05:00
Bertan Dogancay 2d04f5390d Remove hip::device (#954)
[ROCm/rccl commit: 7edb486154]
2023-11-03 19:31:00 -06:00
Wenkai Du aa02d2b675 Use parallel init of LDS and adjust P2P channels for gfx94x (#943)
* Use parallel init of LDS and adjust P2P channels for gfx94x

* Move another init to parallel

* Fix NCCL_NCHANNELS_PER_PEER setting

[ROCm/rccl commit: dbcba2923b]
2023-11-03 16:06:49 -07:00
Nusrat Islam b6f47bad7c msccl: remove cases from numReduction switch statement
[ROCm/rccl commit: f545b94d4b]
2023-11-03 16:56:51 -05:00
gilbertlee-amd d8471eaddf Adding LaunchBench tool (#952)
[ROCm/rccl commit: d50bab28bf]
2023-11-03 12:04:52 -06:00
Wenkai Du 1557a1f258 msccl: use 32-bit LDS access and add RCCL_MSCCL_FORCE_FULLOPS (#953)
[ROCm/rccl commit: bb84345943]
2023-11-03 10:38:02 -07:00
akolliasAMD 4cd86b185c MSCCL stream fix (#948)
[ROCm/rccl commit: 988efe605a]
2023-11-03 09:10:52 -06:00
Wenkai Du 297023a7a6 msccl: add templated kernel (#945)
* msccl: add templated kernel

* Use defines to improve code readability

* Fix kernel indexing and review feedback

[ROCm/rccl commit: f484ff17b9]
2023-11-02 17:21:53 -07:00
Nusrat Islam 7c24d9970e Merge pull request #946 from nusislam/msccl-redop
msccl: remove dereference of reduce args

[ROCm/rccl commit: 61aed56ca7]
2023-11-02 16:22:24 -05:00
Nusrat Islam 88c8bb1495 msccl: remove dereference of reduce args
It can be removed because the msccl kernel will never execute this code
according to the current msccl setup.


[ROCm/rccl commit: 6b80a0d0d4]
2023-11-02 13:20:00 -05:00
Wenkai Du 3eeaea3f00 msccl: use atomic to set dependency flags (#941)
[ROCm/rccl commit: a7400218a2]
2023-10-31 14:46:57 -07:00
akolliasAMD 691df735a3 Revert "Introduce allgather for MSCCL on 8 sockets up to 320KB. (#931)" (#939)
This reverts commit 769f00db5c.

[ROCm/rccl commit: 9f02ee8dea]
2023-10-30 23:52:58 -06:00
Wenkai Du b736b506c0 NPkit: misc fixes for MSCCL (#936)
* msccl: add xcc_id to timestamp sync

* NPKit: add timestamp for rrc operator

* NPKit: add timestamp for MSCCL init

[ROCm/rccl commit: a497722894]
2023-10-30 10:00:12 -07:00
Nilesh M Negi 8b1254a4f1 Fix gcnArchName bug in topology dump (#937)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 1e5ca6820b]
2023-10-28 12:30:36 -05:00
Ziyue Yang e1dfb82023 Fix MSCCL work FIFO out-of-bound issue (#935)
[ROCm/rccl commit: 4c117e5335]
2023-10-27 11:24:52 -07:00
Nilesh M Negi b4ba7cc79d SRC/INIT: fix typo for ENABLE_PROFILING (#934)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 96ec3ffe2e]
2023-10-26 23:52:46 -05:00
Nilesh M Negi 706750597c remove gcnArch support (#920)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: f22df90e5c]
2023-10-26 12:09:15 -05:00
Wenkai Du 446c8cbf66 msccl: reduce debug output when using NCCL_DEBUG=INFO (#932)
[ROCm/rccl commit: fb0eccb57b]
2023-10-25 08:05:19 -07:00
Wen-Heng (Jack) Chung 769f00db5c Introduce allgather for MSCCL on 8 sockets up to 320KB. (#931)
[ROCm/rccl commit: bfb8642450]
2023-10-24 18:41:12 -05:00
Wen-Heng (Jack) Chung 89a8493ef8 Introduce allgather MSCCL XML specification for MI250X up to 320KB. (#930)
[ROCm/rccl commit: 3f9ffe4788]
2023-10-24 18:35:55 -05:00
Wen-Heng (Jack) Chung fc2a13c077 Introduce 1-shot allreduce for MI250X Hayabusa. (#929)
[ROCm/rccl commit: 72d5fbddfd]
2023-10-24 16:31:18 -05:00
Wenkai Du cc4de02a86 Add missing gfx942 support (#927)
[ROCm/rccl commit: c4e65fd382]
2023-10-23 12:04:37 -07:00