Gráfico de commits

1221 Commits

Autor SHA1 Mensaje Fecha
Wenkai Du cfc04a8aef p2p-latency-tests: fix build by switching to gcnArchName (#1030)
* p2p-latency-tests: fix build by switching to gcnArchName

* rccl-prim-test: switch to gcnArchName
2024-01-04 13:36:48 -08:00
Wenkai Du abf265a911 Rework barriers and adjust scope of atomics (#1019) 2024-01-04 08:18:48 -08:00
Ziyue Yang 0a53077c9c Improve MSCCL algorithms (#1023) 2024-01-03 14:51:34 -08:00
akolliasAMD f4858e14b2 rearranged how the min and max functions are part of msccl (#1025)
* rearranged how the min and max functions are part of msccl

* added more coverage on in place graph tests
2023-12-21 08:58:33 -07:00
dependabot[bot] 7e1cbb440d Bump rocm-docs-core from 0.30.2 to 0.30.3 in /docs/sphinx (#1024)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.2 to 0.30.3.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.2...v0.30.3)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-12-20 10:37:13 -07:00
dependabot[bot] d8c53e90d7 Bump rocm-docs-core from 0.30.1 to 0.30.2 in /docs/sphinx (#1021)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.1 to 0.30.2.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.1...v0.30.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-12-19 13:34:37 -07:00
akolliasAMD a924454f0f CMake does not allow for capital letters been used in package names (#1020) 2023-12-15 12:39:17 -07:00
Ziyue Yang 655742a3a6 Fully disable MSCCL when machine is not matched (#1017)
* Disable MSCCL algorithm meta loading when machine is not matched

* fully disable init

* fix potential segfault
2023-12-13 08:36:21 -08:00
Wenkai Du 53d807a5b9 msccl: disable on multi-node (#1018) 2023-12-13 07:41:40 -08:00
Wenkai Du 81602814a7 msccl: fix data corruption with MTYPE_RW (#1014) 2023-12-11 20:33:15 -08:00
Bertan Dogancay fca459baaf correct package name (#1012) 2023-12-11 09:40:29 -07:00
Wenkai Du 7965c8b53c Fix memory fence and use non-temporal store (#1007)
* Fix memory fence and use non-temporal store

* Use amdgcn builtin instead of inline asm

* Move threadfence location

* Revert changes to gfx90a

* Rework gfx90a change

* Apply changes to gfx94x
2023-12-09 12:16:08 -08:00
Ziyue Yang c002f20029 Fix MSCCL scratch allocation (#1010) 2023-12-08 17:47:10 -06:00
Ziyue Yang bb144dcd50 Tune MSCCL all-reduce algorithm (#1009) 2023-12-08 17:47:02 -06:00
Wen-Heng (Jack) Chung baadda4bd8 Relax workgroup barrier implementation for MSCCL send/recv ops. (#997)
* Trim logic.

* Revert "Trim logic."

This reverts commit 8f2dba6c764108acf2bf5428366b9f41d4d206b9.

* Introduce MSCCL template parameters to send / recv.

* Address review feedbacks.
2023-12-08 17:46:53 -06:00
Wenkai Du 12c08fc52a msccl: build same number of kernels as in ROCm 5.7 (#1005)
Removed fullOps kernels from build
2023-12-07 13:36:04 -06:00
dependabot[bot] 9c3fea1751 Bump rocm-docs-core from 0.29.0 to 0.30.1 in /docs/sphinx (#1008)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.29.0 to 0.30.1.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.29.0...v0.30.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-12-07 10:32:06 -07:00
Wen-Heng (Jack) Chung 8e8323252a Let 320KB message size uses LL protocol. (#1006) 2023-12-06 18:14:31 -06:00
Wen-Heng (Jack) Chung 293f0fb752 Use a map to host scratch buffers (#1004)
* Use a map to host scratch buffers

* Address review feedbacks. Deliberately keep mscclSetupScratch function.
2023-12-05 13:15:28 -06:00
Nilesh M Negi bc44e3faa7 Fix gcnArch bug in IFC mix build (#998) (#1002)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2023-12-04 16:20:22 -06:00
Bertan Dogancay 7c0f49a878 IFC mix build (#998) 2023-12-02 18:49:52 -07:00
Wenkai Du 4ba65d1d6a Increase max channles to 64 (#993) 2023-12-01 16:01:11 -08:00
pradeep-ramanna 0b53f79196 Fix GPU to NIC mapping for peertopeer (#994) 2023-12-01 08:00:17 -08:00
Ziyue Yang e44e112a17 Fix mscclAlgoHandle not initialized issue (#995) 2023-12-01 07:58:01 -08:00
dependabot[bot] ddd5c07b56 Bump cryptography from 41.0.4 to 41.0.6 in /docs/sphinx (#985)
Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.4 to 41.0.6.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/41.0.4...41.0.6)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-30 15:05:50 -07:00
Ziyue Yang 4bb0b4a380 Move MSCCL algorithm loading to initialization to workaround HIP graph conflict (#982)
* MSCCL: pre-specify channels and pre-load algorithms

* add mutex

* fix bug

* clean include

* disable all-gathers temporarily
2023-11-30 09:47:20 -08:00
Bertan Dogancay 20b02af19b Renaming unit-tests package (#987) 2023-11-29 15:05:32 -07:00
dependabot[bot] 10a7cb7556 Bump rocm-docs-core from 0.28.0 to 0.29.0 in /docs/sphinx (#980)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.28.0 to 0.29.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.28.0...v0.29.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-29 09:39:33 -07:00
akolliasAMD 56ce9ef05f recreated pr 914 to work with current develop branch (#979) 2023-11-28 16:33:47 -07:00
akolliasAMD c71bae1608 npkit trace script now syncs the on average difference per rank (#981) 2023-11-28 11:03:55 -07:00
gilbertlee-amd 213869a6b4 JitterBench (#975) 2023-11-23 11:14:11 -07:00
Wenkai Du 50b2dd9fd7 Add special handling of gfx940 (#976)
* Add special handling of gfx940

* Update ring base
2023-11-22 15:07:36 -08:00
dependabot[bot] b1c746e7b5 Bump rocm-docs-core from 0.27.0 to 0.28.0 in /docs/sphinx (#969)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.27.0 to 0.28.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.27.0...v0.28.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-22 08:58:26 -07:00
Wenkai Du 569d3f7d59 msccl: allocate scratch as ext-scope fine-grained (#968) 2023-11-16 09:57:25 -06:00
searlmc1 15fa77bb57 Update README.md (#955)
Remove references to HCC, which was removed from ROCm ~2yrs ago
2023-11-15 18:01:45 -08:00
Wenkai Du bc8661f092 Fix kernel command line warnings (#961)
* Fix kernel command line warnings

* Remove while loop
2023-11-15 18:01:12 -08:00
Ziyue Yang 7fc891bc8d Fix MSCCL work FIFO allocation with HIP graph enabled (#967) 2023-11-15 16:43:28 -08:00
Bertan Dogancay 198f14923b Check to support older ROCm versions (#963) 2023-11-15 12:36:31 -07:00
Ziyue Yang 7ae95db5b8 Optimize MSCCL all-gather algorithms for gfx942 (#964) 2023-11-15 08:18:59 -08:00
Ziyue Yang df128879a6 Optimize MSCCL reduce primitive switching for gfx942 (#962)
* Optimize reduce primitive switching for gfx942

* address comment
2023-11-15 08:18:44 -08:00
Wenkai Du 5a800e00cd msccl: enable basic collective trace (#959)
To avoid increasing number of kernels, colltrace is only enabled with
RCCL_MSCCL_FORCE_FULLOPS=1
2023-11-08 20:14:28 -08:00
Bertan Dogancay 8e0258a73d Revert "Remove hip::device (#954)" (#956)
This reverts commit 7edb486154.
2023-11-07 13:40:41 -07:00
Wen-Heng (Jack) Chung efc42d9045 Use send instead of sendWithBarrier. (#727) 2023-11-07 13:47:24 -06:00
dependabot[bot] 7291144c94 Bump rocm-docs-core from 0.26.0 to 0.27.0 in /docs/sphinx (#951)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.26.0 to 0.27.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.26.0...v0.27.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-07 10:01:16 -07:00
Nusrat Islam 022735d208 Merge pull request #950 from nusislam/msccl-red2
msccl: remove cases from numReduction switch statement
2023-11-04 02:48:03 -05:00
Bertan Dogancay 7edb486154 Remove hip::device (#954) 2023-11-03 19:31:00 -06:00
Wenkai Du dbcba2923b Use parallel init of LDS and adjust P2P channels for gfx94x (#943)
* Use parallel init of LDS and adjust P2P channels for gfx94x

* Move another init to parallel

* Fix NCCL_NCHANNELS_PER_PEER setting
2023-11-03 16:06:49 -07:00
Nusrat Islam f545b94d4b msccl: remove cases from numReduction switch statement 2023-11-03 16:56:51 -05:00
gilbertlee-amd d50bab28bf Adding LaunchBench tool (#952) 2023-11-03 12:04:52 -06:00
Wenkai Du bb84345943 msccl: use 32-bit LDS access and add RCCL_MSCCL_FORCE_FULLOPS (#953) 2023-11-03 10:38:02 -07:00