Graf commitů

1438 Commity

Autor SHA1 Zpráva Datum
akolliasAMD 63e4d76e23 gfx12 initial enablement (#1219) 2024-07-10 13:32:09 -06:00
akolliasAMD 7e78641dc1 cleaned codeowners file (#1247) 2024-07-09 10:31:23 -06:00
dependabot[bot] 71e0f551e7 Bump certifi from 2024.2.2 to 2024.7.4 in /docs/sphinx (#1241)
Bumps [certifi](https://github.com/certifi/python-certifi) from 2024.2.2 to 2024.7.4.
- [Commits](https://github.com/certifi/python-certifi/compare/2024.02.02...2024.07.04)

---
updated-dependencies:
- dependency-name: certifi
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-07-08 11:49:35 -06:00
dependabot[bot] 799a8b5e59 Bump rocm-docs-core from 1.4.1 to 1.5.0 in /docs/sphinx (#1240)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.4.1 to 1.5.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.4.1...v1.5.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-07-08 10:59:32 -06:00
corey-derochie-amd 0c36d571ea Enable multi-threading for MSCCL (#1203)
MSCCL can now run in a multi-threaded configuration. To test in the unit tests, added the ENABLE_OPENMP compile definition flag and the --openmp-test-enable flag to the unit test build script. To activate, set the environment variables UT_MULTITHREADED=1 and UT_PROCESS_MASK=1. Set Jenkins to use this mode.
2024-07-04 09:34:38 -06:00
Wenkai Du 45f3fbc52f Checking kernel header files only when missing sysfs entry (#1239) 2024-07-03 15:53:15 -07:00
dependabot[bot] aeaaacad26 Bump rocm-docs-core from 1.4.0 to 1.4.1 in /docs/sphinx (#1233)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.4.0 to 1.4.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.4.0...v1.4.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-07-03 09:32:49 -06:00
Nilesh M Negi 5be3b713ef [GRAPH] Use channel shuffling only for IB systems (#1228)
* [GRAPH] Use channel shuffling only for IB systems

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [GRAPH] Define channels=48 for gfx94 RoCE systems

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [GRAPH] Increase channels for RoCE gfx94 systems

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-07-02 12:20:40 -05:00
Wenkai Du 9d8f68b4ee NPKit: separate time stamps for GPU access from different blocks (#1229)
To avoid races in memory access in GPU
2024-06-28 08:00:22 -07:00
Nusrat Islam b09ea29d66 graph: fix minNchannels for multi-node overwrite (#1230) 2024-06-26 16:56:10 -05:00
Wenkai Du ad31d93f3d Revert "Changing channel stride for MI300X multinode (#1196)" (#1224)
This reverts channel stride change in commit
0948eecbba
2024-06-25 14:03:30 -07:00
saurabhAMD e170f41ddd Unit Tests for testing channels (#1222) 2024-06-25 10:10:10 -05:00
Jack Taylor 5f2b88bc28 Add pytorch rccl/intra node all-reduce benchmark (#1221)
* Add gpt-fast pytorch all reduce benchmark script

* Update readme instructions

* Minor changes
2024-06-25 08:04:38 -07:00
Nusrat Islam 9f2514e5c8 Merge pull request #1223 from nusislam/minNchannels-multinode
graph: fix minNchannels for multi-node
2024-06-25 10:03:35 -05:00
Wenkai Du 5d7078e383 Fix DMABUF support (#1218)
* Fix DMABUF support

* Reduce log output by moving dmabuf allocation details to TRACE

* Enable peer memory GDR support if ib_umem_get_peer is in kernel
2024-06-25 08:00:15 -07:00
Nusrat Islam 05df0f8cea graph: fix minNchannels for multi-node
Multi-node rccl was not correctly setting the minNchannels value. This
PR fixes the bug.
2024-06-24 16:42:44 -05:00
dependabot[bot] 1ddb02c010 Bump urllib3 from 2.2.1 to 2.2.2 in /docs/sphinx (#1215)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.2.1 to 2.2.2.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/2.2.1...2.2.2)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-18 14:52:24 -06:00
dependabot[bot] 53dcfcc5e0 Bump rocm-docs-core from 1.1.3 to 1.4.0 in /docs/sphinx (#1213)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.1.3 to 1.4.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.1.3...v1.4.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-17 09:32:27 -06:00
Sam Wu 9f01acc030 Update Read the Docs configuration to use Python 3.10 and latest rocm-docs-core (#1190)
* Add doc team as owners of RTD config

* Update Read the Docs configuration to use Python 3.10 and latest rocm-docs-core
2024-06-14 12:12:22 -06:00
saurabhAMD 959545dce2 Merge pull request #1211 from saurabhAMD/channel
enable UT to test with channels greater than 64
2024-06-13 14:38:38 -05:00
saurabhAMD 392a73fdef enable UT to test with channels greater than 64 2024-06-13 13:54:08 -05:00
Paul Emberson 435756af02 fix initOnceFunc setting incorrect result code (#1205)
Addresses DMA-BUF support check unexpectedly failing
2024-06-07 16:47:19 -07:00
Nusrat Islam 9660e2e2dc Merge pull request #1200 from nusislam/multi-node-256-fix
graph: fix multi-node channel count
2024-06-07 14:34:20 -05:00
Nilesh M Negi d9661c17e6 Fix min_nchannels bug for gfx94* nranks=4 (#1202)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-06-07 14:31:28 -05:00
gilbertlee-amd 9b94a1052f Disabling NUMA maching for model 79 for some VM configs (#1204) 2024-06-06 17:15:04 -06:00
Nusrat Islam 526cce9bf4 graph: restrict maxChannels to 64 for multi-node and RCCL_ENABLE_INTRANET=1 2024-06-06 10:58:41 -05:00
Nusrat Islam 6ab20a7c6b graph: fix multi-node minChannel count 2024-06-06 10:56:39 -05:00
Wenkai Du 9fcd7b55e1 Allow multiple parameters during selective function generation (#1201)
* Allow multiple parameters during selective function generation

* Remove debug print

* Add examples into Generator.cmake
2024-06-06 07:07:24 -07:00
Nusrat Islam 955347bab4 Merge pull request #1184 from nusislam/256-channel-2
add 256 channels support
2024-06-04 08:25:34 -05:00
Nusrat Islam 9746d8ca3f set MIN_NCHANNEL limit to 64 for multi-node 2024-06-03 13:05:05 -05:00
Nusrat Islam 0634c5c8e1 doubling debug buffer size with increased channels 2024-06-03 13:05:05 -05:00
Nusrat Islam ef442f8f92 set MAXCHANNELS to 128 2024-06-03 13:05:05 -05:00
Nusrat Islam 9f654f6cf5 graph: restrict MAXCHANNELS for certain platforms 2024-06-03 13:05:01 -05:00
Nusrat Islam 48859a97b1 device: update the logic for channelId assignment 2024-06-03 13:03:18 -05:00
Nusrat Islam 506f16c506 add 256 channels support 2024-06-03 13:03:18 -05:00
akolliasAMD 6475da2ed9 fixed typo on BFD linkage (#1192) 2024-06-03 10:05:47 -06:00
gilbertlee-amd 0948eecbba Changing channel stride for MI300X multinode (#1196)
* Shuffling MI300X multi-node channels
* Updating tree channel logic
2024-06-03 10:00:55 -06:00
srawat 3301cdf59a doc organization (#1197)
* doc organization

* removing what is rccl file

* Update index.rst
2024-06-03 18:38:45 +05:30
ClementLinCF cab25f919e Optimize NCHANNELS and MSCCL config for gfx942 80CUs (#1195)
* Optimize NCHANNELS and MSCCL config for gfx942 80CUs

Set appropriately for different NCCL_MIN_NCHANNELS and MSCCL config,
potentially improving communication perf on the MI300x 80CUs

* Delete tools/msccl-algorithms/allreduce_1step_mccl_8_2_16777216_LL.xml

* Change the factor of gfx94 and update msccl config
2024-06-01 07:07:46 -07:00
Nilesh M Negi 5aaf7121d9 [BUILD] Update install.sh for RCCL build (#1191)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-05-31 17:58:34 -05:00
Nilesh M Negi 1249a6c3fd [MSCCL]: Move scratch buffer debug msgs to TRACE (#1189)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-05-31 17:54:23 -05:00
gilbertlee-amd 354e0b29a6 Addressing possible out-of-bounds mem access during channel duplication (#1193) 2024-05-30 14:02:14 -06:00
Wenkai Du 73221b4230 Add ring simple chunk size tuning (#1180)
* Add ring simple chunk size tuning

* modifying the tuning table to improve the performance of broadcast for 8MB to 32MB for single-node MI300X after ring simple chunk size tuning

* modifying the tuning table to improve the performance of reduce for 1MB to 4MB for single-node MI300X after ring simple chunk size tuning

---------

Co-authored-by: PedramAlizadeh <pmohamma@amd.com>
2024-05-29 07:59:47 -07:00
Wenkai Du 8f099b1adb Make WSL detection thread safe (#1178)
* Make WSL detection thread safe

* Change static to beginning

* Switch to use atomics
2024-05-28 17:23:50 -07:00
Edgar Gabriel a78c4f5e88 Merge pull request #1175 from edgargabriel/topic/alt_rsmi
add alternative to rocm_smi_lib
2024-05-28 09:36:55 -05:00
Joseph Macaranas e4c10e4438 Merge pull request #1188 from ROCm/amd/jmacaran/externalCIEnablement
Enable external CI pipeline triggers
2024-05-27 15:52:49 -04:00
amd-jmacaran 125f841c5f Enable external CI pipeline triggers 2024-05-23 16:29:05 -04:00
dependabot[bot] b06e617bea --- (#1183)
updated-dependencies:
- dependency-name: requests
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-05-21 09:34:41 -06:00
Wenkai Du eeea3b693b Report error when collective is not enabled in build (#1177)
* Report error when collective is not enabled in build

* Fix typo
2024-05-16 10:11:12 -07:00
Edgar Gabriel 9ad913bfa8 add alternative to rocm_smi_lib 2024-05-14 13:51:41 -07:00