Gráfico de commits

1467 Commits

Autor SHA1 Mensaje Fecha
dependabot[bot] dfd5106a4b Bump rocm-docs-core from 1.5.0 to 1.6.2 in /docs/sphinx (#1287)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.5.0 to 1.6.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.5.0...v1.6.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-08-09 11:04:57 -06:00
Pedram Alizadeh a25ca9bb90 adding new tunning table for very large number of nodes (#1288) 2024-08-09 10:47:42 -04:00
Tim 4200964202 Adding core binding in info (#1212)
Signed-off-by: AtlantaPepsi <timhu102@amd.com>
2024-08-08 11:36:24 -04:00
Nilesh M Negi a2474846f5 [README] Tips on using less than 8 MI300 GPUs (#1270)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-08-06 11:12:09 -05:00
Nilesh M Negi 4f31ab85ea [BUILD] Update gfxTargets for ASAN build (#1242)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-08-06 10:53:51 -05:00
Ziyue Yang 145a13235a Fix number of loops in p2p-latency-test (#1286) 2024-08-05 13:35:56 -07:00
Nilesh M Negi cb2e0615d7 [BUILD] Disable MSCCLPP build by default (#1283)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-08-02 23:17:51 -05:00
Tim a4793286c7 Adding User Buffer Registration support for Unit test (#1199)
* Adding UBR support for UT SendRecv

Signed-off-by: Tim Hu <timhu102@amd.com>

* Update test/common/TestBedChild.cpp

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Signed-off-by: Tim Hu <timhu102@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2024-07-30 13:39:25 -04:00
Wenkai Du ca5341d419 Restore number of parallel linking jobs (#1278)
* Restore number of parallel linking jobs

* Dynamically adjust number of linker jobs with limit of 16 jobs max

* Fix typo

* Add cgroup v1 support
2024-07-30 08:04:14 -07:00
Pedram Alizadeh b005c13292 Adding tuner plugin example for MI300 (#1274) 2024-07-29 15:43:36 -04:00
Richard Barnes d09b152aa0 Remove unused but set variable from all_reduce.h (#1258)
Allows `-Wunused-but-set-variable` to pass
2024-07-29 08:11:24 -07:00
Richard Barnes 86a4ad6e8b Remove unused but set variable from prims_ll128.h (#1257)
Allows `-Wunused-but-set-variable` to pass
2024-07-29 08:11:01 -07:00
Richard Barnes 7ad432ee23 Remove unused but set variable from prims_ll.h (#1256)
Allows `-Wunused-but-set-variable` to pass
2024-07-29 08:10:38 -07:00
akolliasAMD c246e25f8e gfx12 Disable ll protocol (#1268) 2024-07-26 08:59:55 -06:00
Sam Wu 05dca6def9 Double compile timeout for extended ci to 400 min (#1277) 2024-07-26 09:59:36 -05:00
Benjamin Kitor 4bc118336a topo_expl: Update channel masks for >64 channels (#1279) 2024-07-25 17:27:34 -07:00
Joseph Macaranas 00cd4dae1e Merge pull request #1262 from ROCm/amd/jmacaran/externalCImainline
External CI: Add triggers for mainline branch
2024-07-22 20:41:01 -04:00
corey-derochie-amd 69135976d6 Fix bug where the first collective call was using MSCCL instead of MSCCL++ (#1260) 2024-07-22 15:46:47 -06:00
saurabhAMD cf311b71ee Adding performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing (#1265)
* Adding performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing

* Performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing
2024-07-22 10:21:29 -05:00
corey-derochie-amd b31b4082dd Only initialize MSCCL++ when runtime-enabled. (#1266) 2024-07-22 00:41:31 -06:00
mberenjk 519843d2cf adding rocprof and pytorch parser scripts (#1214)
* adding rocprof parser script

* adding the support for multiple json files

* adding pytorch profiler script

* remove filtering from pytorch log

* adding the addressing the comments and add the feature to parse all kernels

* completing the report for torch profiler

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2024-07-19 14:51:28 -05:00
Nusrat Islam 6f331b0d43 Enable CPX mode for MI300X (#1259)
* graph: enable cpx mode for MI300X

* graph: tune limits for cpx and cleanup
2024-07-19 11:30:37 -05:00
Wenkai Du 89349f2ce4 Template unroll for RCCL kernels (#1250)
* Template unroll for RCCL kernels

* Adding unroll template arg during CMake hipification

* Reduce linking parallel jobs to avoid OOM in CI

* Workaround issues with UT tests

SWDEV-469533: register spill fix is needed for mainline build
LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs
Use -parallel-jobs=8 for linking

* CI: do not use -j 16 when building

* CI: use -j 8 when building

* Only reduce parallel linking job for CI extended

* Restore original jenkins command. Change parallel linking jobs in cmake

* Disable MSCCLPP

---------

Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>
2024-07-19 08:15:59 -07:00
Nilesh M Negi a1ef217b32 Consistent channel shuffling for MI300X multi-node (#1255)
* Revert "[GRAPH] Use channel shuffling only for IB systems (#1228)"

This reverts commit 5be3b713ef.

* Revert "Revert "Changing channel stride for MI300X multinode (#1196)" (#1224)"

This reverts commit ad31d93f3d.
2024-07-18 10:18:09 -05:00
amd-jmacaran 346fee4c83 External CI: Add triggers for mainline branch 2024-07-17 23:16:49 -04:00
Nilesh M Negi 67e867271f [GRAPH] Disable MSCCL override of no. of channels (#1187)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-07-15 10:45:21 -05:00
corey-derochie-amd 9cbb3da224 Only enable MSCCL++ AllReduce for message sizes that are multiples 32 (#1253)
* Only enable MSCCL++ AllReduce for message sizes that are multiples of 32. MSCCL++ does not handle these other sizes.

* Sanitized MSCCL++ logging.
2024-07-12 17:04:23 -07:00
corey-derochie-amd 6dc47eecd7 Integrated RCCL with MSCCL++ for small message sizes (#1231) 2024-07-12 15:32:58 -06:00
Rahul Vaidya c755b9cf93 Improved version reporting in NCCL_DEBUG=VERSION (#1232)
* Improved version reporting in NCCL_DEBUG=VERSION.

Signed-off-by: rahulvaidya20 <ravaidya@amd.com>

* Version reporting changes

Signed-off-by: rahulvaidya20 <ravaidya@amd.com>

* Versioning changes: Initialized char arrays to null and fixed typo.

---------

Signed-off-by: rahulvaidya20 <ravaidya@amd.com>
2024-07-12 08:14:29 -05:00
akolliasAMD 63e4d76e23 gfx12 initial enablement (#1219) 2024-07-10 13:32:09 -06:00
akolliasAMD 7e78641dc1 cleaned codeowners file (#1247) 2024-07-09 10:31:23 -06:00
dependabot[bot] 71e0f551e7 Bump certifi from 2024.2.2 to 2024.7.4 in /docs/sphinx (#1241)
Bumps [certifi](https://github.com/certifi/python-certifi) from 2024.2.2 to 2024.7.4.
- [Commits](https://github.com/certifi/python-certifi/compare/2024.02.02...2024.07.04)

---
updated-dependencies:
- dependency-name: certifi
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-07-08 11:49:35 -06:00
dependabot[bot] 799a8b5e59 Bump rocm-docs-core from 1.4.1 to 1.5.0 in /docs/sphinx (#1240)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.4.1 to 1.5.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.4.1...v1.5.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-07-08 10:59:32 -06:00
corey-derochie-amd 0c36d571ea Enable multi-threading for MSCCL (#1203)
MSCCL can now run in a multi-threaded configuration. To test in the unit tests, added the ENABLE_OPENMP compile definition flag and the --openmp-test-enable flag to the unit test build script. To activate, set the environment variables UT_MULTITHREADED=1 and UT_PROCESS_MASK=1. Set Jenkins to use this mode.
2024-07-04 09:34:38 -06:00
Wenkai Du 45f3fbc52f Checking kernel header files only when missing sysfs entry (#1239) 2024-07-03 15:53:15 -07:00
dependabot[bot] aeaaacad26 Bump rocm-docs-core from 1.4.0 to 1.4.1 in /docs/sphinx (#1233)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.4.0 to 1.4.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.4.0...v1.4.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-07-03 09:32:49 -06:00
Nilesh M Negi 5be3b713ef [GRAPH] Use channel shuffling only for IB systems (#1228)
* [GRAPH] Use channel shuffling only for IB systems

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [GRAPH] Define channels=48 for gfx94 RoCE systems

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [GRAPH] Increase channels for RoCE gfx94 systems

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-07-02 12:20:40 -05:00
Wenkai Du 9d8f68b4ee NPKit: separate time stamps for GPU access from different blocks (#1229)
To avoid races in memory access in GPU
2024-06-28 08:00:22 -07:00
Nusrat Islam b09ea29d66 graph: fix minNchannels for multi-node overwrite (#1230) 2024-06-26 16:56:10 -05:00
Wenkai Du ad31d93f3d Revert "Changing channel stride for MI300X multinode (#1196)" (#1224)
This reverts channel stride change in commit
0948eecbba
2024-06-25 14:03:30 -07:00
saurabhAMD e170f41ddd Unit Tests for testing channels (#1222) 2024-06-25 10:10:10 -05:00
Jack Taylor 5f2b88bc28 Add pytorch rccl/intra node all-reduce benchmark (#1221)
* Add gpt-fast pytorch all reduce benchmark script

* Update readme instructions

* Minor changes
2024-06-25 08:04:38 -07:00
Nusrat Islam 9f2514e5c8 Merge pull request #1223 from nusislam/minNchannels-multinode
graph: fix minNchannels for multi-node
2024-06-25 10:03:35 -05:00
Wenkai Du 5d7078e383 Fix DMABUF support (#1218)
* Fix DMABUF support

* Reduce log output by moving dmabuf allocation details to TRACE

* Enable peer memory GDR support if ib_umem_get_peer is in kernel
2024-06-25 08:00:15 -07:00
Nusrat Islam 05df0f8cea graph: fix minNchannels for multi-node
Multi-node rccl was not correctly setting the minNchannels value. This
PR fixes the bug.
2024-06-24 16:42:44 -05:00
dependabot[bot] 1ddb02c010 Bump urllib3 from 2.2.1 to 2.2.2 in /docs/sphinx (#1215)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.2.1 to 2.2.2.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/2.2.1...2.2.2)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-18 14:52:24 -06:00
dependabot[bot] 53dcfcc5e0 Bump rocm-docs-core from 1.1.3 to 1.4.0 in /docs/sphinx (#1213)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.1.3 to 1.4.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.1.3...v1.4.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-17 09:32:27 -06:00
Sam Wu 9f01acc030 Update Read the Docs configuration to use Python 3.10 and latest rocm-docs-core (#1190)
* Add doc team as owners of RTD config

* Update Read the Docs configuration to use Python 3.10 and latest rocm-docs-core
2024-06-14 12:12:22 -06:00
saurabhAMD 959545dce2 Merge pull request #1211 from saurabhAMD/channel
enable UT to test with channels greater than 64
2024-06-13 14:38:38 -05:00
saurabhAMD 392a73fdef enable UT to test with channels greater than 64 2024-06-13 13:54:08 -05:00