Wykres commitów

1583 Commity

Autor SHA1 Wiadomość Data
Tim 38f91fa2c8 reverting change to RcclReplayer (#1657)
[ROCm/rccl commit: 45e1c3f3e2]
2025-04-23 15:36:46 -04:00
Jeffrey Novotny fb1fdef8e2 Fix broken link to RCCL Replayer GitHub info (#1655)
[ROCm/rccl commit: df778b4ea1]
2025-04-23 14:17:31 -04:00
gilbertlee-amd 8023be9355 Adding UT_DEBUG_PAUSE to unit tests (#1653)
[ROCm/rccl commit: ee85a70bb4]
2025-04-21 21:15:07 -06:00
Bertan Dogancay aac829125c Fix NPKit for SendRecv (#1651)
[ROCm/rccl commit: ac8ec4c08c]
2025-04-21 12:34:47 -04:00
Tim 58ee618194 RCCL Replayer update (#1603)
RCCL recorder w/ suggested change and UT



[ROCm/rccl commit: 9a55ff60a9]
2025-04-19 00:21:27 -04:00
Mustafa Abduljabbar de2b66921a Address nested designator compiler warning issue (#1633)
[ROCm/rccl commit: 52bfdf05dc]
2025-04-18 17:09:50 -04:00
Nusrat Islam 691e98940c Fix MSCCLPP accuracy issue for allreduce7 (#1634)
* ext-src: fix a graph-mode bug in allreduce7

* change MSCCLPP threshold to 16MB

* ext-src: change message size threshold for allreduce7

* ext-src: address review comments

[ROCm/rccl commit: f20c33effd]
2025-04-18 08:54:32 -05:00
AbandiGa acf0bc1c6e added copyright (#1635)
[ROCm/rccl commit: 7a84c5dbb0]
2025-04-14 09:46:18 -05:00
Nilesh M Negi 708c053b21 Update Dockerfile to use CMake-based build (#1630)
* [DOCKER] Update Dockerfile to switch to CMake build

* Fix typo in Dockerfile.ubuntu

* Add README to docker sub-dir

* Update Dockerfile and README

* Modify markdown headings in docker/README

* Update docs

* Fix typo in docs

* Update docs/install/docker-install.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/install/docker-install.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/install/docker-install.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docker/README

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: bd1a5b38b6]
2025-04-10 11:40:10 -05:00
Dingming Wu 63c6180130 Adding #include <dlfcn.h> in profiler.cc to pass build (#1632)
w/o the header, experience following errors:
```
error: use of undeclared identifier 'RTLD_NOW'
error: use of undeclared identifier 'RTLD_LOCAL'
error: use of undeclared identifier 'dlerror'
error: use of undeclared identifier 'dlsym'
error: use of undeclared identifier 'dlclose'
```

[ROCm/rccl commit: 1786c0268b]
2025-04-10 08:48:18 -07:00
Arm Patinyasakdikul f29d59aa00 Add device synchronization before destroying proxy thread. (#1631)
This commit ensures that GPU finishes all kernel before destroying
communicator thread.

[ROCm/rccl commit: 52654e2301]
2025-04-10 10:44:16 -05:00
Pedram Alizadeh 93ac2ea61e all_reduce LL/LL128 and Ring/Tree multi-node tuning for MI300 (#1627)
* Enabling LL128 by default on MI300

* Add missing CUDACHECK

* Adjust BW correction factors to fix the Tree->Ring switching point

* Refactor and add ll128 AR logarithmic factor to tuning models

* Move RCCL tuning changes to a separate file 

* Use enum for tunable indexing

* Use explicit indexing in tuning models to avoid mismatch issues

* Place rcclGetSizePerRank in a function

* Remove HIP ifdef for rccl-only call

---------

Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>

[ROCm/rccl commit: e40ff4f84a]
2025-04-10 11:43:54 -04:00
Pedram Alizadeh b225281747 single-node AR msccl algorithm tuning for MI300 (#1629)
[ROCm/rccl commit: 5b36b68d06]
2025-04-10 10:42:28 -04:00
dependabot[bot] 464d69963d Bump rocm-docs-core from 1.18.1 to 1.18.2 in /docs/sphinx (#1625)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.18.1 to 1.18.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.1...v1.18.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-version: 1.18.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: b6d97a6176]
2025-04-09 17:00:31 -06:00
Nikhil-Nunna ee9da06c80 Fetching RCCL version from shared object file. (#1569)
* added rccl version using rccl-tests

* Added function to get rccl version from rccl-tests

* removed whitespace

* Added rccl version

* Updated readme and fixed formatting

* removed debug prints

[ROCm/rccl commit: 3dc0478722]
2025-04-08 14:09:39 -05:00
Mustafa Abduljabbar 2f4cd5718e Add AllGather LL128 multi-node tuning and include LL cutoff points in tuning models (#1618)
* Enable LL/LL128 cutoff points in tuning models

* Initializing ll/ll128 model cutoffs for MI300

* Use RCCL_LL_LIMITS_UNDEFINED

---------

Co-authored-by: PedramAlizadeh <pmohamma@amd.com>

[ROCm/rccl commit: 4be06f04d8]
2025-04-02 16:26:23 -04:00
Mustafa Abduljabbar 0a81478bd9 Fix topo explorer's nccl 2.23 compatibility (#1623)
* Fix compiler issues due to broken compatibility 

* Fix segfault and pass rank instead of busid and add a pointer to cover a new algorithm

[ROCm/rccl commit: aace4e27f8]
2025-04-02 09:47:29 -04:00
dependabot[bot] 25dafc0c82 Bump rocm-docs-core from 1.17.0 to 1.18.1 in /docs/sphinx (#1599)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.17.0 to 1.18.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.17.0...v1.18.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: ffe255d285]
2025-04-01 17:08:53 -06:00
Nusrat Islam f599690ce3 ext-src: fix mscclpp correctness issue (#1615)
* ext-src: fix mscclpp correctness issue

* ext-src: remove white-space warnings

[ROCm/rccl commit: 4a29bba3c6]
2025-04-01 15:02:16 -05:00
Istvan Kiss 858fa4e65d Add documentation for NPS4 and CPX partition modes (#1555)
[ROCm/rccl commit: 28ab8603d2]
2025-03-31 09:25:25 -06:00
Nilesh M Negi 1a2eca1756 Revert "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)" (#1620)
* Revert "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)"

This reverts commit cf17cff5b6.

* [DOC] Update Changelog

* [DOC] Update CHANGELOG

[ROCm/rccl commit: b17338d164]
2025-03-28 17:57:06 -05:00
Bertan Dogancay b737d8c222 Merge pull request #1559 from BertanDogancay/2.23
[SYNC] 2.23.4-1

[ROCm/rccl commit: 532f54c244]
2025-03-28 17:06:56 -04:00
Nilesh M Negi a7ec191754 [TEST] Switch to googletest release 1.12.0 (#1621)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 0e2c461c6c]
2025-03-28 12:39:42 -05:00
Nilesh M Negi 210f90ae0f [SRC] Enable unroll=1 for gfx950 (#1602)
* [SRC] Enable unroll=1 for gfx950

* Fix typo from rebase in generate.py

* Support for unroll=1 and gfx90a when building for all GPU targets

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 307bc10781]
2025-03-27 18:21:35 -05:00
BertanDogancay 8ed27fde74 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 0b2062c560]
2025-03-27 12:53:04 -05:00
isaki001 f0c853438c Disable mscclpp (#1614)
* disable mscclpp by default

[ROCm/rccl commit: 9dc23d9265]
2025-03-25 15:21:16 -05:00
Nilesh M Negi 290ca7deca [CI] Fix warnings from pytest in Azure CI (#1617)
* [CI] Fix warnings from pytest

* [CI] Append to LD_LIBRARY_PATH instead of overwrite

---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 565f0ade60]
2025-03-22 23:22:53 -05:00
Nilesh M Negi 8cfbc0fbd1 [UT] Increase stack size for StandaloneTests to 480 (#1616)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: d6b987a53f]
2025-03-21 21:33:32 -05:00
dependabot[bot] 68672b9b3c Bump jinja2 from 3.1.5 to 3.1.6 in /docs/sphinx (#1591)
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.5 to 3.1.6.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/3.1.5...3.1.6)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: fe3f3aae17]
2025-03-21 17:17:32 -06:00
dependabot[bot] 83c59c859a Bump cryptography from 43.0.1 to 44.0.1 in /docs/sphinx (#1543)
Bumps [cryptography](https://github.com/pyca/cryptography) from 43.0.1 to 44.0.1.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/43.0.1...44.0.1)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: dd61df75ab]
2025-03-21 17:17:19 -06:00
gilbertlee-amd 4f67522420 Removing the experimental clique kernel files (#1610)
[ROCm/rccl commit: 626dc50ab5]
2025-03-20 18:10:01 -06:00
Wenkai Du e86b217182 Add fault injection of starting warps with random variations (#1593)
* Add fault injection of starting warps with random variations

This is done by inserting randomly delays after __syncthreads().
The feature can be turned off by FAULT_INJECTION=OFF in cmake.

* Remove manually introduced bug for demo purpose

* Use only one thread per warp for checking wall clock

[ROCm/rccl commit: 90ad586d94]
2025-03-20 16:11:43 -07:00
gilbertlee-amd 12c1fe8fdf Psuedo-randomly adding zero-byte sends in AllToAllv unit test (#1597)
[ROCm/rccl commit: 9a4e49ff1a]
2025-03-20 17:00:48 -06:00
amd-jmacaran 72454ece3e [Azure CI] Minor environment setup fixes
- Add extra deletion to ensure source workspace is clean for the job.
- pytest expects function names to start with test_


[ROCm/rccl commit: 805448261d]
2025-03-20 15:36:10 -04:00
corey-derochie-amd e95578ef4c removed gfx940 and gfx941 (#1606)
* removed gfx940 and gfx941

* removed gfx940 and gfx941

* Update "gfx94" to "gfx942" in init.cc

* Updated remaining "gfx94" updates to "gfx942"

* Update filenames and variables from gfx940 to gfx942

---------

Co-authored-by: akolliasAMD <akollias@amd.com>

[ROCm/rccl commit: 6505639cf4]
2025-03-20 09:34:53 -06:00
Joseph Macaranas a95b2c9fb7 Introduce framework to run Azure Pipeline jobs (#1608)
- PR runs targeting develop will require approval from at least one person in a predefined list.
- Nightly runs are separated and will not require approval.
- Sample pytest script is provided for expanding test coverage.


[ROCm/rccl commit: 6403afff4a]
2025-03-18 17:29:44 -04:00
Wenkai Du c6f4c8d17a GDRCOPY support: Off by default (#1605)
[ROCm/rccl commit: bd0092e8f1]
2025-03-18 08:17:01 -07:00
Avinash 5a25b110af Memory leak fix when numIBDevices = 0 (#1429)
* Initial commit for testing

* Fix memory leak in checkOptions

* Fix memory leak in checkOption

* x

* Delete cmake-3.28.2-linux-x86_64.sh

* gcn changes

* gcn memleak fixes

* gcn leak fix

* memory leak fixes for parseRome4P2H and ncclTopoAddGPU

* Keeping only necessary file for fixes

Deleting temporary scripts I created for debugging and testing

* changing to GCN_ARCH_NAME_LEN

* Added sanity check directory

* refactoring scripts

* Updated to sanity checks folder

* Initial fixes

* changes in tools

* pointing RCCL lib build to debug version

* Removed second pthread_detach

* Removing sanity checks

* Keeping only code changes

* addressing memory leaks in ncclIbinit

---------

Co-authored-by: Chao Chen <cchen104@amd.com>

[ROCm/rccl commit: ccb0820743]
2025-03-17 11:21:19 -05:00
mberenjk a3a598efb3 Skipping AllReduce test on more than 8 ranks for FP8 type on Hyabusa (#1598)
* Skipping AllReduce FP8 test on 9 to 16 ranks (gfx90a) as it's using Tree algorithm not RING

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>

[ROCm/rccl commit: 5f691aaf65]
2025-03-17 10:22:49 -05:00
Mustafa Abduljabbar 74d6537141 Multi-node reduce_scatter improved auto-selection for LL and LL128 on gfx942 (#1604)
* Add reduce_scatter LL and LL128 thresholds

* Always honor user choice for protocol

[ROCm/rccl commit: f67b2cc908]
2025-03-17 11:21:01 -04:00
Wenkai Du 0bd40f5a87 Enable LL128 on gfx942 (#1549)
[ROCm/rccl commit: 245c2de909]
2025-03-16 15:10:05 -07:00
Nilesh M Negi cf17cff5b6 [GRAPH] Increase default nChannels to 112 for gfx950 (#1596)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 1df73e209e]
2025-03-14 14:47:03 -07:00
Wenkai Du afd04a5117 Limit P2P channels per peer to not exceeding max channels (#1594)
* Limit P2P channels per peer to not exceeding max channels

* [UT] test single GPU cases for all collectives

* [UT] fix out of range root value

[ROCm/rccl commit: 4237caad69]
2025-03-11 09:32:09 -07:00
Tim a79fa36b77 mscclpp compatibility check for ubr (#1573)
[ROCm/rccl commit: f6c6d451a9]
2025-03-09 22:10:47 -04:00
isaki001 14be1c9a7a fix the size of the recv buffer in AllGather UBR test (#1564)
[ROCm/rccl commit: 59c55842f1]
2025-03-05 11:42:15 -06:00
Nusrat Islam e7c90e0a46 misc/msccl: force use of mscclpp (#1581)
[ROCm/rccl commit: ac823818aa]
2025-03-04 12:48:59 -06:00
Bertan Dogancay d1247bbf2a [Transport] Fix IntraNet (#1582)
[ROCm/rccl commit: d88cca3098]
2025-03-04 13:30:36 -05:00
Wenkai Du 086fa823db NPKit: enable reduce scatter profiling (#1580)
[ROCm/rccl commit: f957c4fe22]
2025-03-04 10:03:56 -08:00
Nilesh M Negi 751370bb70 [BUILD] Enable multiple GPU targets in MSCCLPP (#1574)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 063c6cfc11]
2025-03-01 22:28:42 -06:00
dependabot[bot] 977d04cb9a Bump rocm-docs-core from 1.15.0 to 1.17.0 in /docs/sphinx (#1558)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.15.0 to 1.17.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.15.0...v1.17.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: a6b2ca224e]
2025-02-28 16:32:38 -07:00