Граф коммитов

1728 Коммитов

Автор SHA1 Сообщение Дата
Avinash c81ea25407 collective trace improvements for debugging (#1661)
[ROCm/rccl commit: c54a0c085a]
2025-05-07 13:37:31 -05:00
Bertan Dogancay c75ebd9147 Merge pull request #1662 from BertanDogancay/2.25
[SYNC] 2.25.1-1

[ROCm/rccl commit: 590ad6acc2]
2025-05-06 09:39:09 -04:00
Mustafa Abduljabbar 750bd73047 Add missing MACRO to topo_expl (#1677)
* Fix header compatibility

[ROCm/rccl commit: fdad89690b]
2025-05-05 15:58:57 -04:00
Mustafa Abduljabbar ab4a3eb0c1 Fix topo explorer's compatibility with NCCL 2.24 (#1671)
* Fix build issues

* Fix failure to find path remote rank


[ROCm/rccl commit: f3f3336468]
2025-05-05 15:26:29 -04:00
Siu Chi Chan be0761502d rccl-UnitTests - link to dl library (#1673)
[ROCm/rccl commit: 9525c5b2ef]
2025-05-02 21:20:22 -05:00
Bertan Dogancay b435c75068 [Graph] Try using P2P by default (#1670)
[ROCm/rccl commit: acfac55516]
2025-05-02 11:54:30 -04:00
Nilesh M Negi a6972c0d09 Revert "[SRC] Enable unroll=1 for gfx950 (#1602)" (#1667)
* Revert "[SRC] Enable unroll=1 for gfx950 (#1602)"
This reverts commit 210f90ae0f.

* Update Changelog

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 329e13efff]
2025-04-30 23:33:08 -05:00
deeksha-amd 5580cb7574 Added new tests for improving the code coverage (#1656)
Signed-off-by: Deeksha Goplani <deeksha.goplani@amd.com>

[ROCm/rccl commit: 2486838465]
2025-04-30 18:01:11 -05:00
isaki001 de76d7f649 Add Compilation Flag for enabling/disabling clipping, and tune number of blocks for mscclpp allreduce8 (#1607)
* mscclpp patch apply clip patch and set allreduce8 blocks from 512 to 1024

* add compilation flag for enabling/disabling clipping in mscclpp

* change flag name for consistency, set flag to OFF

* add compilation flag in rccl for enabling clipping in mscclpp

* set 1024 threads for mscclpp allreduce8 only for bfloat16

* fix improper description for ENABLE_MSCCLPP_CLIP flag

* Revert "Merge branch 'clip-patch' of https://github.com/isaki001/rccl into clip-patch"

This reverts commit 6e31857a9db98314b8a748eb024f2c3699ebe2d5, reversing
changes made to 193f4caa8ffa78b4e056893212fd8344aa14e937.

* update clip remove-clip.patch for rebase

[ROCm/rccl commit: 8145c4f3b8]
2025-04-30 16:42:28 -05:00
BertanDogancay 064062ef70 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: cb6e23ae67]
2025-04-30 13:31:41 -05:00
Tim 33c390b36e minor fix for empty scope (group) (#1666)
[ROCm/rccl commit: dc0c5f9153]
2025-04-30 13:29:13 -04:00
Richard Barnes f2d30a163b Enable -Wall (#1644)
[ROCm/rccl commit: 7961624167]
2025-04-24 10:45:46 -07:00
Bertan Dogancay eb50c947eb Merge pull request #1645 from corey-derochie-amd/nccl-2.24
NCCL Sync 2.24.3-1

[ROCm/rccl commit: f8067a76dc]
2025-04-24 10:08:58 -04:00
Mustafa Abduljabbar a85cfaa680 [AllGather MSCCL] Multinode and single node support up to certain send count (#1650)
* Add multinode and singlenode allgather XML


[ROCm/rccl commit: aa7991dfc8]
2025-04-24 09:02:03 -04:00
BertanDogancay d045d0ca23 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: a6bf9bfc9e]
2025-04-23 20:47:43 -07:00
Mustafa Abduljabbar 07620c7efd Expose production tuning table in topo_explorer using internal RCCL/NCCL logic (#1628)
* Internal RCCL/NCCL functionality exposed when RCCL_EXPOSE_STATIC is enabled
* Algo/protocol/max channels can be obtained with the new RCCL API
* Introduce rccl_static and rccl_static_inline macros to work around invisible functions in core source files like enqueue.cc
* Add usage example in topo-explorer tool

[ROCm/rccl commit: 82afb2bcfe]
2025-04-23 15:44:56 -04:00
Tim 38f91fa2c8 reverting change to RcclReplayer (#1657)
[ROCm/rccl commit: 45e1c3f3e2]
2025-04-23 15:36:46 -04:00
Jeffrey Novotny fb1fdef8e2 Fix broken link to RCCL Replayer GitHub info (#1655)
[ROCm/rccl commit: df778b4ea1]
2025-04-23 14:17:31 -04:00
gilbertlee-amd 8023be9355 Adding UT_DEBUG_PAUSE to unit tests (#1653)
[ROCm/rccl commit: ee85a70bb4]
2025-04-21 21:15:07 -06:00
Bertan Dogancay aac829125c Fix NPKit for SendRecv (#1651)
[ROCm/rccl commit: ac8ec4c08c]
2025-04-21 12:34:47 -04:00
Tim 58ee618194 RCCL Replayer update (#1603)
RCCL recorder w/ suggested change and UT



[ROCm/rccl commit: 9a55ff60a9]
2025-04-19 00:21:27 -04:00
Mustafa Abduljabbar de2b66921a Address nested designator compiler warning issue (#1633)
[ROCm/rccl commit: 52bfdf05dc]
2025-04-18 17:09:50 -04:00
Nusrat Islam 691e98940c Fix MSCCLPP accuracy issue for allreduce7 (#1634)
* ext-src: fix a graph-mode bug in allreduce7

* change MSCCLPP threshold to 16MB

* ext-src: change message size threshold for allreduce7

* ext-src: address review comments

[ROCm/rccl commit: f20c33effd]
2025-04-18 08:54:32 -05:00
AbandiGa acf0bc1c6e added copyright (#1635)
[ROCm/rccl commit: 7a84c5dbb0]
2025-04-14 09:46:18 -05:00
Nilesh M Negi 708c053b21 Update Dockerfile to use CMake-based build (#1630)
* [DOCKER] Update Dockerfile to switch to CMake build

* Fix typo in Dockerfile.ubuntu

* Add README to docker sub-dir

* Update Dockerfile and README

* Modify markdown headings in docker/README

* Update docs

* Fix typo in docs

* Update docs/install/docker-install.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/install/docker-install.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/install/docker-install.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docker/README

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: bd1a5b38b6]
2025-04-10 11:40:10 -05:00
Dingming Wu 63c6180130 Adding #include <dlfcn.h> in profiler.cc to pass build (#1632)
w/o the header, experience following errors:
```
error: use of undeclared identifier 'RTLD_NOW'
error: use of undeclared identifier 'RTLD_LOCAL'
error: use of undeclared identifier 'dlerror'
error: use of undeclared identifier 'dlsym'
error: use of undeclared identifier 'dlclose'
```

[ROCm/rccl commit: 1786c0268b]
2025-04-10 08:48:18 -07:00
Arm Patinyasakdikul f29d59aa00 Add device synchronization before destroying proxy thread. (#1631)
This commit ensures that GPU finishes all kernel before destroying
communicator thread.

[ROCm/rccl commit: 52654e2301]
2025-04-10 10:44:16 -05:00
Pedram Alizadeh 93ac2ea61e all_reduce LL/LL128 and Ring/Tree multi-node tuning for MI300 (#1627)
* Enabling LL128 by default on MI300

* Add missing CUDACHECK

* Adjust BW correction factors to fix the Tree->Ring switching point

* Refactor and add ll128 AR logarithmic factor to tuning models

* Move RCCL tuning changes to a separate file 

* Use enum for tunable indexing

* Use explicit indexing in tuning models to avoid mismatch issues

* Place rcclGetSizePerRank in a function

* Remove HIP ifdef for rccl-only call

---------

Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>

[ROCm/rccl commit: e40ff4f84a]
2025-04-10 11:43:54 -04:00
Pedram Alizadeh b225281747 single-node AR msccl algorithm tuning for MI300 (#1629)
[ROCm/rccl commit: 5b36b68d06]
2025-04-10 10:42:28 -04:00
dependabot[bot] 464d69963d Bump rocm-docs-core from 1.18.1 to 1.18.2 in /docs/sphinx (#1625)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.18.1 to 1.18.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.1...v1.18.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-version: 1.18.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: b6d97a6176]
2025-04-09 17:00:31 -06:00
Nikhil-Nunna ee9da06c80 Fetching RCCL version from shared object file. (#1569)
* added rccl version using rccl-tests

* Added function to get rccl version from rccl-tests

* removed whitespace

* Added rccl version

* Updated readme and fixed formatting

* removed debug prints

[ROCm/rccl commit: 3dc0478722]
2025-04-08 14:09:39 -05:00
Mustafa Abduljabbar 2f4cd5718e Add AllGather LL128 multi-node tuning and include LL cutoff points in tuning models (#1618)
* Enable LL/LL128 cutoff points in tuning models

* Initializing ll/ll128 model cutoffs for MI300

* Use RCCL_LL_LIMITS_UNDEFINED

---------

Co-authored-by: PedramAlizadeh <pmohamma@amd.com>

[ROCm/rccl commit: 4be06f04d8]
2025-04-02 16:26:23 -04:00
Mustafa Abduljabbar 0a81478bd9 Fix topo explorer's nccl 2.23 compatibility (#1623)
* Fix compiler issues due to broken compatibility 

* Fix segfault and pass rank instead of busid and add a pointer to cover a new algorithm

[ROCm/rccl commit: aace4e27f8]
2025-04-02 09:47:29 -04:00
dependabot[bot] 25dafc0c82 Bump rocm-docs-core from 1.17.0 to 1.18.1 in /docs/sphinx (#1599)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.17.0 to 1.18.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.17.0...v1.18.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: ffe255d285]
2025-04-01 17:08:53 -06:00
Nusrat Islam f599690ce3 ext-src: fix mscclpp correctness issue (#1615)
* ext-src: fix mscclpp correctness issue

* ext-src: remove white-space warnings

[ROCm/rccl commit: 4a29bba3c6]
2025-04-01 15:02:16 -05:00
Istvan Kiss 858fa4e65d Add documentation for NPS4 and CPX partition modes (#1555)
[ROCm/rccl commit: 28ab8603d2]
2025-03-31 09:25:25 -06:00
Nilesh M Negi 1a2eca1756 Revert "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)" (#1620)
* Revert "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)"

This reverts commit cf17cff5b6.

* [DOC] Update Changelog

* [DOC] Update CHANGELOG

[ROCm/rccl commit: b17338d164]
2025-03-28 17:57:06 -05:00
Bertan Dogancay b737d8c222 Merge pull request #1559 from BertanDogancay/2.23
[SYNC] 2.23.4-1

[ROCm/rccl commit: 532f54c244]
2025-03-28 17:06:56 -04:00
Nilesh M Negi a7ec191754 [TEST] Switch to googletest release 1.12.0 (#1621)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 0e2c461c6c]
2025-03-28 12:39:42 -05:00
Nilesh M Negi 210f90ae0f [SRC] Enable unroll=1 for gfx950 (#1602)
* [SRC] Enable unroll=1 for gfx950

* Fix typo from rebase in generate.py

* Support for unroll=1 and gfx90a when building for all GPU targets

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 307bc10781]
2025-03-27 18:21:35 -05:00
BertanDogancay 8ed27fde74 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 0b2062c560]
2025-03-27 12:53:04 -05:00
isaki001 f0c853438c Disable mscclpp (#1614)
* disable mscclpp by default

[ROCm/rccl commit: 9dc23d9265]
2025-03-25 15:21:16 -05:00
Nilesh M Negi 290ca7deca [CI] Fix warnings from pytest in Azure CI (#1617)
* [CI] Fix warnings from pytest

* [CI] Append to LD_LIBRARY_PATH instead of overwrite

---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 565f0ade60]
2025-03-22 23:22:53 -05:00
Nilesh M Negi 8cfbc0fbd1 [UT] Increase stack size for StandaloneTests to 480 (#1616)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: d6b987a53f]
2025-03-21 21:33:32 -05:00
dependabot[bot] 68672b9b3c Bump jinja2 from 3.1.5 to 3.1.6 in /docs/sphinx (#1591)
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.5 to 3.1.6.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/3.1.5...3.1.6)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: fe3f3aae17]
2025-03-21 17:17:32 -06:00
dependabot[bot] 83c59c859a Bump cryptography from 43.0.1 to 44.0.1 in /docs/sphinx (#1543)
Bumps [cryptography](https://github.com/pyca/cryptography) from 43.0.1 to 44.0.1.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/43.0.1...44.0.1)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: dd61df75ab]
2025-03-21 17:17:19 -06:00
gilbertlee-amd 4f67522420 Removing the experimental clique kernel files (#1610)
[ROCm/rccl commit: 626dc50ab5]
2025-03-20 18:10:01 -06:00
Wenkai Du e86b217182 Add fault injection of starting warps with random variations (#1593)
* Add fault injection of starting warps with random variations

This is done by inserting randomly delays after __syncthreads().
The feature can be turned off by FAULT_INJECTION=OFF in cmake.

* Remove manually introduced bug for demo purpose

* Use only one thread per warp for checking wall clock

[ROCm/rccl commit: 90ad586d94]
2025-03-20 16:11:43 -07:00
gilbertlee-amd 12c1fe8fdf Psuedo-randomly adding zero-byte sends in AllToAllv unit test (#1597)
[ROCm/rccl commit: 9a4e49ff1a]
2025-03-20 17:00:48 -06:00
amd-jmacaran 72454ece3e [Azure CI] Minor environment setup fixes
- Add extra deletion to ensure source workspace is clean for the job.
- pytest expects function names to start with test_


[ROCm/rccl commit: 805448261d]
2025-03-20 15:36:10 -04:00