Commit Graph

1593 Commits

Author SHA1 Message Date
isaki001 25150b1f20 update mscclpp (#1488)
* update commit hash for mscclpp submodule

* update mscclpp submodule

* remove print messages in cmake

* add back some print messages, update MSCLPP CMAKE_ARGS

* enable MSCCL++ patches regardless of finding mscclpp_nccl package

[ROCm/rccl commit: d89432e8c8]
2025-01-20 08:06:43 -06:00
corey-derochie-amd 8e6bedeedc Increased gfx90a stack size expectation to 320 to match latest compiler. (#1487)
[ROCm/rccl commit: c68b558ed5]
2025-01-16 17:04:51 -07:00
amd-garydeng 2bffd86dff use rocjenkins xml and change name (#1489)
[ROCm/rccl commit: bca6f1620f]
2025-01-16 16:49:06 -07:00
corey-derochie-amd ebacc24598 Added RCCL env params to control setting the SO_REUSEADDR and SO_LINGER socket options (#1418)
* Added RCCL env params to control setting the SO_REUSEADDR and SO_LINGER socket options. This can allow control over the number of file descriptors created during bootstrapping.

* Casted the linger value to `int` sooner to avoid a scope of unknown typed-ness.

* Added CHANGELOG entry for this feature.

[ROCm/rccl commit: 2e35417fe5]
2025-01-14 10:26:04 -07:00
Nusrat Islam cf907dbf61 Add MSCCLPP user buffer registration APIs and integrate with RCCL (#1477)
* ext-src: add MSCCLPP memory registration APIs

* update mem-reg patch with mscclpp helper routine to check if buffer is registered

* RCCL integration of MSCCL++ user-buffer registration APIs

* only include mscclpp_nccl header if ENABLE_MSCCLPP is defined

* ext-src: update mscclpp mem-reg patch

* add helper routine to patch

* check handle before MSCCL++ deregister

* fix typo to replace send buff with recv buff

* in case of no mscclpp registration, dduring deRegister call, ont fall back to rccl deRegister which will return an error

* Apply suggestions from code review

Whitespace suggestions and reducing diffs to avoid future merge conflicts

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* rename helper functions and change their return type

* set RCCL user-buffer registration to occur if attempting MSCCL++ registration with a buffer in managed memory

---------

Co-authored-by: isaki001 <Ioannis.Sakiotis@amd.com>
Co-authored-by: isaki001 <36317038+isaki001@users.noreply.github.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: e9b6bbca8a]
2025-01-14 08:20:24 -06:00
dependabot[bot] 43776cfcc8 Bump rocm-docs-core from 1.12.0 to 1.13.0 in /docs/sphinx (#1482)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.12.0 to 1.13.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.12.0...v1.13.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 3fee623d5a]
2025-01-10 17:07:09 -07:00
Dingming Wu b6ae2fd71d improving kernel traces on opCount bits and adding channelId in ncclCollTrace (#1485)
[ROCm/rccl commit: 69d0134ed2]
2025-01-10 07:57:46 -08:00
Nilesh M Negi b9e7e3024b [MSCCLPP] IBVerbs: Check if IBV_ACCESS_RELAXED_ORDERING exists (#1483)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: f0eae84663]
2025-01-08 08:38:51 -06:00
Luna 28c80c9b4e net_ib: fix out of bounds read in ncclIbGdrSupport on non-RDMA kernel (#1470)
Fixes #1469

[ROCm/rccl commit: b24580e3d4]
2025-01-07 16:49:24 -08:00
Jeffrey Novotny ed70fc066a Update license file for 2025 (#1480)
[ROCm/rccl commit: 3fefd31b07]
2025-01-07 14:47:55 -05:00
JhaShweta1 35602f31a1 Merge pull request #1479 from JhaShweta1/patch-1
Update CODEOWNERS: Add a new name

[ROCm/rccl commit: 4ee2ed6e23]
2025-01-07 10:45:21 -06:00
qiwei_ji 2b9394f08a Check nvlink_node instead of xgmi_node in xml.cc (#1407)
It seems like here wants to check xgmi_node instead. If checks node for "nvlink", it will verify the link_info everytime.
If checks node for "xgmi", when get yes answer, it won't need check vsmi topo interface.

[ROCm/rccl commit: f2ee8d9132]
2025-01-06 17:09:27 -08:00
Xeonacid f3e883805c Define wc_store_fence for riscv (#1475)
[ROCm/rccl commit: c6c7b6db98]
2025-01-06 16:59:14 -08:00
dependabot[bot] e61194239d Bump jinja2 from 3.1.4 to 3.1.5 in /docs/sphinx (#1473)
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.4 to 3.1.5.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/3.1.4...3.1.5)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: e77c3c2975]
2025-01-06 17:09:32 -07:00
corey-derochie-amd 17becdb7f8 [SWDEV-497665] Blocked cudaMemcpyAsync race condition by synchronizing (#1447)
* Switched calls to `cudaMemcpyAsync` to be `cudaMemcpy` in `ncclTransportP2pSetup` to avoid race condition with `cudaIpcOpenMemHandle` inside p2p `connect`. See `ncclP2pImportShareableBuffer`.

* Moved synchronize outside of the loop, as it isn't necessary to sync between every iteration of the loop.

[ROCm/rccl commit: c158d3a9b4]
2025-01-03 13:06:47 -07:00
JhaShweta1 d9d60404fd Update CODEOWNERS
Added  a new user: Shweta Jha

[ROCm/rccl commit: f60fac76e6]
2025-01-03 11:47:22 -06:00
Mustafa Abduljabbar dd3fc22531 Add macro for channel mask offset (#1467)
[ROCm/rccl commit: a9d6e7661c]
2025-01-03 08:41:41 -05:00
mberenjk 300f954185 Initializing all ranks to the same value to avoid failure of UT AllR… (#1459)
* Initializing all ranks to the same value to avoid failure of  UT AllReduce for FP8 type

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>

[ROCm/rccl commit: 39483c55f8]
2025-01-02 11:39:02 -06:00
Nilesh M Negi f1bada26ef [BUILD] Fix ASAN build if GPU targets has xnack+ (#1474)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: fd03b5b6a5]
2024-12-26 12:13:36 -06:00
dependabot[bot] beece84733 Bump rocm-docs-core from 1.9.2 to 1.12.0 in /docs/sphinx (#1466)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.9.2 to 1.12.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.9.2...v1.12.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 648a58dd27]
2024-12-18 10:04:07 -07:00
Jeffrey Novotny 7c220660ba Change kernel reference to use new terminology (#1462)
[ROCm/rccl commit: 2934bf6fc6]
2024-12-16 13:34:18 -05:00
Mustafa Abduljabbar 4588da1888 Remove unneeded highestTransportType (#1461)
[ROCm/rccl commit: e6b179d627]
2024-12-16 13:28:47 -05:00
Ziyue Yang 987dfc3b5d Fix MSCCL algorithm loading order (#1460)
[ROCm/rccl commit: 83c5eb7378]
2024-12-16 07:41:17 -08:00
akolliasAMD c65d4ab18f changed the CMake option from AMDGPU_TARGETS to GPU_TARGETS (#1440)
[ROCm/rccl commit: 45c1c1a781]
2024-12-12 12:09:30 -07:00
Shilei Tian 791f45733a Improve the handling of CMake deduplication (#1450)
Certain CMake functions deduplicates arguments by default. For example, if we
have two `target_link_options` with both `-Xoffload-linker -opt-A` and then
`-Xoffload-linker -opt-B`, the final link command would be `-Xoffload-linker
-opt-A -opt-B`, which is not what we want.

[ROCm/rccl commit: 7386fac64a]
2024-12-11 13:48:18 -08:00
Shilei Tian 699980206e Check -parallel-jobs before use (#1451)
`-parallel-jobs` is not always available, such as upstream LLVM.

[ROCm/rccl commit: 8e9fcf111a]
2024-12-11 11:40:49 -06:00
Hujingbo bea86f6248 increase p2p channels for Intel platform (#1448)
Co-authored-by: hujingbo <hujingbo@kuaishou.com>

[ROCm/rccl commit: ad4c36dc34]
2024-12-10 07:33:37 -08:00
Jeffrey Novotny d7498b88a5 Refactor how to docs and formatting fixes (#1444)
[ROCm/rccl commit: 9aa5b9f02e]
2024-12-10 08:47:24 -05:00
Jeffrey Novotny 531476dacf Add RCCL debugging guide (#1420)
* Add RCCL debugging guide

* Changes from external review

* More edits from internal review

* Additional edits

* Minor correction

* More changes after external review

* Integrate index and ToC changes with incoming merge changes

* Integrate feedback from management review

* Minor edits from the internal review

[ROCm/rccl commit: 6d34fb7632]
2024-12-06 13:25:58 -05:00
Nusrat Islam b19a809788 ext-src: tune TP=8 case on MI308 CPX mode (#1446)
Tune the number of blocks for hierarchical mscclpp allreduce.

[ROCm/rccl commit: 42b6831a39]
2024-12-06 08:16:39 -06:00
Benjamin Kitor fe806d5427 Add Topologies for 16-GPU gfx942 SuperNode (#1417)
* Add Topologies for 16-GPU gfx942 SuperNode

- Add GigaIO topologies to tools/topo_expl for dev and testing
- Add GigaIO Columba 16 GPU romeModel and adjust topology
  matching algorithm in rome_models for 16 GPU system
- Fix bug which failed to match Rome Model when using subsets
  of system resources (i.e. ROCR_VISIBLE_DEVICES is set)
- Fixes for topo_expl

* Fix bug w/ 1H16P

[ROCm/rccl commit: a05329bd0d]
2024-12-03 13:12:03 -08:00
Jeffrey Novotny 0f4558ea59 Modify cmake instruction in build from source (#1445)
[ROCm/rccl commit: 28594b26b3]
2024-12-03 11:26:02 -05:00
dependabot[bot] 43ff8386a8 Bump rocm-docs-core from 1.8.3 to 1.9.2 in /docs/sphinx (#1441)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.8.3 to 1.9.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.8.3...v1.9.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 1f789d6836]
2024-11-29 15:21:43 -07:00
Jeffrey Novotny 1d1e17b3c9 Refactor RCCL install guide into several pages (#1427)
* Refactor RCCL install guide into several pages

* Changes from code review and new docker guide

* Add missing entries to ToC

* Minor fixes

* Fix help strings

* Edits after review and remove extra white space

[ROCm/rccl commit: bf7c130631]
2024-11-27 15:34:26 -05:00
Jeffrey Novotny cc9209f770 Update rccl changelog for 6.3.1 (#1433)
* Update rccl changelog for 6.3.1

* Fix version number

* Correct RCCL release version

* Added details to 6.3.0 changelog

---------

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: e42f10a361]
2024-11-26 08:46:37 -05:00
gilbertlee-amd be0865f335 Adding RCCL_MODEL_REVERSAL_DISABLE env var to disable model reversal (#1431)
* Adding RCCL_MODEL_REVERSAL_DISABLE env var to disable model reversal

[ROCm/rccl commit: 000575867c]
2024-11-25 11:24:54 -07:00
Bertan Dogancay df51937f09 Fix typo in ncclGetKernelIndex macro (#1424)
[ROCm/rccl commit: dfe4a3ed81]
2024-11-18 10:40:05 -05:00
corey-derochie-amd 6d61a4e21f Added latest users to CODEOWNERS. (#1422)
[ROCm/rccl commit: 4336a0f3a3]
2024-11-14 16:55:18 -07:00
Bertan Dogancay 9041572e23 Template generic kernel for unroll factor (#1419)
* Template generic kernel for unroll factor

[ROCm/rccl commit: cb175fb0b3]
2024-11-12 18:27:29 -05:00
Jeffrey Novotny 9898395fbe Refactor landing page and move some info to What is RCCL (#1415)
[ROCm/rccl commit: 2d07f18696]
2024-11-12 13:15:27 -05:00
akolliasAMD 891689aab9 removing unused gfx targets (#1411)
[ROCm/rccl commit: 2284101624]
2024-11-06 08:50:08 -07:00
darren-amd 539cc81748 Merge pull request #1406 from ROCm/darren-amd-remove-computeColl-declaration
remove undefined computeColl declaration

[ROCm/rccl commit: 52d5f4cde2]
2024-11-06 10:43:35 -05:00
gilbertlee-amd 71809e8bc0 Updating RCCL Replayer README (#1408)
[ROCm/rccl commit: cb1027de97]
2024-11-05 08:06:11 -07:00
darren-amd 61258b9a9f remove undefined computeColl declaration
[ROCm/rccl commit: ebf0417e90]
2024-11-04 13:42:01 -05:00
saurabhAMD 69d976532b GPU allocation for CPX Unit Tests using PCI bus id (#1403)
* mapping devices wrt pci

* Gpu allocation by using pci mapping

* Passing gpuPriorityOrder in as an argument rather than making the functions non-static.

* Removing redundant testBed instance calling

[ROCm/rccl commit: 69b2b712ab]
2024-11-04 10:51:00 -06:00
corey-derochie-amd ad1384bea1 Hide or fix all build warnings (#1331)
* Changing C-strings to be const.

* Changed variable-length arrays to std::vector to avoid warnings. VLA is a compiler extension.

* Changed `#define` inside functions into `constexpr int` to preserve scoping and avoid macro redefinition warnings.

* Disabled warnings for modifying `CMAKE_CXX_FLAGS` caused by `check_symbol_exists`, which temporarily modifies the flag to do a compile check.

* Fixed VLA in rccl UT.

[ROCm/rccl commit: 1c45962273]
2024-11-04 09:46:42 -07:00
Abhishek Kulkarni 595cda2ab9 GDR enablement logic fix for kernel 6.4.0+ (#1378)
[ROCm/rccl commit: 6178556853]
2024-11-03 01:20:07 -05:00
Bertan Dogancay 251df02d42 Increase MAX_STACK_SIZE for UT (#1398)
[ROCm/rccl commit: 984f1e4343]
2024-11-01 13:07:45 -04:00
corey-derochie-amd 8444e5fe7f Set minimum ROCm version for MSCCLPP to 6.2 (#1401)
* Added ROCm version check around setting `ENABLE_MSCCLPP` flag.

[ROCm/rccl commit: 6db2644766]
2024-10-30 16:48:54 -06:00
Avinash da3887bafb Memory leak fixes in hostside functions (#1388)
memory leak fixes for parseRome4P2H and ncclTopoAddGPU

[ROCm/rccl commit: d6006f0425]
2024-10-30 14:25:56 -05:00