qiwei_ji
2b9394f08a
Check nvlink_node instead of xgmi_node in xml.cc ( #1407 )
...
It seems like here wants to check xgmi_node instead. If checks node for "nvlink", it will verify the link_info everytime.
If checks node for "xgmi", when get yes answer, it won't need check vsmi topo interface.
[ROCm/rccl commit: f2ee8d9132 ]
2025-01-06 17:09:27 -08:00
Xeonacid
f3e883805c
Define wc_store_fence for riscv ( #1475 )
...
[ROCm/rccl commit: c6c7b6db98 ]
2025-01-06 16:59:14 -08:00
dependabot[bot]
e61194239d
Bump jinja2 from 3.1.4 to 3.1.5 in /docs/sphinx ( #1473 )
...
Bumps [jinja2](https://github.com/pallets/jinja ) from 3.1.4 to 3.1.5.
- [Release notes](https://github.com/pallets/jinja/releases )
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst )
- [Commits](https://github.com/pallets/jinja/compare/3.1.4...3.1.5 )
---
updated-dependencies:
- dependency-name: jinja2
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: e77c3c2975 ]
2025-01-06 17:09:32 -07:00
corey-derochie-amd
17becdb7f8
[SWDEV-497665] Blocked cudaMemcpyAsync race condition by synchronizing ( #1447 )
...
* Switched calls to `cudaMemcpyAsync` to be `cudaMemcpy` in `ncclTransportP2pSetup` to avoid race condition with `cudaIpcOpenMemHandle` inside p2p `connect`. See `ncclP2pImportShareableBuffer`.
* Moved synchronize outside of the loop, as it isn't necessary to sync between every iteration of the loop.
[ROCm/rccl commit: c158d3a9b4 ]
2025-01-03 13:06:47 -07:00
Mustafa Abduljabbar
dd3fc22531
Add macro for channel mask offset ( #1467 )
...
[ROCm/rccl commit: a9d6e7661c ]
2025-01-03 08:41:41 -05:00
mberenjk
300f954185
Initializing all ranks to the same value to avoid failure of UT AllR… ( #1459 )
...
* Initializing all ranks to the same value to avoid failure of UT AllReduce for FP8 type
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com >
[ROCm/rccl commit: 39483c55f8 ]
2025-01-02 11:39:02 -06:00
Nilesh M Negi
f1bada26ef
[BUILD] Fix ASAN build if GPU targets has xnack+ ( #1474 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: fd03b5b6a5 ]
2024-12-26 12:13:36 -06:00
dependabot[bot]
beece84733
Bump rocm-docs-core from 1.9.2 to 1.12.0 in /docs/sphinx ( #1466 )
...
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core ) from 1.9.2 to 1.12.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.9.2...v1.12.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 648a58dd27 ]
2024-12-18 10:04:07 -07:00
Jeffrey Novotny
7c220660ba
Change kernel reference to use new terminology ( #1462 )
...
[ROCm/rccl commit: 2934bf6fc6 ]
2024-12-16 13:34:18 -05:00
Mustafa Abduljabbar
4588da1888
Remove unneeded highestTransportType ( #1461 )
...
[ROCm/rccl commit: e6b179d627 ]
2024-12-16 13:28:47 -05:00
Ziyue Yang
987dfc3b5d
Fix MSCCL algorithm loading order ( #1460 )
...
[ROCm/rccl commit: 83c5eb7378 ]
2024-12-16 07:41:17 -08:00
akolliasAMD
c65d4ab18f
changed the CMake option from AMDGPU_TARGETS to GPU_TARGETS ( #1440 )
...
[ROCm/rccl commit: 45c1c1a781 ]
2024-12-12 12:09:30 -07:00
Shilei Tian
791f45733a
Improve the handling of CMake deduplication ( #1450 )
...
Certain CMake functions deduplicates arguments by default. For example, if we
have two `target_link_options` with both `-Xoffload-linker -opt-A` and then
`-Xoffload-linker -opt-B`, the final link command would be `-Xoffload-linker
-opt-A -opt-B`, which is not what we want.
[ROCm/rccl commit: 7386fac64a ]
2024-12-11 13:48:18 -08:00
Shilei Tian
699980206e
Check -parallel-jobs before use ( #1451 )
...
`-parallel-jobs` is not always available, such as upstream LLVM.
[ROCm/rccl commit: 8e9fcf111a ]
2024-12-11 11:40:49 -06:00
Hujingbo
bea86f6248
increase p2p channels for Intel platform ( #1448 )
...
Co-authored-by: hujingbo <hujingbo@kuaishou.com >
[ROCm/rccl commit: ad4c36dc34 ]
2024-12-10 07:33:37 -08:00
Jeffrey Novotny
d7498b88a5
Refactor how to docs and formatting fixes ( #1444 )
...
[ROCm/rccl commit: 9aa5b9f02e ]
2024-12-10 08:47:24 -05:00
Jeffrey Novotny
531476dacf
Add RCCL debugging guide ( #1420 )
...
* Add RCCL debugging guide
* Changes from external review
* More edits from internal review
* Additional edits
* Minor correction
* More changes after external review
* Integrate index and ToC changes with incoming merge changes
* Integrate feedback from management review
* Minor edits from the internal review
[ROCm/rccl commit: 6d34fb7632 ]
2024-12-06 13:25:58 -05:00
Nusrat Islam
b19a809788
ext-src: tune TP=8 case on MI308 CPX mode ( #1446 )
...
Tune the number of blocks for hierarchical mscclpp allreduce.
[ROCm/rccl commit: 42b6831a39 ]
2024-12-06 08:16:39 -06:00
Benjamin Kitor
fe806d5427
Add Topologies for 16-GPU gfx942 SuperNode ( #1417 )
...
* Add Topologies for 16-GPU gfx942 SuperNode
- Add GigaIO topologies to tools/topo_expl for dev and testing
- Add GigaIO Columba 16 GPU romeModel and adjust topology
matching algorithm in rome_models for 16 GPU system
- Fix bug which failed to match Rome Model when using subsets
of system resources (i.e. ROCR_VISIBLE_DEVICES is set)
- Fixes for topo_expl
* Fix bug w/ 1H16P
[ROCm/rccl commit: a05329bd0d ]
2024-12-03 13:12:03 -08:00
Jeffrey Novotny
0f4558ea59
Modify cmake instruction in build from source ( #1445 )
...
[ROCm/rccl commit: 28594b26b3 ]
2024-12-03 11:26:02 -05:00
dependabot[bot]
43ff8386a8
Bump rocm-docs-core from 1.8.3 to 1.9.2 in /docs/sphinx ( #1441 )
...
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core ) from 1.8.3 to 1.9.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.8.3...v1.9.2 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 1f789d6836 ]
2024-11-29 15:21:43 -07:00
Jeffrey Novotny
1d1e17b3c9
Refactor RCCL install guide into several pages ( #1427 )
...
* Refactor RCCL install guide into several pages
* Changes from code review and new docker guide
* Add missing entries to ToC
* Minor fixes
* Fix help strings
* Edits after review and remove extra white space
[ROCm/rccl commit: bf7c130631 ]
2024-11-27 15:34:26 -05:00
Jeffrey Novotny
cc9209f770
Update rccl changelog for 6.3.1 ( #1433 )
...
* Update rccl changelog for 6.3.1
* Fix version number
* Correct RCCL release version
* Added details to 6.3.0 changelog
---------
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com >
[ROCm/rccl commit: e42f10a361 ]
2024-11-26 08:46:37 -05:00
gilbertlee-amd
be0865f335
Adding RCCL_MODEL_REVERSAL_DISABLE env var to disable model reversal ( #1431 )
...
* Adding RCCL_MODEL_REVERSAL_DISABLE env var to disable model reversal
[ROCm/rccl commit: 000575867c ]
2024-11-25 11:24:54 -07:00
Bertan Dogancay
df51937f09
Fix typo in ncclGetKernelIndex macro ( #1424 )
...
[ROCm/rccl commit: dfe4a3ed81 ]
2024-11-18 10:40:05 -05:00
corey-derochie-amd
6d61a4e21f
Added latest users to CODEOWNERS. ( #1422 )
...
[ROCm/rccl commit: 4336a0f3a3 ]
2024-11-14 16:55:18 -07:00
Bertan Dogancay
9041572e23
Template generic kernel for unroll factor ( #1419 )
...
* Template generic kernel for unroll factor
[ROCm/rccl commit: cb175fb0b3 ]
2024-11-12 18:27:29 -05:00
Jeffrey Novotny
9898395fbe
Refactor landing page and move some info to What is RCCL ( #1415 )
...
[ROCm/rccl commit: 2d07f18696 ]
2024-11-12 13:15:27 -05:00
akolliasAMD
891689aab9
removing unused gfx targets ( #1411 )
...
[ROCm/rccl commit: 2284101624 ]
2024-11-06 08:50:08 -07:00
darren-amd
539cc81748
Merge pull request #1406 from ROCm/darren-amd-remove-computeColl-declaration
...
remove undefined computeColl declaration
[ROCm/rccl commit: 52d5f4cde2 ]
2024-11-06 10:43:35 -05:00
gilbertlee-amd
71809e8bc0
Updating RCCL Replayer README ( #1408 )
...
[ROCm/rccl commit: cb1027de97 ]
2024-11-05 08:06:11 -07:00
darren-amd
61258b9a9f
remove undefined computeColl declaration
...
[ROCm/rccl commit: ebf0417e90 ]
2024-11-04 13:42:01 -05:00
saurabhAMD
69d976532b
GPU allocation for CPX Unit Tests using PCI bus id ( #1403 )
...
* mapping devices wrt pci
* Gpu allocation by using pci mapping
* Passing gpuPriorityOrder in as an argument rather than making the functions non-static.
* Removing redundant testBed instance calling
[ROCm/rccl commit: 69b2b712ab ]
2024-11-04 10:51:00 -06:00
corey-derochie-amd
ad1384bea1
Hide or fix all build warnings ( #1331 )
...
* Changing C-strings to be const.
* Changed variable-length arrays to std::vector to avoid warnings. VLA is a compiler extension.
* Changed `#define` inside functions into `constexpr int` to preserve scoping and avoid macro redefinition warnings.
* Disabled warnings for modifying `CMAKE_CXX_FLAGS` caused by `check_symbol_exists`, which temporarily modifies the flag to do a compile check.
* Fixed VLA in rccl UT.
[ROCm/rccl commit: 1c45962273 ]
2024-11-04 09:46:42 -07:00
Abhishek Kulkarni
595cda2ab9
GDR enablement logic fix for kernel 6.4.0+ ( #1378 )
...
[ROCm/rccl commit: 6178556853 ]
2024-11-03 01:20:07 -05:00
Bertan Dogancay
251df02d42
Increase MAX_STACK_SIZE for UT ( #1398 )
...
[ROCm/rccl commit: 984f1e4343 ]
2024-11-01 13:07:45 -04:00
corey-derochie-amd
8444e5fe7f
Set minimum ROCm version for MSCCLPP to 6.2 ( #1401 )
...
* Added ROCm version check around setting `ENABLE_MSCCLPP` flag.
[ROCm/rccl commit: 6db2644766 ]
2024-10-30 16:48:54 -06:00
Avinash
da3887bafb
Memory leak fixes in hostside functions ( #1388 )
...
memory leak fixes for parseRome4P2H and ncclTopoAddGPU
[ROCm/rccl commit: d6006f0425 ]
2024-10-30 14:25:56 -05:00
Tim
e346e19065
Adjustment for UT Sendrecv ( #1400 )
...
Enabled UT sendrecv to same rank and refactor UBR call
[ROCm/rccl commit: fd9924cfe7 ]
2024-10-30 15:13:53 -04:00
Nusrat Islam
e1c20e7f24
ext-src: Improved allreduce performance in cpx mode for MI308 ( #1393 )
...
To get the improved performance for TP=4, the user needs to use
RCCL_MSCCL_FORCE_ENABLE=1 and MSCCLPP_READ_ALLRED=1. For TP=8, the
user should use MSCCLPP_HIERARCHICAL_ALLRED=1.
[ROCm/rccl commit: 0fb3b5eba9 ]
2024-10-30 08:30:15 -05:00
corey-derochie-amd
af1e36a7ee
Remove MSCCL switch case fall-through by adding break statement. ( #1342 )
...
[ROCm/rccl commit: ea20af698e ]
2024-10-29 15:47:59 -06:00
corey-derochie-amd
f9d38d8858
6.2 final documentation fixes updated for 6.3 ( #1252 ) ( #1399 )
...
* Update CHANGELOG.md
* Update NOTICES.txt
* [DOCS] Note on using less than 8 MI300 GPUs
* Update README.md
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
Co-authored-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 8ac63e7e70 ]
2024-10-29 15:23:45 -06:00
gilbertlee-amd
02bf3a3bf8
Adding support for odd nodes for model_87 ( #1309 )
...
[ROCm/rccl commit: 0cbce2a757 ]
2024-10-24 08:38:12 -06:00
corey-derochie-amd
1c700083b2
Update CHANGELOG to match release branches 6.2 and 6.3 ( #1391 )
...
* [CHANGELOG] Add Known issues for ROCm 6.2.1
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
* Updated 6.2.1 known issues to match the content in develop.
* Updated CHANGELOG for ROCm 6.3 release. (#1380 )
* Updated CHANGELOG for ROCm 6.3 release.
* Update CHANGELOG to new format.
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
Co-authored-by: nileshnegi <Nilesh.Negi@amd.com >
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
[ROCm/rccl commit: 6ed513e1b9 ]
2024-10-23 13:49:40 -06:00
Arm Patinyasakdikul
928414ac06
Increased maximum number of XML nodes to support CPX mode. ( #1386 )
...
[ROCm/rccl commit: 29f87c7191 ]
2024-10-23 11:15:11 -05:00
Wenkai Du
075381ee2e
Fix topology discovery in container with subset of GPUs ( #1384 )
...
* Fix topology discovery in container with subset of GPUs
* Move links counting out of loop
[ROCm/rccl commit: e0780ba4d4 ]
2024-10-22 13:50:23 -07:00
Bertan Dogancay
fcb0b2da3f
[Replayer] Add validation ( #1387 )
...
* Add validation to rccl_replayer
[ROCm/rccl commit: cfecce790f ]
2024-10-22 10:41:08 -04:00
dependabot[bot]
64aead445c
Bump rocm-docs-core from 1.8.2 to 1.8.3 in /docs/sphinx ( #1385 )
...
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core ) from 1.8.2 to 1.8.3.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.8.2...v1.8.3 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 4685d3c546 ]
2024-10-21 10:05:58 -06:00
Bertan Dogancay
57710c1183
Dynamically select unroll factor to build for when targeting local arch ( #1371 )
...
* Dynamically select unroll factor to build for when targeting local arch only
[ROCm/rccl commit: 373f113524 ]
2024-10-21 10:53:11 -04:00
Wenkai Du
5ee84e0353
Increase CQ size to 3*MAX_REQUESTS ( #1374 )
...
* Increase CQ size to 3*MAX_REQUESTS
Suggested by Rukhsana Ansari <rukhsana.ansari@broadcom.com >
* Reword comments based on feedback from Rukhsana
[ROCm/rccl commit: 7c077db307 ]
2024-10-18 11:01:03 -07:00