コミットグラフ

1612 コミット

作成者 SHA1 メッセージ 日付
Jeffrey Novotny 134f736882 Fix broken link to install instructions (#1515) 2025-02-03 10:14:40 -05:00
Mustafa Abduljabbar dc75209dd7 Add IB verbs logging and enable traces through install.sh (#1511)
* Add IB Verbs logging

* Simplify tracing and undo debug.h changes

* Update debug.h

* Update CHANGELOG.md

* Update CHANGELOG.md

* Update CHANGELOG.md

* Exchange remote comm device index
2025-01-31 12:35:39 -05:00
Wenkai Du caba0bc049 Add HDP flush for gfx940 (#1434)
* Fix collective trace

* Use nontemporal for st_global

* Fix previous commit

* Add HDP flush to data receive path

* Fix previous commit

* Control flushing by NCCL_NET_FORCE_FLUSH and RCCL_NET_HDP_FLUSH

* Introduce RCCL_NET_HDP_FLUSH and RCCL_NET_GDR_FLUSH

Both are on by default. Turn both off will skip all flush will likely
result in data error.

* Enable GDR copy by default

* Remove GDR flush env var because it is disabled by GDC flush

* Output kernel collective trace at comm destroy by default

* Limit kernel timeout messages to 100

* Use system relaxed atomic for loadInt

* Refine timeout messages and use atomic for setting offset from CPU

* Add kernel trace for barrier timeout

* Add backup barrier to avoid race in atomicAdd

* Use different counters for different warps

* Rework barrier implementation

* Fix for other GFX

* Use __hip_atomic_store and __hip_atomic_load

* Fix bug in previous commit

* Don't reset barrier values in running kernel

* Update trace format

* Fix typo

* Switch back to hip_atomic_fetch_add

* Use same barrier implementation for all GFX

* Remove extra threadfence

* Turn off HDP flush by default

Please use RCCL_NET_HDP_FLUSH=1 to switch on HDP flush

* Remove unnecessary changes from alterative barrier implementation

* Added back __threadfence_block

* Revert back to threadfence for gfx other than gfx94x
2025-01-31 07:51:10 -08:00
dependabot[bot] ad8012f2fc Bump rocm-docs-core from 1.14.1 to 1.15.0 in /docs/sphinx (#1514)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.14.1 to 1.15.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.14.1...v1.15.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-01-30 17:15:17 -07:00
Bertan Dogancay ecf31da14f Add ncclDataType_t as type to ROCTX (#1512) 2025-01-30 13:46:48 -05:00
Arm Patinyasakdikul 6b2b87c9f8 Make proxy dump print out meaningful information. (#1504)
* Make proxy dump print out meaningful information.

fixed: HPEXA-63

* printout raw data instead.
2025-01-29 16:48:49 -06:00
Bertan Dogancay 35fe9e06f3 [Profiler] Enable ROCTX during build by default (#1506)
* Enable ROCTX during build by default

* Check for roctx support in cmake
2025-01-29 11:29:46 -05:00
corey-derochie-amd bd0f5cccbe Disabled MSCCL++ feature except when building on Ubuntu or CentOS host systems (#1505)
* Added condition for MSCCL++ to only build on an Ubuntu host system.

* Added CentOS to the supported OS list
2025-01-29 08:54:09 -07:00
Nusrat Islam 7ac82248de Tune allreduce performance in CPX mode (single OAM) (#1508) 2025-01-29 08:58:48 -06:00
dependabot[bot] f84625a1cc Bump rocm-docs-core from 1.13.0 to 1.14.1 in /docs/sphinx (#1496)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.13.0 to 1.14.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.13.0...v1.14.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-01-23 17:10:36 -07:00
Bertan Dogancay dd185f26d2 Fix ROCTX call for MSCCL (#1502) 2025-01-23 16:00:07 -07:00
Bertan Dogancay 27b3921ab0 Merge pull request #1426 from BertanDogancay/nccl-2.22-sync
[SYNC] 2.22.3-1
2025-01-23 13:14:05 -05:00
BertanDogancay 36343be84f Merge remote-tracking branch 'nccl/master' into develop 2025-01-23 12:08:46 -06:00
corey-derochie-amd b6377e0b8c Changed working dir for the submodule command and extended it to the json repo (#1495)
This allows it to work when the sub repos don't exist.
2025-01-23 09:34:25 -07:00
corey-derochie-amd f77308a2fe Removing duplicate definitions of INC_COLL_TRACE and traceData macros (#1500)
They are nearly identical, except the common.h definition sets `collTrace->channelId`.
2025-01-22 16:50:27 -07:00
Bertan Dogancay 5afe900efd Only look for librccl .co files in StackSize test (#1499)
Co-authored-by: BertanDogancay <bertan.dogancay>
2025-01-22 16:48:10 -07:00
isaki001 ff130cce7a fix scatter_perf crash (#1493)
* fix scatter_perf crash

* Update src/misc/msccl/msccl_lifecycle.cc

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* Update src/misc/msccl/msccl_lifecycle.cc

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* More buffsRegisteredNonGraphMode spelling fixes.

---------

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2025-01-21 09:24:32 -06:00
isaki001 d89432e8c8 update mscclpp (#1488)
* update commit hash for mscclpp submodule

* update mscclpp submodule

* remove print messages in cmake

* add back some print messages, update MSCLPP CMAKE_ARGS

* enable MSCCL++ patches regardless of finding mscclpp_nccl package
2025-01-20 08:06:43 -06:00
corey-derochie-amd c68b558ed5 Increased gfx90a stack size expectation to 320 to match latest compiler. (#1487) 2025-01-16 17:04:51 -07:00
amd-garydeng bca6f1620f use rocjenkins xml and change name (#1489) 2025-01-16 16:49:06 -07:00
corey-derochie-amd 2e35417fe5 Added RCCL env params to control setting the SO_REUSEADDR and SO_LINGER socket options (#1418)
* Added RCCL env params to control setting the SO_REUSEADDR and SO_LINGER socket options. This can allow control over the number of file descriptors created during bootstrapping.

* Casted the linger value to `int` sooner to avoid a scope of unknown typed-ness.

* Added CHANGELOG entry for this feature.
2025-01-14 10:26:04 -07:00
Nusrat Islam e9b6bbca8a Add MSCCLPP user buffer registration APIs and integrate with RCCL (#1477)
* ext-src: add MSCCLPP memory registration APIs

* update mem-reg patch with mscclpp helper routine to check if buffer is registered

* RCCL integration of MSCCL++ user-buffer registration APIs

* only include mscclpp_nccl header if ENABLE_MSCCLPP is defined

* ext-src: update mscclpp mem-reg patch

* add helper routine to patch

* check handle before MSCCL++ deregister

* fix typo to replace send buff with recv buff

* in case of no mscclpp registration, dduring deRegister call, ont fall back to rccl deRegister which will return an error

* Apply suggestions from code review

Whitespace suggestions and reducing diffs to avoid future merge conflicts

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* rename helper functions and change their return type

* set RCCL user-buffer registration to occur if attempting MSCCL++ registration with a buffer in managed memory

---------

Co-authored-by: isaki001 <Ioannis.Sakiotis@amd.com>
Co-authored-by: isaki001 <36317038+isaki001@users.noreply.github.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2025-01-14 08:20:24 -06:00
dependabot[bot] 3fee623d5a Bump rocm-docs-core from 1.12.0 to 1.13.0 in /docs/sphinx (#1482)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.12.0 to 1.13.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.12.0...v1.13.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-01-10 17:07:09 -07:00
Dingming Wu 69d0134ed2 improving kernel traces on opCount bits and adding channelId in ncclCollTrace (#1485) 2025-01-10 07:57:46 -08:00
Nilesh M Negi f0eae84663 [MSCCLPP] IBVerbs: Check if IBV_ACCESS_RELAXED_ORDERING exists (#1483)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-01-08 08:38:51 -06:00
Luna b24580e3d4 net_ib: fix out of bounds read in ncclIbGdrSupport on non-RDMA kernel (#1470)
Fixes #1469
2025-01-07 16:49:24 -08:00
Jeffrey Novotny 3fefd31b07 Update license file for 2025 (#1480) 2025-01-07 14:47:55 -05:00
JhaShweta1 4ee2ed6e23 Merge pull request #1479 from JhaShweta1/patch-1
Update CODEOWNERS: Add a new name
2025-01-07 10:45:21 -06:00
qiwei_ji f2ee8d9132 Check nvlink_node instead of xgmi_node in xml.cc (#1407)
It seems like here wants to check xgmi_node instead. If checks node for "nvlink", it will verify the link_info everytime.
If checks node for "xgmi", when get yes answer, it won't need check vsmi topo interface.
2025-01-06 17:09:27 -08:00
Xeonacid c6c7b6db98 Define wc_store_fence for riscv (#1475) 2025-01-06 16:59:14 -08:00
dependabot[bot] e77c3c2975 Bump jinja2 from 3.1.4 to 3.1.5 in /docs/sphinx (#1473)
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.4 to 3.1.5.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/3.1.4...3.1.5)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-01-06 17:09:32 -07:00
corey-derochie-amd c158d3a9b4 [SWDEV-497665] Blocked cudaMemcpyAsync race condition by synchronizing (#1447)
* Switched calls to `cudaMemcpyAsync` to be `cudaMemcpy` in `ncclTransportP2pSetup` to avoid race condition with `cudaIpcOpenMemHandle` inside p2p `connect`. See `ncclP2pImportShareableBuffer`.

* Moved synchronize outside of the loop, as it isn't necessary to sync between every iteration of the loop.
2025-01-03 13:06:47 -07:00
JhaShweta1 f60fac76e6 Update CODEOWNERS
Added  a new user: Shweta Jha
2025-01-03 11:47:22 -06:00
Mustafa Abduljabbar a9d6e7661c Add macro for channel mask offset (#1467) 2025-01-03 08:41:41 -05:00
mberenjk 39483c55f8 Initializing all ranks to the same value to avoid failure of UT AllR… (#1459)
* Initializing all ranks to the same value to avoid failure of  UT AllReduce for FP8 type

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-01-02 11:39:02 -06:00
Nilesh M Negi fd03b5b6a5 [BUILD] Fix ASAN build if GPU targets has xnack+ (#1474)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-12-26 12:13:36 -06:00
dependabot[bot] 648a58dd27 Bump rocm-docs-core from 1.9.2 to 1.12.0 in /docs/sphinx (#1466)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.9.2 to 1.12.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.9.2...v1.12.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-18 10:04:07 -07:00
Jeffrey Novotny 2934bf6fc6 Change kernel reference to use new terminology (#1462) 2024-12-16 13:34:18 -05:00
Mustafa Abduljabbar e6b179d627 Remove unneeded highestTransportType (#1461) 2024-12-16 13:28:47 -05:00
Ziyue Yang 83c5eb7378 Fix MSCCL algorithm loading order (#1460) 2024-12-16 07:41:17 -08:00
akolliasAMD 45c1c1a781 changed the CMake option from AMDGPU_TARGETS to GPU_TARGETS (#1440) 2024-12-12 12:09:30 -07:00
Shilei Tian 7386fac64a Improve the handling of CMake deduplication (#1450)
Certain CMake functions deduplicates arguments by default. For example, if we
have two `target_link_options` with both `-Xoffload-linker -opt-A` and then
`-Xoffload-linker -opt-B`, the final link command would be `-Xoffload-linker
-opt-A -opt-B`, which is not what we want.
2024-12-11 13:48:18 -08:00
Shilei Tian 8e9fcf111a Check -parallel-jobs before use (#1451)
`-parallel-jobs` is not always available, such as upstream LLVM.
2024-12-11 11:40:49 -06:00
Hujingbo ad4c36dc34 increase p2p channels for Intel platform (#1448)
Co-authored-by: hujingbo <hujingbo@kuaishou.com>
2024-12-10 07:33:37 -08:00
Jeffrey Novotny 9aa5b9f02e Refactor how to docs and formatting fixes (#1444) 2024-12-10 08:47:24 -05:00
Jeffrey Novotny 6d34fb7632 Add RCCL debugging guide (#1420)
* Add RCCL debugging guide

* Changes from external review

* More edits from internal review

* Additional edits

* Minor correction

* More changes after external review

* Integrate index and ToC changes with incoming merge changes

* Integrate feedback from management review

* Minor edits from the internal review
2024-12-06 13:25:58 -05:00
Nusrat Islam 42b6831a39 ext-src: tune TP=8 case on MI308 CPX mode (#1446)
Tune the number of blocks for hierarchical mscclpp allreduce.
2024-12-06 08:16:39 -06:00
Benjamin Kitor a05329bd0d Add Topologies for 16-GPU gfx942 SuperNode (#1417)
* Add Topologies for 16-GPU gfx942 SuperNode

- Add GigaIO topologies to tools/topo_expl for dev and testing
- Add GigaIO Columba 16 GPU romeModel and adjust topology
  matching algorithm in rome_models for 16 GPU system
- Fix bug which failed to match Rome Model when using subsets
  of system resources (i.e. ROCR_VISIBLE_DEVICES is set)
- Fixes for topo_expl

* Fix bug w/ 1H16P
2024-12-03 13:12:03 -08:00
Jeffrey Novotny 28594b26b3 Modify cmake instruction in build from source (#1445) 2024-12-03 11:26:02 -05:00
dependabot[bot] 1f789d6836 Bump rocm-docs-core from 1.8.3 to 1.9.2 in /docs/sphinx (#1441)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.8.3 to 1.9.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.8.3...v1.9.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-29 15:21:43 -07:00