Граф коммитов

1535 Коммитов

Автор SHA1 Сообщение Дата
Wenkai Du 086fa823db NPKit: enable reduce scatter profiling (#1580)
[ROCm/rccl commit: f957c4fe22]
2025-03-04 10:03:56 -08:00
Nilesh M Negi 751370bb70 [BUILD] Enable multiple GPU targets in MSCCLPP (#1574)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 063c6cfc11]
2025-03-01 22:28:42 -06:00
dependabot[bot] 977d04cb9a Bump rocm-docs-core from 1.15.0 to 1.17.0 in /docs/sphinx (#1558)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.15.0 to 1.17.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.15.0...v1.17.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: a6b2ca224e]
2025-02-28 16:32:38 -07:00
Nusrat Islam f70f406463 misc/msccl: Read graph capture status for every collective call (#1576)
* misc/msccl: read graphCaptureStatus for every collective call

* fix a bug in checking whether UBR is enabled in MSCCLPP

* cmake: Fix patch reversal order

* misc/msccl: add logging

[ROCm/rccl commit: 23c0b7bd84]
2025-02-28 17:16:07 -06:00
Wenkai Du 3be905ca83 Improve RDMA flushing by write dummy payload with RO=0 (#1570)
* Improve RDMA flushing by write dummy payload with RO=0

* Rename env var for disabling this change to RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING

[ROCm/rccl commit: 60c1264d27]
2025-02-27 16:20:32 -08:00
Nilesh M Negi a3f44b4a10 [BUILD] Fix generate.py for gfx950 (#1575)
* [BUILD] Fix generate.py for gfx950

Signed-off-by: nilenegi <Nilesh.Negi@amd.com>

* [BUILD] Cleaner way to check for gfx targets

Signed-off-by: nilenegi <Nilesh.Negi@amd.com>

---------

Signed-off-by: nilenegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 02b8e504f0]
2025-02-26 18:43:37 -06:00
Bertan Dogancay a9d09c6551 Use bit reversal based mapping for multi-node (#1572)
[ROCm/rccl commit: 85eb1f16bc]
2025-02-26 09:48:03 -05:00
Pedram Alizadeh acf5822a6c enable building rccl for gfx950 (#1571)
[ROCm/rccl commit: f268553ee4]
2025-02-25 16:13:48 -05:00
Nilesh M Negi c363118589 [BUILD] MSCCLPP: Fix OS check for CentOS (#1568)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: corey-derochie-amd <corey.derochie@amd.com>

[ROCm/rccl commit: daaa6e155f]
2025-02-25 13:03:04 -06:00
Nusrat Islam 3fbafef948 ext-src: tuning for allreduce8 kernel (#1560)
This PR tunes the number of threadblocks used for larger (>1MB)
message sizes.

[ROCm/rccl commit: fdf75fd2c1]
2025-02-21 19:34:38 -06:00
Nusrat Islam 4a5ab6cf75 ext-src: fix mscclpp allreduce for non-multiple of 128 message sizes (#1556)
[ROCm/rccl commit: 83f8b191ff]
2025-02-21 11:58:10 -06:00
gilbertlee-amd 4ca7e6873e Rail optimized trees (#1540)
* Allow disabling rail-optimized trees via RCCL_DISABLE_RAIL_TREES, Graphviz-friendly output via RCCL_OUTPUT_TREES


[ROCm/rccl commit: ddc5d58b93]
2025-02-20 15:18:29 -07:00
Nilesh M Negi 5f3134db50 [TOOLS] Update rcclDiagnostics script (#1557)
* [TOOLS] Update rcclDiagnostics script

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Fix typo in valid_marketing_names list

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: 159587be5c]
2025-02-20 16:11:05 -06:00
akolliasAMD ae0b3c19d2 reverted the syncLDS back to syncthreads (#1554)
[ROCm/rccl commit: aedbc95735]
2025-02-19 10:44:32 -07:00
Wenkai Du 7eff149ceb Insert barrier after loading work items to LDS (#1551)
[ROCm/rccl commit: baaa2ac64d]
2025-02-18 10:17:27 -08:00
Wenkai Du 35987b9170 Enable GDRCopy only on gfx94x (#1550)
* Enable GDRCopy only on gfx94x

* Use cudaFree instead of hipFree

* Add warning if failed to get device property

* Remove extra return

[ROCm/rccl commit: 32dc7ef47c]
2025-02-17 13:28:19 -08:00
Sohaib Nadeem f7602c30f8 Remove COMPILING_TARGETS from CMakeLists.txt (#1533)
COMPILING_TARGETS is not actually used for --offload-arch option,
instead GPU_TARGETS is being used implicitly when we call
find_package(hip REQUIRED) (See hip-config-amd.cmake).

[ROCm/rccl commit: 2f1c0bb213]
2025-02-16 21:46:37 -06:00
Nikhil-Nunna f228a50646 Env conf debug (#1534)
* Initial Script ready for review

* Added RCCL-tests and RCCL versions

* Added output folder and README

* Base format built

* Added ROCm version

* Added function to center titles and Vram information

* Added HIP version

* Cleaned formatting

* UCX version and MPI version

* Added NUMA balancing

* Added rocminfo

* Removed notes

* Changed regex for broadcom Nic

* Removed note by the ACS info

* Added Hostname to summary and details

* Print summary to terminal

* Added argparse

* Added flags and readme

* Added GPU ID

* fixed spelling

* renamed script again

* Added file descriptor and locked mem checks

* Added file descriptor and locked mem checks

* Removed extra spaces from summary table

* printing output file location

* Removed sudo in code and ACS flag

[ROCm/rccl commit: 4ba94d6662]
2025-02-14 17:31:18 -06:00
Pedram Alizadeh 37f46f1669 reverting the (Reduce NPKit latency overhead in MSCCL kernel) PR #893 (#1525)
[ROCm/rccl commit: 0e5f4d0662]
2025-02-14 11:03:43 -05:00
corey-derochie-amd 30eecfdb25 Revert "replacing rccl_float8 with hip_fp8 and address compatibility issue (#…" (#1545)
This reverts commit 9463a79dd9.

[ROCm/rccl commit: 824b81c034]
2025-02-13 10:00:22 -07:00
mberenjk 9463a79dd9 replacing rccl_float8 with hip_fp8 and address compatibility issue (#1538)
* replacing rccl_float8 with hip_fp8 and address compatibility issue with gfx942
---------

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>

[ROCm/rccl commit: d437d6e41c]
2025-02-13 10:34:17 -06:00
Wenkai Du 170cc1afde Print KL/CL/KE events for all warps (#1544)
* Print KL/CL/KE events for all warps

* Fix count off-by-one issue

* Fix opCount in KE and restore CPU thread option

* Simplify count calculation

[ROCm/rccl commit: ebf7e2305e]
2025-02-12 13:36:31 -08:00
Wenkai Du 7fdbcdfdec Move collective trace to HBM and fix log issue (#1542)
[ROCm/rccl commit: f5b15f27a9]
2025-02-11 11:40:14 -08:00
rahulc1984 689725fb9e Make rccl version detection robust. (#1517)
* Accept an EXPLICIT_ROCM_VERSION and use that vs inspecting the environment if provided.
* Use CMake's built in file reading support vs execute_process (without error checking) to avoid silent but deadly later failures.
* Properly quote some comparisons to avoid syntax errors if they happen to have an empty string.
* Guard against ROCM_PATH being an empty string, avoiding stray path extensions to root directories, etc.

Co-authored-by: Stella Laurenzo <stellaraccident@gmail.com>

[ROCm/rccl commit: 92ac136db5]
2025-02-11 10:48:22 -07:00
corey-derochie-amd 30e7047750 Switched from cmake_host_system_information feature to a manual parse (#1518)
* Switched cmake_host_system_information feature to a manual parse to remain cmake 3.5 compliant.

* Updating minimum cmake to 3.16 to conform with the rest of ROCm. This change still applies.

[ROCm/rccl commit: 42ab425037]
2025-02-11 08:51:39 -07:00
Nilesh M Negi 4ccbaabdc9 [UT] Include iomanip if not defined (#1510)
* [UT] Include iomanip if not defined

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Remove include guards

`iomanip.h` has pre-defined include guards. These are not needed.

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: 4e406acc43]
2025-02-11 08:48:47 -07:00
Dingming Wu 3da64c4edf Replace atomicAdd with _hip_atmoc_fetch_add in getting colltrace tail position (#1539)
[ROCm/rccl commit: e8fb1335fd]
2025-02-10 08:53:25 -08:00
gilbertlee-amd 94545f827c Updating topology explorer (#1536)
[ROCm/rccl commit: 6cb0599e38]
2025-02-07 08:44:04 -07:00
Vijay Srinivasan b9051c3eca Adding AINIC Network Plugin check (#1528)
- Adding AINIC network plugin check to pass unused parameter to pass the channelId to the network plugin layer

[ROCm/rccl commit: 3494f52d40]
2025-02-06 23:37:53 -06:00
Nikhil-Nunna 60a86a65a1 Added Nikhil-Nunna to codeowners
[ROCm/rccl commit: fd3422afdb]
2025-02-05 14:28:00 -06:00
AbandiGa 236cc66797 Adding @AbandiGa (myself) as code owner (#1532)
Signed-off-by: AbandiGa <galaband@amd.com>

[ROCm/rccl commit: e92a103bad]
2025-02-05 13:23:25 -06:00
Wenkai Du 07d1cad139 Reset barrier and make barrier_next thread local (#1531)
[ROCm/rccl commit: a12bf32475]
2025-02-05 09:06:48 -08:00
Wenkai Du e6b6a37528 Revert "Remove unused code path (#1527)" (#1530)
This reverts commit a7d9bfda6e.

[ROCm/rccl commit: d00e903d72]
2025-02-04 13:14:43 -08:00
Edgar Gabriel cac16e2c96 update CODEOWNERS (#1529)
* update CODEOWNERS
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>


[ROCm/rccl commit: 3646b1de43]
2025-02-04 11:54:42 -07:00
Wenkai Du a7d9bfda6e Remove unused code path (#1527)
[ROCm/rccl commit: 091bf899a1]
2025-02-04 10:24:56 -08:00
Bertan Dogancay b52af8d803 [P2P] Have connIdx for both send and recv (#1524)
[ROCm/rccl commit: 387c973b5d]
2025-02-04 11:53:20 -05:00
isaki001 a40d4eb960 non-hipGraph MSCCL++ tests for allReduce and allGather (#1503)
* working tests for a single message size

* move call_RCCL routine StandaloneUtils, create .cpp file for StandaloneUtils so that it can be included in several tests

* simplify test invocation

* remove unecessary logs and exit from ncclCommRegister

* set expected results for allGather

* skip test if nranks doesn't match number of gpus, call getAndDistributeNCCLid only from parent process

* fix improper size of expected-results vector

* Removing unused changes.

* Refactored to create a new file for the forked collectives call, as StandaloneUtils is for the Standalone tests. Renamed the functions to be slightly more accurate and follow existing naming conventions.

* Apply suggestions from code review

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Co-authored-by: isaki001 <isakioti@banff-pla-r27-38.pla.dcgpu>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
Co-authored-by: Corey Derochie <corey.derochie@amd.com>

[ROCm/rccl commit: 3398fa78fe]
2025-02-04 09:11:32 -06:00
isaki001 d2b5ba80a7 Update MSCCL++ register/deregister (#1523)
* erase handle key from mscclpp communicator during deregistration

* remove check on buffer size being a multiple of 32 from registration/deregistration routines since these checks are applied during enqueue

* add check for greater than zero buffer size in mscclpp registration

[ROCm/rccl commit: 19105206f6]
2025-02-04 09:09:56 -06:00
Bertan Dogancay e171f59719 [BUILD] Fix unsupported arguments in generator (#1519)
* Fix unsupported arguments in generator

* Get ROCM_PATH as env variable

[ROCm/rccl commit: 5804603632]
2025-02-03 14:51:55 -05:00
Wenkai Du 37409368f9 Add back opCount and channel ID to debug trace (#1520)
[ROCm/rccl commit: a5c6b547a2]
2025-02-03 08:55:27 -08:00
Jeffrey E Erickson 8127eb85b2 modify max memory to use free (#1513)
[ROCm/rccl commit: 7af21dd996]
2025-02-03 09:35:02 -06:00
Jeffrey Novotny ad149a5ee4 Fix broken link to install instructions (#1515)
[ROCm/rccl commit: 134f736882]
2025-02-03 10:14:40 -05:00
Mustafa Abduljabbar f58025185e Add IB verbs logging and enable traces through install.sh (#1511)
* Add IB Verbs logging

* Simplify tracing and undo debug.h changes

* Update debug.h

* Update CHANGELOG.md

* Update CHANGELOG.md

* Update CHANGELOG.md

* Exchange remote comm device index

[ROCm/rccl commit: dc75209dd7]
2025-01-31 12:35:39 -05:00
Wenkai Du f94af0c9ba Add HDP flush for gfx940 (#1434)
* Fix collective trace

* Use nontemporal for st_global

* Fix previous commit

* Add HDP flush to data receive path

* Fix previous commit

* Control flushing by NCCL_NET_FORCE_FLUSH and RCCL_NET_HDP_FLUSH

* Introduce RCCL_NET_HDP_FLUSH and RCCL_NET_GDR_FLUSH

Both are on by default. Turn both off will skip all flush will likely
result in data error.

* Enable GDR copy by default

* Remove GDR flush env var because it is disabled by GDC flush

* Output kernel collective trace at comm destroy by default

* Limit kernel timeout messages to 100

* Use system relaxed atomic for loadInt

* Refine timeout messages and use atomic for setting offset from CPU

* Add kernel trace for barrier timeout

* Add backup barrier to avoid race in atomicAdd

* Use different counters for different warps

* Rework barrier implementation

* Fix for other GFX

* Use __hip_atomic_store and __hip_atomic_load

* Fix bug in previous commit

* Don't reset barrier values in running kernel

* Update trace format

* Fix typo

* Switch back to hip_atomic_fetch_add

* Use same barrier implementation for all GFX

* Remove extra threadfence

* Turn off HDP flush by default

Please use RCCL_NET_HDP_FLUSH=1 to switch on HDP flush

* Remove unnecessary changes from alterative barrier implementation

* Added back __threadfence_block

* Revert back to threadfence for gfx other than gfx94x

[ROCm/rccl commit: caba0bc049]
2025-01-31 07:51:10 -08:00
dependabot[bot] ffe6030ee6 Bump rocm-docs-core from 1.14.1 to 1.15.0 in /docs/sphinx (#1514)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.14.1 to 1.15.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.14.1...v1.15.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: ad8012f2fc]
2025-01-30 17:15:17 -07:00
Bertan Dogancay eadb5153ba Add ncclDataType_t as type to ROCTX (#1512)
[ROCm/rccl commit: ecf31da14f]
2025-01-30 13:46:48 -05:00
Arm Patinyasakdikul 2e42761ad1 Make proxy dump print out meaningful information. (#1504)
* Make proxy dump print out meaningful information.

fixed: HPEXA-63

* printout raw data instead.

[ROCm/rccl commit: 6b2b87c9f8]
2025-01-29 16:48:49 -06:00
Bertan Dogancay a781a3033b [Profiler] Enable ROCTX during build by default (#1506)
* Enable ROCTX during build by default

* Check for roctx support in cmake

[ROCm/rccl commit: 35fe9e06f3]
2025-01-29 11:29:46 -05:00
corey-derochie-amd d40857a61c Disabled MSCCL++ feature except when building on Ubuntu or CentOS host systems (#1505)
* Added condition for MSCCL++ to only build on an Ubuntu host system.

* Added CentOS to the supported OS list

[ROCm/rccl commit: bd0f5cccbe]
2025-01-29 08:54:09 -07:00
Nusrat Islam 53c927678b Tune allreduce performance in CPX mode (single OAM) (#1508)
[ROCm/rccl commit: 7ac82248de]
2025-01-29 08:58:48 -06:00