Graf commitů

2008 Commity

Autor SHA1 Zpráva Datum
isaki001 da183596cd add back missing proxy-counter updates (#2052) 2025-11-25 15:22:34 -06:00
Kapil S. Pawar 5fd86021a8 [RcclReplayer] JSON <-> BIN log format conversion tool (#2056)
* Add replay log format converter

* Add Log Sanitizer

* Add no timestamp option (nts) to sanitizer
2025-11-24 11:51:36 -06:00
Nilesh M Negi db52690c2a [AzureCI] Increase timeout of per PR and nightly pipeline to 240 mins (#2074) 2025-11-24 10:55:36 -06:00
AbandiGa b14e32c46e Fix rcclNetP2pPolicy issue (#2072)
* fix rcclNetP2pPolicy issue

* change the comment to ncclNetIb
2025-11-21 18:28:10 -06:00
Nilesh M Negi fca0015962 [AzureCI] Increase UT timeout to 3 hours (#2070) 2025-11-21 09:18:49 -06:00
Matt Williams 3495baa6b2 Fix ToC in API Library page (#2053)
* Add intro and remove ToC
2025-11-20 09:35:15 -05:00
Kapil S. Pawar ab1eb6e70e [AzureCI] Add RcclReplayer to CI (#2048) 2025-11-18 12:21:24 -06:00
Kapil S. Pawar c7f400dbff Functional Tests for Ext-Profiler Plugin (#2007)
* Add functional tests for CSV Tuner Plugin

* Updated directory structure

* Updated and renamed directories

* Updated csv conf files

* Added tests for ext-profiler

* Updated readme

* Updated readme
2025-11-18 11:20:39 -06:00
Arm Patinyasakdikul 461e61d10e Added install.sh flag to suppress warnings. (#2054) 2025-11-17 00:35:06 -06:00
Pedram Alizadeh fb67e5b467 Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic (#2037)
* Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic

* Switching to hip_bf16.h from ROCm 6.0.0
2025-11-13 15:56:18 -05:00
AbandiGa 277b6e9bac Disable Bfloatf16 pipelining for reduction collectives for gfx950 (#2047)
* disable bf16 reduce_copy pipelining for gfx950

* edit CHANGELOG

* Combine unroll and pipeline local arch calculation into single function

* fix multi-node error and disbale for gfx950 even if it's not a local build

* removed has_gfx950

* disable pipelining for gfx950 in rcclSetPipelining

---------

Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h30-08.prov.gtu.zts.cpe.ice.amd.com>
Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h30-18.prov.gtu.zts.cpe.ice.amd.com>
Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h28a-08.prov.gtu.zts.cpe.ice.amd.com>
2025-11-13 14:55:09 -06:00
isaki001 0d09f86608 Post thread-block size increase tuning (#2042)
* for multinode gfx950, extend AR LL128 up to 256MB, extend RS LL128 up to 8MB per rank, extend AG LL up to 64KB per rank

* dont override direct allgather threshold if set to -1

* restore 2-node AR simple at earlier message sizes than higher multi-node AR

* extend range of LL for single-node RS on gfx950

* update algo/proto for multi-node allreduce on gfx942

* set single-node AR on gfx950 to Tree LL for KB message sizes

* decrease threshold for single node Tree for gfx950 AR
2025-11-13 14:51:04 -06:00
Bertan Dogancay 83ffc82fa7 [Launch] Move cudaEventRecord call to capturing stream only (#2050) 2025-11-13 08:38:09 -06:00
gilbertlee-amd 46b032b760 [GRAPH] Adding support for rail-optimized trees for MI3XX with 4 NICs (#2031) 2025-11-12 19:34:27 -06:00
nawrinsu c488c5307e Add tuner config file (2,4,8 nodes) for gfx950 (#2012)
* Add tuner config file (2,4,8 nodes) for gfx950

* remove alltoall

* Added comment regarding allgather direct
2025-11-12 09:16:36 -08:00
Kapil S. Pawar c8da880dc7 Added Functional Tests for CSV Tuner Plugin (#1968)
* Add functional tests for CSV Tuner Plugin

* Updated directory structure

* Updated and renamed directories

* Updated csv conf files

* Updated readme

* Updated readme

* Updated readme
2025-11-11 10:11:19 -06:00
Dingming Wu b811645688 Adjust nChannels on gfx950 based on ranks and nodes for better bandwidth (#2027) 2025-11-11 09:46:51 -06:00
Gheorghe-Teodor Bercea 1678bb9ae7 Fix compilation when enabling indirect function calls (#1994)
Fix compilation when enabling indirect function calls.
2025-11-11 09:36:48 -05:00
Mustafa Abduljabbar 52f9526bd6 Reduce LL threshold for a2a (#2032) 2025-11-10 19:14:23 -05:00
Kapil S. Pawar acdafac49f [RcclReplayer] Compile without the need for RCCL to be compiled (#2039) 2025-11-10 15:38:48 -06:00
Dingming Wu 05f914c997 Fail the job if flag HIP_HOST_UNCACHED_MEMORY is not set on MI350x (#2023)
* Fail the job if compiler flag HIP_HOST_UNCACHED_MEMORY is not turned on on mi350x
Place the check after initTransportsRank as the GPU arch info in comm->topo->nodes info is populated after that.

* Update src/init.cc to use ERROR instead of WARN
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
2025-11-10 11:54:35 -06:00
Dingming Wu b00ee4c83c Increment opCount for intra-node comms as well (#2024)
* Enhance logging in NCCL initialization
It's convenient to log comms obj and default channels together for debugging

* Add opCount to collDevWork and update increment logic
Added opCount to collDevWork and incremented it when proxyOpQueue is empty (e.g., for intra-node comms)

* Clarify opCount increment logic in enqueue.cc
Updated comment to clarify incrementing opCount for intranode communications.

* Refactor NCCL_INIT logging format
Updated logging format for NCCL_INIT to improve clarity.

* Remove duplicate INFO logging in init.cc
2025-11-10 11:23:49 -06:00
Bertan Dogancay b1e680adc0 [GEN/BUILD] Refactor generator script and reduce build time for old archs. (#2030) 2025-11-07 15:15:25 -05:00
Bertan Dogancay a9bb7e9807 [Launch] Enable Implicit order launch with serial mode (#2033) 2025-11-07 13:29:53 -05:00
Avinash 85baa0d113 Empty kernel test enhancements [tools] (#1999)
* Initial commit

* Improvements-1

* Initial commit for PR

* Updates warning, run.sh, decoupled loops

* Forcing seq cst for CPU timimg
2025-11-07 12:28:06 -06:00
Ghadeer Ahmed H Alabandi 45991fadad [NET] Enable capping the number of QPs created for send/recv colls (#1998) 2025-11-07 00:47:01 +00:00
alex-breslow-amd 56e0b4e445 [gfx950] Turn On Single Node One Slice Optimization for gfx950 and MI300A (#2017)
* Internal benchmarking shows nice single-node performance uplift for MI300A and MI350
2025-11-06 12:12:45 -08:00
Arm Patinyasakdikul d6a53d2022 proxy: handle progressOps return code properly. (#2029) 2025-11-04 09:09:50 -06:00
Aravind Ravikumar 07f8f6d6c6 Add S3 upload support for Perf and test reports by run ID and architecture (#2020)
* Commits to enable scp report copy

* Added Post report upload step

* Added extra arg for fetch artifacts

* Moved to a specific commit

* Add write permissions to s3

* Added comment for TheRock sha commit date

---------

Co-authored-by: arravikum <arravikum@amd.com>
2025-11-03 19:09:34 -05:00
nawrinsu 166268d715 Fix protocol and channel override when tuner is used (#1985)
* Fix protocol and channel override when tuner is used

* Added comment

* Fix README for basic tuner implementation
2025-11-03 13:56:34 -08:00
Ahmed Khan caffd013f6 Remove duplicate MaxEmptyLinesToKeep from clang-format (#2016) 2025-11-02 21:44:27 -06:00
Nilesh M Negi 62ab7a22d7 Revert "[GEN/BUILD] Refactor generate.py and reduce build time for older archs (#2006)" (#2021)
This reverts commit bed7cdf863.
2025-10-31 10:04:12 -05:00
David DeBonis 63d5846452 Single-node AllGather and ReduceScatter Optimization (#2019)
* Single-node performance tuning

* Normalizing value to individual rank
2025-10-31 08:59:46 -06:00
Arm Patinyasakdikul 1ce83d5cc0 Added ERROR message class to handle fatal error messages. (#2002)
* Added ERROR message class to handle fatal error messages.

New ERROR message class will print the message in all debug level,
including none.

Change some of the fatal error message to be in ERROR instead of WARN.

Added new error handler function to print out more meaningful error
message in the future.

* Added CHANGELOG entry.

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Change to no longer reuse NONE as ERROR. ERROR is now a separated class.

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-10-30 16:14:20 -05:00
Arm Patinyasakdikul 84fdcab68a Added copyrights for Palamida scan 7.2. (#2018) 2025-10-30 13:33:20 -05:00
isaki001 641c0eb51c P2p batching hang-fix (#2011)
* prevent batching when send/recv bytes dont match, restore bit reversal for channel to part mapping, prevent batching beyond 32-nodes

* correct computation for channel to part mapping

* update changelog

* disabling p2p-batching by default
2025-10-30 13:32:01 -05:00
isaki001 72996e4d9f gx950 multi-node tuning for LL/LL128 (#1953)
* increased LL threshold for gfx950 AR to 256KB

* AG/RS proto threshold update
2025-10-30 12:08:12 -05:00
Bertan Dogancay bed7cdf863 [GEN/BUILD] Refactor generate.py and reduce build time for older archs (#2006) 2025-10-30 11:45:53 -04:00
Nilesh M Negi 8444b3c6e9 Fix gfx950 gating conditions to match ROCm 7.0.2 (#2003) 2025-10-29 23:27:04 -05:00
Mustafa Abduljabbar 12f51ba8bf [Device] Adjust threadblock size for gfx950 to increase LL64/Simple performance for AR, RS and AG (#1978)
* Add initial commit to increase tb size to 512
* Fix LL perf issue when subset of NCCL_MAX_NTHREADS is used
Adding a constant to barrier_generic logic from using fallback logic when nthreads < NCCL_MAX_NTHREADS and nthreads == blockDim.X
* Adjust nthreads for LL
* Opt threads for reduce_scatter upper small range
* Add macro for single node
* Restrict MSCCL to 256 threads to prevent mem access fault
* Support pre-MI350 compatibility
* Partially refactor threadblock size override
* Use const macros instead of numerals
* opt out of unused function
2025-10-29 23:24:32 -05:00
Bertan Dogancay b703ffdfa4 [Tools/Replayer] Fix prohibited calls during capture mode (#1938) 2025-10-29 12:19:32 -04:00
corey-derochie-amd 561ad2fe05 Updated Changelog with 7.1.1 and 7.2.0 stub sections (#2008)
* Missing ROCm 7.0 & 7.1.0 Changelog entries (#1976)

* Update CHANGELOG.md

* Update CHANGELOG.md

* Apply suggestions from code review

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update CHANGELOG.md

* Update CHANGELOG.md

* Update CHANGELOG.md

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Added ROCm 7.2.0 section.

* Update CHANGELOG.md

* Apply suggestion from @corey-derochie-amd

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-10-28 13:41:22 -06:00
Atul Kulkarni cc867dbaf2 Removed RCCL_EXPOSE_STATIC duplicate definition. (#1988) 2025-10-28 13:01:48 -05:00
Atul Kulkarni 26dc7abb32 Added ROCM_VERSION restriction to alloc unit tests (#1989) 2025-10-28 12:54:34 -05:00
alex-breslow-amd e69b11eba5 Remove nontemporality from stores, put in casts to global address space (#1982)
* Implements casting key loads and stores to address_space(1) so that vector global load and store instructions are emitted by the compiler instead of more costly flat loads and stores
* Removes nontemporality from some key stores for gfx950.
2025-10-28 10:34:48 -07:00
corey-derochie-amd f290e302d3 Updated CODEOWNERS to instead use RCCL-Reviewers team (#2010)
* Updated CODEOWNERS to instead use RCCL-Reviewers team

* Apply suggestion from @nileshnegi

Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>

---------

Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
2025-10-28 09:27:26 -06:00
Kapil S. Pawar 912d53caba Fix segmentation fault related to ext-profiler plugin (#1986) 2025-10-23 09:26:35 -05:00
Joseph Macaranas c2e71e83d1 [External CI] Add references to rocm-systems super repo (#1935)
- In order to trigger downstream jobs to verify projects that consume rccl, references to those repos are required.
2025-10-22 16:07:05 -04:00
Aravind Ravikumar 506c2e9878 Adding reservation time for salloc in CI (#1992)
Co-authored-by: arravikum <arravikum@amd.com>
2025-10-22 10:00:01 -04:00
ehsanhosseinzadehKhaligh aec4f0a659 Updating npkit_trace_generator.py to check npkit directory (#1891)
* create dir regardless of default or user-provided path if it doesn't exist
* Fix npkit_dump_dir on npkit_trace_generator.py

---------

Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>
2025-10-22 02:51:16 -05:00