Граф коммитов

852 Коммитов

Автор SHA1 Сообщение Дата
Nilesh M Negi 307bc10781 [SRC] Enable unroll=1 for gfx950 (#1602)
* [SRC] Enable unroll=1 for gfx950

* Fix typo from rebase in generate.py

* Support for unroll=1 and gfx90a when building for all GPU targets

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-03-27 18:21:35 -05:00
isaki001 9dc23d9265 Disable mscclpp (#1614)
* disable mscclpp by default
2025-03-25 15:21:16 -05:00
gilbertlee-amd 626dc50ab5 Removing the experimental clique kernel files (#1610) 2025-03-20 18:10:01 -06:00
Wenkai Du 90ad586d94 Add fault injection of starting warps with random variations (#1593)
* Add fault injection of starting warps with random variations

This is done by inserting randomly delays after __syncthreads().
The feature can be turned off by FAULT_INJECTION=OFF in cmake.

* Remove manually introduced bug for demo purpose

* Use only one thread per warp for checking wall clock
2025-03-20 16:11:43 -07:00
corey-derochie-amd 6505639cf4 removed gfx940 and gfx941 (#1606)
* removed gfx940 and gfx941

* removed gfx940 and gfx941

* Update "gfx94" to "gfx942" in init.cc

* Updated remaining "gfx94" updates to "gfx942"

* Update filenames and variables from gfx940 to gfx942

---------

Co-authored-by: akolliasAMD <akollias@amd.com>
2025-03-20 09:34:53 -06:00
Wenkai Du bd0092e8f1 GDRCOPY support: Off by default (#1605) 2025-03-18 08:17:01 -07:00
Avinash ccb0820743 Memory leak fix when numIBDevices = 0 (#1429)
* Initial commit for testing

* Fix memory leak in checkOptions

* Fix memory leak in checkOption

* x

* Delete cmake-3.28.2-linux-x86_64.sh

* gcn changes

* gcn memleak fixes

* gcn leak fix

* memory leak fixes for parseRome4P2H and ncclTopoAddGPU

* Keeping only necessary file for fixes

Deleting temporary scripts I created for debugging and testing

* changing to GCN_ARCH_NAME_LEN

* Added sanity check directory

* refactoring scripts

* Updated to sanity checks folder

* Initial fixes

* changes in tools

* pointing RCCL lib build to debug version

* Removed second pthread_detach

* Removing sanity checks

* Keeping only code changes

* addressing memory leaks in ncclIbinit

---------

Co-authored-by: Chao Chen <cchen104@amd.com>
2025-03-17 11:21:19 -05:00
Mustafa Abduljabbar f67b2cc908 Multi-node reduce_scatter improved auto-selection for LL and LL128 on gfx942 (#1604)
* Add reduce_scatter LL and LL128 thresholds

* Always honor user choice for protocol
2025-03-17 11:21:01 -04:00
Wenkai Du 245c2de909 Enable LL128 on gfx942 (#1549) 2025-03-16 15:10:05 -07:00
Nilesh M Negi 1df73e209e [GRAPH] Increase default nChannels to 112 for gfx950 (#1596)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-03-14 14:47:03 -07:00
Wenkai Du 4237caad69 Limit P2P channels per peer to not exceeding max channels (#1594)
* Limit P2P channels per peer to not exceeding max channels

* [UT] test single GPU cases for all collectives

* [UT] fix out of range root value
2025-03-11 09:32:09 -07:00
Tim f6c6d451a9 mscclpp compatibility check for ubr (#1573) 2025-03-09 22:10:47 -04:00
Nusrat Islam ac823818aa misc/msccl: force use of mscclpp (#1581) 2025-03-04 12:48:59 -06:00
Bertan Dogancay d88cca3098 [Transport] Fix IntraNet (#1582) 2025-03-04 13:30:36 -05:00
Wenkai Du f957c4fe22 NPKit: enable reduce scatter profiling (#1580) 2025-03-04 10:03:56 -08:00
Nusrat Islam 23c0b7bd84 misc/msccl: Read graph capture status for every collective call (#1576)
* misc/msccl: read graphCaptureStatus for every collective call

* fix a bug in checking whether UBR is enabled in MSCCLPP

* cmake: Fix patch reversal order

* misc/msccl: add logging
2025-02-28 17:16:07 -06:00
Wenkai Du 60c1264d27 Improve RDMA flushing by write dummy payload with RO=0 (#1570)
* Improve RDMA flushing by write dummy payload with RO=0

* Rename env var for disabling this change to RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING
2025-02-27 16:20:32 -08:00
Nilesh M Negi 02b8e504f0 [BUILD] Fix generate.py for gfx950 (#1575)
* [BUILD] Fix generate.py for gfx950

Signed-off-by: nilenegi <Nilesh.Negi@amd.com>

* [BUILD] Cleaner way to check for gfx targets

Signed-off-by: nilenegi <Nilesh.Negi@amd.com>

---------

Signed-off-by: nilenegi <Nilesh.Negi@amd.com>
2025-02-26 18:43:37 -06:00
Bertan Dogancay 85eb1f16bc Use bit reversal based mapping for multi-node (#1572) 2025-02-26 09:48:03 -05:00
Pedram Alizadeh f268553ee4 enable building rccl for gfx950 (#1571) 2025-02-25 16:13:48 -05:00
gilbertlee-amd ddc5d58b93 Rail optimized trees (#1540)
* Allow disabling rail-optimized trees via RCCL_DISABLE_RAIL_TREES, Graphviz-friendly output via RCCL_OUTPUT_TREES
2025-02-20 15:18:29 -07:00
akolliasAMD aedbc95735 reverted the syncLDS back to syncthreads (#1554) 2025-02-19 10:44:32 -07:00
Wenkai Du baaa2ac64d Insert barrier after loading work items to LDS (#1551) 2025-02-18 10:17:27 -08:00
Wenkai Du 32dc7ef47c Enable GDRCopy only on gfx94x (#1550)
* Enable GDRCopy only on gfx94x

* Use cudaFree instead of hipFree

* Add warning if failed to get device property

* Remove extra return
2025-02-17 13:28:19 -08:00
Pedram Alizadeh 0e5f4d0662 reverting the (Reduce NPKit latency overhead in MSCCL kernel) PR #893 (#1525) 2025-02-14 11:03:43 -05:00
corey-derochie-amd 824b81c034 Revert "replacing rccl_float8 with hip_fp8 and address compatibility issue (#…" (#1545)
This reverts commit d437d6e41c.
2025-02-13 10:00:22 -07:00
mberenjk d437d6e41c replacing rccl_float8 with hip_fp8 and address compatibility issue (#1538)
* replacing rccl_float8 with hip_fp8 and address compatibility issue with gfx942
---------

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-02-13 10:34:17 -06:00
Wenkai Du ebf7e2305e Print KL/CL/KE events for all warps (#1544)
* Print KL/CL/KE events for all warps

* Fix count off-by-one issue

* Fix opCount in KE and restore CPU thread option

* Simplify count calculation
2025-02-12 13:36:31 -08:00
Wenkai Du f5b15f27a9 Move collective trace to HBM and fix log issue (#1542) 2025-02-11 11:40:14 -08:00
Dingming Wu e8fb1335fd Replace atomicAdd with _hip_atmoc_fetch_add in getting colltrace tail position (#1539) 2025-02-10 08:53:25 -08:00
Vijay Srinivasan 3494f52d40 Adding AINIC Network Plugin check (#1528)
- Adding AINIC network plugin check to pass unused parameter to pass the channelId to the network plugin layer
2025-02-06 23:37:53 -06:00
Wenkai Du a12bf32475 Reset barrier and make barrier_next thread local (#1531) 2025-02-05 09:06:48 -08:00
Wenkai Du d00e903d72 Revert "Remove unused code path (#1527)" (#1530)
This reverts commit 091bf899a1.
2025-02-04 13:14:43 -08:00
Wenkai Du 091bf899a1 Remove unused code path (#1527) 2025-02-04 10:24:56 -08:00
Bertan Dogancay 387c973b5d [P2P] Have connIdx for both send and recv (#1524) 2025-02-04 11:53:20 -05:00
isaki001 19105206f6 Update MSCCL++ register/deregister (#1523)
* erase handle key from mscclpp communicator during deregistration

* remove check on buffer size being a multiple of 32 from registration/deregistration routines since these checks are applied during enqueue

* add check for greater than zero buffer size in mscclpp registration
2025-02-04 09:09:56 -06:00
Bertan Dogancay 5804603632 [BUILD] Fix unsupported arguments in generator (#1519)
* Fix unsupported arguments in generator

* Get ROCM_PATH as env variable
2025-02-03 14:51:55 -05:00
Wenkai Du a5c6b547a2 Add back opCount and channel ID to debug trace (#1520) 2025-02-03 08:55:27 -08:00
Mustafa Abduljabbar dc75209dd7 Add IB verbs logging and enable traces through install.sh (#1511)
* Add IB Verbs logging

* Simplify tracing and undo debug.h changes

* Update debug.h

* Update CHANGELOG.md

* Update CHANGELOG.md

* Update CHANGELOG.md

* Exchange remote comm device index
2025-01-31 12:35:39 -05:00
Wenkai Du caba0bc049 Add HDP flush for gfx940 (#1434)
* Fix collective trace

* Use nontemporal for st_global

* Fix previous commit

* Add HDP flush to data receive path

* Fix previous commit

* Control flushing by NCCL_NET_FORCE_FLUSH and RCCL_NET_HDP_FLUSH

* Introduce RCCL_NET_HDP_FLUSH and RCCL_NET_GDR_FLUSH

Both are on by default. Turn both off will skip all flush will likely
result in data error.

* Enable GDR copy by default

* Remove GDR flush env var because it is disabled by GDC flush

* Output kernel collective trace at comm destroy by default

* Limit kernel timeout messages to 100

* Use system relaxed atomic for loadInt

* Refine timeout messages and use atomic for setting offset from CPU

* Add kernel trace for barrier timeout

* Add backup barrier to avoid race in atomicAdd

* Use different counters for different warps

* Rework barrier implementation

* Fix for other GFX

* Use __hip_atomic_store and __hip_atomic_load

* Fix bug in previous commit

* Don't reset barrier values in running kernel

* Update trace format

* Fix typo

* Switch back to hip_atomic_fetch_add

* Use same barrier implementation for all GFX

* Remove extra threadfence

* Turn off HDP flush by default

Please use RCCL_NET_HDP_FLUSH=1 to switch on HDP flush

* Remove unnecessary changes from alterative barrier implementation

* Added back __threadfence_block

* Revert back to threadfence for gfx other than gfx94x
2025-01-31 07:51:10 -08:00
Bertan Dogancay ecf31da14f Add ncclDataType_t as type to ROCTX (#1512) 2025-01-30 13:46:48 -05:00
Arm Patinyasakdikul 6b2b87c9f8 Make proxy dump print out meaningful information. (#1504)
* Make proxy dump print out meaningful information.

fixed: HPEXA-63

* printout raw data instead.
2025-01-29 16:48:49 -06:00
Bertan Dogancay 35fe9e06f3 [Profiler] Enable ROCTX during build by default (#1506)
* Enable ROCTX during build by default

* Check for roctx support in cmake
2025-01-29 11:29:46 -05:00
Bertan Dogancay dd185f26d2 Fix ROCTX call for MSCCL (#1502) 2025-01-23 16:00:07 -07:00
BertanDogancay 36343be84f Merge remote-tracking branch 'nccl/master' into develop 2025-01-23 12:08:46 -06:00
corey-derochie-amd f77308a2fe Removing duplicate definitions of INC_COLL_TRACE and traceData macros (#1500)
They are nearly identical, except the common.h definition sets `collTrace->channelId`.
2025-01-22 16:50:27 -07:00
isaki001 ff130cce7a fix scatter_perf crash (#1493)
* fix scatter_perf crash

* Update src/misc/msccl/msccl_lifecycle.cc

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* Update src/misc/msccl/msccl_lifecycle.cc

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* More buffsRegisteredNonGraphMode spelling fixes.

---------

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2025-01-21 09:24:32 -06:00
isaki001 d89432e8c8 update mscclpp (#1488)
* update commit hash for mscclpp submodule

* update mscclpp submodule

* remove print messages in cmake

* add back some print messages, update MSCLPP CMAKE_ARGS

* enable MSCCL++ patches regardless of finding mscclpp_nccl package
2025-01-20 08:06:43 -06:00
corey-derochie-amd 2e35417fe5 Added RCCL env params to control setting the SO_REUSEADDR and SO_LINGER socket options (#1418)
* Added RCCL env params to control setting the SO_REUSEADDR and SO_LINGER socket options. This can allow control over the number of file descriptors created during bootstrapping.

* Casted the linger value to `int` sooner to avoid a scope of unknown typed-ness.

* Added CHANGELOG entry for this feature.
2025-01-14 10:26:04 -07:00
Nusrat Islam e9b6bbca8a Add MSCCLPP user buffer registration APIs and integrate with RCCL (#1477)
* ext-src: add MSCCLPP memory registration APIs

* update mem-reg patch with mscclpp helper routine to check if buffer is registered

* RCCL integration of MSCCL++ user-buffer registration APIs

* only include mscclpp_nccl header if ENABLE_MSCCLPP is defined

* ext-src: update mscclpp mem-reg patch

* add helper routine to patch

* check handle before MSCCL++ deregister

* fix typo to replace send buff with recv buff

* in case of no mscclpp registration, dduring deRegister call, ont fall back to rccl deRegister which will return an error

* Apply suggestions from code review

Whitespace suggestions and reducing diffs to avoid future merge conflicts

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* rename helper functions and change their return type

* set RCCL user-buffer registration to occur if attempting MSCCL++ registration with a buffer in managed memory

---------

Co-authored-by: isaki001 <Ioannis.Sakiotis@amd.com>
Co-authored-by: isaki001 <36317038+isaki001@users.noreply.github.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2025-01-14 08:20:24 -06:00