2
0
Gráfico de cometimentos

821 Cometimentos

Autor(a) SHA1 Mensagem Data
Wenkai Du a12bf32475 Reset barrier and make barrier_next thread local (#1531) 2025-02-05 09:06:48 -08:00
Wenkai Du d00e903d72 Revert "Remove unused code path (#1527)" (#1530)
This reverts commit 091bf899a1.
2025-02-04 13:14:43 -08:00
Wenkai Du 091bf899a1 Remove unused code path (#1527) 2025-02-04 10:24:56 -08:00
Bertan Dogancay 387c973b5d [P2P] Have connIdx for both send and recv (#1524) 2025-02-04 11:53:20 -05:00
isaki001 19105206f6 Update MSCCL++ register/deregister (#1523)
* erase handle key from mscclpp communicator during deregistration

* remove check on buffer size being a multiple of 32 from registration/deregistration routines since these checks are applied during enqueue

* add check for greater than zero buffer size in mscclpp registration
2025-02-04 09:09:56 -06:00
Bertan Dogancay 5804603632 [BUILD] Fix unsupported arguments in generator (#1519)
* Fix unsupported arguments in generator

* Get ROCM_PATH as env variable
2025-02-03 14:51:55 -05:00
Wenkai Du a5c6b547a2 Add back opCount and channel ID to debug trace (#1520) 2025-02-03 08:55:27 -08:00
Mustafa Abduljabbar dc75209dd7 Add IB verbs logging and enable traces through install.sh (#1511)
* Add IB Verbs logging

* Simplify tracing and undo debug.h changes

* Update debug.h

* Update CHANGELOG.md

* Update CHANGELOG.md

* Update CHANGELOG.md

* Exchange remote comm device index
2025-01-31 12:35:39 -05:00
Wenkai Du caba0bc049 Add HDP flush for gfx940 (#1434)
* Fix collective trace

* Use nontemporal for st_global

* Fix previous commit

* Add HDP flush to data receive path

* Fix previous commit

* Control flushing by NCCL_NET_FORCE_FLUSH and RCCL_NET_HDP_FLUSH

* Introduce RCCL_NET_HDP_FLUSH and RCCL_NET_GDR_FLUSH

Both are on by default. Turn both off will skip all flush will likely
result in data error.

* Enable GDR copy by default

* Remove GDR flush env var because it is disabled by GDC flush

* Output kernel collective trace at comm destroy by default

* Limit kernel timeout messages to 100

* Use system relaxed atomic for loadInt

* Refine timeout messages and use atomic for setting offset from CPU

* Add kernel trace for barrier timeout

* Add backup barrier to avoid race in atomicAdd

* Use different counters for different warps

* Rework barrier implementation

* Fix for other GFX

* Use __hip_atomic_store and __hip_atomic_load

* Fix bug in previous commit

* Don't reset barrier values in running kernel

* Update trace format

* Fix typo

* Switch back to hip_atomic_fetch_add

* Use same barrier implementation for all GFX

* Remove extra threadfence

* Turn off HDP flush by default

Please use RCCL_NET_HDP_FLUSH=1 to switch on HDP flush

* Remove unnecessary changes from alterative barrier implementation

* Added back __threadfence_block

* Revert back to threadfence for gfx other than gfx94x
2025-01-31 07:51:10 -08:00
Bertan Dogancay ecf31da14f Add ncclDataType_t as type to ROCTX (#1512) 2025-01-30 13:46:48 -05:00
Arm Patinyasakdikul 6b2b87c9f8 Make proxy dump print out meaningful information. (#1504)
* Make proxy dump print out meaningful information.

fixed: HPEXA-63

* printout raw data instead.
2025-01-29 16:48:49 -06:00
Bertan Dogancay 35fe9e06f3 [Profiler] Enable ROCTX during build by default (#1506)
* Enable ROCTX during build by default

* Check for roctx support in cmake
2025-01-29 11:29:46 -05:00
Bertan Dogancay dd185f26d2 Fix ROCTX call for MSCCL (#1502) 2025-01-23 16:00:07 -07:00
BertanDogancay 36343be84f Merge remote-tracking branch 'nccl/master' into develop 2025-01-23 12:08:46 -06:00
corey-derochie-amd f77308a2fe Removing duplicate definitions of INC_COLL_TRACE and traceData macros (#1500)
They are nearly identical, except the common.h definition sets `collTrace->channelId`.
2025-01-22 16:50:27 -07:00
isaki001 ff130cce7a fix scatter_perf crash (#1493)
* fix scatter_perf crash

* Update src/misc/msccl/msccl_lifecycle.cc

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* Update src/misc/msccl/msccl_lifecycle.cc

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* More buffsRegisteredNonGraphMode spelling fixes.

---------

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2025-01-21 09:24:32 -06:00
isaki001 d89432e8c8 update mscclpp (#1488)
* update commit hash for mscclpp submodule

* update mscclpp submodule

* remove print messages in cmake

* add back some print messages, update MSCLPP CMAKE_ARGS

* enable MSCCL++ patches regardless of finding mscclpp_nccl package
2025-01-20 08:06:43 -06:00
corey-derochie-amd 2e35417fe5 Added RCCL env params to control setting the SO_REUSEADDR and SO_LINGER socket options (#1418)
* Added RCCL env params to control setting the SO_REUSEADDR and SO_LINGER socket options. This can allow control over the number of file descriptors created during bootstrapping.

* Casted the linger value to `int` sooner to avoid a scope of unknown typed-ness.

* Added CHANGELOG entry for this feature.
2025-01-14 10:26:04 -07:00
Nusrat Islam e9b6bbca8a Add MSCCLPP user buffer registration APIs and integrate with RCCL (#1477)
* ext-src: add MSCCLPP memory registration APIs

* update mem-reg patch with mscclpp helper routine to check if buffer is registered

* RCCL integration of MSCCL++ user-buffer registration APIs

* only include mscclpp_nccl header if ENABLE_MSCCLPP is defined

* ext-src: update mscclpp mem-reg patch

* add helper routine to patch

* check handle before MSCCL++ deregister

* fix typo to replace send buff with recv buff

* in case of no mscclpp registration, dduring deRegister call, ont fall back to rccl deRegister which will return an error

* Apply suggestions from code review

Whitespace suggestions and reducing diffs to avoid future merge conflicts

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* rename helper functions and change their return type

* set RCCL user-buffer registration to occur if attempting MSCCL++ registration with a buffer in managed memory

---------

Co-authored-by: isaki001 <Ioannis.Sakiotis@amd.com>
Co-authored-by: isaki001 <36317038+isaki001@users.noreply.github.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2025-01-14 08:20:24 -06:00
Dingming Wu 69d0134ed2 improving kernel traces on opCount bits and adding channelId in ncclCollTrace (#1485) 2025-01-10 07:57:46 -08:00
Luna b24580e3d4 net_ib: fix out of bounds read in ncclIbGdrSupport on non-RDMA kernel (#1470)
Fixes #1469
2025-01-07 16:49:24 -08:00
qiwei_ji f2ee8d9132 Check nvlink_node instead of xgmi_node in xml.cc (#1407)
It seems like here wants to check xgmi_node instead. If checks node for "nvlink", it will verify the link_info everytime.
If checks node for "xgmi", when get yes answer, it won't need check vsmi topo interface.
2025-01-06 17:09:27 -08:00
Xeonacid c6c7b6db98 Define wc_store_fence for riscv (#1475) 2025-01-06 16:59:14 -08:00
corey-derochie-amd c158d3a9b4 [SWDEV-497665] Blocked cudaMemcpyAsync race condition by synchronizing (#1447)
* Switched calls to `cudaMemcpyAsync` to be `cudaMemcpy` in `ncclTransportP2pSetup` to avoid race condition with `cudaIpcOpenMemHandle` inside p2p `connect`. See `ncclP2pImportShareableBuffer`.

* Moved synchronize outside of the loop, as it isn't necessary to sync between every iteration of the loop.
2025-01-03 13:06:47 -07:00
Mustafa Abduljabbar a9d6e7661c Add macro for channel mask offset (#1467) 2025-01-03 08:41:41 -05:00
Mustafa Abduljabbar e6b179d627 Remove unneeded highestTransportType (#1461) 2024-12-16 13:28:47 -05:00
Ziyue Yang 83c5eb7378 Fix MSCCL algorithm loading order (#1460) 2024-12-16 07:41:17 -08:00
Hujingbo ad4c36dc34 increase p2p channels for Intel platform (#1448)
Co-authored-by: hujingbo <hujingbo@kuaishou.com>
2024-12-10 07:33:37 -08:00
Benjamin Kitor a05329bd0d Add Topologies for 16-GPU gfx942 SuperNode (#1417)
* Add Topologies for 16-GPU gfx942 SuperNode

- Add GigaIO topologies to tools/topo_expl for dev and testing
- Add GigaIO Columba 16 GPU romeModel and adjust topology
  matching algorithm in rome_models for 16 GPU system
- Fix bug which failed to match Rome Model when using subsets
  of system resources (i.e. ROCR_VISIBLE_DEVICES is set)
- Fixes for topo_expl

* Fix bug w/ 1H16P
2024-12-03 13:12:03 -08:00
gilbertlee-amd 000575867c Adding RCCL_MODEL_REVERSAL_DISABLE env var to disable model reversal (#1431)
* Adding RCCL_MODEL_REVERSAL_DISABLE env var to disable model reversal
2024-11-25 11:24:54 -07:00
Bertan Dogancay dfe4a3ed81 Fix typo in ncclGetKernelIndex macro (#1424) 2024-11-18 10:40:05 -05:00
Bertan Dogancay cb175fb0b3 Template generic kernel for unroll factor (#1419)
* Template generic kernel for unroll factor
2024-11-12 18:27:29 -05:00
darren-amd ebf0417e90 remove undefined computeColl declaration 2024-11-04 13:42:01 -05:00
corey-derochie-amd 1c45962273 Hide or fix all build warnings (#1331)
* Changing C-strings to be const.

* Changed variable-length arrays to std::vector to avoid warnings. VLA is a compiler extension.

* Changed `#define` inside functions into `constexpr int` to preserve scoping and avoid macro redefinition warnings.

* Disabled warnings for modifying `CMAKE_CXX_FLAGS` caused by `check_symbol_exists`, which temporarily modifies the flag to do a compile check.

* Fixed VLA in rccl UT.
2024-11-04 09:46:42 -07:00
Abhishek Kulkarni 6178556853 GDR enablement logic fix for kernel 6.4.0+ (#1378) 2024-11-03 01:20:07 -05:00
Avinash d6006f0425 Memory leak fixes in hostside functions (#1388)
memory leak fixes for parseRome4P2H and ncclTopoAddGPU
2024-10-30 14:25:56 -05:00
corey-derochie-amd ea20af698e Remove MSCCL switch case fall-through by adding break statement. (#1342) 2024-10-29 15:47:59 -06:00
gilbertlee-amd 0cbce2a757 Adding support for odd nodes for model_87 (#1309) 2024-10-24 08:38:12 -06:00
Arm Patinyasakdikul 29f87c7191 Increased maximum number of XML nodes to support CPX mode. (#1386) 2024-10-23 11:15:11 -05:00
Wenkai Du e0780ba4d4 Fix topology discovery in container with subset of GPUs (#1384)
* Fix topology discovery in container with subset of GPUs

* Move links counting out of loop
2024-10-22 13:50:23 -07:00
Bertan Dogancay 373f113524 Dynamically select unroll factor to build for when targeting local arch (#1371)
* Dynamically select unroll factor to build for when targeting local arch only
2024-10-21 10:53:11 -04:00
Wenkai Du 7c077db307 Increase CQ size to 3*MAX_REQUESTS (#1374)
* Increase CQ size to 3*MAX_REQUESTS

Suggested by Rukhsana Ansari <rukhsana.ansari@broadcom.com>

* Reword comments based on feedback from Rukhsana
2024-10-18 11:01:03 -07:00
akolliasAMD af5678641d added atomic acquire for gfx12 on prims_simple (#1382) 2024-10-18 11:26:38 -06:00
Wenkai Du c8d3543d3f Add back missing net flush (#1376) 2024-10-15 08:12:26 -07:00
Wenkai Du 821d2e1f30 Allow zero byte sendrecv in alltoallv (#1349)
* Allow zero byte sendrecv in alltoallv

* Fix previous merge error
2024-10-11 10:40:32 -07:00
Wenkai Du 5c367a21d0 Improve model matching for GPUs with alltoall XGMI connection (#1372) 2024-10-11 09:53:14 -07:00
Arm Patinyasakdikul 133ea201cf Increase default number of channels for MI300A in multi-node scenario. (#1366)
This commit changed the default of channels of MI300A from 8 upto 24.
This helps bring up multi-node performance to the expected level.
2024-10-11 11:37:48 -05:00
Wenkai Du b55b6be0cb Fix crash when PXN is enabled on some platforms (#1369) 2024-10-11 09:02:59 -07:00
corey-derochie-amd c11f6b1531 Only set minNchannels if we are actually using MSCCL, checked using comm->mscclCompatible. (#1337) 2024-10-08 10:20:55 -06:00
akolliasAMD bc519fd733 disabled wbinvl1 for gfx9x on ll128 (#1365) 2024-10-08 08:43:29 -06:00