rocm-systems

Autor(a)	SHA1	Mensagem	Data
Wenkai Du	a12bf32475	Reset barrier and make barrier_next thread local (#1531 )	2025-02-05 09:06:48 -08:00
Wenkai Du	d00e903d72	Revert "Remove unused code path (#1527 )" (#1530 ) This reverts commit `091bf899a1`.	2025-02-04 13:14:43 -08:00
Wenkai Du	091bf899a1	Remove unused code path (#1527 )	2025-02-04 10:24:56 -08:00
Bertan Dogancay	387c973b5d	[P2P] Have connIdx for both send and recv (#1524 )	2025-02-04 11:53:20 -05:00
isaki001	19105206f6	Update MSCCL++ register/deregister (#1523 ) * erase handle key from mscclpp communicator during deregistration * remove check on buffer size being a multiple of 32 from registration/deregistration routines since these checks are applied during enqueue * add check for greater than zero buffer size in mscclpp registration	2025-02-04 09:09:56 -06:00
Bertan Dogancay	5804603632	[BUILD] Fix unsupported arguments in generator (#1519 ) * Fix unsupported arguments in generator * Get ROCM_PATH as env variable	2025-02-03 14:51:55 -05:00
Wenkai Du	a5c6b547a2	Add back opCount and channel ID to debug trace (#1520 )	2025-02-03 08:55:27 -08:00
Mustafa Abduljabbar	dc75209dd7	Add IB verbs logging and enable traces through install.sh (#1511 ) * Add IB Verbs logging * Simplify tracing and undo debug.h changes * Update debug.h * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Exchange remote comm device index	2025-01-31 12:35:39 -05:00
Wenkai Du	caba0bc049	Add HDP flush for gfx940 (#1434 ) * Fix collective trace * Use nontemporal for st_global * Fix previous commit * Add HDP flush to data receive path * Fix previous commit * Control flushing by NCCL_NET_FORCE_FLUSH and RCCL_NET_HDP_FLUSH * Introduce RCCL_NET_HDP_FLUSH and RCCL_NET_GDR_FLUSH Both are on by default. Turn both off will skip all flush will likely result in data error. * Enable GDR copy by default * Remove GDR flush env var because it is disabled by GDC flush * Output kernel collective trace at comm destroy by default * Limit kernel timeout messages to 100 * Use system relaxed atomic for loadInt * Refine timeout messages and use atomic for setting offset from CPU * Add kernel trace for barrier timeout * Add backup barrier to avoid race in atomicAdd * Use different counters for different warps * Rework barrier implementation * Fix for other GFX * Use __hip_atomic_store and __hip_atomic_load * Fix bug in previous commit * Don't reset barrier values in running kernel * Update trace format * Fix typo * Switch back to hip_atomic_fetch_add * Use same barrier implementation for all GFX * Remove extra threadfence * Turn off HDP flush by default Please use RCCL_NET_HDP_FLUSH=1 to switch on HDP flush * Remove unnecessary changes from alterative barrier implementation * Added back __threadfence_block * Revert back to threadfence for gfx other than gfx94x	2025-01-31 07:51:10 -08:00
Bertan Dogancay	ecf31da14f	Add ncclDataType_t as type to ROCTX (#1512 )	2025-01-30 13:46:48 -05:00
Arm Patinyasakdikul	6b2b87c9f8	Make proxy dump print out meaningful information. (#1504 ) * Make proxy dump print out meaningful information. fixed: HPEXA-63 * printout raw data instead.	2025-01-29 16:48:49 -06:00
Bertan Dogancay	35fe9e06f3	[Profiler] Enable ROCTX during build by default (#1506 ) * Enable ROCTX during build by default * Check for roctx support in cmake	2025-01-29 11:29:46 -05:00
Bertan Dogancay	dd185f26d2	Fix ROCTX call for MSCCL (#1502 )	2025-01-23 16:00:07 -07:00
BertanDogancay	36343be84f	Merge remote-tracking branch 'nccl/master' into develop	2025-01-23 12:08:46 -06:00
corey-derochie-amd	f77308a2fe	Removing duplicate definitions of `INC_COLL_TRACE` and `traceData` macros (#1500 ) They are nearly identical, except the common.h definition sets `collTrace->channelId`.	2025-01-22 16:50:27 -07:00
isaki001	ff130cce7a	fix scatter_perf crash (#1493 ) * fix scatter_perf crash * Update src/misc/msccl/msccl_lifecycle.cc Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com> * Update src/misc/msccl/msccl_lifecycle.cc Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com> * More buffsRegisteredNonGraphMode spelling fixes. --------- Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>	2025-01-21 09:24:32 -06:00
isaki001	d89432e8c8	update mscclpp (#1488 ) * update commit hash for mscclpp submodule * update mscclpp submodule * remove print messages in cmake * add back some print messages, update MSCLPP CMAKE_ARGS * enable MSCCL++ patches regardless of finding mscclpp_nccl package	2025-01-20 08:06:43 -06:00
corey-derochie-amd	2e35417fe5	Added RCCL env params to control setting the SO_REUSEADDR and SO_LINGER socket options (#1418 ) * Added RCCL env params to control setting the SO_REUSEADDR and SO_LINGER socket options. This can allow control over the number of file descriptors created during bootstrapping. * Casted the linger value to `int` sooner to avoid a scope of unknown typed-ness. * Added CHANGELOG entry for this feature.	2025-01-14 10:26:04 -07:00
Nusrat Islam	e9b6bbca8a	Add MSCCLPP user buffer registration APIs and integrate with RCCL (#1477 ) * ext-src: add MSCCLPP memory registration APIs * update mem-reg patch with mscclpp helper routine to check if buffer is registered * RCCL integration of MSCCL++ user-buffer registration APIs * only include mscclpp_nccl header if ENABLE_MSCCLPP is defined * ext-src: update mscclpp mem-reg patch * add helper routine to patch * check handle before MSCCL++ deregister * fix typo to replace send buff with recv buff * in case of no mscclpp registration, dduring deRegister call, ont fall back to rccl deRegister which will return an error * Apply suggestions from code review Whitespace suggestions and reducing diffs to avoid future merge conflicts Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com> * rename helper functions and change their return type * set RCCL user-buffer registration to occur if attempting MSCCL++ registration with a buffer in managed memory --------- Co-authored-by: isaki001 <Ioannis.Sakiotis@amd.com> Co-authored-by: isaki001 <36317038+isaki001@users.noreply.github.com> Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>	2025-01-14 08:20:24 -06:00
Dingming Wu	69d0134ed2	improving kernel traces on opCount bits and adding channelId in ncclCollTrace (#1485 )	2025-01-10 07:57:46 -08:00
Luna	b24580e3d4	net_ib: fix out of bounds read in ncclIbGdrSupport on non-RDMA kernel (#1470 ) Fixes #1469	2025-01-07 16:49:24 -08:00
qiwei_ji	f2ee8d9132	Check nvlink_node instead of xgmi_node in xml.cc (#1407 ) It seems like here wants to check xgmi_node instead. If checks node for "nvlink", it will verify the link_info everytime. If checks node for "xgmi", when get yes answer, it won't need check vsmi topo interface.	2025-01-06 17:09:27 -08:00
Xeonacid	c6c7b6db98	Define wc_store_fence for riscv (#1475 )	2025-01-06 16:59:14 -08:00
corey-derochie-amd	c158d3a9b4	[SWDEV-497665] Blocked `cudaMemcpyAsync` race condition by synchronizing (#1447 ) * Switched calls to `cudaMemcpyAsync` to be `cudaMemcpy` in `ncclTransportP2pSetup` to avoid race condition with `cudaIpcOpenMemHandle` inside p2p `connect`. See `ncclP2pImportShareableBuffer`. * Moved synchronize outside of the loop, as it isn't necessary to sync between every iteration of the loop.	2025-01-03 13:06:47 -07:00
Mustafa Abduljabbar	a9d6e7661c	Add macro for channel mask offset (#1467 )	2025-01-03 08:41:41 -05:00
Mustafa Abduljabbar	e6b179d627	Remove unneeded highestTransportType (#1461 )	2024-12-16 13:28:47 -05:00
Ziyue Yang	83c5eb7378	Fix MSCCL algorithm loading order (#1460 )	2024-12-16 07:41:17 -08:00
Hujingbo	ad4c36dc34	increase p2p channels for Intel platform (#1448 ) Co-authored-by: hujingbo <hujingbo@kuaishou.com>	2024-12-10 07:33:37 -08:00
Benjamin Kitor	a05329bd0d	Add Topologies for 16-GPU gfx942 SuperNode (#1417 ) * Add Topologies for 16-GPU gfx942 SuperNode - Add GigaIO topologies to tools/topo_expl for dev and testing - Add GigaIO Columba 16 GPU romeModel and adjust topology matching algorithm in rome_models for 16 GPU system - Fix bug which failed to match Rome Model when using subsets of system resources (i.e. ROCR_VISIBLE_DEVICES is set) - Fixes for topo_expl * Fix bug w/ 1H16P	2024-12-03 13:12:03 -08:00
gilbertlee-amd	000575867c	Adding RCCL_MODEL_REVERSAL_DISABLE env var to disable model reversal (#1431 ) * Adding RCCL_MODEL_REVERSAL_DISABLE env var to disable model reversal	2024-11-25 11:24:54 -07:00
Bertan Dogancay	dfe4a3ed81	Fix typo in ncclGetKernelIndex macro (#1424 )	2024-11-18 10:40:05 -05:00
Bertan Dogancay	cb175fb0b3	Template generic kernel for unroll factor (#1419 ) * Template generic kernel for unroll factor	2024-11-12 18:27:29 -05:00
darren-amd	ebf0417e90	remove undefined computeColl declaration	2024-11-04 13:42:01 -05:00
corey-derochie-amd	1c45962273	Hide or fix all build warnings (#1331 ) * Changing C-strings to be const. * Changed variable-length arrays to std::vector to avoid warnings. VLA is a compiler extension. * Changed `#define` inside functions into `constexpr int` to preserve scoping and avoid macro redefinition warnings. * Disabled warnings for modifying `CMAKE_CXX_FLAGS` caused by `check_symbol_exists`, which temporarily modifies the flag to do a compile check. * Fixed VLA in rccl UT.	2024-11-04 09:46:42 -07:00
Abhishek Kulkarni	6178556853	GDR enablement logic fix for kernel 6.4.0+ (#1378 )	2024-11-03 01:20:07 -05:00
Avinash	d6006f0425	Memory leak fixes in hostside functions (#1388 ) memory leak fixes for parseRome4P2H and ncclTopoAddGPU	2024-10-30 14:25:56 -05:00
corey-derochie-amd	ea20af698e	Remove MSCCL switch case fall-through by adding break statement. (#1342 )	2024-10-29 15:47:59 -06:00
gilbertlee-amd	0cbce2a757	Adding support for odd nodes for model_87 (#1309 )	2024-10-24 08:38:12 -06:00
Arm Patinyasakdikul	29f87c7191	Increased maximum number of XML nodes to support CPX mode. (#1386 )	2024-10-23 11:15:11 -05:00
Wenkai Du	e0780ba4d4	Fix topology discovery in container with subset of GPUs (#1384 ) * Fix topology discovery in container with subset of GPUs * Move links counting out of loop	2024-10-22 13:50:23 -07:00
Bertan Dogancay	373f113524	Dynamically select unroll factor to build for when targeting local arch (#1371 ) * Dynamically select unroll factor to build for when targeting local arch only	2024-10-21 10:53:11 -04:00
Wenkai Du	7c077db307	Increase CQ size to 3MAX_REQUESTS (#1374 ) Increase CQ size to 3MAX_REQUESTS Suggested by Rukhsana Ansari <rukhsana.ansari@broadcom.com> Reword comments based on feedback from Rukhsana	2024-10-18 11:01:03 -07:00
akolliasAMD	af5678641d	added atomic acquire for gfx12 on prims_simple (#1382 )	2024-10-18 11:26:38 -06:00
Wenkai Du	c8d3543d3f	Add back missing net flush (#1376 )	2024-10-15 08:12:26 -07:00
Wenkai Du	821d2e1f30	Allow zero byte sendrecv in alltoallv (#1349 ) * Allow zero byte sendrecv in alltoallv * Fix previous merge error	2024-10-11 10:40:32 -07:00
Wenkai Du	5c367a21d0	Improve model matching for GPUs with alltoall XGMI connection (#1372 )	2024-10-11 09:53:14 -07:00
Arm Patinyasakdikul	133ea201cf	Increase default number of channels for MI300A in multi-node scenario. (#1366 ) This commit changed the default of channels of MI300A from 8 upto 24. This helps bring up multi-node performance to the expected level.	2024-10-11 11:37:48 -05:00
Wenkai Du	b55b6be0cb	Fix crash when PXN is enabled on some platforms (#1369 )	2024-10-11 09:02:59 -07:00
corey-derochie-amd	c11f6b1531	Only set `minNchannels` if we are actually using MSCCL, checked using `comm->mscclCompatible`. (#1337 )	2024-10-08 10:20:55 -06:00
akolliasAMD	bc519fd733	disabled wbinvl1 for gfx9x on ll128 (#1365 )	2024-10-08 08:43:29 -06:00

1 2 3 4 5 ...

821 Cometimentos