rocm-systems

Author	SHA1	Message	Date
Arm Patinyasakdikul	29f87c7191	Increased maximum number of XML nodes to support CPX mode. (#1386 )	2024-10-23 11:15:11 -05:00
Wenkai Du	e0780ba4d4	Fix topology discovery in container with subset of GPUs (#1384 ) * Fix topology discovery in container with subset of GPUs * Move links counting out of loop	2024-10-22 13:50:23 -07:00
Bertan Dogancay	373f113524	Dynamically select unroll factor to build for when targeting local arch (#1371 ) * Dynamically select unroll factor to build for when targeting local arch only	2024-10-21 10:53:11 -04:00
Wenkai Du	7c077db307	Increase CQ size to 3MAX_REQUESTS (#1374 ) Increase CQ size to 3MAX_REQUESTS Suggested by Rukhsana Ansari <rukhsana.ansari@broadcom.com> Reword comments based on feedback from Rukhsana	2024-10-18 11:01:03 -07:00
akolliasAMD	af5678641d	added atomic acquire for gfx12 on prims_simple (#1382 )	2024-10-18 11:26:38 -06:00
Wenkai Du	c8d3543d3f	Add back missing net flush (#1376 )	2024-10-15 08:12:26 -07:00
Wenkai Du	821d2e1f30	Allow zero byte sendrecv in alltoallv (#1349 ) * Allow zero byte sendrecv in alltoallv * Fix previous merge error	2024-10-11 10:40:32 -07:00
Wenkai Du	5c367a21d0	Improve model matching for GPUs with alltoall XGMI connection (#1372 )	2024-10-11 09:53:14 -07:00
Arm Patinyasakdikul	133ea201cf	Increase default number of channels for MI300A in multi-node scenario. (#1366 ) This commit changed the default of channels of MI300A from 8 upto 24. This helps bring up multi-node performance to the expected level.	2024-10-11 11:37:48 -05:00
Wenkai Du	b55b6be0cb	Fix crash when PXN is enabled on some platforms (#1369 )	2024-10-11 09:02:59 -07:00
corey-derochie-amd	c11f6b1531	Only set `minNchannels` if we are actually using MSCCL, checked using `comm->mscclCompatible`. (#1337 )	2024-10-08 10:20:55 -06:00
akolliasAMD	bc519fd733	disabled wbinvl1 for gfx9x on ll128 (#1365 )	2024-10-08 08:43:29 -06:00
Nilesh M Negi	8ad76f8d10	[TRANSPORT] Add RCCL_FORCE_ENABLE_GDRDMA for debugging (#1356 ) Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2024-10-06 18:43:49 -05:00
Bertan Dogancay	2dd10c8f17	[BUILD] Move code generation to python from CMake (#1360 ) * Use generate.py for func generation * Convert AddUnroll.cmake to bash	2024-10-03 10:21:19 -04:00
BertanDogancay	84081064a0	Merge remote-tracking branch 'nccl/master' into develop	2024-10-02 09:31:25 -05:00
Wenkai Du	e453f1ced9	Add another Rome model (#1354 )	2024-10-01 17:41:27 -05:00
Ziyue Yang	7830af5844	Fix size matching in MSCCL (#1318 )	2024-10-01 13:32:41 -07:00
Mustafa Abduljabbar	03a3ef3c34	MSCCL Multithreaded regression root cause fix (#1347 ) * Make sure the target device is used for MSCCL * Enable single process mode by default to use MSCCL in MT * Create a per-rank state when GPUs share a thread	2024-09-25 15:24:25 -04:00
Nilesh M Negi	105ff1611f	[TRANSPORT] GDRDMA enablement for linux kernel 6.4.0 or newer (#1328 ) Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2024-09-25 11:29:52 -05:00
Mustafa Abduljabbar	2fe1e9f7db	Fix MSCCLPP seg-fault when RCCL_MSCCL_ENABLE_SINGLE_PROCESS is enabled (#1338 ) Removing unnecessary changes. rename unique hosts function Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com> use updated function name Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com> Missed one instance of `mscclIsMultithreadedComm`. rename unique hosts function Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com> use updated function name Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com> Missed one instance of `mscclIsMultithreadedComm`.	2024-09-20 11:22:05 -05:00
corey-derochie-amd	853a0586b4	Moved `mscclpp_ncclGetUniqueId` call into `ncclCommInitRankFunc` (#1332 ) * Moved call to `mscclpp_ncclGetUniqueId` into `ncclCommInitRankFunc` to avoid setting up transport early in environments where MSCCL++ isn't valid. * Checking `mscclEnabled` for the process and the topology to gate MSCCL++. * Allowed `mscclForceEnable` to enable MSCCL++.	2024-09-16 16:41:40 -06:00
corey-derochie-amd	736a705875	Re-enabled MSCCL++ (#1325 ) * Added restrictions around calling MSCCL++ collectives (#1281) * Added restriction to non-zero 32-byte multiple message sizes to MSCCL++ AllGather. * Renamed and refactored some mscclpp types. * Only transmit the MSCCL++ unique id for non-split comm init. For splitting comm, it has already been transmitted. Instead, save the MSCCL++ communicator in child communicators when calling `ncclCommSplit`. Only destroy MSCCL++ communicators when no RCCL communicators remain that use it. Also improved trace logging. * Disable MSCCL++ when using managed memory buffers as it isn't supported. * Added datatype and op constraints for MSCCL++ AllReduce. * Added documentation on MSCCL++ restrictions to the README. * [BUILD] Support custom CMake flags in MSCCLPP (#1275) * [BUILD] Support custom CMAKE_PREFIX_PATH in MSCCLPP Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> * [BUILD] CMake flags to support build-id in MSCCLPP Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> * [BUILD] Fix CMake warnings in MSCCLPP build Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> * Wrapped all cmake arguments passed to mscclpp to remove empty arguments and properly format them. --------- Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> Co-authored-by: Corey Derochie <corey.derochie@amd.com> * Link to libmscclpp_nccl statically (#1282) * Switched mscclpp_nccl to static linking. Added a build step to rename the NCCL API functions. * Undid separation of building libmscclpp_nccl from building librccl with MSCCL++ integration. With a static build, it's either fully enabled or fully disabled. * `nm` isn't always available in docker containers due to being stripped down. Removed use of `nm` in `cmake` and hard-coded the output into mscclpp_nccl_syms.txt. * Removed IBVerbs dependency for integrating with MSCCL++ (#1313) * Renamed `RCCL_ENABLE_MSCCLPP` to `RCCL_MSCCLPP_ENABLE` to conform to MSCCL. Set `RCCL_MSCCLPP_ENABLE` to 1 by default if `ENABLE_MSCCLPP` is defined, or 0 otherwise. Added a log warning if `RCCL_MSCCLPP_ENABLE` is set to 1 but `ENABLE_MSCCLPP` is not defined. (#1294) * Include mscclpp as a git submodule (#1314) * Added the desired mscclpp commit as a git submodule. * Added step to automatically checkout the mscclpp submodule if it isn't already present, in case the user forgot to clone recursively. * Added instruction to README to clone using --recurse-submodules to get the mscclpp submodule. * Enabled MSCCL++ feature build. --------- Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>	2024-09-11 09:55:16 -06:00
mberenjk	4ceb672179	replacing nccl/cuda related part of the api_trace.h with rccl/hip (#1326 ) Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>	2024-09-10 11:05:14 -05:00
corey-derochie-amd	e056fe8f7e	Disable MSCCL for the non-multi-process case by default (#1307 ) * Added `RCCL_MSCCL_ENABLE_SINGLE_PROCESS` runtime flag to return to the original MSCCL enablement behaviour except when explicitly enabling for multi-thread. * Added documentation for the new `RCCL_MSCCL_ENABLE_SINGLE_PROCESS` runtime env var.	2024-09-04 11:11:50 -06:00
Nusrat Islam	833435be18	graph: fix for MI300X 64 GPU case (#1308 ) PR #1290 introduced a failure for 64 GPU case on MI300X. This PR fixes the failure.	2024-08-26 18:37:58 -05:00
Wenkai Du	532b70afb6	Add new Rome model (#1304 ) * Add another rome model and override * Fix bug * Fix typo * Add ring * Update ring * Fix model matching * Clean up * Clean up * Reverse rings for NCCL_RINGS input * Only reverse NCCL_RINGS for ring graph * Fix mapping issue when using NCCL_RINGS * Add NCCL_RINGS_REMAP to handle inconsistant net names	2024-08-23 08:45:43 +08:00
mberenjk	db840f024e	adding all nccl apis to api_support to enable rccl tracing by rocprofv3 (#1297 ) * adding all nccl apis to api_support to enable rccl tracing by rocprofv3 Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com> Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>	2024-08-22 12:36:07 -05:00
Wenkai Du	d3171b51b7	Fix gfx940 CPX mode (#1290 )	2024-08-16 08:46:06 +08:00
Wenkai Du	eff56735b0	Fix model matching with PXN enable (#1295 )	2024-08-16 06:16:00 +08:00
akolliasAMD	d6c317d6ae	removed hcc mentions (#1291 )	2024-08-14 15:04:13 -06:00
Pedram Alizadeh	a25ca9bb90	adding new tunning table for very large number of nodes (#1288 )	2024-08-09 10:47:42 -04:00
Tim	4200964202	Adding core binding in info (#1212 ) Signed-off-by: AtlantaPepsi <timhu102@amd.com>	2024-08-08 11:36:24 -04:00
Richard Barnes	d09b152aa0	Remove unused but set variable from all_reduce.h (#1258 ) Allows `-Wunused-but-set-variable` to pass	2024-07-29 08:11:24 -07:00
Richard Barnes	86a4ad6e8b	Remove unused but set variable from prims_ll128.h (#1257 ) Allows `-Wunused-but-set-variable` to pass	2024-07-29 08:11:01 -07:00
Richard Barnes	7ad432ee23	Remove unused but set variable from prims_ll.h (#1256 ) Allows `-Wunused-but-set-variable` to pass	2024-07-29 08:10:38 -07:00
akolliasAMD	c246e25f8e	gfx12 Disable ll protocol (#1268 )	2024-07-26 08:59:55 -06:00
corey-derochie-amd	69135976d6	Fix bug where the first collective call was using MSCCL instead of MSCCL++ (#1260 )	2024-07-22 15:46:47 -06:00
saurabhAMD	cf311b71ee	Adding performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing (#1265 ) * Adding performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing * Performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing	2024-07-22 10:21:29 -05:00
corey-derochie-amd	b31b4082dd	Only initialize MSCCL++ when runtime-enabled. (#1266 )	2024-07-22 00:41:31 -06:00
Nusrat Islam	6f331b0d43	Enable CPX mode for MI300X (#1259 ) * graph: enable cpx mode for MI300X * graph: tune limits for cpx and cleanup	2024-07-19 11:30:37 -05:00
Wenkai Du	89349f2ce4	Template unroll for RCCL kernels (#1250 ) * Template unroll for RCCL kernels * Adding unroll template arg during CMake hipification * Reduce linking parallel jobs to avoid OOM in CI * Workaround issues with UT tests SWDEV-469533: register spill fix is needed for mainline build LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs Use -parallel-jobs=8 for linking * CI: do not use -j 16 when building * CI: use -j 8 when building * Only reduce parallel linking job for CI extended * Restore original jenkins command. Change parallel linking jobs in cmake * Disable MSCCLPP --------- Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>	2024-07-19 08:15:59 -07:00
Nilesh M Negi	a1ef217b32	Consistent channel shuffling for MI300X multi-node (#1255 ) * Revert "[GRAPH] Use channel shuffling only for IB systems (#1228)" This reverts commit `5be3b713ef`. * Revert "Revert "Changing channel stride for MI300X multinode (#1196)" (#1224)" This reverts commit `ad31d93f3d`.	2024-07-18 10:18:09 -05:00
Nilesh M Negi	67e867271f	[GRAPH] Disable MSCCL override of no. of channels (#1187 ) Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2024-07-15 10:45:21 -05:00
corey-derochie-amd	9cbb3da224	Only enable MSCCL++ AllReduce for message sizes that are multiples 32 (#1253 ) * Only enable MSCCL++ AllReduce for message sizes that are multiples of 32. MSCCL++ does not handle these other sizes. * Sanitized MSCCL++ logging.	2024-07-12 17:04:23 -07:00
corey-derochie-amd	6dc47eecd7	Integrated RCCL with MSCCL++ for small message sizes (#1231 )	2024-07-12 15:32:58 -06:00
Rahul Vaidya	c755b9cf93	Improved version reporting in NCCL_DEBUG=VERSION (#1232 ) * Improved version reporting in NCCL_DEBUG=VERSION. Signed-off-by: rahulvaidya20 <ravaidya@amd.com> * Version reporting changes Signed-off-by: rahulvaidya20 <ravaidya@amd.com> * Versioning changes: Initialized char arrays to null and fixed typo. --------- Signed-off-by: rahulvaidya20 <ravaidya@amd.com>	2024-07-12 08:14:29 -05:00
akolliasAMD	63e4d76e23	gfx12 initial enablement (#1219 )	2024-07-10 13:32:09 -06:00
corey-derochie-amd	0c36d571ea	Enable multi-threading for MSCCL (#1203 ) MSCCL can now run in a multi-threaded configuration. To test in the unit tests, added the ENABLE_OPENMP compile definition flag and the --openmp-test-enable flag to the unit test build script. To activate, set the environment variables UT_MULTITHREADED=1 and UT_PROCESS_MASK=1. Set Jenkins to use this mode.	2024-07-04 09:34:38 -06:00
Wenkai Du	45f3fbc52f	Checking kernel header files only when missing sysfs entry (#1239 )	2024-07-03 15:53:15 -07:00
Nilesh M Negi	5be3b713ef	[GRAPH] Use channel shuffling only for IB systems (#1228 ) * [GRAPH] Use channel shuffling only for IB systems Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> * [GRAPH] Define channels=48 for gfx94 RoCE systems Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> * [GRAPH] Increase channels for RoCE gfx94 systems Signed-off-by: nileshnegi <Nilesh.Negi@amd.com> --------- Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2024-07-02 12:20:40 -05:00

1 2 3 4 5 ...

782 Commits