Граф коммитов

197 Коммитов

Автор SHA1 Сообщение Дата
corey-derochie-amd 6dc47eecd7 Integrated RCCL with MSCCL++ for small message sizes (#1231) 2024-07-12 15:32:58 -06:00
Rahul Vaidya c755b9cf93 Improved version reporting in NCCL_DEBUG=VERSION (#1232)
* Improved version reporting in NCCL_DEBUG=VERSION.

Signed-off-by: rahulvaidya20 <ravaidya@amd.com>

* Version reporting changes

Signed-off-by: rahulvaidya20 <ravaidya@amd.com>

* Versioning changes: Initialized char arrays to null and fixed typo.

---------

Signed-off-by: rahulvaidya20 <ravaidya@amd.com>
2024-07-12 08:14:29 -05:00
corey-derochie-amd 0c36d571ea Enable multi-threading for MSCCL (#1203)
MSCCL can now run in a multi-threaded configuration. To test in the unit tests, added the ENABLE_OPENMP compile definition flag and the --openmp-test-enable flag to the unit test build script. To activate, set the environment variables UT_MULTITHREADED=1 and UT_PROCESS_MASK=1. Set Jenkins to use this mode.
2024-07-04 09:34:38 -06:00
Wenkai Du 5d7078e383 Fix DMABUF support (#1218)
* Fix DMABUF support

* Reduce log output by moving dmabuf allocation details to TRACE

* Enable peer memory GDR support if ib_umem_get_peer is in kernel
2024-06-25 08:00:15 -07:00
Nilesh M Negi d9661c17e6 Fix min_nchannels bug for gfx94* nranks=4 (#1202)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-06-07 14:31:28 -05:00
Nusrat Islam 0634c5c8e1 doubling debug buffer size with increased channels 2024-06-03 13:05:05 -05:00
Nusrat Islam 506f16c506 add 256 channels support 2024-06-03 13:03:18 -05:00
ClementLinCF cab25f919e Optimize NCHANNELS and MSCCL config for gfx942 80CUs (#1195)
* Optimize NCHANNELS and MSCCL config for gfx942 80CUs

Set appropriately for different NCCL_MIN_NCHANNELS and MSCCL config,
potentially improving communication perf on the MI300x 80CUs

* Delete tools/msccl-algorithms/allreduce_1step_mccl_8_2_16777216_LL.xml

* Change the factor of gfx94 and update msccl config
2024-06-01 07:07:46 -07:00
AtlantaPepsi 67246649ac prevent segfault from npkit-enabled rccl build
Signed-off-by: AtlantaPepsi <timhu102@amd.com>
2024-04-26 10:54:27 -05:00
Wenkai Du 9e0c9b4ed8 Replace __HIP_PLATFORM_HCC__ with __HIP_PLATFORM_AMD__ (#1154) 2024-04-25 07:19:18 -07:00
BertanDogancay e1a835910e Merge remote-tracking branch 'nccl/master' into develop 2024-04-23 13:34:00 -07:00
gilbertlee-amd 4cb62f999a Rail optimization for rings (#1140)
- Modifies the ring creation algorithm to be friendlier to rail-optimized topologies (should not affect classic fabric topologies)
2024-04-15 12:03:57 -06:00
Wenkai Du 137571fa01 Fix buffer overflow when parsing kernel cmdline (#1133) 2024-04-08 11:12:20 -07:00
Wenkai Du 5976f757dd Remove hipEventDisableSystemFence (#1122)
There is no indication that disabling system fence has any latency improvement.
Removing it per recommendation from HIP.
2024-03-25 08:01:57 -07:00
Sylvain Jeaugey 48bb7fec79 2.20.5-1
Fix UDS connection failure when using ncclCommSplit. Issue #1185
2024-02-26 02:52:39 -08:00
Wenkai Du c5ab37211b Update RCCL/MSCCL work FIFO depth to 256K (#1091) 2024-02-21 17:15:11 -08:00
Sylvain Jeaugey b6475625fb 2.20.3-1
Add support for alternating rings, allow for cross-nic rings without
cross-rail communication.
Add support for user buffer registration for network send/recv.
Optimize aggregated operations to better utilize all channels.
Add flattening for BCM PCI gen5 switches.
Add support for inter-node NVLink communication
Add support for port fusion in NET/IB.
Add support for ReduceScatter and AllGather using Collnet.
Update net API to v8.
Fix hang during A2A connection.
2024-02-13 04:22:38 -08:00
BertanDogancay 00fdb1ef51 Clean up 2024-01-31 17:27:15 -08:00
Wenkai Du 1a134b283b Merge remote-tracking branch 'rccl/develop' into 2.19.4 2024-01-31 11:53:10 -06:00
BertanDogancay 9ff53eeeae Merge remote-tracking branch 'nccl/master' into develop 2024-01-30 14:43:43 -08:00
Wenkai Du be8ef4367f colltrace: fix dropped trace messages (#1059)
* colltrace: fix dropped trace messages

* Remove extra space
2024-01-25 13:31:53 -08:00
BertanDogancay 81ddf9de89 Merge remote-tracking branch 'nccl/v2.19' into develop 2024-01-24 15:25:33 -08:00
Bertan Dogancay c4dbf8a914 Fix collective trace when rccl is configured (#1056)
* Fix collective trace when rccl is configured
2024-01-22 09:26:44 -07:00
Bertan Dogancay 28d9b170c9 [DEV] Configure functions in RCCL (#986)
* configure functions in rccl
2024-01-18 15:07:16 -07:00
Pedram Alizadeh aa5c84c997 Merge pull request #1022 from PedramAlizadeh/sync_nccl_2.18.6
Sync to nccl 2.18.6
2024-01-09 13:29:29 -05:00
Wenkai Du f7e39fced2 Doubling buffer size to fix NCCL INFO corruption with increased channels (#1035) 2024-01-08 08:14:33 -08:00
Wenkai Du e5bf56c6d8 Increase stack size for gfx906 (#1034)
Occationally "Memory access fault by GPU node-8 (Agent handle: 0x23a5640) on address 0x7f461ec00000. Reason: Page not present or supervisor privilege" can be seen from gfx906 CI
2024-01-07 20:25:02 -08:00
PedramAlizadeh 0d515f9388 resolved conflicts, fixed the localNetCount/0 bug 2023-12-18 08:11:34 +00:00
Ziyue Yang 655742a3a6 Fully disable MSCCL when machine is not matched (#1017)
* Disable MSCCL algorithm meta loading when machine is not matched

* fully disable init

* fix potential segfault
2023-12-13 08:36:21 -08:00
Nilesh M Negi bc44e3faa7 Fix gcnArch bug in IFC mix build (#998) (#1002)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2023-12-04 16:20:22 -06:00
Bertan Dogancay 7c0f49a878 IFC mix build (#998) 2023-12-02 18:49:52 -07:00
Wenkai Du bc8661f092 Fix kernel command line warnings (#961)
* Fix kernel command line warnings

* Remove while loop
2023-11-15 18:01:12 -08:00
Sylvain Jeaugey 88d44d777f 2.19.4-1
Split transport connect phase into multiple steps to avoid port
exhaustion when connecting alltoall at large scale. Defaults to 128
peers per round.
Fix memory leaks on CUDA graph capture.
Fix alltoallv crash on self-sendrecv.
Make topology detection more deterministic when PCI speeds are not
available (fix issue #1020).
Properly close shared memory in NVLS resources.
Revert proxy detach after 5 seconds.
Add option to print progress during transport connect.
Add option to set NCCL_DEBUG to INFO on first WARN.
2023-11-13 10:36:12 -08:00
Nilesh M Negi 96ec3ffe2e SRC/INIT: fix typo for ENABLE_PROFILING (#934)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2023-10-26 23:52:46 -05:00
akolliasAMD 28d7fe5629 Dma buf support optin (#905)
* dmaBufSupport Optin added on every part of the code that should invoke it
2023-10-03 03:17:48 -06:00
Sylvain Jeaugey 8c6c595185 2.19.3-1
H800/H100 fixes and tuning.
Re-enable intra-process direct pointer buffer access when CUMEM is
enabled.
2023-09-26 05:57:15 -07:00
Sylvain Jeaugey 3435178b6c Merge remote-tracking branch 'origin/master' into v2.19 2023-09-26 05:55:56 -07:00
Sylvain Jeaugey f9c3dc251e 2.19.1-1
Add local user buffer registration for NVLink SHARP.
Add tuning plugin support.
Increase net API to v7 to allow for device-side packet reordering;
remove support for v4 plugins.
Add support for RoCE ECE.
Add support for C2C links.
Better detect SHM allocation failures to avoid crash with Bus Error.
Fix missing thread unlocks in bootstrap (Fixes #936).
Disable network flush by default on H100.
Move device code from src/collectives/device to src/device.
2023-09-26 05:50:33 -07:00
Kaiming Ouyang 4365458757 Fix cudaMemcpyAsync bug
We are trying to use the copy result of first cudaMemcpyAsync in the
second cudaMemcpyAsync without sync in between. This patch fixes it
by allocating a CPU side array to cache device side addr so that we
can avoid this consecutive cuda mem copy.

Fixes #957
2023-09-20 05:51:14 -07:00
Audrey MP e58ec78d35 Gcn arch name (#886)
We use CMake to determine if we're compiling against a version of ROCm that supports gcnArchName and handles architecture checking appropriately. It includes a few helper functions as drop ins for the functionality we used gcnArch for before; sometimes to enable flags, and sometimes to set frequencies.
2023-09-12 15:34:40 -04:00
Andy li e1dc4d5e42 enable hip graph on multi-node (#884)
* initial checkin

* enable msccl when hip graph is on

* remove the commented out code of msccl enable check

* clean up the code

* remove the msccl HighestTransportType check logic
2023-09-11 15:30:04 -07:00
akolliasAMD d33cd5a233 NCCL_TREES variable and rome model fixes (#856) 2023-08-21 10:35:37 -06:00
Wenkai Du d65c0830c6 Detect HIP_UNCACHED_MEMORY support from HIP version (#842) 2023-08-04 10:17:04 -07:00
Wenkai Du c8085eb704 Improve collective trace (#835) 2023-08-03 07:16:12 -07:00
Wenkai Du a7fcd58a97 Enable gfx94x (#808) (#816)
(cherry picked from commit 94da229a7788d74685d1591a4e75a8341de64f41)
2023-07-21 07:31:27 -07:00
Wenkai Du abd0615351 Merge remote-tracking branch 'nccl/master' into develop 2023-06-26 22:51:56 +00:00
Sylvain Jeaugey ea38312273 2.18.3-1
Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC.
Fix hang with Collnet on bfloat16 on systems with less than one NIC
per GPU.
Fix long initialization time.
Fix data corruption with Collnet when mixing multi-process and
multi-GPU per process.
Fix crash when shared memory creation fails.
Fix Avg operation with Collnet/Chain.
Fix performance of alltoall at scale with more than one NIC per GPU.
Fix performance for DGX H800.
Fix race condition in connection progress causing a crash.
Fix network flush with Collnet.
Fix performance of aggregated allGather/reduceScatter operations.
Fix PXN operation when CUDA_VISIBLE_DEVICES is set.
Fix NVTX3 compilation issues on Debian 10.
2023-06-14 01:29:17 -07:00
Ziyue Yang 7d6e7bcd7d revert npkit (#748) 2023-05-24 07:41:05 -07:00
Wenkai Du 8bb3340fcb Skip checking of some settings in Cray OS (#739) 2023-05-09 07:59:56 -07:00
Wenkai Du 897745a266 Remove references to NVLS functions 2023-05-05 07:55:20 -07:00