2
0
Gráfico de cometimentos

101 Cometimentos

Autor(a) SHA1 Mensagem Data
BertanDogancay c2c9ed2acb Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 9ff53eeeae]
2024-01-30 14:43:43 -08:00
BertanDogancay 404d398bac Merge remote-tracking branch 'nccl/v2.19' into develop
[ROCm/rccl commit: 81ddf9de89]
2024-01-24 15:25:33 -08:00
Wenkai Du 8b8179a689 Use new HIP graph API compatible with CUDA 11030 (#991)
* Use new HIP graph API compatible with CUDA 11030

* Update dependency to ROCm 6.1

* Fix single stream use case

[ROCm/rccl commit: 7e25d5bc55]
2024-01-21 19:00:50 -08:00
Bertan Dogancay 11674674fc [DEV] Configure functions in RCCL (#986)
* configure functions in rccl

[ROCm/rccl commit: 28d9b170c9]
2024-01-18 15:07:16 -07:00
Sylvain Jeaugey b16882a024 2.19.4-1
Split transport connect phase into multiple steps to avoid port
exhaustion when connecting alltoall at large scale. Defaults to 128
peers per round.
Fix memory leaks on CUDA graph capture.
Fix alltoallv crash on self-sendrecv.
Make topology detection more deterministic when PCI speeds are not
available (fix issue #1020).
Properly close shared memory in NVLS resources.
Revert proxy detach after 5 seconds.
Add option to print progress during transport connect.
Add option to set NCCL_DEBUG to INFO on first WARN.


[ROCm/rccl commit: 88d44d777f]
2023-11-13 10:36:12 -08:00
Sylvain Jeaugey 69ee68b6d3 2.19.3-1
H800/H100 fixes and tuning.
Re-enable intra-process direct pointer buffer access when CUMEM is
enabled.


[ROCm/rccl commit: 8c6c595185]
2023-09-26 05:57:15 -07:00
Sylvain Jeaugey 506d6c332c 2.19.1-1
Add local user buffer registration for NVLink SHARP.
Add tuning plugin support.
Increase net API to v7 to allow for device-side packet reordering;
remove support for v4 plugins.
Add support for RoCE ECE.
Add support for C2C links.
Better detect SHM allocation failures to avoid crash with Bus Error.
Fix missing thread unlocks in bootstrap (Fixes #936).
Disable network flush by default on H100.
Move device code from src/collectives/device to src/device.


[ROCm/rccl commit: f9c3dc251e]
2023-09-26 05:50:33 -07:00
Wenkai Du 5983f0e371 Use relaxed atomics for LL on GFX11 (#859)
[ROCm/rccl commit: 6a0a6a37d9]
2023-08-21 16:28:39 -07:00
akolliasAMD 56129830a6 NCCL_TREES variable and rome model fixes (#856)
[ROCm/rccl commit: d33cd5a233]
2023-08-21 10:35:37 -06:00
Wenkai Du 6fdb4103b7 gfx11: don't use LL for sendrecv (#853)
* gfx11: don't use LL for sendrecv

* Use builtin instead of inline asm

[ROCm/rccl commit: f70e3e569b]
2023-08-17 08:50:51 -07:00
Bertan Dogancay 0508fa569a Implement RCCL Replayer (#817)
* Implement RCCL Replayer


[ROCm/rccl commit: 8bab4f04b7]
2023-07-24 16:26:22 -06:00
Wenkai Du f98715baea Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: abd0615351]
2023-06-26 22:51:56 +00:00
Bertan Dogancay d411d52b19 Disable Colltrace for --fast option (#778)
* Disable Colltrace for --fast option

* Limit nprocs for CI

[ROCm/rccl commit: 0c77c66221]
2023-06-21 14:16:09 -06:00
Sylvain Jeaugey 2dc2c86ec1 2.18.3-1
Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC.
Fix hang with Collnet on bfloat16 on systems with less than one NIC
per GPU.
Fix long initialization time.
Fix data corruption with Collnet when mixing multi-process and
multi-GPU per process.
Fix crash when shared memory creation fails.
Fix Avg operation with Collnet/Chain.
Fix performance of alltoall at scale with more than one NIC per GPU.
Fix performance for DGX H800.
Fix race condition in connection progress causing a crash.
Fix network flush with Collnet.
Fix performance of aggregated allGather/reduceScatter operations.
Fix PXN operation when CUDA_VISIBLE_DEVICES is set.
Fix NVTX3 compilation issues on Debian 10.


[ROCm/rccl commit: ea38312273]
2023-06-14 01:29:17 -07:00
Wenkai Du 18562abdb2 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 53a1f91857]
2023-04-25 15:38:32 -07:00
Sylvain Jeaugey 902ff02645 2.18.1-1
Add support for IB SHARP to NVLS (NVLink SHARP algorithm).
Add NVLS+Tree algorithm.
Add support for memory management using cuMem* functions.
Use all NICs for Send/Receive operations on systems with more than
one NIC per GPU (#804).
Add ncclCommSplit primitive, with resource sharing option in config.
Fix alltoallv hang (#788)
Increase number of channels on H100 when we're not limited by NVLink.
Improve error reporting in case of IB failure, printing local and
remote ID (#779).
Add build option to allow compilation against RDMA includes instead
of dynamically loading IB verbs symbols (#802).
Fix context creation for progress thread (#803).
NET/IB: add option to use multiple QPs in round-robin mode.
Fix tree performance issue when NVB is disabled on HCM topologies.


[ROCm/rccl commit: d97a32fac8]
2023-04-18 03:58:25 -07:00
Wenkai Du 5d9c3b0277 Fix unit test HIP graph error (#712)
[ROCm/rccl commit: b02fd04165]
2023-03-20 15:34:09 -07:00
Sylvain Jeaugey 8dcf8e8720 2.17.1-1
Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.


[ROCm/rccl commit: 5d3ab08b69]
2023-03-01 00:39:04 -08:00
Wenkai Du 1f42d485f8 Fix P2P scheduling (#690)
[ROCm/rccl commit: 86e7b71234]
2023-02-21 07:49:54 -08:00
Wenkai Du c76bc214c8 Merge remote-tracking branch 'nccl/master' into HEAD
[ROCm/rccl commit: e1cb45ff22]
2023-02-04 01:44:43 +00:00
Sylvain Jeaugey 2ce8946622 2.16.2-1
Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.


[ROCm/rccl commit: 28189e2df8]
2022-11-30 02:31:59 -08:00
Wenkai Du c15a10a9d2 Move hipify to cmake stage
Add minimal ROCm/HIP version requirements for Graph support


[ROCm/rccl commit: 562dd87036]
2022-11-14 18:10:45 +00:00
Wenkai Du 4c9c1d41ee Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 9a077e6947]
2022-11-03 21:17:42 +00:00
Wenkai Du 4630365f2d Fix P2P scheduling
[ROCm/rccl commit: 72ef100050]
2022-10-31 08:54:34 -07:00
Sylvain Jeaugey 0b20e8b7e9 2.15.5-1
Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.


[ROCm/rccl commit: cb111f764a]
2022-10-25 00:55:55 -07:00
Wenkai Du 36e5e02e46 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 4f0e223db4]
2022-10-20 15:41:29 +00:00
Edgar Gabriel f2736a4fb3 introduce a hw topology aware bintree
for hayabusa architecture.


[ROCm/rccl commit: e645b02cd8]
2022-10-03 15:26:21 +00:00
Wenkai Du f6da79844a Add LL128 tuning (#630)
[ROCm/rccl commit: 021932b3c8]
2022-09-27 09:39:09 -07:00
Sylvain Jeaugey b4bac0d15a 2.15.1-1
Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.


[ROCm/rccl commit: da8152e57a]
2022-09-27 02:31:13 -07:00
Edgar Gabriel 95d6ed2154 make binary tree work on 2.13.4
[ROCm/rccl commit: 8f3219dbd4]
2022-09-15 00:01:54 +00:00
Edgar Gabriel 4c17f4dcc1 Merge branch 'develop' into 2.13.4
[ROCm/rccl commit: be935d7ce7]
2022-09-13 17:19:04 -05:00
Edgar Gabriel 7148c0aa7b add binary tree
In addition, introduce the ability to have 2 trees at the same time.
Only for allreduce at the moment.


[ROCm/rccl commit: 65e2ae20e5]
2022-09-13 20:52:32 +00:00
Wenkai Du 7874a99c75 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: a79d9e3586]
2022-09-09 16:05:38 +00:00
Wenkai Du fe99249cde Enable LL128 protocol support (#605)
* Enable LL128 protocol support

* Use shared memory object directly when possible

[ROCm/rccl commit: 7bbce085cc]
2022-09-08 14:45:27 -07:00
gilbertlee-amd 616cb39a0b Adding opt-in hipGraph support for RCCL via RCCL_ENABLE_HIPGRAPH (#608)
Adding opt-in hipGraph support via RCCL_ENABLE_HIPGRAPH

[ROCm/rccl commit: 47b2fc3a30]
2022-09-06 10:29:46 -06:00
Wenkai Du f18868f439 Use hipExtLaunchKernel when not using graph and not in group mode (#606)
[ROCm/rccl commit: c9f2fe1f65]
2022-08-26 13:40:37 -07:00
akolliasAMD 1d55fe756c Simple tree changes (#599)
changed treebase to create basic balanced tree

[ROCm/rccl commit: 3c1b1ec8c8]
2022-08-19 13:51:49 -06:00
Sylvain Jeaugey f6e1e3d9ed 2.14.3-1
Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.


[ROCm/rccl commit: c4e2aa6c79]
2022-08-18 02:53:17 -07:00
akolliasAMD 6fb5c5d5e3 minor latency tuning (#591)
* minor tuning for tree ll

[ROCm/rccl commit: 4cecdc9be5]
2022-08-03 15:07:44 -06:00
Sylvain Jeaugey 91154e8df9 2.13.4-1
Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.


[ROCm/rccl commit: 19ab67d172]
2022-07-11 08:10:34 -07:00
Wenkai Du 2f4aea93e0 Fix GPU to NIC mapping in tree (#573)
* Fix GPU to NIC mapping in tree

* Update tuning table

[ROCm/rccl commit: 00af1f64e9]
2022-07-03 20:52:52 -07:00
Wenkai Du 11a6cdd52f Fix P2P scheduling (#560)
[ROCm/rccl commit: 5cb2aca3d9]
2022-06-06 13:32:28 -07:00
Aristotelis 0b55e01ef3 Merge remote-tracking branch 'ncclRepo/master' into develop
[ROCm/rccl commit: e0864e7093]
2022-06-02 15:27:24 +00:00
akolliasAMD 22dc8bd246 Added creation of new tree and added switch for using treesplit for specific cases (#551)
[ROCm/rccl commit: 98f0809a39]
2022-05-25 18:55:14 -04:00
Sylvain Jeaugey 1c5734046d 2.12.12-1
Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.


[ROCm/rccl commit: 7aa1c46fd5]
2022-05-13 00:26:57 -07:00
Wenkai Du 347ea354c2 Update tuning parameters
[ROCm/rccl commit: 83fd4f70e7]
2022-04-18 16:04:04 -07:00
Wenkai Du 67e7e6507e Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: d28e1cb44f]
2022-04-18 11:15:25 -07:00
Sylvain Jeaugey 27130280b2 2.12.10-1
Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.


[ROCm/rccl commit: 353e8ba446]
2022-03-30 02:27:01 -07:00
Wenkai Du 828f3d11a0 Update tuning parameters (#518)
* Update tuning parameters

* Respect user algo and topo selections

[ROCm/rccl commit: 7cbbca4da1]
2022-03-29 08:15:37 -07:00
Sylvain Jeaugey f8886d8687 2.12.7-1
Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.


[ROCm/rccl commit: 3c223c105a]
2022-03-02 20:48:56 +01:00