提交線圖

226 次程式碼提交

作者 SHA1 備註 日期
Wenkai Du bf8eb40705 Move HDP flush to CPU 2021-02-12 18:06:19 +00:00
Wenkai Du 9cc3b56166 Fix GDRDMA read and remove unused files 2021-02-09 01:34:39 +00:00
Stanley Tsang d00b7d17bd Update MP UT to support arbitrary # of GPUs; multiple bugfixes (#16)
* Fixing temp file creation/deletion for Clique kernel mode.

* Refactoring of MP unit tests; include bugfixes and general support for any number of GPUs

* GroupCall MP UT properly quits when too many devices specified

* MP UT will programmatically set NCCL_COMM_ID if not specified; updated install script
2021-02-05 16:49:25 -08:00
Wenkai Du ab1e7a0318 Merge remote-tracking branch 'origin/develop' into 2.8.3 2021-02-04 20:02:34 -05:00
gilbertlee-amd 1990ffd76a Tuning some clique-based kernel parameters (#315) 2021-02-03 20:00:08 -07:00
Wenkai Du 5f97122442 Enable GPU direct RDMA read from GPU 2021-02-03 02:48:30 +00:00
gilbertlee-amd 3e62ceddc5 Clique kernel support (#295) (#15)
* Adding experimental clique-based kernels (opt-in only)

Co-authored-by: Stanley Tsang <stanley.tsang@amd.com>
Co-authored-by: Gilbert Lee <gilbert.lee@amd.com>
Co-authored-by: Wenkai Du <43822138+wenkaidu@users.noreply.github.com>

Co-authored-by: Stanley Tsang <stanley.tsang@amd.com>
Co-authored-by: Wenkai Du <43822138+wenkaidu@users.noreply.github.com>
2021-01-28 09:45:01 -07:00
Wenkai Du 41e47a36e7 Use less unroll for clique kernels (#313) 2021-01-15 17:48:10 -08:00
Wenkai Du 2ddbe6646b Improve collective trace 2021-01-14 19:28:01 -05:00
Wenkai Du f4d5d3d620 Port alltoall[v] 2021-01-14 19:28:01 -05:00
Wenkai Du 105db19a11 Do not allow GPU as intermediate 2021-01-14 19:28:01 -05:00
Wenkai Du e055229e56 Revert "Changes to topology based on XGMI (#272)"
This reverts commit 01bd2573db.
2021-01-14 19:28:01 -05:00
Wenkai Du d469947641 Merge remote-tracking branch 'nccl/master' into no-target-id 2021-01-14 19:27:53 -05:00
Wenkai Du 373a108516 Fix Rome PCIe 2 node topology generation (#310) 2020-12-15 17:16:17 -08:00
Wenkai Du 975b14dffa Add Rome model and improve search (#305) 2020-11-17 14:55:06 -08:00
Sylvain Jeaugey 920dbe5b35 2.8.3-1
Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.
2020-11-17 11:08:52 -08:00
Wenkai Du 554729079d Use device's link width and speed if port doesn't report (#304) 2020-11-13 17:58:04 -08:00
Stanley Tsang 2958f7eace Fixing IPC handle leak (#302) 2020-11-13 10:32:42 -07:00
gilbertlee-amd c8d08a7c2f Adding RCCL_CLIQUE_DEBUG to help debug experimental clique feature (#300) 2020-11-13 09:07:11 -07:00
Wenkai Du 4e68229c8b Skip unused peer connection in scatter and gather (#301) 2020-11-12 15:47:34 -08:00
gilbertlee-amd 41bcfb8878 Clique kernel support (#295)
* Adding experimental clique-based kernels (opt-in only)

Co-authored-by: Stanley Tsang <stanley.tsang@amd.com>
Co-authored-by: Gilbert Lee <gilbert.lee@amd.com>
Co-authored-by: Wenkai Du <43822138+wenkaidu@users.noreply.github.com>
2020-11-10 15:44:10 -07:00
Wenkai Du 2e8b3a0857 Use ncclSend/ncclRecv for alltoall type of collectives as default (#297) 2020-11-09 11:23:17 -08:00
Wenkai Du 709b7e4880 Improve GPU direct RDMA handling on Rome (#294) 2020-11-03 14:29:08 -08:00
Wenkai Du dfa3c41ede Add more Rome models (#292) 2020-10-30 21:26:04 -07:00
xietingwew 084207e685 fix proxyArgs for trace log 2020-10-21 09:18:40 -07:00
Wenkai Du dcad0ef7cb Fix incorrect pointer checking for scatter and gather (#285) 2020-10-19 13:27:09 -07:00
Wenkai Du c835d8263a Merge remote-tracking branch 'nccl/master' into nccl_sync 2020-10-15 18:42:38 -04:00
gilbertlee-amd 84a2541e01 Revert "Initial support for clique-based kernels (#276)" (#280)
This reverts commit 2b8184808d.
2020-10-15 11:30:18 -07:00
Sylvain Jeaugey 0e14394c5f Fix affinity move 2020-10-13 16:58:05 -07:00
Sylvain Jeaugey c6dbdb0084 Make sure proxy threads inherit the CPU affinity. 2020-10-13 16:37:52 -07:00
Wenkai Du 33babcb5e2 Update Rome single node models (#277) 2020-10-13 13:33:09 -07:00
gilbertlee-amd 2b8184808d Initial support for clique-based kernels (#276)
* Initial support for clique-based kernels
2020-10-13 11:22:04 -06:00
Wenkai Du ae008fd2db Rework Rome detection and add multiple network ports models (#274)
* Rework Rome detection and add multiple network ports models

* Remove unused opCount in p2p transport
2020-10-07 13:37:36 -07:00
Wenkai Du b871ea3c0c Add Alltoallv RCCL kernel implementation (#269)
* Add alltoallv API and implementation

* Extend Rome P2P channel limit to multinode and alltoall kernels

* topo_expl: fix compilation and sync up with main

* gtest: use RCCL alltoallv API

* Code review changes
2020-09-30 16:25:36 -07:00
Stanley Tsang acca2ae20a Updating inline asm to not require explicit L1 cache invalidation (#270) 2020-09-25 13:46:26 -06:00
gilbertlee-amd 01bd2573db Changes to topology based on XGMI (#272)
* Alterations to topology search to improve XGMI-enabled nodes
2020-09-25 12:20:09 -06:00
Wenkai Du 44fcde7835 Ensure all ranks on same send/receive or alltoall kernel path (#271) 2020-09-24 08:25:04 -07:00
Wenkai Du d871fceb54 Change network plugin name to librccl-net.so (#266) 2020-09-18 13:23:30 -07:00
Wenkai Du 42955f5f4f Limit P2P channels on Rome 2020-09-17 17:20:32 -07:00
Wenkai Du 60819dcf8d Merge pull request #262 from wenkaidu/alignment
Make data alignment requirements matching ISA manual
2020-09-08 10:40:42 -07:00
Wenkai Du e2042ccf8a Fix broken profiling build (#263) 2020-09-02 15:39:52 -07:00
Wenkai Du 4751992231 Make data alignment requirements matching ISA manual
From https://developer.amd.com/wp-content/resources/Vega_Shader_ISA.pdf

8.1.7. Alignment
For Dword or larger reads or writes, the two LSBs of the byte-address
are ignored, thus forcing Dword alignment.
2020-09-01 21:21:58 +00:00
Wenkai Du 4180e6409e Fix incorrect threads split in sendrecv (#261) 2020-08-31 17:33:22 -07:00
Wenkai Du c5cbece6d0 Increase minimal channels for gfx908 (#259) 2020-08-26 11:40:11 -07:00
Wenkai Du b0919dc46c Only use software barrier for synchronization (#258) 2020-08-25 13:16:34 -07:00
Wenkai Du 391bbf3f1e Add NPS4 support on some models (#256)
* Add NPS4 support on some models

* Add XML models
2020-08-19 11:03:20 -07:00
Wenkai Du a51e4071e3 Add another Rome model (#249)
* Add another Rome model

* Add gfx908 4P3L models and support

* Revert "Use cached value for detecting GDR support only once"

This reverts commit 67c8e72ce3.

* Skip using ibverb for GPU direct RDMA detection

* Fine tune one Rome model
2020-08-17 10:51:02 -07:00
Wenkai Du 7e3d8a31cc Collect gcnArch and hipDeviceArch_t in XML (#252) 2020-08-12 15:48:38 -07:00
Wenkai Du 066223333d Merge pull request #248 from wenkaidu/2.7.8
2.7.8
2020-08-11 08:20:37 -07:00
Wenkai Du 7e3f841fab Merge remote-tracking branch 'nccl/master' into 2.7.8 2020-08-10 16:11:00 +00:00