Граф коммитов

293 Коммитов

Автор SHA1 Сообщение Дата
Wenkai Du 60ca7484c0 Fix kernel data trace 2021-08-24 14:02:53 -07:00
Wenkai Du d5f93649ff Merge remote-tracking branch 'origin/develop' into 2.10.3 2021-08-24 09:49:47 -07:00
Wenkai Du 5c8380ff5b Implement NIC identification and remapping (#420)
* Add 1H16P GPU model

* Implement NIC identification and remapping

* Revert "Sort IB devices based on device name (#413)"

This reverts commit 2d0ed8dff6.

* Fix permute and check order

* Correction on IB speed reporting

* Revert "Allow user to link layer with RCCL_IB_HCA_SKIP_LINK_LAYER (#361)"

This reverts commit caf5c9992a.
2021-08-24 09:42:04 -07:00
Wenkai Du 5f15ed6e3e Add gfx908 VM model (#418) 2021-08-10 08:55:11 -07:00
Wenkai Du 707c687090 Use noinline for kernel functions 2021-08-06 09:15:04 -07:00
Wenkai Du 01d3b20a66 Fix incorrect network proxy received bytes reporting 2021-08-05 17:45:48 -07:00
Wenkai Du babbd1047b Merge branch 'develop' into 2.10.3 2021-08-04 09:45:22 -07:00
Wenkai Du 2d0ed8dff6 Sort IB devices based on device name (#413) 2021-08-03 15:32:41 -07:00
Wenkai Du bf2339f93e Merge remote-tracking branch 'nccl/master' into 2.10.3 2021-07-30 16:23:14 -07:00
Wenkai Du 3e27227562 XGMI connection is always prioritized over NET regardless of hops (#412) 2021-07-29 11:12:42 -07:00
Wenkai Du 818cdb16a8 Query XGMI links from xml and adjust gfx906 channel usage (#410) 2021-07-27 17:32:41 -07:00
Wenkai Du 135d47d125 topo_expl: fix build after switching to rocm-smi-lib (#405)
* topo_expl: fix build after switching to rocm-smi-lib

* Use minimal of 4 channels for gfx908
2021-07-27 08:30:08 -07:00
Wenkai Du dfc62d5fbb Skipping unnecessary functions in Doxygen by marking as internal (#353) (#406)
(cherry picked from commit 1c982d819d9c7fe0310b80f9a25808e54c71137e)

Co-authored-by: saadrahim <44449863+saadrahim@users.noreply.github.com>
2021-07-24 11:04:27 -07:00
Lu bd6dbca8fb Add more info to RCCL logging for topo-aware optim. 2021-07-22 09:52:39 -07:00
Ke Wen 7e51592129 2.10.3-1
Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.
2021-07-08 14:30:14 -07:00
Wenkai Du 56155ff5b6 Use rocm_smi_lib for getting topology information (#402)
* Use rocm_smi_lib for getting topology information

* Add rocm-smi-lib dependency to RCCL package
2021-07-08 13:23:11 -07:00
Wenkai Du fa6d7e9a63 Fixes for NCCL_MAX_NCHANNELS and topo_expl (#398) 2021-06-22 08:41:49 -07:00
Wenkai Du 6dcae8a459 Select sendrecv path based on collective data size (#391)
* Select sendrecv path based on collective data size

* Add comments on packing and unpacking group field

* Toggling RCCL_P2P_NET_DISABLE in combined calls unit tests
2021-06-10 17:51:04 -07:00
Wenkai Du b815a2800f Setup collectives threshold for enabling intranet (#387)
* Setup collectives threshold for enabling intranet

* Use separate operation counters for coll and p2p
2021-06-09 13:24:26 -07:00
Wenkai Du c2064adcc7 Add support for another Rome model (#385) 2021-06-08 13:58:20 -07:00
Wenkai Du a3a8c2d56b Allow intranode use of network connection (#383)
* Allow intranode use of network connection

* Checking for graph for null pointer
2021-06-08 07:37:59 -07:00
Wenkai Du 961922ea02 Add option to enable multiple SAT in SHARP (#380)
* Add option to enable multiple SAT in SHARP

* Extend number of NICs to 16
2021-06-03 19:45:18 -07:00
Wenkai Du e3abf1c2ec Merge remote-tracking branch 'nccl/master' into develop 2021-05-25 20:52:15 -07:00
Wenkai Du 4c83adb75c Update Rome models matching (#376) 2021-05-25 10:12:40 -07:00
gilbertlee-amd 8e817ecd6d Tweak clique channel usage for gfx908 (#374) 2021-05-21 15:36:21 -06:00
Wenkai Du 50da1b48af Correction on max number of groups (#373) 2021-05-20 08:58:45 -07:00
Wenkai Du 8cde34be51 Use fixed segment size for sendrecv (#369) 2021-05-19 08:25:26 -07:00
gilbertlee-amd ddceadc313 Tune clique-based AllReduce for device type 908 (#372)
* Changing switch-over point for clique-based AllReduce
2021-05-18 15:36:07 -06:00
Wenkai Du 87727383fe Merge remote-tracking branch 'nccl/master' into 2.9.8 2021-05-17 10:15:16 -07:00
Stanley Tsang 0b2bfdd6d8 Multiprocess unit test various fixes (#367)
* Re-enabling mp unit tests

* Fixing shared memory leak and other bugs related to shared mem for MP unit tests

* Revert 43bfbfc97bf9edbae1f386d461439091618ff8ed

* Further tightening up unlinks

* Moving test check macros to separate header file

* Tightening up shared memory unlinking for clique kernels, add munmap for host barrier for MP unit tests

* Updating new MP unit test

* Fixing mqueue bug

* Fixing memory leak in MP unit tests
2021-05-14 09:38:49 -06:00
Wenkai Du caf5c9992a Allow user to link layer with RCCL_IB_HCA_SKIP_LINK_LAYER (#361)
To skip Infiniband, set RCCL_IB_HCA_SKIP_LINK_LAYER=1.
To skip Ethernet, RCCL_IB_HCA_SKIP_LINK_LAYER=2.
2021-05-12 14:14:53 -07:00
Sylvain Jeaugey 3fec2fa5ee 2.9.9-1
Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)
2021-05-12 11:09:31 -07:00
gilbertlee-amd e796b1645c Clique tuning upgrade (#352) (#19)
* Enabling clique for any XGMI-connected topology, adding tuning
* Updating CHANGELOG for clique tuning
* Re-working clique barrier system to work on multi-process / multi-gpu
2021-05-11 08:44:59 -06:00
Sylvain Jeaugey ca8485b0d0 2.9.8-1
Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.
2021-05-10 14:00:03 -07:00
Wenkai Du a4ea1fed5b Merge remote-tracking branch 'nccl/master' into develop 2021-05-05 16:01:01 -07:00
gilbertlee-amd 4f8e788a61 Fixing potential race-condition in env var parameter macro (#359) 2021-04-28 12:04:41 -06:00
Wenkai Du ed237dcaa7 Use better name for kernel collective trace enable (#357)
"NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL" enables collectives API
trace. Adding "RCCL_KERNEL_COLL_TRACE_ENABLE=1" enables kernel traces.
2021-04-26 08:35:53 -07:00
Wenkai Du 9cc9c3360b Control collective trace from kernel separately (#356) 2021-04-23 16:36:19 -07:00
Stanley Tsang 70597789d0 Message queue refactor to POSIX implementation and leak fix (#355)
* Fixing message queue leak.

* Using POSIX implementation of Message Queues

* Adding unlink to msgqueue

* MsgQueue update

* Adding timeout check to msgqueue broadcast; tightening up system checks

* Removing unnecessary code

* Removing extra argument from print

* Adding explicit msg queue close call to all other ranks
2021-04-23 11:33:20 -06:00
Wenkai Du 415c7cd3d1 Tune number of channels for gfx90a (#349) 2021-04-19 15:27:01 -07:00
Wenkai Du 9c718ce6d6 Use correct WARP_SIZE for gfx1030 (#348) 2021-04-14 14:09:52 -07:00
Wenkai Du a79f74082e Limit max channels for ring graph on single node Rome (#347)
* Limit max channels for ring graph on single node Rome
* Partially revert "Use non-temporal access for streaming data (#341)"
2021-04-14 10:14:54 -07:00
Wenkai Du 1fe031402a Add gfx90a target (#344)
* Add gfx90a target

* Support gfx90a topology

Co-authored-by: Eiden Yoshida <eiden.yoshida@amd.com>
2021-04-14 09:29:00 -06:00
Sylvain Jeaugey a46ea10583 2.9.6-1
Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.
2021-04-12 16:00:46 -07:00
TomSang 87f12cbb86 Add detection of cooperative multi device launch attribute (#345) 2021-04-11 13:29:24 -07:00
Wenkai Du 9dfc2c183e Use non-temporal access for streaming data (#341)
* Use non-temporal access for streaming data

* Revert to ulong2 after fixing compiling issue
2021-04-07 17:34:35 -07:00
gilbertlee-amd caba0a63d2 Fixing clique-topology detection (#342)
* Fixing clique-topology detection
* Fix to enable multi-process clique-based kernels
2021-04-07 11:29:44 -06:00
Wenkai Du e26ad2995e Cleanup number of channels calculation (#340) 2021-04-05 17:51:56 -07:00
Wenkai Du 17491c918e Fix incorrect net counting (#339)
* Fix incorrect net counting

* Add comments
2021-04-05 12:21:57 -07:00
Wenkai Du 1d2946ee4b Rework network port trimming code (#338)
* Rework network port trimming code

* Move Rome related changes to separate source files
2021-03-31 10:25:59 -07:00