Граф коммитов

555 Коммитов

Автор SHA1 Сообщение Дата
Wenkai Du 1670bddea0 Remove hard coded /opt/rocm from cmake (#396)
[ROCm/rccl commit: 59d2867b01]
2021-06-21 08:29:23 -07:00
gilbertlee-amd 01a8efbb76 [TransferBench] Adding ability to specify source data pattern (#394)
* [TransferBench] Adding ability to specify source data pattern

[ROCm/rccl commit: ff413be933]
2021-06-15 08:41:57 -06:00
Eiden Yoshida dbb867942d Move address-sanitizer build above addition of rccl library in CMakeLists (#392)
[ROCm/rccl commit: fb267ea333]
2021-06-11 14:43:54 -06:00
Wenkai Du f82f99f533 Select sendrecv path based on collective data size (#391)
* Select sendrecv path based on collective data size

* Add comments on packing and unpacking group field

* Toggling RCCL_P2P_NET_DISABLE in combined calls unit tests

[ROCm/rccl commit: 6dcae8a459]
2021-06-10 17:51:04 -07:00
Stanley Tsang dd98f1762a Fixing bug with ExtractSubDataset function not fully initializing subdataset (#390)
[ROCm/rccl commit: f6f5e16fe6]
2021-06-10 14:35:39 -06:00
Eiden Yoshida a24e180296 Add address sanitizer build option (#389)
[ROCm/rccl commit: eea7b24058]
2021-06-10 09:14:54 -06:00
Wenkai Du 5bebcb0015 Setup collectives threshold for enabling intranet (#387)
* Setup collectives threshold for enabling intranet

* Use separate operation counters for coll and p2p

[ROCm/rccl commit: b815a2800f]
2021-06-09 13:24:26 -07:00
Wenkai Du ab9b9151d2 Add support for another Rome model (#385)
[ROCm/rccl commit: c2064adcc7]
2021-06-08 13:58:20 -07:00
Stanley Tsang 52ffd67cd6 Updating changelog to show install script fix (#384)
[ROCm/rccl commit: 6842429a14]
2021-06-08 13:00:40 -06:00
Wenkai Du c8a432dc25 Allow intranode use of network connection (#383)
* Allow intranode use of network connection

* Checking for graph for null pointer

[ROCm/rccl commit: a3a8c2d56b]
2021-06-08 07:37:59 -07:00
Stanley Tsang b1f41247a2 Fixing install script so that invoking -r alone does not trigger rebuild (#382)
[ROCm/rccl commit: 820a53287f]
2021-06-04 09:46:04 -06:00
Wenkai Du 4b31e521e9 Add option to enable multiple SAT in SHARP (#380)
* Add option to enable multiple SAT in SHARP

* Extend number of NICs to 16

[ROCm/rccl commit: 961922ea02]
2021-06-03 19:45:18 -07:00
gilbertlee-amd fd94c55afe ROCm 4.3 changelog update (#379)
* Update CHANGELOG.md (#378)

* Updating CHANGELOG.md for ROCm 4.3

[ROCm/rccl commit: 903c84050d]
2021-06-03 10:56:02 -06:00
Wenkai Du cdf2780687 topo_expl: update to 2.9.9
[ROCm/rccl commit: 13dc80ee14]
2021-05-26 09:24:34 -07:00
Wenkai Du b154b532a2 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: e3abf1c2ec]
2021-05-25 20:52:15 -07:00
Stanley Tsang b1a6561fee Adding support for hipMallocManaged() in unit tests (#375)
* Adding HMM support for unit tests

* Fixing HMM opt-in check

[ROCm/rccl commit: 256403d4f0]
2021-05-25 17:07:12 -06:00
Wenkai Du aa95cc6102 Update Rome models matching (#376)
[ROCm/rccl commit: 4c83adb75c]
2021-05-25 10:12:40 -07:00
gilbertlee-amd ea17b05518 Tweak clique channel usage for gfx908 (#374)
[ROCm/rccl commit: 8e817ecd6d]
2021-05-21 15:36:21 -06:00
Wenkai Du 92bcdcf5b0 Correction on max number of groups (#373)
[ROCm/rccl commit: 50da1b48af]
2021-05-20 08:58:45 -07:00
Wenkai Du b27490d38d Use fixed segment size for sendrecv (#369)
[ROCm/rccl commit: 8cde34be51]
2021-05-19 08:25:26 -07:00
Wenkai Du 83d309354e Running only sum for CI quick test (#370)
[ROCm/rccl commit: 42b080867e]
2021-05-19 08:25:13 -07:00
gilbertlee-amd 87423792fd Tune clique-based AllReduce for device type 908 (#372)
* Changing switch-over point for clique-based AllReduce

[ROCm/rccl commit: ddceadc313]
2021-05-18 15:36:07 -06:00
gilbertlee-amd 08addb85a2 Disabling env var caching for all unit tests (#371)
* Disabling env var caching for all unit tests

[ROCm/rccl commit: 2daadcc834]
2021-05-18 12:56:30 -06:00
Wenkai Du e0edd3d5e4 Merge remote-tracking branch 'nccl/master' into 2.9.8
[ROCm/rccl commit: 87727383fe]
2021-05-17 10:15:16 -07:00
Stanley Tsang 6bce88058a Multiprocess unit test various fixes (#367)
* Re-enabling mp unit tests

* Fixing shared memory leak and other bugs related to shared mem for MP unit tests

* Revert 43bfbfc97bf9edbae1f386d461439091618ff8ed

* Further tightening up unlinks

* Moving test check macros to separate header file

* Tightening up shared memory unlinking for clique kernels, add munmap for host barrier for MP unit tests

* Updating new MP unit test

* Fixing mqueue bug

* Fixing memory leak in MP unit tests

[ROCm/rccl commit: 0b2bfdd6d8]
2021-05-14 09:38:49 -06:00
Wenkai Du fa690c47a0 Allow user to link layer with RCCL_IB_HCA_SKIP_LINK_LAYER (#361)
To skip Infiniband, set RCCL_IB_HCA_SKIP_LINK_LAYER=1.
To skip Ethernet, RCCL_IB_HCA_SKIP_LINK_LAYER=2.

[ROCm/rccl commit: caf5c9992a]
2021-05-12 14:14:53 -07:00
Sylvain Jeaugey 3dd3a1ca66 2.9.9-1
Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)


[ROCm/rccl commit: 3fec2fa5ee]
2021-05-12 11:09:31 -07:00
Wenkai Du 129e7f4bfc Merge pull request #366 from ROCmSoftwarePlatform/2.9.6
Sync up to NCCL 2.9.6

[ROCm/rccl commit: abde40197a]
2021-05-11 20:20:42 -07:00
Wenkai Du 778aa5f868 Revert "Sync up to NCCL 2.9.6 (#363)" (#365)
This reverts commit 4f7d5f85ec.

[ROCm/rccl commit: 330b82df3b]
2021-05-11 20:18:17 -07:00
Wenkai Du 4f7d5f85ec Sync up to NCCL 2.9.6 (#363)
* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* Clique tuning upgrade (#352) (#19)

* Enabling clique for any XGMI-connected topology, adding tuning
* Updating CHANGELOG for clique tuning
* Re-working clique barrier system to work on multi-process / multi-gpu

Co-authored-by: Sylvain Jeaugey <sjeaugey@nvidia.com>
Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com>

[ROCm/rccl commit: 6021329af0]
2021-05-11 19:40:34 -07:00
gilbertlee-amd e2bf842e85 Update README.md (#364)
- Remove outdated HIP Direct call requirements
- Remove outdated chrpath requirement
- Adding section about HSA_FORCE_FINE_GRAIN_PCIE

[ROCm/rccl commit: b122dcd991]
2021-05-11 13:41:41 -06:00
gilbertlee-amd 071150a1b4 Clique tuning upgrade (#352) (#19)
* Enabling clique for any XGMI-connected topology, adding tuning
* Updating CHANGELOG for clique tuning
* Re-working clique barrier system to work on multi-process / multi-gpu

[ROCm/rccl commit: e796b1645c]
2021-05-11 08:44:59 -06:00
Sylvain Jeaugey 780273774a 2.9.8-1
Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.


[ROCm/rccl commit: ca8485b0d0]
2021-05-10 14:00:03 -07:00
gilbertlee-amd f4a12be69b Clique tuning upgrade (#352)
* Enabling clique for any XGMI-connected topology, adding tuning
* Updating CHANGELOG for clique tuning
* Re-working clique barrier system to work on multi-process / multi-gpu

[ROCm/rccl commit: 9d7232c091]
2021-05-06 09:50:07 -06:00
Wenkai Du a76bebf8b6 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: a4ea1fed5b]
2021-05-05 16:01:01 -07:00
gilbertlee-amd ffdf00a2fa Fixing potential race-condition in env var parameter macro (#359)
[ROCm/rccl commit: 4f8e788a61]
2021-04-28 12:04:41 -06:00
saadrahim 7e30cf002d Expanding CI coverage for 8GPU configurations plus extended tests (#350)
[ROCm/rccl commit: 96782191cf]
2021-04-27 09:57:00 -06:00
Wenkai Du 22e8269864 Add libdl linking option (#358)
[ROCm/rccl commit: ad54a14a5c]
2021-04-26 15:24:58 -07:00
Wenkai Du 185ad0deab Use better name for kernel collective trace enable (#357)
"NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL" enables collectives API
trace. Adding "RCCL_KERNEL_COLL_TRACE_ENABLE=1" enables kernel traces.

[ROCm/rccl commit: ed237dcaa7]
2021-04-26 08:35:53 -07:00
Wenkai Du 60a24aa4db Control collective trace from kernel separately (#356)
[ROCm/rccl commit: 9cc9c3360b]
2021-04-23 16:36:19 -07:00
Stanley Tsang 6680e23c63 Message queue refactor to POSIX implementation and leak fix (#355)
* Fixing message queue leak.

* Using POSIX implementation of Message Queues

* Adding unlink to msgqueue

* MsgQueue update

* Adding timeout check to msgqueue broadcast; tightening up system checks

* Removing unnecessary code

* Removing extra argument from print

* Adding explicit msg queue close call to all other ranks

[ROCm/rccl commit: 70597789d0]
2021-04-23 11:33:20 -06:00
Wenkai Du e28bac31aa Tune number of channels for gfx90a (#349)
[ROCm/rccl commit: 415c7cd3d1]
2021-04-19 15:27:01 -07:00
Wenkai Du 951d89b12f Use correct WARP_SIZE for gfx1030 (#348)
[ROCm/rccl commit: 9c718ce6d6]
2021-04-14 14:09:52 -07:00
Wenkai Du 661b1351a3 Limit max channels for ring graph on single node Rome (#347)
* Limit max channels for ring graph on single node Rome
* Partially revert "Use non-temporal access for streaming data (#341)"

[ROCm/rccl commit: a79f74082e]
2021-04-14 10:14:54 -07:00
Wenkai Du 0f4d497edc Add gfx90a target (#344)
* Add gfx90a target

* Support gfx90a topology

Co-authored-by: Eiden Yoshida <eiden.yoshida@amd.com>

[ROCm/rccl commit: 1fe031402a]
2021-04-14 09:29:00 -06:00
Sylvain Jeaugey 20da390b96 2.9.6-1
Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.


[ROCm/rccl commit: a46ea10583]
2021-04-12 16:00:46 -07:00
Wenkai Du 27f33208e3 Remove link to NUMA lib as it is no longer needed (#346)
[ROCm/rccl commit: 3f18540f50]
2021-04-12 09:53:17 -07:00
TomSang 6105af2dfc Add detection of cooperative multi device launch attribute (#345)
[ROCm/rccl commit: 87f12cbb86]
2021-04-11 13:29:24 -07:00
Wenkai Du 040656ba9b Move RCCL changelog and Copyright out of /usr/share (#343)
[ROCm/rccl commit: def8b4ca0d]
2021-04-09 14:08:40 -07:00
Wenkai Du 6b3389b790 Use non-temporal access for streaming data (#341)
* Use non-temporal access for streaming data

* Revert to ulong2 after fixing compiling issue

[ROCm/rccl commit: 9dfc2c183e]
2021-04-07 17:34:35 -07:00