コミットグラフ

564 コミット

作成者 SHA1 メッセージ 日付
Wenkai Du bbfc0c85d2 Skipping unnecessary functions in Doxygen by marking as internal (#353) (#406)
(cherry picked from commit 198e17608ef40acf6b9515c6831d4a26786aabd6)

Co-authored-by: saadrahim <44449863+saadrahim@users.noreply.github.com>

[ROCm/rccl commit: dfc62d5fbb]
2021-07-24 11:04:27 -07:00
Lu 2ab26a2ff9 Add more info to RCCL logging for topo-aware optim.
[ROCm/rccl commit: bd6dbca8fb]
2021-07-22 09:52:39 -07:00
Wenkai Du f773636575 Fix unit tests static build (#403)
[ROCm/rccl commit: 215904ee8e]
2021-07-09 09:35:32 -07:00
Wenkai Du 71dfc3978e Use rocm_smi_lib for getting topology information (#402)
* Use rocm_smi_lib for getting topology information

* Add rocm-smi-lib dependency to RCCL package

[ROCm/rccl commit: 56155ff5b6]
2021-07-08 13:23:11 -07:00
Eiden Yoshida 66349efb1d Fix static builds (#393)
[ROCm/rccl commit: 5c3e7d8b67]
2021-06-23 09:19:48 -06:00
gilbertlee-amd f2a72b1e0b [TransferBench] Fixing a typo in TransferBench usage example (#401)
[ROCm/rccl commit: 2b0b608270]
2021-06-22 17:08:57 -06:00
Wenkai Du 929d72b3b9 Deduct ROCM_PATH from CXX unless specified (#400)
[ROCm/rccl commit: e75bc53e06]
2021-06-22 13:29:08 -07:00
Wenkai Du 90ae176437 Fixes for NCCL_MAX_NCHANNELS and topo_expl (#398)
[ROCm/rccl commit: fa6d7e9a63]
2021-06-22 08:41:49 -07:00
gilbertlee-amd 0a636f20a3 [TransferBench] Switching from little-endian fill pattern to big-endian (#399)
[ROCm/rccl commit: 720374a767]
2021-06-21 14:28:51 -06:00
Wenkai Du 1670bddea0 Remove hard coded /opt/rocm from cmake (#396)
[ROCm/rccl commit: 59d2867b01]
2021-06-21 08:29:23 -07:00
gilbertlee-amd 01a8efbb76 [TransferBench] Adding ability to specify source data pattern (#394)
* [TransferBench] Adding ability to specify source data pattern

[ROCm/rccl commit: ff413be933]
2021-06-15 08:41:57 -06:00
Eiden Yoshida dbb867942d Move address-sanitizer build above addition of rccl library in CMakeLists (#392)
[ROCm/rccl commit: fb267ea333]
2021-06-11 14:43:54 -06:00
Wenkai Du f82f99f533 Select sendrecv path based on collective data size (#391)
* Select sendrecv path based on collective data size

* Add comments on packing and unpacking group field

* Toggling RCCL_P2P_NET_DISABLE in combined calls unit tests

[ROCm/rccl commit: 6dcae8a459]
2021-06-10 17:51:04 -07:00
Stanley Tsang dd98f1762a Fixing bug with ExtractSubDataset function not fully initializing subdataset (#390)
[ROCm/rccl commit: f6f5e16fe6]
2021-06-10 14:35:39 -06:00
Eiden Yoshida a24e180296 Add address sanitizer build option (#389)
[ROCm/rccl commit: eea7b24058]
2021-06-10 09:14:54 -06:00
Wenkai Du 5bebcb0015 Setup collectives threshold for enabling intranet (#387)
* Setup collectives threshold for enabling intranet

* Use separate operation counters for coll and p2p

[ROCm/rccl commit: b815a2800f]
2021-06-09 13:24:26 -07:00
Wenkai Du ab9b9151d2 Add support for another Rome model (#385)
[ROCm/rccl commit: c2064adcc7]
2021-06-08 13:58:20 -07:00
Stanley Tsang 52ffd67cd6 Updating changelog to show install script fix (#384)
[ROCm/rccl commit: 6842429a14]
2021-06-08 13:00:40 -06:00
Wenkai Du c8a432dc25 Allow intranode use of network connection (#383)
* Allow intranode use of network connection

* Checking for graph for null pointer

[ROCm/rccl commit: a3a8c2d56b]
2021-06-08 07:37:59 -07:00
Stanley Tsang b1f41247a2 Fixing install script so that invoking -r alone does not trigger rebuild (#382)
[ROCm/rccl commit: 820a53287f]
2021-06-04 09:46:04 -06:00
Wenkai Du 4b31e521e9 Add option to enable multiple SAT in SHARP (#380)
* Add option to enable multiple SAT in SHARP

* Extend number of NICs to 16

[ROCm/rccl commit: 961922ea02]
2021-06-03 19:45:18 -07:00
gilbertlee-amd fd94c55afe ROCm 4.3 changelog update (#379)
* Update CHANGELOG.md (#378)

* Updating CHANGELOG.md for ROCm 4.3

[ROCm/rccl commit: 903c84050d]
2021-06-03 10:56:02 -06:00
Wenkai Du cdf2780687 topo_expl: update to 2.9.9
[ROCm/rccl commit: 13dc80ee14]
2021-05-26 09:24:34 -07:00
Wenkai Du b154b532a2 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: e3abf1c2ec]
2021-05-25 20:52:15 -07:00
Stanley Tsang b1a6561fee Adding support for hipMallocManaged() in unit tests (#375)
* Adding HMM support for unit tests

* Fixing HMM opt-in check

[ROCm/rccl commit: 256403d4f0]
2021-05-25 17:07:12 -06:00
Wenkai Du aa95cc6102 Update Rome models matching (#376)
[ROCm/rccl commit: 4c83adb75c]
2021-05-25 10:12:40 -07:00
gilbertlee-amd ea17b05518 Tweak clique channel usage for gfx908 (#374)
[ROCm/rccl commit: 8e817ecd6d]
2021-05-21 15:36:21 -06:00
Wenkai Du 92bcdcf5b0 Correction on max number of groups (#373)
[ROCm/rccl commit: 50da1b48af]
2021-05-20 08:58:45 -07:00
Wenkai Du b27490d38d Use fixed segment size for sendrecv (#369)
[ROCm/rccl commit: 8cde34be51]
2021-05-19 08:25:26 -07:00
Wenkai Du 83d309354e Running only sum for CI quick test (#370)
[ROCm/rccl commit: 42b080867e]
2021-05-19 08:25:13 -07:00
gilbertlee-amd 87423792fd Tune clique-based AllReduce for device type 908 (#372)
* Changing switch-over point for clique-based AllReduce

[ROCm/rccl commit: ddceadc313]
2021-05-18 15:36:07 -06:00
gilbertlee-amd 08addb85a2 Disabling env var caching for all unit tests (#371)
* Disabling env var caching for all unit tests

[ROCm/rccl commit: 2daadcc834]
2021-05-18 12:56:30 -06:00
Wenkai Du e0edd3d5e4 Merge remote-tracking branch 'nccl/master' into 2.9.8
[ROCm/rccl commit: 87727383fe]
2021-05-17 10:15:16 -07:00
Stanley Tsang 6bce88058a Multiprocess unit test various fixes (#367)
* Re-enabling mp unit tests

* Fixing shared memory leak and other bugs related to shared mem for MP unit tests

* Revert 43bfbfc97bf9edbae1f386d461439091618ff8ed

* Further tightening up unlinks

* Moving test check macros to separate header file

* Tightening up shared memory unlinking for clique kernels, add munmap for host barrier for MP unit tests

* Updating new MP unit test

* Fixing mqueue bug

* Fixing memory leak in MP unit tests

[ROCm/rccl commit: 0b2bfdd6d8]
2021-05-14 09:38:49 -06:00
Wenkai Du fa690c47a0 Allow user to link layer with RCCL_IB_HCA_SKIP_LINK_LAYER (#361)
To skip Infiniband, set RCCL_IB_HCA_SKIP_LINK_LAYER=1.
To skip Ethernet, RCCL_IB_HCA_SKIP_LINK_LAYER=2.

[ROCm/rccl commit: caf5c9992a]
2021-05-12 14:14:53 -07:00
Sylvain Jeaugey 3dd3a1ca66 2.9.9-1
Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)


[ROCm/rccl commit: 3fec2fa5ee]
2021-05-12 11:09:31 -07:00
Wenkai Du 129e7f4bfc Merge pull request #366 from ROCmSoftwarePlatform/2.9.6
Sync up to NCCL 2.9.6

[ROCm/rccl commit: abde40197a]
2021-05-11 20:20:42 -07:00
Wenkai Du 778aa5f868 Revert "Sync up to NCCL 2.9.6 (#363)" (#365)
This reverts commit 4f7d5f85ec.

[ROCm/rccl commit: 330b82df3b]
2021-05-11 20:18:17 -07:00
Wenkai Du 4f7d5f85ec Sync up to NCCL 2.9.6 (#363)
* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* Clique tuning upgrade (#352) (#19)

* Enabling clique for any XGMI-connected topology, adding tuning
* Updating CHANGELOG for clique tuning
* Re-working clique barrier system to work on multi-process / multi-gpu

Co-authored-by: Sylvain Jeaugey <sjeaugey@nvidia.com>
Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com>

[ROCm/rccl commit: 6021329af0]
2021-05-11 19:40:34 -07:00
gilbertlee-amd e2bf842e85 Update README.md (#364)
- Remove outdated HIP Direct call requirements
- Remove outdated chrpath requirement
- Adding section about HSA_FORCE_FINE_GRAIN_PCIE

[ROCm/rccl commit: b122dcd991]
2021-05-11 13:41:41 -06:00
gilbertlee-amd 071150a1b4 Clique tuning upgrade (#352) (#19)
* Enabling clique for any XGMI-connected topology, adding tuning
* Updating CHANGELOG for clique tuning
* Re-working clique barrier system to work on multi-process / multi-gpu

[ROCm/rccl commit: e796b1645c]
2021-05-11 08:44:59 -06:00
Sylvain Jeaugey 780273774a 2.9.8-1
Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.


[ROCm/rccl commit: ca8485b0d0]
2021-05-10 14:00:03 -07:00
gilbertlee-amd f4a12be69b Clique tuning upgrade (#352)
* Enabling clique for any XGMI-connected topology, adding tuning
* Updating CHANGELOG for clique tuning
* Re-working clique barrier system to work on multi-process / multi-gpu

[ROCm/rccl commit: 9d7232c091]
2021-05-06 09:50:07 -06:00
Wenkai Du a76bebf8b6 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: a4ea1fed5b]
2021-05-05 16:01:01 -07:00
gilbertlee-amd ffdf00a2fa Fixing potential race-condition in env var parameter macro (#359)
[ROCm/rccl commit: 4f8e788a61]
2021-04-28 12:04:41 -06:00
saadrahim 7e30cf002d Expanding CI coverage for 8GPU configurations plus extended tests (#350)
[ROCm/rccl commit: 96782191cf]
2021-04-27 09:57:00 -06:00
Wenkai Du 22e8269864 Add libdl linking option (#358)
[ROCm/rccl commit: ad54a14a5c]
2021-04-26 15:24:58 -07:00
Wenkai Du 185ad0deab Use better name for kernel collective trace enable (#357)
"NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL" enables collectives API
trace. Adding "RCCL_KERNEL_COLL_TRACE_ENABLE=1" enables kernel traces.

[ROCm/rccl commit: ed237dcaa7]
2021-04-26 08:35:53 -07:00
Wenkai Du 60a24aa4db Control collective trace from kernel separately (#356)
[ROCm/rccl commit: 9cc9c3360b]
2021-04-23 16:36:19 -07:00
Stanley Tsang 6680e23c63 Message queue refactor to POSIX implementation and leak fix (#355)
* Fixing message queue leak.

* Using POSIX implementation of Message Queues

* Adding unlink to msgqueue

* MsgQueue update

* Adding timeout check to msgqueue broadcast; tightening up system checks

* Removing unnecessary code

* Removing extra argument from print

* Adding explicit msg queue close call to all other ranks

[ROCm/rccl commit: 70597789d0]
2021-04-23 11:33:20 -06:00