Граф коммитов

582 Коммитов

Автор SHA1 Сообщение Дата
Wenkai Du 9ffeb41fe1 Update tuning table (#424)
[ROCm/rccl commit: ef432e48e1]
2021-09-13 08:39:01 -07:00
Wenkai Du 934885526d Merge pull request #423 from wenkaidu/prim-test
rccl-prim-test: support 8p1h and 16p1h testing

[ROCm/rccl commit: a2421f8b4a]
2021-09-08 17:01:19 -07:00
Wenkai Du d2580c8cf5 Improve barrier implementation
[ROCm/rccl commit: adb8d63352]
2021-09-08 16:14:32 -05:00
Wenkai Du d75504e9dc Remove atomic from profiling
[ROCm/rccl commit: 31bd4236f1]
2021-09-08 14:20:32 -05:00
Wenkai Du 310d51056f rccl-prim-test: enable 8p1h and 16p1h test
[ROCm/rccl commit: 7558b5e2bf]
2021-09-08 11:51:26 -05:00
Wenkai Du 4f610a2239 Revert "rccl-prim-test: add all-to-all benchmark (#185)"
This reverts commit e3e1c6b29c.


[ROCm/rccl commit: b22d097524]
2021-09-07 16:41:46 -05:00
gilbertlee-amd 06b0e1c4e2 [TransferBench] ConfigFile parsing fixes, adding additional info (#422)
* [TransferBench] Adding GPU to NUMA distance detection, parsing fixes, config file generation fix

* [TransferBench] Fixing up NUMA node detection by filtering pools

[ROCm/rccl commit: 51d64894ff]
2021-09-07 15:28:16 -06:00
Wenkai Du b9508a6aba Implement NIC identification and remapping (#420)
* Add 1H16P GPU model

* Implement NIC identification and remapping

* Revert "Sort IB devices based on device name (#413)"

This reverts commit de0c586bad.

* Fix permute and check order

* Correction on IB speed reporting

* Revert "Allow user to link layer with RCCL_IB_HCA_SKIP_LINK_LAYER (#361)"

This reverts commit fa690c47a0.

[ROCm/rccl commit: 5c8380ff5b]
2021-08-24 09:42:04 -07:00
Wenkai Du 57518da006 Add gfx908 VM model (#418)
[ROCm/rccl commit: 5f15ed6e3e]
2021-08-10 08:55:11 -07:00
gilbertlee-amd b0c3a1790f [TransferBench] Removing dependency on hip_fp16 header, fixing swapped output CSV header (#416)
[ROCm/rccl commit: 1ed272e5f0]
2021-08-04 10:53:41 -06:00
Wenkai Du de0c586bad Sort IB devices based on device name (#413)
[ROCm/rccl commit: 2d0ed8dff6]
2021-08-03 15:32:41 -07:00
Wenkai Du 4b082ceb32 XGMI connection is always prioritized over NET regardless of hops (#412)
[ROCm/rccl commit: 3e27227562]
2021-07-29 11:12:42 -07:00
Eiden Yoshida d4bdf8fab7 Add basic rtest.xml (#411)
[ROCm/rccl commit: 229ca88ee6]
2021-07-28 11:53:03 -06:00
Wenkai Du faea6ead5c Query XGMI links from xml and adjust gfx906 channel usage (#410)
[ROCm/rccl commit: 818cdb16a8]
2021-07-27 17:32:41 -07:00
Liam Wrubleski 765c46dd89 Update changelog with packaging information (#409)
[ROCm/rccl commit: e579de7ec2]
2021-07-27 10:38:36 -06:00
Wenkai Du 8fbeb14175 topo_expl: fix build after switching to rocm-smi-lib (#405)
* topo_expl: fix build after switching to rocm-smi-lib

* Use minimal of 4 channels for gfx908

[ROCm/rccl commit: 135d47d125]
2021-07-27 08:30:08 -07:00
Eiden Yoshida 3d382b5ba3 Extend test stage timeout to 24 hours (#408)
[ROCm/rccl commit: 56801656f3]
2021-07-26 15:29:21 -06:00
Liam Wrubleski 4efbbec091 Setup runtime and development packages (#407)
* changes to enable devel package

* Update rocm-cmake version & build

[ROCm/rccl commit: 97d9cf40e7]
2021-07-26 15:06:17 -06:00
Wenkai Du bbfc0c85d2 Skipping unnecessary functions in Doxygen by marking as internal (#353) (#406)
(cherry picked from commit 198e17608ef40acf6b9515c6831d4a26786aabd6)

Co-authored-by: saadrahim <44449863+saadrahim@users.noreply.github.com>

[ROCm/rccl commit: dfc62d5fbb]
2021-07-24 11:04:27 -07:00
Lu 2ab26a2ff9 Add more info to RCCL logging for topo-aware optim.
[ROCm/rccl commit: bd6dbca8fb]
2021-07-22 09:52:39 -07:00
Wenkai Du f773636575 Fix unit tests static build (#403)
[ROCm/rccl commit: 215904ee8e]
2021-07-09 09:35:32 -07:00
Wenkai Du 71dfc3978e Use rocm_smi_lib for getting topology information (#402)
* Use rocm_smi_lib for getting topology information

* Add rocm-smi-lib dependency to RCCL package

[ROCm/rccl commit: 56155ff5b6]
2021-07-08 13:23:11 -07:00
Eiden Yoshida 66349efb1d Fix static builds (#393)
[ROCm/rccl commit: 5c3e7d8b67]
2021-06-23 09:19:48 -06:00
gilbertlee-amd f2a72b1e0b [TransferBench] Fixing a typo in TransferBench usage example (#401)
[ROCm/rccl commit: 2b0b608270]
2021-06-22 17:08:57 -06:00
Wenkai Du 929d72b3b9 Deduct ROCM_PATH from CXX unless specified (#400)
[ROCm/rccl commit: e75bc53e06]
2021-06-22 13:29:08 -07:00
Wenkai Du 90ae176437 Fixes for NCCL_MAX_NCHANNELS and topo_expl (#398)
[ROCm/rccl commit: fa6d7e9a63]
2021-06-22 08:41:49 -07:00
gilbertlee-amd 0a636f20a3 [TransferBench] Switching from little-endian fill pattern to big-endian (#399)
[ROCm/rccl commit: 720374a767]
2021-06-21 14:28:51 -06:00
Wenkai Du 1670bddea0 Remove hard coded /opt/rocm from cmake (#396)
[ROCm/rccl commit: 59d2867b01]
2021-06-21 08:29:23 -07:00
gilbertlee-amd 01a8efbb76 [TransferBench] Adding ability to specify source data pattern (#394)
* [TransferBench] Adding ability to specify source data pattern

[ROCm/rccl commit: ff413be933]
2021-06-15 08:41:57 -06:00
Eiden Yoshida dbb867942d Move address-sanitizer build above addition of rccl library in CMakeLists (#392)
[ROCm/rccl commit: fb267ea333]
2021-06-11 14:43:54 -06:00
Wenkai Du f82f99f533 Select sendrecv path based on collective data size (#391)
* Select sendrecv path based on collective data size

* Add comments on packing and unpacking group field

* Toggling RCCL_P2P_NET_DISABLE in combined calls unit tests

[ROCm/rccl commit: 6dcae8a459]
2021-06-10 17:51:04 -07:00
Stanley Tsang dd98f1762a Fixing bug with ExtractSubDataset function not fully initializing subdataset (#390)
[ROCm/rccl commit: f6f5e16fe6]
2021-06-10 14:35:39 -06:00
Eiden Yoshida a24e180296 Add address sanitizer build option (#389)
[ROCm/rccl commit: eea7b24058]
2021-06-10 09:14:54 -06:00
Wenkai Du 5bebcb0015 Setup collectives threshold for enabling intranet (#387)
* Setup collectives threshold for enabling intranet

* Use separate operation counters for coll and p2p

[ROCm/rccl commit: b815a2800f]
2021-06-09 13:24:26 -07:00
Wenkai Du ab9b9151d2 Add support for another Rome model (#385)
[ROCm/rccl commit: c2064adcc7]
2021-06-08 13:58:20 -07:00
Stanley Tsang 52ffd67cd6 Updating changelog to show install script fix (#384)
[ROCm/rccl commit: 6842429a14]
2021-06-08 13:00:40 -06:00
Wenkai Du c8a432dc25 Allow intranode use of network connection (#383)
* Allow intranode use of network connection

* Checking for graph for null pointer

[ROCm/rccl commit: a3a8c2d56b]
2021-06-08 07:37:59 -07:00
Stanley Tsang b1f41247a2 Fixing install script so that invoking -r alone does not trigger rebuild (#382)
[ROCm/rccl commit: 820a53287f]
2021-06-04 09:46:04 -06:00
Wenkai Du 4b31e521e9 Add option to enable multiple SAT in SHARP (#380)
* Add option to enable multiple SAT in SHARP

* Extend number of NICs to 16

[ROCm/rccl commit: 961922ea02]
2021-06-03 19:45:18 -07:00
gilbertlee-amd fd94c55afe ROCm 4.3 changelog update (#379)
* Update CHANGELOG.md (#378)

* Updating CHANGELOG.md for ROCm 4.3

[ROCm/rccl commit: 903c84050d]
2021-06-03 10:56:02 -06:00
Wenkai Du cdf2780687 topo_expl: update to 2.9.9
[ROCm/rccl commit: 13dc80ee14]
2021-05-26 09:24:34 -07:00
Wenkai Du b154b532a2 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: e3abf1c2ec]
2021-05-25 20:52:15 -07:00
Stanley Tsang b1a6561fee Adding support for hipMallocManaged() in unit tests (#375)
* Adding HMM support for unit tests

* Fixing HMM opt-in check

[ROCm/rccl commit: 256403d4f0]
2021-05-25 17:07:12 -06:00
Wenkai Du aa95cc6102 Update Rome models matching (#376)
[ROCm/rccl commit: 4c83adb75c]
2021-05-25 10:12:40 -07:00
gilbertlee-amd ea17b05518 Tweak clique channel usage for gfx908 (#374)
[ROCm/rccl commit: 8e817ecd6d]
2021-05-21 15:36:21 -06:00
Wenkai Du 92bcdcf5b0 Correction on max number of groups (#373)
[ROCm/rccl commit: 50da1b48af]
2021-05-20 08:58:45 -07:00
Wenkai Du b27490d38d Use fixed segment size for sendrecv (#369)
[ROCm/rccl commit: 8cde34be51]
2021-05-19 08:25:26 -07:00
Wenkai Du 83d309354e Running only sum for CI quick test (#370)
[ROCm/rccl commit: 42b080867e]
2021-05-19 08:25:13 -07:00
gilbertlee-amd 87423792fd Tune clique-based AllReduce for device type 908 (#372)
* Changing switch-over point for clique-based AllReduce

[ROCm/rccl commit: ddceadc313]
2021-05-18 15:36:07 -06:00
gilbertlee-amd 08addb85a2 Disabling env var caching for all unit tests (#371)
* Disabling env var caching for all unit tests

[ROCm/rccl commit: 2daadcc834]
2021-05-18 12:56:30 -06:00