Wenkai Du
f4387b2954
Use relaxed atomics and add sleep and wakeup in barrier loop ( #425 )
...
* Use relaxed atomics and add sleep and wakeup in barrier loop
* atomicAdd in ROCm 4.3 only support unsigned long long
* Switch to atomicAdd and atomicExch in more places
* Restore LOAD/STORE define to __ATOMIC_SEQ_CST
* Restore atomic for sizes FIFO
[ROCm/rccl commit: 020484bf40 ]
2021-09-13 17:03:49 -07:00
Wenkai Du
9ffeb41fe1
Update tuning table ( #424 )
...
[ROCm/rccl commit: ef432e48e1 ]
2021-09-13 08:39:01 -07:00
Wenkai Du
934885526d
Merge pull request #423 from wenkaidu/prim-test
...
rccl-prim-test: support 8p1h and 16p1h testing
[ROCm/rccl commit: a2421f8b4a ]
2021-09-08 17:01:19 -07:00
Wenkai Du
d2580c8cf5
Improve barrier implementation
...
[ROCm/rccl commit: adb8d63352 ]
2021-09-08 16:14:32 -05:00
Wenkai Du
d75504e9dc
Remove atomic from profiling
...
[ROCm/rccl commit: 31bd4236f1 ]
2021-09-08 14:20:32 -05:00
Wenkai Du
310d51056f
rccl-prim-test: enable 8p1h and 16p1h test
...
[ROCm/rccl commit: 7558b5e2bf ]
2021-09-08 11:51:26 -05:00
Wenkai Du
4f610a2239
Revert "rccl-prim-test: add all-to-all benchmark ( #185 )"
...
This reverts commit e3e1c6b29c .
[ROCm/rccl commit: b22d097524 ]
2021-09-07 16:41:46 -05:00
gilbertlee-amd
06b0e1c4e2
[TransferBench] ConfigFile parsing fixes, adding additional info ( #422 )
...
* [TransferBench] Adding GPU to NUMA distance detection, parsing fixes, config file generation fix
* [TransferBench] Fixing up NUMA node detection by filtering pools
[ROCm/rccl commit: 51d64894ff ]
2021-09-07 15:28:16 -06:00
Wenkai Du
b9508a6aba
Implement NIC identification and remapping ( #420 )
...
* Add 1H16P GPU model
* Implement NIC identification and remapping
* Revert "Sort IB devices based on device name (#413 )"
This reverts commit de0c586bad .
* Fix permute and check order
* Correction on IB speed reporting
* Revert "Allow user to link layer with RCCL_IB_HCA_SKIP_LINK_LAYER (#361 )"
This reverts commit fa690c47a0 .
[ROCm/rccl commit: 5c8380ff5b ]
2021-08-24 09:42:04 -07:00
Wenkai Du
57518da006
Add gfx908 VM model ( #418 )
...
[ROCm/rccl commit: 5f15ed6e3e ]
2021-08-10 08:55:11 -07:00
gilbertlee-amd
b0c3a1790f
[TransferBench] Removing dependency on hip_fp16 header, fixing swapped output CSV header ( #416 )
...
[ROCm/rccl commit: 1ed272e5f0 ]
2021-08-04 10:53:41 -06:00
Wenkai Du
de0c586bad
Sort IB devices based on device name ( #413 )
...
[ROCm/rccl commit: 2d0ed8dff6 ]
2021-08-03 15:32:41 -07:00
Wenkai Du
4b082ceb32
XGMI connection is always prioritized over NET regardless of hops ( #412 )
...
[ROCm/rccl commit: 3e27227562 ]
2021-07-29 11:12:42 -07:00
Eiden Yoshida
d4bdf8fab7
Add basic rtest.xml ( #411 )
...
[ROCm/rccl commit: 229ca88ee6 ]
2021-07-28 11:53:03 -06:00
Wenkai Du
faea6ead5c
Query XGMI links from xml and adjust gfx906 channel usage ( #410 )
...
[ROCm/rccl commit: 818cdb16a8 ]
2021-07-27 17:32:41 -07:00
Liam Wrubleski
765c46dd89
Update changelog with packaging information ( #409 )
...
[ROCm/rccl commit: e579de7ec2 ]
2021-07-27 10:38:36 -06:00
Wenkai Du
8fbeb14175
topo_expl: fix build after switching to rocm-smi-lib ( #405 )
...
* topo_expl: fix build after switching to rocm-smi-lib
* Use minimal of 4 channels for gfx908
[ROCm/rccl commit: 135d47d125 ]
2021-07-27 08:30:08 -07:00
Eiden Yoshida
3d382b5ba3
Extend test stage timeout to 24 hours ( #408 )
...
[ROCm/rccl commit: 56801656f3 ]
2021-07-26 15:29:21 -06:00
Liam Wrubleski
4efbbec091
Setup runtime and development packages ( #407 )
...
* changes to enable devel package
* Update rocm-cmake version & build
[ROCm/rccl commit: 97d9cf40e7 ]
2021-07-26 15:06:17 -06:00
Wenkai Du
bbfc0c85d2
Skipping unnecessary functions in Doxygen by marking as internal ( #353 ) ( #406 )
...
(cherry picked from commit 198e17608ef40acf6b9515c6831d4a26786aabd6)
Co-authored-by: saadrahim <44449863+saadrahim@users.noreply.github.com >
[ROCm/rccl commit: dfc62d5fbb ]
2021-07-24 11:04:27 -07:00
Hubert Lu
d3d312f041
Merge pull request #404 from hubertlu-tw/coll_trace
...
Enhancement of RCCL logging information for topology-aware optimization
[ROCm/rccl commit: c3e2e7cb5d ]
2021-07-22 13:50:28 -07:00
Lu
2ab26a2ff9
Add more info to RCCL logging for topo-aware optim.
...
[ROCm/rccl commit: bd6dbca8fb ]
2021-07-22 09:52:39 -07:00
Wenkai Du
f773636575
Fix unit tests static build ( #403 )
...
[ROCm/rccl commit: 215904ee8e ]
2021-07-09 09:35:32 -07:00
Wenkai Du
71dfc3978e
Use rocm_smi_lib for getting topology information ( #402 )
...
* Use rocm_smi_lib for getting topology information
* Add rocm-smi-lib dependency to RCCL package
[ROCm/rccl commit: 56155ff5b6 ]
2021-07-08 13:23:11 -07:00
Eiden Yoshida
66349efb1d
Fix static builds ( #393 )
...
[ROCm/rccl commit: 5c3e7d8b67 ]
2021-06-23 09:19:48 -06:00
gilbertlee-amd
f2a72b1e0b
[TransferBench] Fixing a typo in TransferBench usage example ( #401 )
...
[ROCm/rccl commit: 2b0b608270 ]
2021-06-22 17:08:57 -06:00
Wenkai Du
929d72b3b9
Deduct ROCM_PATH from CXX unless specified ( #400 )
...
[ROCm/rccl commit: e75bc53e06 ]
2021-06-22 13:29:08 -07:00
Wenkai Du
90ae176437
Fixes for NCCL_MAX_NCHANNELS and topo_expl ( #398 )
...
[ROCm/rccl commit: fa6d7e9a63 ]
2021-06-22 08:41:49 -07:00
gilbertlee-amd
0a636f20a3
[TransferBench] Switching from little-endian fill pattern to big-endian ( #399 )
...
[ROCm/rccl commit: 720374a767 ]
2021-06-21 14:28:51 -06:00
Wenkai Du
1670bddea0
Remove hard coded /opt/rocm from cmake ( #396 )
...
[ROCm/rccl commit: 59d2867b01 ]
2021-06-21 08:29:23 -07:00
gilbertlee-amd
01a8efbb76
[TransferBench] Adding ability to specify source data pattern ( #394 )
...
* [TransferBench] Adding ability to specify source data pattern
[ROCm/rccl commit: ff413be933 ]
2021-06-15 08:41:57 -06:00
Eiden Yoshida
dbb867942d
Move address-sanitizer build above addition of rccl library in CMakeLists ( #392 )
...
[ROCm/rccl commit: fb267ea333 ]
2021-06-11 14:43:54 -06:00
Wenkai Du
f82f99f533
Select sendrecv path based on collective data size ( #391 )
...
* Select sendrecv path based on collective data size
* Add comments on packing and unpacking group field
* Toggling RCCL_P2P_NET_DISABLE in combined calls unit tests
[ROCm/rccl commit: 6dcae8a459 ]
2021-06-10 17:51:04 -07:00
Stanley Tsang
dd98f1762a
Fixing bug with ExtractSubDataset function not fully initializing subdataset ( #390 )
...
[ROCm/rccl commit: f6f5e16fe6 ]
2021-06-10 14:35:39 -06:00
Eiden Yoshida
a24e180296
Add address sanitizer build option ( #389 )
...
[ROCm/rccl commit: eea7b24058 ]
2021-06-10 09:14:54 -06:00
Wenkai Du
5bebcb0015
Setup collectives threshold for enabling intranet ( #387 )
...
* Setup collectives threshold for enabling intranet
* Use separate operation counters for coll and p2p
[ROCm/rccl commit: b815a2800f ]
2021-06-09 13:24:26 -07:00
Wenkai Du
ab9b9151d2
Add support for another Rome model ( #385 )
...
[ROCm/rccl commit: c2064adcc7 ]
2021-06-08 13:58:20 -07:00
Stanley Tsang
52ffd67cd6
Updating changelog to show install script fix ( #384 )
...
[ROCm/rccl commit: 6842429a14 ]
2021-06-08 13:00:40 -06:00
Wenkai Du
c8a432dc25
Allow intranode use of network connection ( #383 )
...
* Allow intranode use of network connection
* Checking for graph for null pointer
[ROCm/rccl commit: a3a8c2d56b ]
2021-06-08 07:37:59 -07:00
Stanley Tsang
b1f41247a2
Fixing install script so that invoking -r alone does not trigger rebuild ( #382 )
...
[ROCm/rccl commit: 820a53287f ]
2021-06-04 09:46:04 -06:00
Wenkai Du
4b31e521e9
Add option to enable multiple SAT in SHARP ( #380 )
...
* Add option to enable multiple SAT in SHARP
* Extend number of NICs to 16
[ROCm/rccl commit: 961922ea02 ]
2021-06-03 19:45:18 -07:00
gilbertlee-amd
fd94c55afe
ROCm 4.3 changelog update ( #379 )
...
* Update CHANGELOG.md (#378 )
* Updating CHANGELOG.md for ROCm 4.3
[ROCm/rccl commit: 903c84050d ]
2021-06-03 10:56:02 -06:00
Wenkai Du
f7024c67c2
Merge pull request #377 from wenkaidu/2.9.9
...
Sync up with NCCL 2.9.9
[ROCm/rccl commit: 03ac898825 ]
2021-05-26 11:38:19 -07:00
Wenkai Du
cdf2780687
topo_expl: update to 2.9.9
...
[ROCm/rccl commit: 13dc80ee14 ]
2021-05-26 09:24:34 -07:00
Wenkai Du
b154b532a2
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: e3abf1c2ec ]
2021-05-25 20:52:15 -07:00
Stanley Tsang
b1a6561fee
Adding support for hipMallocManaged() in unit tests ( #375 )
...
* Adding HMM support for unit tests
* Fixing HMM opt-in check
[ROCm/rccl commit: 256403d4f0 ]
2021-05-25 17:07:12 -06:00
Wenkai Du
aa95cc6102
Update Rome models matching ( #376 )
...
[ROCm/rccl commit: 4c83adb75c ]
2021-05-25 10:12:40 -07:00
gilbertlee-amd
ea17b05518
Tweak clique channel usage for gfx908 ( #374 )
...
[ROCm/rccl commit: 8e817ecd6d ]
2021-05-21 15:36:21 -06:00
Wenkai Du
92bcdcf5b0
Correction on max number of groups ( #373 )
...
[ROCm/rccl commit: 50da1b48af ]
2021-05-20 08:58:45 -07:00
Wenkai Du
b27490d38d
Use fixed segment size for sendrecv ( #369 )
...
[ROCm/rccl commit: 8cde34be51 ]
2021-05-19 08:25:26 -07:00