Wenkai Du
3d092f32b8
Bump up HCC version for -hc-function-calls switch
2020-02-11 19:37:13 +00:00
Wenkai Du
d1dae2721d
Add ring bandwidth correction factor
2020-01-30 09:52:27 -08:00
Stanley Tsang
20fa04d9b6
Updating copyright notices for 2020.
2020-01-29 15:28:08 -08:00
Wenkai Du
fe6d012eb0
Merge remote-tracking branch 'remotes/rccl/master' into rccl_2.5.6_cleanup
2020-01-29 15:28:03 -08:00
Wenkai Du
486fd436af
Split primitive class to smaller structures
2020-01-29 15:27:23 -08:00
Wenkai Du
1e55645d97
Misc fixes and improvements for 2.5.6
...
1. Fix RCCL unit test
2. Add ROME detection and tuning
3. Change default P2P level
4. Fix search algorithm for XGMI
5. Remove explicit channel duplication with implicit by using half of link speed
6. Add collective trace support
7. Correct Intel Skylake CPU detection and bandwidth
8. Fix topo connect function
9. Disable GDR read and remove unreachable code
10. Disable LL128 kernels
11. Add tuning parameters
12. Use original clock64() implementation which returns RTC counter value
13. Print out timestamp of collective trace
14. Do not use struct ncclColl in kernel launch parameter
15. Fix abort handling and add tracing
17. Add __launch_bounds__ to kernel functions
18. Remove unused abortCount
19. Unset default MIN_NRINGS and MIN_NCHANNELS
20. Do not allocate shared memory when not using LL128 kernels
21. Correct time print out in tuning log
2020-01-29 15:27:05 -08:00
paulfreddy
15c917244d
Changes for multiple ROCm installation ( #164 )
...
* Changes for multiple ROCm installation
1. Set version to 2.10.1
2. Add CMAKE_INSTALL_PREFIX to neccessary places
3. Cleanup, fix rpath, use prefix in install.sh
* Changes for multiple ROCm installation
1. Set soversion to match release version
2. Add CMAKE_INSTALL_PREFIX to neccessary places
3. Cleanup, fix rpath, use prefix in install.sh
* Changes for multiple ROCm installation
1. Set soversion to match release version
2. Add CMAKE_INSTALL_PREFIX to neccessary places
3. Cleanup, fix rpath, use prefix in install.sh
2020-01-08 21:28:16 -08:00
Gilbert Lee
e5074ce94d
Changing single sync mode to time all iterations instead of just last
2019-12-20 17:08:39 -08:00
gilbertlee-amd
000bce6f27
Removing OpenMP from unit tests ( #163 )
2019-12-20 11:41:56 -07:00
gilbertlee-amd
2f4269d06d
Adding new sleep after sync capability for data fabric profiling ( #162 )
...
Fixing missing header include for ROCM 3.0 changes
2019-12-12 15:20:54 -07:00
saadrahim
0092b35132
Package fix ( #161 )
...
* Fixing RHEL dependency on rocm-dev
2019-12-06 16:06:50 -07:00
saadrahim
bd59b6f880
Changing package dependency to rocm-dev ( #160 )
2019-12-06 14:00:25 -07:00
Wenkai Du
9e10cde644
Merge pull request #158 from wenkaidu/p2p
...
Change default P2P level
2019-12-04 16:30:58 -08:00
Wenkai Du
e9ca3a8029
Merge pull request #157 from wenkaidu/readme
...
Change manual build instructions to fit most common usage
2019-12-04 14:50:41 -08:00
Wenkai Du
90e928bcd5
Change default P2P level
2019-12-04 21:05:10 +00:00
Wenkai Du
6648c81dc6
Merge remote-tracking branch 'remotes/nccl/master' into rccl_2.5.6
2019-12-03 15:42:04 -08:00
Wenkai Du
00a910c2da
Change manual build instructions to fit most common usage
2019-11-26 12:40:26 -08:00
Wenkai Du
b1ed4b7fa8
Merge pull request #155 from wenkaidu/direct
...
Disable direct buffers to reduce scratch memory size
2019-11-21 09:39:09 -08:00
Wenkai Du
a0be2b8812
Disable direct buffers to reduce scratch memory size
2019-11-20 13:03:16 -08:00
Sylvain Jeaugey
299c554dcc
2.5.6-1 ( #255 )
...
Add LL128 Protocol.
Rewrite the topology detection and tree/ring creation (#179 ). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.
Rework P2P/SHM detection in containers (#155 , #248 ).
Detect duplicated devices and return an error (#231 ).
Add tuning for GCP
2019-11-19 14:57:39 -08:00
Wenkai Du
9a70ee2eb1
Merge pull request #154 from wenkaidu/bf16
...
Add bfloat16 support in RCCL
2019-11-19 09:07:51 -08:00
Wenkai Du
4ca05c1297
Support bfloat16 on rest of the unit tests
2019-11-18 14:18:34 -08:00
Wenkai Du
bdac0256a5
Add bfloat16 all reduce unit test
2019-11-18 13:50:29 -08:00
Wenkai Du
5e109ed400
Add bfloat16 support in RCCL
...
Preprocessor symbol RCCL_BFLOAT16 is used as feature indicator
2019-11-18 13:45:53 -08:00
Wenkai Du
58a6e535f6
Merge pull request #153 from wenkaidu/fuji
...
Temporary disable 0x803 target due to build error
2019-11-14 11:46:21 -08:00
Wenkai Du
cd7ab1425b
Temporary disable 0x803 target due to build error
2019-11-14 11:17:41 -08:00
Wenkai Du
55c07e4fb7
Merge pull request #151 from wenkaidu/prim_test
...
rccl_prim_test: Generalize ring topology and duplications
2019-11-13 08:17:55 -08:00
Siu Chi Chan
453c735475
Merge pull request #152 from scchan/bump_hcc_version_check_32
...
Bump up HCC version for -hc-function-calls switch
2019-11-13 10:45:40 -05:00
Siu Chi Chan
08ba92f1b0
Bump up HCC version for -hc-function-calls switch
2019-11-12 14:16:35 -05:00
Wenkai Du
07bb6fce8f
rccl_prim_test: Generalize ring topology and duplications
...
Allow user specified ring topology from command line and duplicated
to requested number of workgroups:
./rccl_prim_test -w 12 -p copy -r "0 1 2 3|3 2 1 0|0 2 1 3|3 1 2 0|0 2 3 1|1 3 2 0"
2019-11-11 15:42:24 -08:00
Wenkai Du
277c72a638
Merge pull request #149 from wenkaidu/rtc
...
Correct RTC frequencies for profiling purpose
2019-11-06 08:02:58 -08:00
gilbertlee-amd
fd94f4fa25
Adding interactive mode for profiling purposes ( #150 )
2019-11-05 17:10:16 -07:00
Wenkai Du
8995047830
Correct RTC frequencies for profiling purpose
2019-11-05 11:36:45 -08:00
Wenkai Du
c49de785d2
Merge pull request #148 from wenkaidu/fine_grain
...
Check for fine grain support using memory allocation
2019-11-04 10:19:07 -08:00
Wenkai Du
669f1951a4
Check for fine grain support using memory allocation
2019-11-01 15:58:49 -07:00
Wenkai Du
90b2921207
Merge pull request #145 from wenkaidu/prim_test
...
rccl-prim-test: use hipExtLaunchMultiKernelMultiDevice and minor cleanup
2019-11-01 13:30:01 -07:00
gilbertlee-amd
2f9edd2432
Single Sync Timing mode ( #144 )
...
* Adding single sync timing mode to emulate timing reported by rccl-prim-test / rccl-tests
* Adding duration / overhead info
2019-11-01 10:18:25 -06:00
Jeff Daily
5a502955c9
additional check for fine grain support in p2pCanConnect ( #146 )
2019-10-31 08:58:38 -07:00
Wenkai Du
ab91cdd5c9
rccl-prim-test: use hipExtLaunchMultiKernelMultiDevice and minor cleanup
2019-10-30 13:15:02 -07:00
Gilbert Lee
648c1ee7cc
Adding ability to switch between fine/coarse grain destination GPU memory
...
Adding ability to switch between memset/memcpy
2019-10-29 12:00:32 -06:00
Wenkai Du
9be7ae8f0d
Merge pull request #140 from scchan/rocm210_hc_function_calls
...
add -hc-function-calls switch back for HCC ROCm 2.10
2019-10-28 09:56:47 -07:00
mhbliao
d89734234a
Merge pull request #142 from mhbliao/hliao/master/cmake
...
[cmake] Allow GPU targets to be parameterized with `AMDGPU_TARGETS`.
2019-10-28 08:33:30 -04:00
Michael LIAO
ec10a5cf14
[cmake] Allow GPU targets to be parameterized with AMDGPU_TARGETS.
2019-10-25 13:55:27 -04:00
Wenkai Du
b98d334114
Merge pull request #141 from wenkaidu/hdp
...
Disable HDP flush for RDMA
2019-10-24 16:26:01 -07:00
Wenkai Du
296176a4fd
Disable HDP flush for RDMA
2019-10-23 14:40:17 -07:00
Siu Chi Chan
d779eae1d0
add -hc-function-calls switch back for HCC ROCm 2.10
2019-10-21 18:00:02 -04:00
Wenkai Du
998ab83675
Merge pull request #138 from wenkaidu/slice_steps
...
Revert collective chunk and slice steps to avoid drop in throughput
2019-10-18 13:30:27 -07:00
Wenkai Du
df74d12946
Revert collective chunk and slice steps to avoid drop in throughput
2019-10-18 12:54:00 -07:00
saadrahim
a95529a6e2
CI Re-enabled for Ubuntu ( #135 )
2019-10-18 11:38:51 -06:00
gilbertlee-amd
60279867b3
Merge pull request #137 from gilbertlee-amd/GenericOpFix
...
Fix for GenericOp device primitive bug
2019-10-11 10:46:29 -06:00