Wenkai Du
706de76046
Merge pull request #208 from wenkaidu/perf_xgmi
...
Give preference to path with more XGMI connections
2020-05-15 10:07:22 -07:00
Wenkai Du
b3c9852634
Give preference to path with more XGMI connections
2020-05-14 15:33:16 -07:00
Wenkai Du
f1058b6353
rccl-prim-test: add flags when calling hipExtLaunchMultiKernelMultiDevice in hip-clang
2020-05-12 23:54:07 +00:00
Saad Rahim
33c23fdcda
Merge remote-tracking branch 'upstream/master' into develop
2020-04-29 16:12:37 -07:00
Wenkai Du
5743c6b7d2
topo_expl: fix build error
2020-04-27 17:17:05 +00:00
Gilbert Lee
339bf9ff19
Adding option to re-use streams instead of re-creating per topology
2020-04-23 15:53:40 +00:00
Wenkai Du
ef7064ba9b
rccl-prim-test: auto-detect rings in 4P and 8P configurations
2020-04-10 18:17:21 +00:00
Aaron Enye Shi
a95090d981
Fix HIP-Clang build with HSA headers
...
HIP-Clang does not include these HSA headers, and they need to be explicitly added in RCCL.
2020-04-03 17:58:23 -04:00
Wenkai Du
6f54b23503
topo_expl: update to 2.6
2020-04-01 13:37:08 -07:00
Wenkai Du
ebc823e603
rccl-prim-test: add all-to-all benchmark ( #185 )
...
For gfx908, support simple detection of ring topology.
Call ReduceOrCopyMulti directly from kernel.
Also simplify code by removing kernel start synchronization option
which has no effect on throughput measurements.
2020-03-16 10:00:54 -07:00
Wenkai Du
32388d60a9
topo_expl: add a few more single node models
2020-03-02 11:43:03 -08:00
Wenkai Du
498d5029ad
Add topology visualizer tool
2020-02-26 15:23:34 -08:00
Wenkai Du
934b6de557
topo_expl: use bandwidth numbers defined in graph in CPU models
2020-02-26 14:17:36 -08:00
Wenkai Du
d2adc61bf6
Revise PCI BW numbers on Rome
2020-02-26 13:17:49 -08:00
Wenkai Du
55f8e2dec7
Add topology explorer
2020-02-19 14:42:06 -08:00
Stanley Tsang
20fa04d9b6
Updating copyright notices for 2020.
2020-01-29 15:28:08 -08:00
Wenkai Du
fe6d012eb0
Merge remote-tracking branch 'remotes/rccl/master' into rccl_2.5.6_cleanup
2020-01-29 15:28:03 -08:00
Wenkai Du
1e55645d97
Misc fixes and improvements for 2.5.6
...
1. Fix RCCL unit test
2. Add ROME detection and tuning
3. Change default P2P level
4. Fix search algorithm for XGMI
5. Remove explicit channel duplication with implicit by using half of link speed
6. Add collective trace support
7. Correct Intel Skylake CPU detection and bandwidth
8. Fix topo connect function
9. Disable GDR read and remove unreachable code
10. Disable LL128 kernels
11. Add tuning parameters
12. Use original clock64() implementation which returns RTC counter value
13. Print out timestamp of collective trace
14. Do not use struct ncclColl in kernel launch parameter
15. Fix abort handling and add tracing
17. Add __launch_bounds__ to kernel functions
18. Remove unused abortCount
19. Unset default MIN_NRINGS and MIN_NCHANNELS
20. Do not allocate shared memory when not using LL128 kernels
21. Correct time print out in tuning log
2020-01-29 15:27:05 -08:00
Gilbert Lee
e5074ce94d
Changing single sync mode to time all iterations instead of just last
2019-12-20 17:08:39 -08:00
gilbertlee-amd
2f4269d06d
Adding new sleep after sync capability for data fabric profiling ( #162 )
...
Fixing missing header include for ROCM 3.0 changes
2019-12-12 15:20:54 -07:00
Wenkai Du
07bb6fce8f
rccl_prim_test: Generalize ring topology and duplications
...
Allow user specified ring topology from command line and duplicated
to requested number of workgroups:
./rccl_prim_test -w 12 -p copy -r "0 1 2 3|3 2 1 0|0 2 1 3|3 1 2 0|0 2 3 1|1 3 2 0"
2019-11-11 15:42:24 -08:00
Wenkai Du
277c72a638
Merge pull request #149 from wenkaidu/rtc
...
Correct RTC frequencies for profiling purpose
2019-11-06 08:02:58 -08:00
gilbertlee-amd
fd94f4fa25
Adding interactive mode for profiling purposes ( #150 )
2019-11-05 17:10:16 -07:00
Wenkai Du
8995047830
Correct RTC frequencies for profiling purpose
2019-11-05 11:36:45 -08:00
Wenkai Du
90b2921207
Merge pull request #145 from wenkaidu/prim_test
...
rccl-prim-test: use hipExtLaunchMultiKernelMultiDevice and minor cleanup
2019-11-01 13:30:01 -07:00
gilbertlee-amd
2f9edd2432
Single Sync Timing mode ( #144 )
...
* Adding single sync timing mode to emulate timing reported by rccl-prim-test / rccl-tests
* Adding duration / overhead info
2019-11-01 10:18:25 -06:00
Wenkai Du
ab91cdd5c9
rccl-prim-test: use hipExtLaunchMultiKernelMultiDevice and minor cleanup
2019-10-30 13:15:02 -07:00
Gilbert Lee
648c1ee7cc
Adding ability to switch between fine/coarse grain destination GPU memory
...
Adding ability to switch between memset/memcpy
2019-10-29 12:00:32 -06:00
rohit pathania
a270ee080e
Read operation throughput
2019-09-03 14:58:40 +05:30
rohit pathania
e5b13d69e5
display each workgroup ,links and directions with throughputs
2019-08-30 13:28:23 +05:30
rpathani
40e30b5168
Update rccl_prim_test.cpp
2019-08-19 12:44:11 +05:30
rpathani
deea20d49c
Merge branch 'master' into xgmi_bench
2019-08-16 10:56:56 +05:30
Wenkai Du
f11c8f60cd
RCCL 2.4 update
2019-08-14 10:42:35 -07:00
rohit pathania
65e2f5d87b
Modified the code to use RTC clock frequency based on gpu gcn id
2019-08-14 12:55:12 +05:30
rohit pathania
0f74929dab
Merge branch 'xgmi_bench' of https://github.com/rpathani/rccl into xgmi_bench
...
# Conflicts:
# tools/rccl-prim-test/rccl_prim_test.cpp
2019-08-13 11:36:56 +05:30
rohit pathania
3bbf924ff8
Adding linkinfo and srcGPU to destGPU info
2019-08-13 11:28:50 +05:30
rohit pathania
5a2f74b8d0
Adding linkinfo and srcGPU to destGPU info
2019-08-09 12:44:06 +05:30
gilbertlee-amd
b8cf48fc16
Adding TransferBench tool ( #113 )
...
* Adding standalone TransferBench tool
2019-08-07 17:21:41 -06:00
Wenkai Du
70804da15b
Refactor primitive test to support multiple GPUs in rings ( #94 )
...
* Refactor primitive test to support multiple GPUs in rings
* Make GPUs sync before transfer optional
* Use same ring format as RCCL
* Extend to 8 GPUs and report errors if there is no P2P access
* Control GPUs sync before ops from command line with "-s" option
* Change buffer size through command line option "-n"
Rename iterations command line option to "-i"
2019-07-05 14:29:20 -07:00
Wenkai Du
e6a0da444f
Match primitives unroll counts with latest RCCL ( #91 )
2019-06-26 15:09:13 -07:00
Wenkai Du
ee14676064
Calculate and print kernel throughput ( #78 )
...
* rccl-prim-test: print GPU info and set iterations
* Calculate and print kernel throughput
2019-06-07 10:39:30 -07:00
Wenkai Du
42b488507d
rccl-prim-test: print GPU info and set iterations ( #77 )
2019-06-05 15:16:33 -07:00
Wenkai Du
1bb6d2104c
Add RCCL primitive testing ( #70 )
2019-05-23 16:52:17 -06:00