rocm-systems

Autors	SHA1	Ziņojums	Datums
Wenkai Du	cfa97eccd3	Add IB/RDMA unit test	2020-06-16 18:29:17 +00:00
Wenkai Du	e80e29573c	Add gather, scatter and alltoall collectives Introducing 3 new APIs: ncclResult_t ncclGather(const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream); ncclResult_t ncclScatter(const void* sendbuff, void* recvbuff, size_t recvcount, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream); ncclResult_t ncclAllToAll(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream); Only out of place operation is supported. Preprocessor symbol RCCL_GATHER_SCATTER=1 indicates API availibility. By default the APIs launche RCCL kernel implementation, which can be disabled by RCCL_ALLTOALL_KERNEL_DISABLE=1. Then the APIs use wrapper around ncclSend and ncclRecv.	2020-06-09 17:44:08 -07:00
Wenkai Du	71ec3e09df	tpol_expl: update to 2.7	2020-06-09 17:40:24 -07:00
Wenkai Du	706de76046	Merge pull request #208 from wenkaidu/perf_xgmi Give preference to path with more XGMI connections	2020-05-15 10:07:22 -07:00
Wenkai Du	b3c9852634	Give preference to path with more XGMI connections	2020-05-14 15:33:16 -07:00
Wenkai Du	f1058b6353	rccl-prim-test: add flags when calling hipExtLaunchMultiKernelMultiDevice in hip-clang	2020-05-12 23:54:07 +00:00
Saad Rahim	33c23fdcda	Merge remote-tracking branch 'upstream/master' into develop	2020-04-29 16:12:37 -07:00
Wenkai Du	5743c6b7d2	topo_expl: fix build error	2020-04-27 17:17:05 +00:00
Gilbert Lee	339bf9ff19	Adding option to re-use streams instead of re-creating per topology	2020-04-23 15:53:40 +00:00
Wenkai Du	ef7064ba9b	rccl-prim-test: auto-detect rings in 4P and 8P configurations	2020-04-10 18:17:21 +00:00
Aaron Enye Shi	a95090d981	Fix HIP-Clang build with HSA headers HIP-Clang does not include these HSA headers, and they need to be explicitly added in RCCL.	2020-04-03 17:58:23 -04:00
Wenkai Du	6f54b23503	topo_expl: update to 2.6	2020-04-01 13:37:08 -07:00
Wenkai Du	ebc823e603	rccl-prim-test: add all-to-all benchmark (#185 ) For gfx908, support simple detection of ring topology. Call ReduceOrCopyMulti directly from kernel. Also simplify code by removing kernel start synchronization option which has no effect on throughput measurements.	2020-03-16 10:00:54 -07:00
Wenkai Du	32388d60a9	topo_expl: add a few more single node models	2020-03-02 11:43:03 -08:00
Wenkai Du	498d5029ad	Add topology visualizer tool	2020-02-26 15:23:34 -08:00
Wenkai Du	934b6de557	topo_expl: use bandwidth numbers defined in graph in CPU models	2020-02-26 14:17:36 -08:00
Wenkai Du	d2adc61bf6	Revise PCI BW numbers on Rome	2020-02-26 13:17:49 -08:00
Wenkai Du	55f8e2dec7	Add topology explorer	2020-02-19 14:42:06 -08:00
Stanley Tsang	20fa04d9b6	Updating copyright notices for 2020.	2020-01-29 15:28:08 -08:00
Wenkai Du	fe6d012eb0	Merge remote-tracking branch 'remotes/rccl/master' into rccl_2.5.6_cleanup	2020-01-29 15:28:03 -08:00
Wenkai Du	1e55645d97	Misc fixes and improvements for 2.5.6 1. Fix RCCL unit test 2. Add ROME detection and tuning 3. Change default P2P level 4. Fix search algorithm for XGMI 5. Remove explicit channel duplication with implicit by using half of link speed 6. Add collective trace support 7. Correct Intel Skylake CPU detection and bandwidth 8. Fix topo connect function 9. Disable GDR read and remove unreachable code 10. Disable LL128 kernels 11. Add tuning parameters 12. Use original clock64() implementation which returns RTC counter value 13. Print out timestamp of collective trace 14. Do not use struct ncclColl in kernel launch parameter 15. Fix abort handling and add tracing 17. Add __launch_bounds__ to kernel functions 18. Remove unused abortCount 19. Unset default MIN_NRINGS and MIN_NCHANNELS 20. Do not allocate shared memory when not using LL128 kernels 21. Correct time print out in tuning log	2020-01-29 15:27:05 -08:00
Gilbert Lee	e5074ce94d	Changing single sync mode to time all iterations instead of just last	2019-12-20 17:08:39 -08:00
gilbertlee-amd	2f4269d06d	Adding new sleep after sync capability for data fabric profiling (#162 ) Fixing missing header include for ROCM 3.0 changes	2019-12-12 15:20:54 -07:00
Wenkai Du	07bb6fce8f	rccl_prim_test: Generalize ring topology and duplications Allow user specified ring topology from command line and duplicated to requested number of workgroups: ./rccl_prim_test -w 12 -p copy -r "0 1 2 3\|3 2 1 0\|0 2 1 3\|3 1 2 0\|0 2 3 1\|1 3 2 0"	2019-11-11 15:42:24 -08:00
Wenkai Du	277c72a638	Merge pull request #149 from wenkaidu/rtc Correct RTC frequencies for profiling purpose	2019-11-06 08:02:58 -08:00
gilbertlee-amd	fd94f4fa25	Adding interactive mode for profiling purposes (#150 )	2019-11-05 17:10:16 -07:00
Wenkai Du	8995047830	Correct RTC frequencies for profiling purpose	2019-11-05 11:36:45 -08:00
Wenkai Du	90b2921207	Merge pull request #145 from wenkaidu/prim_test rccl-prim-test: use hipExtLaunchMultiKernelMultiDevice and minor cleanup	2019-11-01 13:30:01 -07:00
gilbertlee-amd	2f9edd2432	Single Sync Timing mode (#144 ) * Adding single sync timing mode to emulate timing reported by rccl-prim-test / rccl-tests * Adding duration / overhead info	2019-11-01 10:18:25 -06:00
Wenkai Du	ab91cdd5c9	rccl-prim-test: use hipExtLaunchMultiKernelMultiDevice and minor cleanup	2019-10-30 13:15:02 -07:00
Gilbert Lee	648c1ee7cc	Adding ability to switch between fine/coarse grain destination GPU memory Adding ability to switch between memset/memcpy	2019-10-29 12:00:32 -06:00
rohit pathania	a270ee080e	Read operation throughput	2019-09-03 14:58:40 +05:30
rohit pathania	e5b13d69e5	display each workgroup ,links and directions with throughputs	2019-08-30 13:28:23 +05:30
rpathani	40e30b5168	Update rccl_prim_test.cpp	2019-08-19 12:44:11 +05:30
rpathani	deea20d49c	Merge branch 'master' into xgmi_bench	2019-08-16 10:56:56 +05:30
Wenkai Du	f11c8f60cd	RCCL 2.4 update	2019-08-14 10:42:35 -07:00
rohit pathania	65e2f5d87b	Modified the code to use RTC clock frequency based on gpu gcn id	2019-08-14 12:55:12 +05:30
rohit pathania	0f74929dab	Merge branch 'xgmi_bench' of https://github.com/rpathani/rccl into xgmi_bench # Conflicts: # tools/rccl-prim-test/rccl_prim_test.cpp	2019-08-13 11:36:56 +05:30
rohit pathania	3bbf924ff8	Adding linkinfo and srcGPU to destGPU info	2019-08-13 11:28:50 +05:30
rohit pathania	5a2f74b8d0	Adding linkinfo and srcGPU to destGPU info	2019-08-09 12:44:06 +05:30
gilbertlee-amd	b8cf48fc16	Adding TransferBench tool (#113 ) * Adding standalone TransferBench tool	2019-08-07 17:21:41 -06:00
Wenkai Du	70804da15b	Refactor primitive test to support multiple GPUs in rings (#94 ) * Refactor primitive test to support multiple GPUs in rings * Make GPUs sync before transfer optional * Use same ring format as RCCL * Extend to 8 GPUs and report errors if there is no P2P access * Control GPUs sync before ops from command line with "-s" option * Change buffer size through command line option "-n" Rename iterations command line option to "-i"	2019-07-05 14:29:20 -07:00
Wenkai Du	e6a0da444f	Match primitives unroll counts with latest RCCL (#91 )	2019-06-26 15:09:13 -07:00
Wenkai Du	ee14676064	Calculate and print kernel throughput (#78 ) * rccl-prim-test: print GPU info and set iterations * Calculate and print kernel throughput	2019-06-07 10:39:30 -07:00
Wenkai Du	42b488507d	rccl-prim-test: print GPU info and set iterations (#77 )	2019-06-05 15:16:33 -07:00
Wenkai Du	1bb6d2104c	Add RCCL primitive testing (#70 )	2019-05-23 16:52:17 -06:00

46 Revīzijas