نمودار کامیت

303 کامیت‌ها

مولف SHA1 پیام تاریخ
Gilbert Lee e5074ce94d Changing single sync mode to time all iterations instead of just last 2019-12-20 17:08:39 -08:00
gilbertlee-amd 000bce6f27 Removing OpenMP from unit tests (#163) 2019-12-20 11:41:56 -07:00
gilbertlee-amd 2f4269d06d Adding new sleep after sync capability for data fabric profiling (#162)
Fixing missing header include for ROCM 3.0 changes
2019-12-12 15:20:54 -07:00
saadrahim 0092b35132 Package fix (#161)
* Fixing RHEL dependency on rocm-dev
2019-12-06 16:06:50 -07:00
saadrahim bd59b6f880 Changing package dependency to rocm-dev (#160) 2019-12-06 14:00:25 -07:00
Wenkai Du 9e10cde644 Merge pull request #158 from wenkaidu/p2p
Change default P2P level
2019-12-04 16:30:58 -08:00
Wenkai Du e9ca3a8029 Merge pull request #157 from wenkaidu/readme
Change manual build instructions to fit most common usage
2019-12-04 14:50:41 -08:00
Wenkai Du 90e928bcd5 Change default P2P level 2019-12-04 21:05:10 +00:00
Wenkai Du 00a910c2da Change manual build instructions to fit most common usage 2019-11-26 12:40:26 -08:00
Wenkai Du b1ed4b7fa8 Merge pull request #155 from wenkaidu/direct
Disable direct buffers to reduce scratch memory size
2019-11-21 09:39:09 -08:00
Wenkai Du a0be2b8812 Disable direct buffers to reduce scratch memory size 2019-11-20 13:03:16 -08:00
Wenkai Du 9a70ee2eb1 Merge pull request #154 from wenkaidu/bf16
Add bfloat16 support in RCCL
2019-11-19 09:07:51 -08:00
Wenkai Du 4ca05c1297 Support bfloat16 on rest of the unit tests 2019-11-18 14:18:34 -08:00
Wenkai Du bdac0256a5 Add bfloat16 all reduce unit test 2019-11-18 13:50:29 -08:00
Wenkai Du 5e109ed400 Add bfloat16 support in RCCL
Preprocessor symbol RCCL_BFLOAT16 is used as feature indicator
2019-11-18 13:45:53 -08:00
Wenkai Du 58a6e535f6 Merge pull request #153 from wenkaidu/fuji
Temporary disable 0x803 target due to build error
2019-11-14 11:46:21 -08:00
Wenkai Du cd7ab1425b Temporary disable 0x803 target due to build error 2019-11-14 11:17:41 -08:00
Wenkai Du 55c07e4fb7 Merge pull request #151 from wenkaidu/prim_test
rccl_prim_test: Generalize ring topology and duplications
2019-11-13 08:17:55 -08:00
Siu Chi Chan 453c735475 Merge pull request #152 from scchan/bump_hcc_version_check_32
Bump up HCC version for -hc-function-calls switch
2019-11-13 10:45:40 -05:00
Siu Chi Chan 08ba92f1b0 Bump up HCC version for -hc-function-calls switch 2019-11-12 14:16:35 -05:00
Wenkai Du 07bb6fce8f rccl_prim_test: Generalize ring topology and duplications
Allow user specified ring topology from command line and duplicated
to requested number of workgroups:
./rccl_prim_test -w 12 -p copy -r "0 1 2 3|3 2 1 0|0 2 1 3|3 1 2 0|0 2 3 1|1 3 2 0"
2019-11-11 15:42:24 -08:00
Wenkai Du 277c72a638 Merge pull request #149 from wenkaidu/rtc
Correct RTC frequencies for profiling purpose
2019-11-06 08:02:58 -08:00
gilbertlee-amd fd94f4fa25 Adding interactive mode for profiling purposes (#150) 2019-11-05 17:10:16 -07:00
Wenkai Du 8995047830 Correct RTC frequencies for profiling purpose 2019-11-05 11:36:45 -08:00
Wenkai Du c49de785d2 Merge pull request #148 from wenkaidu/fine_grain
Check for fine grain support using memory allocation
2019-11-04 10:19:07 -08:00
Wenkai Du 669f1951a4 Check for fine grain support using memory allocation 2019-11-01 15:58:49 -07:00
Wenkai Du 90b2921207 Merge pull request #145 from wenkaidu/prim_test
rccl-prim-test: use hipExtLaunchMultiKernelMultiDevice and minor cleanup
2019-11-01 13:30:01 -07:00
gilbertlee-amd 2f9edd2432 Single Sync Timing mode (#144)
* Adding single sync timing mode to emulate timing reported by rccl-prim-test / rccl-tests
* Adding duration / overhead info
2019-11-01 10:18:25 -06:00
Jeff Daily 5a502955c9 additional check for fine grain support in p2pCanConnect (#146) 2019-10-31 08:58:38 -07:00
Wenkai Du ab91cdd5c9 rccl-prim-test: use hipExtLaunchMultiKernelMultiDevice and minor cleanup 2019-10-30 13:15:02 -07:00
Gilbert Lee 648c1ee7cc Adding ability to switch between fine/coarse grain destination GPU memory
Adding ability to switch between memset/memcpy
2019-10-29 12:00:32 -06:00
Wenkai Du 9be7ae8f0d Merge pull request #140 from scchan/rocm210_hc_function_calls
add -hc-function-calls switch back for HCC ROCm 2.10
2019-10-28 09:56:47 -07:00
mhbliao d89734234a Merge pull request #142 from mhbliao/hliao/master/cmake
[cmake] Allow GPU targets to be parameterized with `AMDGPU_TARGETS`.
2019-10-28 08:33:30 -04:00
Michael LIAO ec10a5cf14 [cmake] Allow GPU targets to be parameterized with AMDGPU_TARGETS. 2019-10-25 13:55:27 -04:00
Wenkai Du b98d334114 Merge pull request #141 from wenkaidu/hdp
Disable HDP flush for RDMA
2019-10-24 16:26:01 -07:00
Wenkai Du 296176a4fd Disable HDP flush for RDMA 2019-10-23 14:40:17 -07:00
Siu Chi Chan d779eae1d0 add -hc-function-calls switch back for HCC ROCm 2.10 2019-10-21 18:00:02 -04:00
Wenkai Du 998ab83675 Merge pull request #138 from wenkaidu/slice_steps
Revert collective chunk and slice steps to avoid drop in throughput
2019-10-18 13:30:27 -07:00
Wenkai Du df74d12946 Revert collective chunk and slice steps to avoid drop in throughput 2019-10-18 12:54:00 -07:00
saadrahim a95529a6e2 CI Re-enabled for Ubuntu (#135) 2019-10-18 11:38:51 -06:00
gilbertlee-amd 60279867b3 Merge pull request #137 from gilbertlee-amd/GenericOpFix
Fix for GenericOp device primitive bug
2019-10-11 10:46:29 -06:00
Gilbert Lee 37603ae6cb Reverting GenericOp bug workaround modifications to slice/chunk steps 2019-10-11 09:20:10 -07:00
Gilbert Lee 1392dd2997 Performing __threadfence_system() with only first thread 2019-10-11 09:16:19 -07:00
Gilbert Lee 8ae1bce3bb Fix for GenericOp device primitive bug 2019-10-10 22:39:45 -07:00
Wenkai Du 062c798c86 Merge pull request #136 from wenkaidu/tree
Enable tree kernels in build
2019-10-09 10:58:52 -07:00
Wenkai Du 662281e599 Merge pull request #134 from changpeng/master
Tuning the inline and unroll to reduce the scratch usage
2019-10-09 10:58:38 -07:00
Wenkai Du 76976c9e2e Enable tree kernels in build
Need to tune and specify NCCL_TREE_THRESHOLD to allow usage
2019-10-08 23:20:11 +00:00
Changpeng Fang eec319038e Tuning the inline and unroll to reduce the scratch usage
Summary:
 1. remove the noinline attribute for AllReduceThreeKernel;
 2. change AUTPUNROLL for tree functions to 1 or 2;
 Combining 1 and 2 will reduce the scratch usage from 1256 to 952
2019-10-08 14:02:25 -07:00
Wenkai Du de25e4cb7c Merge pull request #133 from scchan/hcc_version_detect
detect the hcc version and conditionally add -hc-function-calls
2019-10-03 10:53:45 -07:00
Siu Chi Chan b87ef4f152 detect the hcc version and conditionally add the -hc-function-calls switch 2019-10-03 13:25:25 -04:00