Wenkai Du
4e1b8c1cbb
MSCCL: add support for out-of-place all reduce ( #1156 )
2024-04-28 19:49:09 -07:00
Wenkai Du
9e0c9b4ed8
Replace __HIP_PLATFORM_HCC__ with __HIP_PLATFORM_AMD__ ( #1154 )
2024-04-25 07:19:18 -07:00
gilbertlee-amd
4cb62f999a
Rail optimization for rings ( #1140 )
...
- Modifies the ring creation algorithm to be friendlier to rail-optimized topologies (should not affect classic fabric topologies)
2024-04-15 12:03:57 -06:00
gilbertlee-amd
93982533d7
[topo_expl] Adding -n option to override number of nodes ( #1134 )
2024-04-04 15:11:47 -06:00
Wenkai Du
e8c76fd806
rccl_prim_test: increase max number of workgroups and test iterations ( #1132 )
2024-04-03 11:29:21 -07:00
corey-derochie-amd
503a472a25
Replaced ROCmSoftwarePlatform and RadeonOpenCompute links with ROCm links. ( #1125 )
2024-03-25 16:29:13 -06:00
corey-derochie-amd
9eefc68cb5
Fixes the copyright comment block on each of topo_expl/models/*.xml. The format was not valid XML. ( #1124 )
2024-03-25 16:21:17 -06:00
Pedram Alizadeh
c2fc1d6809
msccl algorithms tuning for alltoall on MI300 ( #1120 )
...
Co-authored-by: PedramAlizadeh <amd@pmohamma.com >
2024-03-21 20:35:29 -04:00
Pedram Alizadeh
50f22e8317
msccl algorithms tuning for allgather on MI300 ( #1110 )
2024-03-14 12:18:26 -04:00
Andy li
6777e65c1d
Enable fp8 support ( #1101 )
...
* initial checkin
* resolve cr comments
* resolve the build issue
* fix the data correctless issue
* update fp8 header file and update the unit test for fp8 support
* remove fp16 from fp8 headers
* fix ut issue and catch up the latest code from develop
* udate according to cr comments
* update ut according to cr comments
* update num floats for each SumPostDiv from 4 to 6
* update fp8 header file name
* fix the typo
2024-03-08 15:17:53 -08:00
Wenkai Du
d2224fd3e1
topo_expl: 2.19.4 update and fix build error ( #1098 )
2024-03-07 08:52:50 -08:00
Wenkai Du
df98a6957d
Add another Rome model ( #1095 )
2024-02-28 10:46:05 -08:00
Wenkai Du
74f9e5db64
Add new GPU model ( #1080 )
2024-02-23 12:19:42 -08:00
Pedram Alizadeh
5a0f9990a9
msccl algorithms tuning for allreduce on MI300 ( #1088 )
2024-02-21 11:31:56 -05:00
BertanDogancay
76f83f95ab
Merge remote-tracking branch 'rccl/develop' into 2.19.4
2024-02-15 13:37:14 -08:00
akolliasAMD
16d7f372b7
Npkit updates ( #1084 )
...
* removed warmup runs to be an opt in
2024-02-15 07:48:45 -07:00
Wenkai Du
d1575a1622
topo_expl: 2.19 update
2024-01-31 16:11:14 -06:00
Wenkai Du
600b44fee5
topo-expl: fix broken build ( #1048 )
2024-01-17 08:59:03 -08:00
Wenkai Du
f7e39fced2
Doubling buffer size to fix NCCL INFO corruption with increased channels ( #1035 )
2024-01-08 08:14:33 -08:00
Wenkai Du
cfc04a8aef
p2p-latency-tests: fix build by switching to gcnArchName ( #1030 )
...
* p2p-latency-tests: fix build by switching to gcnArchName
* rccl-prim-test: switch to gcnArchName
2024-01-04 13:36:48 -08:00
Ziyue Yang
0a53077c9c
Improve MSCCL algorithms ( #1023 )
2024-01-03 14:51:34 -08:00
Ziyue Yang
bb144dcd50
Tune MSCCL all-reduce algorithm ( #1009 )
2023-12-08 17:47:02 -06:00
Wen-Heng (Jack) Chung
8e8323252a
Let 320KB message size uses LL protocol. ( #1006 )
2023-12-06 18:14:31 -06:00
Ziyue Yang
e44e112a17
Fix mscclAlgoHandle not initialized issue ( #995 )
2023-12-01 07:58:01 -08:00
Ziyue Yang
4bb0b4a380
Move MSCCL algorithm loading to initialization to workaround HIP graph conflict ( #982 )
...
* MSCCL: pre-specify channels and pre-load algorithms
* add mutex
* fix bug
* clean include
* disable all-gathers temporarily
2023-11-30 09:47:20 -08:00
akolliasAMD
c71bae1608
npkit trace script now syncs the on average difference per rank ( #981 )
2023-11-28 11:03:55 -07:00
gilbertlee-amd
213869a6b4
JitterBench ( #975 )
2023-11-23 11:14:11 -07:00
Wenkai Du
50b2dd9fd7
Add special handling of gfx940 ( #976 )
...
* Add special handling of gfx940
* Update ring base
2023-11-22 15:07:36 -08:00
Ziyue Yang
7ae95db5b8
Optimize MSCCL all-gather algorithms for gfx942 ( #964 )
2023-11-15 08:18:59 -08:00
gilbertlee-amd
d50bab28bf
Adding LaunchBench tool ( #952 )
2023-11-03 12:04:52 -06:00
akolliasAMD
9f02ee8dea
Revert "Introduce allgather for MSCCL on 8 sockets up to 320KB. ( #931 )" ( #939 )
...
This reverts commit bfb8642450 .
2023-10-30 23:52:58 -06:00
Wen-Heng (Jack) Chung
bfb8642450
Introduce allgather for MSCCL on 8 sockets up to 320KB. ( #931 )
2023-10-24 18:41:12 -05:00
Wen-Heng (Jack) Chung
3f9ffe4788
Introduce allgather MSCCL XML specification for MI250X up to 320KB. ( #930 )
2023-10-24 18:35:55 -05:00
Wen-Heng (Jack) Chung
72d5fbddfd
Introduce 1-shot allreduce for MI250X Hayabusa. ( #929 )
2023-10-24 16:31:18 -05:00
Wen-Heng (Jack) Chung
341926c60a
Introduce 1pass allreduce. Tailor it for very small message sizes <= 20KB. ( #919 )
2023-10-16 16:31:08 -05:00
mberenjk
7e2d905376
adding cuda support for EmptyKernelTest ( #913 )
2023-10-11 14:11:12 -05:00
gilbertlee-amd
7dbf47e07b
Adding a simple EmptyKernelTest to measure launch latency ( #910 )
2023-10-04 17:22:48 -06:00
Pedram Alizadeh
3f6c2b9b32
Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite ( #895 )
2023-09-27 12:44:36 -04:00
akolliasAMD
762a42859e
Fixed topo_expl ( #891 )
2023-09-13 12:05:35 -06:00
Audrey MP
e58ec78d35
Gcn arch name ( #886 )
...
We use CMake to determine if we're compiling against a version of ROCm that supports gcnArchName and handles architecture checking appropriately. It includes a few helper functions as drop ins for the functionality we used gcnArch for before; sometimes to enable flags, and sometimes to set frequencies.
2023-09-12 15:34:40 -04:00
Wenkai Du
c6dd6f6237
Update ll_latency_test and add CUDA version ( #873 )
2023-08-30 16:29:42 -07:00
Wenkai Du
aa95985867
rccl-prim-test: use non-temporal access ( #867 )
2023-08-28 08:28:05 -07:00
Wenkai Du
aeca1af374
Add MSCCL xml files ( #861 )
2023-08-23 14:12:34 -07:00
akolliasAMD
d33cd5a233
NCCL_TREES variable and rome model fixes ( #856 )
2023-08-21 10:35:37 -06:00
Wenkai Du
148e3430f4
p2p/ll-latency-test: convert to single thread tests ( #857 )
2023-08-21 07:48:37 -07:00
Wenkai Du
7044599575
Add new model support ( #847 )
...
* Add new model support
* Update new rings
2023-08-10 17:14:51 -07:00
Wenkai Du
0441eec32e
p2p_latency_test: clean up IPC temp files at exit ( #832 )
2023-07-31 08:11:07 -07:00
Wenkai Du
c424979c14
ll_latency_test: fix time calculation ( #825 )
...
* ll_latency_test: fix time calculation
* Measure time after barrier
* Read time stamp only from thread 0
2023-07-27 09:04:35 -07:00
Wenkai Du
1c1ec096e2
tools: Add LL latency test ( #820 )
...
* Add LL latency test
* Correct name in usage
2023-07-25 20:08:04 -07:00
Bertan Dogancay
8bab4f04b7
Implement RCCL Replayer ( #817 )
...
* Implement RCCL Replayer
2023-07-24 16:26:22 -06:00