Wenkai Du
abf265a911
Rework barriers and adjust scope of atomics ( #1019 )
2024-01-04 08:18:48 -08:00
Ziyue Yang
0a53077c9c
Improve MSCCL algorithms ( #1023 )
2024-01-03 14:51:34 -08:00
akolliasAMD
f4858e14b2
rearranged how the min and max functions are part of msccl ( #1025 )
...
* rearranged how the min and max functions are part of msccl
* added more coverage on in place graph tests
2023-12-21 08:58:33 -07:00
Ziyue Yang
655742a3a6
Fully disable MSCCL when machine is not matched ( #1017 )
...
* Disable MSCCL algorithm meta loading when machine is not matched
* fully disable init
* fix potential segfault
2023-12-13 08:36:21 -08:00
Wenkai Du
53d807a5b9
msccl: disable on multi-node ( #1018 )
2023-12-13 07:41:40 -08:00
Wenkai Du
81602814a7
msccl: fix data corruption with MTYPE_RW ( #1014 )
2023-12-11 20:33:15 -08:00
Wenkai Du
7965c8b53c
Fix memory fence and use non-temporal store ( #1007 )
...
* Fix memory fence and use non-temporal store
* Use amdgcn builtin instead of inline asm
* Move threadfence location
* Revert changes to gfx90a
* Rework gfx90a change
* Apply changes to gfx94x
2023-12-09 12:16:08 -08:00
Ziyue Yang
c002f20029
Fix MSCCL scratch allocation ( #1010 )
2023-12-08 17:47:10 -06:00
Wen-Heng (Jack) Chung
baadda4bd8
Relax workgroup barrier implementation for MSCCL send/recv ops. ( #997 )
...
* Trim logic.
* Revert "Trim logic."
This reverts commit 8f2dba6c764108acf2bf5428366b9f41d4d206b9.
* Introduce MSCCL template parameters to send / recv.
* Address review feedbacks.
2023-12-08 17:46:53 -06:00
Wenkai Du
12c08fc52a
msccl: build same number of kernels as in ROCm 5.7 ( #1005 )
...
Removed fullOps kernels from build
2023-12-07 13:36:04 -06:00
Wen-Heng (Jack) Chung
293f0fb752
Use a map to host scratch buffers ( #1004 )
...
* Use a map to host scratch buffers
* Address review feedbacks. Deliberately keep mscclSetupScratch function.
2023-12-05 13:15:28 -06:00
Nilesh M Negi
bc44e3faa7
Fix gcnArch bug in IFC mix build ( #998 ) ( #1002 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2023-12-04 16:20:22 -06:00
Bertan Dogancay
7c0f49a878
IFC mix build ( #998 )
2023-12-02 18:49:52 -07:00
Wenkai Du
4ba65d1d6a
Increase max channles to 64 ( #993 )
2023-12-01 16:01:11 -08:00
pradeep-ramanna
0b53f79196
Fix GPU to NIC mapping for peertopeer ( #994 )
2023-12-01 08:00:17 -08:00
Ziyue Yang
e44e112a17
Fix mscclAlgoHandle not initialized issue ( #995 )
2023-12-01 07:58:01 -08:00
Ziyue Yang
4bb0b4a380
Move MSCCL algorithm loading to initialization to workaround HIP graph conflict ( #982 )
...
* MSCCL: pre-specify channels and pre-load algorithms
* add mutex
* fix bug
* clean include
* disable all-gathers temporarily
2023-11-30 09:47:20 -08:00
akolliasAMD
56ce9ef05f
recreated pr 914 to work with current develop branch ( #979 )
2023-11-28 16:33:47 -07:00
Wenkai Du
50b2dd9fd7
Add special handling of gfx940 ( #976 )
...
* Add special handling of gfx940
* Update ring base
2023-11-22 15:07:36 -08:00
Wenkai Du
569d3f7d59
msccl: allocate scratch as ext-scope fine-grained ( #968 )
2023-11-16 09:57:25 -06:00
Wenkai Du
bc8661f092
Fix kernel command line warnings ( #961 )
...
* Fix kernel command line warnings
* Remove while loop
2023-11-15 18:01:12 -08:00
Ziyue Yang
7fc891bc8d
Fix MSCCL work FIFO allocation with HIP graph enabled ( #967 )
2023-11-15 16:43:28 -08:00
Bertan Dogancay
198f14923b
Check to support older ROCm versions ( #963 )
2023-11-15 12:36:31 -07:00
Ziyue Yang
7ae95db5b8
Optimize MSCCL all-gather algorithms for gfx942 ( #964 )
2023-11-15 08:18:59 -08:00
Ziyue Yang
df128879a6
Optimize MSCCL reduce primitive switching for gfx942 ( #962 )
...
* Optimize reduce primitive switching for gfx942
* address comment
2023-11-15 08:18:44 -08:00
Wenkai Du
5a800e00cd
msccl: enable basic collective trace ( #959 )
...
To avoid increasing number of kernels, colltrace is only enabled with
RCCL_MSCCL_FORCE_FULLOPS=1
2023-11-08 20:14:28 -08:00
Wen-Heng (Jack) Chung
efc42d9045
Use send instead of sendWithBarrier. ( #727 )
2023-11-07 13:47:24 -06:00
Nusrat Islam
022735d208
Merge pull request #950 from nusislam/msccl-red2
...
msccl: remove cases from numReduction switch statement
2023-11-04 02:48:03 -05:00
Wenkai Du
dbcba2923b
Use parallel init of LDS and adjust P2P channels for gfx94x ( #943 )
...
* Use parallel init of LDS and adjust P2P channels for gfx94x
* Move another init to parallel
* Fix NCCL_NCHANNELS_PER_PEER setting
2023-11-03 16:06:49 -07:00
Nusrat Islam
f545b94d4b
msccl: remove cases from numReduction switch statement
2023-11-03 16:56:51 -05:00
Wenkai Du
bb84345943
msccl: use 32-bit LDS access and add RCCL_MSCCL_FORCE_FULLOPS ( #953 )
2023-11-03 10:38:02 -07:00
akolliasAMD
988efe605a
MSCCL stream fix ( #948 )
2023-11-03 09:10:52 -06:00
Wenkai Du
f484ff17b9
msccl: add templated kernel ( #945 )
...
* msccl: add templated kernel
* Use defines to improve code readability
* Fix kernel indexing and review feedback
2023-11-02 17:21:53 -07:00
Nusrat Islam
6b80a0d0d4
msccl: remove dereference of reduce args
...
It can be removed because the msccl kernel will never execute this code
according to the current msccl setup.
2023-11-02 13:20:00 -05:00
Wenkai Du
a7400218a2
msccl: use atomic to set dependency flags ( #941 )
2023-10-31 14:46:57 -07:00
Wenkai Du
a497722894
NPkit: misc fixes for MSCCL ( #936 )
...
* msccl: add xcc_id to timestamp sync
* NPKit: add timestamp for rrc operator
* NPKit: add timestamp for MSCCL init
2023-10-30 10:00:12 -07:00
Nilesh M Negi
1e5ca6820b
Fix gcnArchName bug in topology dump ( #937 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2023-10-28 12:30:36 -05:00
Ziyue Yang
4c117e5335
Fix MSCCL work FIFO out-of-bound issue ( #935 )
2023-10-27 11:24:52 -07:00
Nilesh M Negi
96ec3ffe2e
SRC/INIT: fix typo for ENABLE_PROFILING ( #934 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2023-10-26 23:52:46 -05:00
Nilesh M Negi
f22df90e5c
remove gcnArch support ( #920 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2023-10-26 12:09:15 -05:00
Wenkai Du
fb0eccb57b
msccl: reduce debug output when using NCCL_DEBUG=INFO ( #932 )
2023-10-25 08:05:19 -07:00
Wenkai Du
c4e65fd382
Add missing gfx942 support ( #927 )
2023-10-23 12:04:37 -07:00
Wenkai Du
dbb5611a3a
Remove LDS based software barriers from MSCCL ( #923 )
2023-10-19 16:39:41 -05:00
Wenkai Du
4278a9918b
Update rome models ( #922 )
2023-10-18 17:28:01 -07:00
Wenkai Du
39812ce757
NPKit: add xcc_id field ( #918 )
2023-10-13 15:24:59 -07:00
Wenkai Du
1b80d041cb
Fix incorrect arch name parsing ( #916 )
2023-10-13 10:01:11 -07:00
Wenkai Du
6d0b5c1e89
Port init_once fix from NCCL ( #915 )
2023-10-13 08:01:12 -07:00
Wen-Heng (Jack) Chung
7ee5c1c28b
Change MSCCL kernel signature to allow kernel arguments be preloaded via SGPR ( #911 )
...
* Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895 )
Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu >
* Only build gfx941
* demo
* fine tune malloc
* Fix merge errors
* Fix merge errors
* Disable parallel build
* Adopt --amdgpu-kernarg-preload-count
* Revert "Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895 )"
This reverts commit f5e252dddf02a41b4d1bc512f306f45f97166304.
* Revert CMake changes.
* NPKIT changes.
* Remove some license declarations.
* Address code review feedbacks on msccl_kernel_impl.h
* Update CMakeLists.txt
* Add CMake logic to check the existence of --amdgpu-kernarg-preload-count
* Fix NPKIT trace logic.
---------
Co-authored-by: Pedram Alizadeh <pmohamma@amd.com >
Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu >
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com >
2023-10-12 20:17:08 -05:00
Bertan Dogancay
a6ff4618c7
Revert "Remove 2H4P condition from P2P channels adjustment ( #890 )" ( #904 )
...
This reverts commit 16dd05a58a .
2023-10-04 09:46:11 -06:00
akolliasAMD
28d7fe5629
Dma buf support optin ( #905 )
...
* dmaBufSupport Optin added on every part of the code that should invoke it
2023-10-03 03:17:48 -06:00