Bertan Dogancay
8a442faa12
Nvtx support ( #1076 )
...
* NVTX support
2024-02-08 14:08:24 -07:00
Wenkai Du
5257c753c5
msccl: use relaxed atomics on scratch buffer ( #1075 )
2024-02-08 12:09:56 -08:00
Wenkai Du
704c9ef0d1
Doubling P2P channels per peer on single node gfx94x only ( #1074 )
2024-02-07 14:05:57 -08:00
Wenkai Du
1d989f6524
Doubling P2P channels per peer on single node only ( #1069 )
2024-02-02 12:41:00 -08:00
Bertan Dogancay
01b359027b
Include common.h in enqueue.cc instead ( #1067 )
2024-01-30 08:24:22 -08:00
Wenkai Du
f7550d83b8
msccl: ensure memory coherence after data receive ( #1062 )
2024-01-30 08:22:50 -08:00
Pedram Alizadeh
ccfb35fa6d
modifying the tuning table to improve the performance of allreduce for 8MB and 16MB for single-node MI300X ( #1063 )
2024-01-26 09:05:53 -05:00
Wenkai Du
be8ef4367f
colltrace: fix dropped trace messages ( #1059 )
...
* colltrace: fix dropped trace messages
* Remove extra space
2024-01-25 13:31:53 -08:00
Wenkai Du
ffde530af5
Increase P2P channels per peer ( #1060 )
2024-01-25 11:21:58 -08:00
Wenkai Du
7987015a19
Revert "msccl: build same number of kernels as in ROCm 5.7" ( #1058 )
...
This reverts commit f960174d03be7e5174baa83b256526d388a38842.
2024-01-24 08:43:50 -08:00
Bertan Dogancay
5564d65e71
Use binary search for direct function calls ( #1057 )
...
* Use binary search for direct function calls
* fix scratch mem issue on MI300
2024-01-22 17:37:56 -07:00
Bertan Dogancay
c4dbf8a914
Fix collective trace when rccl is configured ( #1056 )
...
* Fix collective trace when rccl is configured
2024-01-22 09:26:44 -07:00
Wenkai Du
7e25d5bc55
Use new HIP graph API compatible with CUDA 11030 ( #991 )
...
* Use new HIP graph API compatible with CUDA 11030
* Update dependency to ROCm 6.1
* Fix single stream use case
2024-01-21 19:00:50 -08:00
Nilesh M Negi
8b97a20943
COLLECTIVES: Switch to unroll 2 for MI300 ( #1051 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2024-01-19 12:16:05 -06:00
Bertan Dogancay
28d9b170c9
[DEV] Configure functions in RCCL ( #986 )
...
* configure functions in rccl
2024-01-18 15:07:16 -07:00
Wenkai Du
3325f96c56
Only use full MAXCHANNELS for gfx94x ( #1050 )
2024-01-17 09:00:49 -08:00
Pedram Alizadeh
b08124c85d
adding rccl tuning parameters for MI300X gfx942 with 8 GPUs single and multi-node ( #1047 )
2024-01-16 13:44:32 -05:00
Wenkai Du
261707d90a
Add option to force enable network transport on single node ( #1046 )
2024-01-16 07:54:18 -08:00
PedramAlizadeh
767fde8210
Revert "2.18.5-1"
...
This reverts commit 559b70f86c .
2024-01-12 16:54:19 +00:00
Bertan Dogancay
cf248d9402
Addressing the compiler warning ( #988 )
2024-01-10 14:59:40 -07:00
Hossein Pourreza
735178c1fe
cover more gpu/nic mapping cases ( #1037 )
2024-01-10 08:01:37 -08:00
Wenkai Du
5851ae5974
Re-enable L128 on gfx90a of compiler supports it ( #1036 )
2024-01-10 08:01:11 -08:00
Nilesh M Negi
249e9f7f65
Un-escaped character causes error with address sanitizer builds ( #992 )
...
Signed-off-by: Nilesh M Negi <Nilesh.Negi@amd.com >
Co-authored-by: Jenkins <jenkins-compute@amd.com >
2024-01-09 13:28:32 -06:00
Pedram Alizadeh
aa5c84c997
Merge pull request #1022 from PedramAlizadeh/sync_nccl_2.18.6
...
Sync to nccl 2.18.6
2024-01-09 13:29:29 -05:00
Wenkai Du
d9871d171b
msccl: use custom reduce function ( #1033 )
2024-01-08 14:53:12 -08:00
Wenkai Du
f7e39fced2
Doubling buffer size to fix NCCL INFO corruption with increased channels ( #1035 )
2024-01-08 08:14:33 -08:00
Wenkai Du
e5bf56c6d8
Increase stack size for gfx906 ( #1034 )
...
Occationally "Memory access fault by GPU node-8 (Agent handle: 0x23a5640) on address 0x7f461ec00000. Reason: Page not present or supervisor privilege" can be seen from gfx906 CI
2024-01-07 20:25:02 -08:00
Ziyue Yang
70bbeb4773
Fix MSCCL multi-node ( #1032 )
...
1) Move needsProxy initialization before mscclSetupConnections since the latter
will revise it later.
2) Remove mscclAvailable check in net.cc since it's no more required and caused
non-shared buffer allocated for MSCCL which is not expected.
2024-01-05 17:03:43 -08:00
Wenkai Du
abf265a911
Rework barriers and adjust scope of atomics ( #1019 )
2024-01-04 08:18:48 -08:00
Ziyue Yang
0a53077c9c
Improve MSCCL algorithms ( #1023 )
2024-01-03 14:51:34 -08:00
akolliasAMD
f4858e14b2
rearranged how the min and max functions are part of msccl ( #1025 )
...
* rearranged how the min and max functions are part of msccl
* added more coverage on in place graph tests
2023-12-21 08:58:33 -07:00
PedramAlizadeh
0d515f9388
resolved conflicts, fixed the localNetCount/0 bug
2023-12-18 08:11:34 +00:00
Ziyue Yang
655742a3a6
Fully disable MSCCL when machine is not matched ( #1017 )
...
* Disable MSCCL algorithm meta loading when machine is not matched
* fully disable init
* fix potential segfault
2023-12-13 08:36:21 -08:00
Wenkai Du
53d807a5b9
msccl: disable on multi-node ( #1018 )
2023-12-13 07:41:40 -08:00
Wenkai Du
81602814a7
msccl: fix data corruption with MTYPE_RW ( #1014 )
2023-12-11 20:33:15 -08:00
Wenkai Du
7965c8b53c
Fix memory fence and use non-temporal store ( #1007 )
...
* Fix memory fence and use non-temporal store
* Use amdgcn builtin instead of inline asm
* Move threadfence location
* Revert changes to gfx90a
* Rework gfx90a change
* Apply changes to gfx94x
2023-12-09 12:16:08 -08:00
Ziyue Yang
c002f20029
Fix MSCCL scratch allocation ( #1010 )
2023-12-08 17:47:10 -06:00
Wen-Heng (Jack) Chung
baadda4bd8
Relax workgroup barrier implementation for MSCCL send/recv ops. ( #997 )
...
* Trim logic.
* Revert "Trim logic."
This reverts commit 8f2dba6c764108acf2bf5428366b9f41d4d206b9.
* Introduce MSCCL template parameters to send / recv.
* Address review feedbacks.
2023-12-08 17:46:53 -06:00
Wenkai Du
12c08fc52a
msccl: build same number of kernels as in ROCm 5.7 ( #1005 )
...
Removed fullOps kernels from build
2023-12-07 13:36:04 -06:00
Wen-Heng (Jack) Chung
293f0fb752
Use a map to host scratch buffers ( #1004 )
...
* Use a map to host scratch buffers
* Address review feedbacks. Deliberately keep mscclSetupScratch function.
2023-12-05 13:15:28 -06:00
Nilesh M Negi
bc44e3faa7
Fix gcnArch bug in IFC mix build ( #998 ) ( #1002 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2023-12-04 16:20:22 -06:00
Bertan Dogancay
7c0f49a878
IFC mix build ( #998 )
2023-12-02 18:49:52 -07:00
Wenkai Du
4ba65d1d6a
Increase max channles to 64 ( #993 )
2023-12-01 16:01:11 -08:00
pradeep-ramanna
0b53f79196
Fix GPU to NIC mapping for peertopeer ( #994 )
2023-12-01 08:00:17 -08:00
Ziyue Yang
e44e112a17
Fix mscclAlgoHandle not initialized issue ( #995 )
2023-12-01 07:58:01 -08:00
Ziyue Yang
4bb0b4a380
Move MSCCL algorithm loading to initialization to workaround HIP graph conflict ( #982 )
...
* MSCCL: pre-specify channels and pre-load algorithms
* add mutex
* fix bug
* clean include
* disable all-gathers temporarily
2023-11-30 09:47:20 -08:00
akolliasAMD
56ce9ef05f
recreated pr 914 to work with current develop branch ( #979 )
2023-11-28 16:33:47 -07:00
Wenkai Du
50b2dd9fd7
Add special handling of gfx940 ( #976 )
...
* Add special handling of gfx940
* Update ring base
2023-11-22 15:07:36 -08:00
Wenkai Du
569d3f7d59
msccl: allocate scratch as ext-scope fine-grained ( #968 )
2023-11-16 09:57:25 -06:00
Wenkai Du
bc8661f092
Fix kernel command line warnings ( #961 )
...
* Fix kernel command line warnings
* Remove while loop
2023-11-15 18:01:12 -08:00