Rahul Vaidya
c755b9cf93
Improved version reporting in NCCL_DEBUG=VERSION ( #1232 )
...
* Improved version reporting in NCCL_DEBUG=VERSION.
Signed-off-by: rahulvaidya20 <ravaidya@amd.com >
* Version reporting changes
Signed-off-by: rahulvaidya20 <ravaidya@amd.com >
* Versioning changes: Initialized char arrays to null and fixed typo.
---------
Signed-off-by: rahulvaidya20 <ravaidya@amd.com >
2024-07-12 08:14:29 -05:00
akolliasAMD
63e4d76e23
gfx12 initial enablement ( #1219 )
2024-07-10 13:32:09 -06:00
akolliasAMD
6475da2ed9
fixed typo on BFD linkage ( #1192 )
2024-06-03 10:05:47 -06:00
Nilesh M Negi
5aaf7121d9
[BUILD] Update install.sh for RCCL build ( #1191 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2024-05-31 17:58:34 -05:00
Edgar Gabriel
9ad913bfa8
add alternative to rocm_smi_lib
2024-05-14 13:51:41 -07:00
Wenkai Du
a64aab5f63
Use rocm-smi thread only mutex when available ( #1169 )
2024-05-08 14:32:24 -07:00
Wenkai Du
b18784d8b8
Add compiler warning for uninitialized variable and fix ( #1163 )
...
* Add compiler warning for uninitialized variable and fix
* Add -Wsometimes-uninitialized
* Convert warning to error
2024-05-08 07:00:25 -07:00
BertanDogancay
e1a835910e
Merge remote-tracking branch 'nccl/master' into develop
2024-04-23 13:34:00 -07:00
Bertan Dogancay
3caad91f32
Add unique files to source list ( #1144 )
2024-04-15 09:46:53 -06:00
mberenjk
428837ffe4
replacing rccl_bfloat16 with hip_bfloat16 ( #1126 )
...
Co-authored-by: mberenjk <mberenjk@amd.com >
2024-04-11 11:30:37 -05:00
arvindcheru
c1b8eab8e1
Update Depends with correct HIP Runtime package name ( #1130 )
2024-04-09 19:27:07 -04:00
arvindcheru
c0a51dc84b
Static Build update - Moved all cmake install() to rocm-cmake APIs, static build update ( #1123 )
2024-03-26 11:11:09 -04:00
corey-derochie-amd
503a472a25
Replaced ROCmSoftwarePlatform and RadeonOpenCompute links with ROCm links. ( #1125 )
2024-03-25 16:29:13 -06:00
Wenkai Du
5976f757dd
Remove hipEventDisableSystemFence ( #1122 )
...
There is no indication that disabling system fence has any latency improvement.
Removing it per recommendation from HIP.
2024-03-25 08:01:57 -07:00
Nilesh M Negi
53fad75001
BUILD: Enable RCCL static build ( #1114 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2024-03-15 12:18:18 -05:00
Andy li
6777e65c1d
Enable fp8 support ( #1101 )
...
* initial checkin
* resolve cr comments
* resolve the build issue
* fix the data correctless issue
* update fp8 header file and update the unit test for fp8 support
* remove fp16 from fp8 headers
* fix ut issue and catch up the latest code from develop
* udate according to cr comments
* update ut according to cr comments
* update num floats for each SumPostDiv from 4 to 6
* update fp8 header file name
* fix the typo
2024-03-08 15:17:53 -08:00
Wenkai Du
cbd955627e
Add support for using contiguous for GPU direct RDMA ( #1096 )
...
Enabled by env var RCCL_NET_CONTIGUOUS_MEM=1
2024-02-29 10:06:43 -08:00
Bertan Dogancay
b617aecc31
Implement ROCTX ( #1094 )
...
* Implement roctx
2024-02-27 15:46:15 -07:00
BertanDogancay
76f83f95ab
Merge remote-tracking branch 'rccl/develop' into 2.19.4
2024-02-15 13:37:14 -08:00
Bertan Dogancay
dc2d486ba0
Add stack size UT ( #1081 )
...
* Add stack size UT
2024-02-12 17:56:15 -07:00
Wenkai Du
d999d9ad21
Merge remote-tracking branch 'rccl/develop' into 2.19.4
2024-02-09 11:31:03 -06:00
Bertan Dogancay
8a442faa12
Nvtx support ( #1076 )
...
* NVTX support
2024-02-08 14:08:24 -07:00
Wenkai Du
e64324a64a
Merge remote-tracking branch 'rccl/develop' into HEAD
2024-02-01 12:17:09 -06:00
Nilesh M Negi
2458f158b1
Enable kernarg preloading for ROCm 6.1 ( #1068 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2024-02-01 12:14:04 -06:00
Wenkai Du
6afabf0d0b
Remove enhcompat.cc
2024-01-24 17:13:30 -08:00
BertanDogancay
81ddf9de89
Merge remote-tracking branch 'nccl/v2.19' into develop
2024-01-24 15:25:33 -08:00
Wenkai Du
7e25d5bc55
Use new HIP graph API compatible with CUDA 11030 ( #991 )
...
* Use new HIP graph API compatible with CUDA 11030
* Update dependency to ROCm 6.1
* Fix single stream use case
2024-01-21 19:00:50 -08:00
Bertan Dogancay
5f365a9957
Turn IFC off ( #1053 )
2024-01-18 15:29:36 -07:00
Bertan Dogancay
28d9b170c9
[DEV] Configure functions in RCCL ( #986 )
...
* configure functions in rccl
2024-01-18 15:07:16 -07:00
Wenkai Du
5851ae5974
Re-enable L128 on gfx90a of compiler supports it ( #1036 )
2024-01-10 08:01:11 -08:00
Nilesh M Negi
414884c6cb
Remove FORCE from AMDGPU_TARGETS and add support in install script ( #989 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2024-01-09 13:29:47 -06:00
akolliasAMD
a924454f0f
CMake does not allow for capital letters been used in package names ( #1020 )
2023-12-15 12:39:17 -07:00
Bertan Dogancay
fca459baaf
correct package name ( #1012 )
2023-12-11 09:40:29 -07:00
Bertan Dogancay
7c0f49a878
IFC mix build ( #998 )
2023-12-02 18:49:52 -07:00
Bertan Dogancay
20b02af19b
Renaming unit-tests package ( #987 )
2023-11-29 15:05:32 -07:00
Bertan Dogancay
198f14923b
Check to support older ROCm versions ( #963 )
2023-11-15 12:36:31 -07:00
Bertan Dogancay
8e0258a73d
Revert "Remove hip::device ( #954 )" ( #956 )
...
This reverts commit 7edb486154 .
2023-11-07 13:40:41 -07:00
Bertan Dogancay
7edb486154
Remove hip::device ( #954 )
2023-11-03 19:31:00 -06:00
Nilesh M Negi
f22df90e5c
remove gcnArch support ( #920 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2023-10-26 12:09:15 -05:00
searlmc1
f59de10524
Remove quotes causing asan build breakage
...
The quotes around "-fsanitize=address -shared-libasan" cause the BUILD_ADDRESS_SANITIZER build to fail; remove the quotes
2023-10-19 16:13:39 -07:00
Wen-Heng (Jack) Chung
7ee5c1c28b
Change MSCCL kernel signature to allow kernel arguments be preloaded via SGPR ( #911 )
...
* Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895 )
Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu >
* Only build gfx941
* demo
* fine tune malloc
* Fix merge errors
* Fix merge errors
* Disable parallel build
* Adopt --amdgpu-kernarg-preload-count
* Revert "Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895 )"
This reverts commit f5e252dddf02a41b4d1bc512f306f45f97166304.
* Revert CMake changes.
* NPKIT changes.
* Remove some license declarations.
* Address code review feedbacks on msccl_kernel_impl.h
* Update CMakeLists.txt
* Add CMake logic to check the existence of --amdgpu-kernarg-preload-count
* Fix NPKIT trace logic.
---------
Co-authored-by: Pedram Alizadeh <pmohamma@amd.com >
Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu >
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com >
2023-10-12 20:17:08 -05:00
Edgar Gabriel
88a55cef83
turn bfd compilation off by default
...
revert the logic to ensure that we are not accidentally creating
a dependency on the bfd libraries when deploying rccl binaries.
2023-09-29 20:25:33 +00:00
Cen Zhao
fb57a438d7
Update install.sh to take "--static" option ( #894 )
...
* Update install.sh to take "--static" option
* Fix static build errors
---------
Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com >
2023-09-27 12:45:21 -04:00
Audrey MP
e58ec78d35
Gcn arch name ( #886 )
...
We use CMake to determine if we're compiling against a version of ROCm that supports gcnArchName and handles architecture checking appropriately. It includes a few helper functions as drop ins for the functionality we used gcnArch for before; sometimes to enable flags, and sometimes to set frequencies.
2023-09-12 15:34:40 -04:00
Wenkai Du
2baca3a55a
Remove --hipcc-func-supp with recent compilers ( #874 )
...
* Remove --hipcc-func-supp with recent compilers
* Remove HIP_UNCACHED_MEMORY deetction from header file
2023-09-01 07:53:18 -07:00
arvindcheru
6ee758382e
366827 - Disable file reorg backward compatibility support by default ( #849 )
...
* Disable file reorg backward compatibility support by default
- File Reorg backward compatibility option set to OFF
* Update install.sh
2023-08-22 09:14:49 -04:00
Bertan Dogancay
da107ff2bc
Fix mscclLoadAlgo error ( #846 )
2023-08-09 11:39:21 -06:00
Wenkai Du
d65c0830c6
Detect HIP_UNCACHED_MEMORY support from HIP version ( #842 )
2023-08-04 10:17:04 -07:00
Bertan Dogancay
64c32d1c5b
Disable MSCCL kernels at compile time ( #834 )
...
* Disable MSCCL kernels at compile time
2023-08-02 09:45:18 -06:00
Wenkai Du
3db371c9a5
Revert "Enable Ll128 on gfx90a ( #823 )" ( #829 )
...
This reverts commit 420f8af6a0 .
Also increase number of parallel jobs for linking
2023-07-27 20:25:18 -07:00