コミットグラフ

168 コミット

作成者 SHA1 メッセージ 日付
mberenjk db840f024e adding all nccl apis to api_support to enable rccl tracing by rocprofv3 (#1297)
* adding all nccl apis to api_support to enable rccl tracing by rocprofv3

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
2024-08-22 12:36:07 -05:00
akolliasAMD d6c317d6ae removed hcc mentions (#1291) 2024-08-14 15:04:13 -06:00
Nilesh M Negi 4f31ab85ea [BUILD] Update gfxTargets for ASAN build (#1242)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-08-06 10:53:51 -05:00
Nilesh M Negi cb2e0615d7 [BUILD] Disable MSCCLPP build by default (#1283)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-08-02 23:17:51 -05:00
Wenkai Du ca5341d419 Restore number of parallel linking jobs (#1278)
* Restore number of parallel linking jobs

* Dynamically adjust number of linker jobs with limit of 16 jobs max

* Fix typo

* Add cgroup v1 support
2024-07-30 08:04:14 -07:00
corey-derochie-amd b31b4082dd Only initialize MSCCL++ when runtime-enabled. (#1266) 2024-07-22 00:41:31 -06:00
Wenkai Du 89349f2ce4 Template unroll for RCCL kernels (#1250)
* Template unroll for RCCL kernels

* Adding unroll template arg during CMake hipification

* Reduce linking parallel jobs to avoid OOM in CI

* Workaround issues with UT tests

SWDEV-469533: register spill fix is needed for mainline build
LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs
Use -parallel-jobs=8 for linking

* CI: do not use -j 16 when building

* CI: use -j 8 when building

* Only reduce parallel linking job for CI extended

* Restore original jenkins command. Change parallel linking jobs in cmake

* Disable MSCCLPP

---------

Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>
2024-07-19 08:15:59 -07:00
corey-derochie-amd 6dc47eecd7 Integrated RCCL with MSCCL++ for small message sizes (#1231) 2024-07-12 15:32:58 -06:00
Rahul Vaidya c755b9cf93 Improved version reporting in NCCL_DEBUG=VERSION (#1232)
* Improved version reporting in NCCL_DEBUG=VERSION.

Signed-off-by: rahulvaidya20 <ravaidya@amd.com>

* Version reporting changes

Signed-off-by: rahulvaidya20 <ravaidya@amd.com>

* Versioning changes: Initialized char arrays to null and fixed typo.

---------

Signed-off-by: rahulvaidya20 <ravaidya@amd.com>
2024-07-12 08:14:29 -05:00
akolliasAMD 63e4d76e23 gfx12 initial enablement (#1219) 2024-07-10 13:32:09 -06:00
akolliasAMD 6475da2ed9 fixed typo on BFD linkage (#1192) 2024-06-03 10:05:47 -06:00
Nilesh M Negi 5aaf7121d9 [BUILD] Update install.sh for RCCL build (#1191)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-05-31 17:58:34 -05:00
Edgar Gabriel 9ad913bfa8 add alternative to rocm_smi_lib 2024-05-14 13:51:41 -07:00
Wenkai Du a64aab5f63 Use rocm-smi thread only mutex when available (#1169) 2024-05-08 14:32:24 -07:00
Wenkai Du b18784d8b8 Add compiler warning for uninitialized variable and fix (#1163)
* Add compiler warning for uninitialized variable and fix

* Add -Wsometimes-uninitialized

* Convert warning to error
2024-05-08 07:00:25 -07:00
BertanDogancay e1a835910e Merge remote-tracking branch 'nccl/master' into develop 2024-04-23 13:34:00 -07:00
Bertan Dogancay 3caad91f32 Add unique files to source list (#1144) 2024-04-15 09:46:53 -06:00
mberenjk 428837ffe4 replacing rccl_bfloat16 with hip_bfloat16 (#1126)
Co-authored-by: mberenjk <mberenjk@amd.com>
2024-04-11 11:30:37 -05:00
arvindcheru c1b8eab8e1 Update Depends with correct HIP Runtime package name (#1130) 2024-04-09 19:27:07 -04:00
arvindcheru c0a51dc84b Static Build update - Moved all cmake install() to rocm-cmake APIs, static build update (#1123) 2024-03-26 11:11:09 -04:00
corey-derochie-amd 503a472a25 Replaced ROCmSoftwarePlatform and RadeonOpenCompute links with ROCm links. (#1125) 2024-03-25 16:29:13 -06:00
Wenkai Du 5976f757dd Remove hipEventDisableSystemFence (#1122)
There is no indication that disabling system fence has any latency improvement.
Removing it per recommendation from HIP.
2024-03-25 08:01:57 -07:00
Nilesh M Negi 53fad75001 BUILD: Enable RCCL static build (#1114)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-03-15 12:18:18 -05:00
Andy li 6777e65c1d Enable fp8 support (#1101)
* initial checkin

* resolve cr comments

* resolve the build issue

* fix the data correctless issue

* update fp8 header file and update the unit test for fp8 support

* remove fp16 from fp8 headers

* fix ut issue and catch up the latest code from develop

* udate according to cr comments

* update ut according to cr comments

* update num floats for each SumPostDiv from 4 to 6

* update fp8 header file name

* fix the typo
2024-03-08 15:17:53 -08:00
Wenkai Du cbd955627e Add support for using contiguous for GPU direct RDMA (#1096)
Enabled by env var RCCL_NET_CONTIGUOUS_MEM=1
2024-02-29 10:06:43 -08:00
Bertan Dogancay b617aecc31 Implement ROCTX (#1094)
* Implement roctx
2024-02-27 15:46:15 -07:00
BertanDogancay 76f83f95ab Merge remote-tracking branch 'rccl/develop' into 2.19.4 2024-02-15 13:37:14 -08:00
Bertan Dogancay dc2d486ba0 Add stack size UT (#1081)
* Add stack size UT
2024-02-12 17:56:15 -07:00
Wenkai Du d999d9ad21 Merge remote-tracking branch 'rccl/develop' into 2.19.4 2024-02-09 11:31:03 -06:00
Bertan Dogancay 8a442faa12 Nvtx support (#1076)
* NVTX support
2024-02-08 14:08:24 -07:00
Wenkai Du e64324a64a Merge remote-tracking branch 'rccl/develop' into HEAD 2024-02-01 12:17:09 -06:00
Nilesh M Negi 2458f158b1 Enable kernarg preloading for ROCm 6.1 (#1068)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-02-01 12:14:04 -06:00
Wenkai Du 6afabf0d0b Remove enhcompat.cc 2024-01-24 17:13:30 -08:00
BertanDogancay 81ddf9de89 Merge remote-tracking branch 'nccl/v2.19' into develop 2024-01-24 15:25:33 -08:00
Wenkai Du 7e25d5bc55 Use new HIP graph API compatible with CUDA 11030 (#991)
* Use new HIP graph API compatible with CUDA 11030

* Update dependency to ROCm 6.1

* Fix single stream use case
2024-01-21 19:00:50 -08:00
Bertan Dogancay 5f365a9957 Turn IFC off (#1053) 2024-01-18 15:29:36 -07:00
Bertan Dogancay 28d9b170c9 [DEV] Configure functions in RCCL (#986)
* configure functions in rccl
2024-01-18 15:07:16 -07:00
Wenkai Du 5851ae5974 Re-enable L128 on gfx90a of compiler supports it (#1036) 2024-01-10 08:01:11 -08:00
Nilesh M Negi 414884c6cb Remove FORCE from AMDGPU_TARGETS and add support in install script (#989)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-01-09 13:29:47 -06:00
akolliasAMD a924454f0f CMake does not allow for capital letters been used in package names (#1020) 2023-12-15 12:39:17 -07:00
Bertan Dogancay fca459baaf correct package name (#1012) 2023-12-11 09:40:29 -07:00
Bertan Dogancay 7c0f49a878 IFC mix build (#998) 2023-12-02 18:49:52 -07:00
Bertan Dogancay 20b02af19b Renaming unit-tests package (#987) 2023-11-29 15:05:32 -07:00
Bertan Dogancay 198f14923b Check to support older ROCm versions (#963) 2023-11-15 12:36:31 -07:00
Bertan Dogancay 8e0258a73d Revert "Remove hip::device (#954)" (#956)
This reverts commit 7edb486154.
2023-11-07 13:40:41 -07:00
Bertan Dogancay 7edb486154 Remove hip::device (#954) 2023-11-03 19:31:00 -06:00
Nilesh M Negi f22df90e5c remove gcnArch support (#920)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2023-10-26 12:09:15 -05:00
searlmc1 f59de10524 Remove quotes causing asan build breakage
The quotes around "-fsanitize=address -shared-libasan" cause the BUILD_ADDRESS_SANITIZER build to fail; remove the quotes
2023-10-19 16:13:39 -07:00
Wen-Heng (Jack) Chung 7ee5c1c28b Change MSCCL kernel signature to allow kernel arguments be preloaded via SGPR (#911)
* Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895)

Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu>

* Only build gfx941

* demo

* fine tune malloc

* Fix merge errors

* Fix merge errors

* Disable parallel build

* Adopt --amdgpu-kernarg-preload-count

* Revert "Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895)"

This reverts commit f5e252dddf02a41b4d1bc512f306f45f97166304.

* Revert CMake changes.

* NPKIT changes.

* Remove some license declarations.

* Address code review feedbacks on msccl_kernel_impl.h

* Update CMakeLists.txt

* Add CMake logic to check the existence of --amdgpu-kernarg-preload-count

* Fix NPKIT trace logic.

---------

Co-authored-by: Pedram Alizadeh <pmohamma@amd.com>
Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
2023-10-12 20:17:08 -05:00
Edgar Gabriel 88a55cef83 turn bfd compilation off by default
revert the logic to ensure that we are not accidentally creating
a dependency on the bfd libraries when deploying rccl binaries.
2023-09-29 20:25:33 +00:00