Commit Graph

143 Commits

Author SHA1 Message Date
Bertan Dogancay b617aecc31 Implement ROCTX (#1094)
* Implement roctx
2024-02-27 15:46:15 -07:00
BertanDogancay 76f83f95ab Merge remote-tracking branch 'rccl/develop' into 2.19.4 2024-02-15 13:37:14 -08:00
Bertan Dogancay dc2d486ba0 Add stack size UT (#1081)
* Add stack size UT
2024-02-12 17:56:15 -07:00
Wenkai Du d999d9ad21 Merge remote-tracking branch 'rccl/develop' into 2.19.4 2024-02-09 11:31:03 -06:00
Bertan Dogancay 8a442faa12 Nvtx support (#1076)
* NVTX support
2024-02-08 14:08:24 -07:00
Wenkai Du e64324a64a Merge remote-tracking branch 'rccl/develop' into HEAD 2024-02-01 12:17:09 -06:00
Nilesh M Negi 2458f158b1 Enable kernarg preloading for ROCm 6.1 (#1068)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-02-01 12:14:04 -06:00
Wenkai Du 6afabf0d0b Remove enhcompat.cc 2024-01-24 17:13:30 -08:00
BertanDogancay 81ddf9de89 Merge remote-tracking branch 'nccl/v2.19' into develop 2024-01-24 15:25:33 -08:00
Wenkai Du 7e25d5bc55 Use new HIP graph API compatible with CUDA 11030 (#991)
* Use new HIP graph API compatible with CUDA 11030

* Update dependency to ROCm 6.1

* Fix single stream use case
2024-01-21 19:00:50 -08:00
Bertan Dogancay 5f365a9957 Turn IFC off (#1053) 2024-01-18 15:29:36 -07:00
Bertan Dogancay 28d9b170c9 [DEV] Configure functions in RCCL (#986)
* configure functions in rccl
2024-01-18 15:07:16 -07:00
Wenkai Du 5851ae5974 Re-enable L128 on gfx90a of compiler supports it (#1036) 2024-01-10 08:01:11 -08:00
Nilesh M Negi 414884c6cb Remove FORCE from AMDGPU_TARGETS and add support in install script (#989)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-01-09 13:29:47 -06:00
akolliasAMD a924454f0f CMake does not allow for capital letters been used in package names (#1020) 2023-12-15 12:39:17 -07:00
Bertan Dogancay fca459baaf correct package name (#1012) 2023-12-11 09:40:29 -07:00
Bertan Dogancay 7c0f49a878 IFC mix build (#998) 2023-12-02 18:49:52 -07:00
Bertan Dogancay 20b02af19b Renaming unit-tests package (#987) 2023-11-29 15:05:32 -07:00
Bertan Dogancay 198f14923b Check to support older ROCm versions (#963) 2023-11-15 12:36:31 -07:00
Bertan Dogancay 8e0258a73d Revert "Remove hip::device (#954)" (#956)
This reverts commit 7edb486154.
2023-11-07 13:40:41 -07:00
Bertan Dogancay 7edb486154 Remove hip::device (#954) 2023-11-03 19:31:00 -06:00
Nilesh M Negi f22df90e5c remove gcnArch support (#920)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2023-10-26 12:09:15 -05:00
searlmc1 f59de10524 Remove quotes causing asan build breakage
The quotes around "-fsanitize=address -shared-libasan" cause the BUILD_ADDRESS_SANITIZER build to fail; remove the quotes
2023-10-19 16:13:39 -07:00
Wen-Heng (Jack) Chung 7ee5c1c28b Change MSCCL kernel signature to allow kernel arguments be preloaded via SGPR (#911)
* Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895)

Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu>

* Only build gfx941

* demo

* fine tune malloc

* Fix merge errors

* Fix merge errors

* Disable parallel build

* Adopt --amdgpu-kernarg-preload-count

* Revert "Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895)"

This reverts commit f5e252dddf02a41b4d1bc512f306f45f97166304.

* Revert CMake changes.

* NPKIT changes.

* Remove some license declarations.

* Address code review feedbacks on msccl_kernel_impl.h

* Update CMakeLists.txt

* Add CMake logic to check the existence of --amdgpu-kernarg-preload-count

* Fix NPKIT trace logic.

---------

Co-authored-by: Pedram Alizadeh <pmohamma@amd.com>
Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
2023-10-12 20:17:08 -05:00
Edgar Gabriel 88a55cef83 turn bfd compilation off by default
revert the logic to ensure that we are not accidentally creating
a dependency on the bfd libraries when deploying rccl binaries.
2023-09-29 20:25:33 +00:00
Cen Zhao fb57a438d7 Update install.sh to take "--static" option (#894)
* Update install.sh to take "--static" option

* Fix static build errors

---------

Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>
2023-09-27 12:45:21 -04:00
Audrey MP e58ec78d35 Gcn arch name (#886)
We use CMake to determine if we're compiling against a version of ROCm that supports gcnArchName and handles architecture checking appropriately. It includes a few helper functions as drop ins for the functionality we used gcnArch for before; sometimes to enable flags, and sometimes to set frequencies.
2023-09-12 15:34:40 -04:00
Wenkai Du 2baca3a55a Remove --hipcc-func-supp with recent compilers (#874)
* Remove --hipcc-func-supp with recent compilers

* Remove HIP_UNCACHED_MEMORY deetction from header file
2023-09-01 07:53:18 -07:00
arvindcheru 6ee758382e 366827 - Disable file reorg backward compatibility support by default (#849)
* Disable file reorg backward compatibility support by default

- File Reorg backward compatibility option set to OFF

* Update install.sh
2023-08-22 09:14:49 -04:00
Bertan Dogancay da107ff2bc Fix mscclLoadAlgo error (#846) 2023-08-09 11:39:21 -06:00
Wenkai Du d65c0830c6 Detect HIP_UNCACHED_MEMORY support from HIP version (#842) 2023-08-04 10:17:04 -07:00
Bertan Dogancay 64c32d1c5b Disable MSCCL kernels at compile time (#834)
* Disable MSCCL kernels at compile time
2023-08-02 09:45:18 -06:00
Wenkai Du 3db371c9a5 Revert "Enable Ll128 on gfx90a (#823)" (#829)
This reverts commit 420f8af6a0.

Also increase number of parallel jobs for linking
2023-07-27 20:25:18 -07:00
Wenkai Du a7fcd58a97 Enable gfx94x (#808) (#816)
(cherry picked from commit 94da229a7788d74685d1591a4e75a8341de64f41)
2023-07-21 07:31:27 -07:00
Wenkai Du ce6a2ffac8 Merge pull request #782 from ROCmSoftwarePlatform/2.18.3
Sync up with NCCL 2.18.3
2023-06-29 15:04:16 -07:00
Wenkai Du abd0615351 Merge remote-tracking branch 'nccl/master' into develop 2023-06-26 22:51:56 +00:00
arvindcheru bd14ac8b59 ASAN build excluding additional files, Algodir support for share folder
* ASAN build excluding additional files, Algodir support for share folder (#786)
* Algodir support for share folder
2023-06-23 10:57:20 -04:00
Bertan Dogancay f35777e9b0 improve compilation time and create timetrace plot (#773)
* improve compilation time and create time-trace plot

* set default value for nproc
2023-06-14 09:17:51 -06:00
Bertan Dogancay d52b6c0d24 add DMA_BUF support (#763)
* add DMA_BUF support

* remove unused libraries in src/init.cc

* change NCCL_ALL to NCCL_INIT

* remove extra pointer functions in transport/net.cc
2023-06-01 12:46:42 -06:00
gilbertlee-amd c62aebe882 Removing init_nvtx.cc from source list (#762) 2023-05-31 14:44:55 -06:00
gilbertlee-amd 777d8747a5 Refactoring CMakeFiles (#755) 2023-05-25 16:08:54 -06:00
Wenkai Du 53a1f91857 Merge remote-tracking branch 'nccl/master' into develop 2023-04-25 15:38:32 -07:00
Ziyue Yang e3b2342f39 MSCCL: Improve executor and integrate scheduler (#694)
* MSCCL: improve executor and add scheduler for testing

* Use external scheduler

* Fix cmake error

* Address comments

* Fix thread safe issue

* Make MSCCL lifecycle APIs thread safe

* Make MSCCL internal scheduler aware of topology hint

* Revise error message
2023-03-14 14:34:25 -07:00
Wenkai Du f7a456122c Remove workaround and use indirect function call (#684) 2023-02-14 13:59:48 -08:00
Wenkai Du 9461a43168 Merge pull request #681 from wenkaidu/gfx9
Add HIP event optimization and remove special code for gfx90a
2023-02-13 08:04:59 -08:00
Pedram Alizadeh f525b8e1e6 Adding -pthread flag into CMakeLists.txt (#682)
Adding -pthread flag for linking issues into CMakeLists.txt
2023-02-10 17:22:30 -05:00
Wenkai Du 39534e8724 Add HIP event optimization and remove special code for gfx90a 2023-02-10 16:46:01 +00:00
Wenkai Du e1cb45ff22 Merge remote-tracking branch 'nccl/master' into HEAD 2023-02-04 01:44:43 +00:00
Ziyue Yang adafc0f759 Add MSCCL Support (#658)
* Add MSCCL support

* Add alignment and message size checking

* Fix nRanks checking, in-place and out-of-place tests and group call handling

* Fix hipGraph unit test

* Change MSCCL init warning to INFO

* Revise license info
2022-12-12 15:51:04 -08:00
Wenkai Du aebed537a5 Reduce linking time through more parallel jobs (#657) 2022-11-30 16:06:03 -08:00