Граф коммитов

135 Коммитов

Автор SHA1 Сообщение Дата
mberenjk eb65dadfc5 replacing rccl_bfloat16 with hip_bfloat16 (#70)
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2024-04-23 17:00:20 -05:00
Nilesh M Negi 990f88cbaa Ammend use of CUSTOM_RCCL_LIB to avoid build error (#71)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-04-12 12:01:32 -05:00
mberenjk 3f7f7859bf adding git version to rccl-tests (#69)
Co-authored-by: mberenjk <mberenjk@amd.com>
2024-03-28 14:03:59 -05:00
akolliasAMD 91609be0ef Revert "adding git version to rccl-test (#66)"
This reverts commit a31679775c.
2024-03-22 10:21:37 -06:00
mberenjk a31679775c adding git version to rccl-test (#66)
* adding git version to rccl-test

---------

Co-authored-by: mberenjk <mberenjk@banff-cyxtera-s74-2.ctr.dcgpu>
2024-03-20 10:04:12 -05:00
Andy li e447c17382 update the fp8 header file name (#65)
* update the fp8 header name
2024-03-08 10:02:40 -08:00
Andy li 21e59fb283 Enable fp8 support (#63)
* initial checkin

* rename the fp8 datatype name

* update based on cr comments

* resolve the build issue

* resolve fp8 campability issue

* fix minior bug and catch up to reflex latest develop branch change

* add fp8 + operatior support

* update fp8 header file

* resolve merge issue from develop branch
2024-03-07 16:54:41 -08:00
Bertan Dogancay 88cf7dbf45 Add hipify steps prior to build (#62)
* Add hipify steps prior to build
2024-03-05 09:47:18 -07:00
Wenkai Du 621dde544d Merge remote-tracking branch 'nccl-tests/master' into HEAD 2024-03-01 18:34:44 +00:00
Wenkai Du 7715a0cf1f Fix typo in rank assignment (#59) 2024-02-15 12:04:38 -08:00
David Addison c6afef0b6f Added missing MPI_Comm_free() call before MPI_Finalize() 2024-02-05 08:53:54 -08:00
Nusrat Islam a2bec5d2f6 Add option to disable out-of-place 2024-01-04 16:43:50 -06:00
Lauren Wrubleski e1a816b869 Offload arch linking (#54)
* Update CMakeLists.txt

* Update CMakeLists.txt

* Link rccl_common object against hip::device

Previously the tests were compiled with `--amdgpu-target` to compile for multiple architectures, As rccl_common was not compiled against those architectures, this didn't work. Linking it against hip::device automatically links against all architectures in `AMDGPU_TARGETS`, and so are the test executables.
2023-12-05 19:20:46 -06:00
Wenkai Du 5ee7a08994 Warm up both out-of-place and in-place collectives (#51) 2023-10-16 12:13:50 -07:00
David Addison 1292b25553 Added an MPI_Barrier() call after MPI_Bcast() for HCOLL issue 2023-10-12 16:53:32 -07:00
David Addison 6c46206a47 Make the -c option be a datacheck iteration count parameter
Default is 1
2023-09-13 14:03:38 -07:00
arvindcheru a6593375bc Update Makefile - HIPCC Path Updated to latest (#45) 2023-08-04 19:33:39 -04:00
Wenkai Du fcd0888d53 Remove hardcoded number of GPUs limit for alltoallv (#41) 2023-06-18 18:07:29 -07:00
Wenkai Du 652a24d38d Fix merge error 2023-06-14 20:26:33 +00:00
Wenkai Du bb0f15d407 Merge remote-tracking branch 'nccl/master' into develop 2023-06-14 08:21:02 -07:00
Wenkai Du 469225bcaf Merge remote-tracking branch 'origin/master' into develop 2023-06-14 08:01:50 -07:00
Pedram Alizadeh d16d1fb16b fixing the error message for mpirun when number of requested GPUs exceeds the limits (#37) 2023-04-27 14:06:17 -04:00
Pedram Alizadeh e856fa720f Revert "fixing the error message for mpirun when number of requested GPUs exceeds the limits (#33)" (#36)
This reverts commit e146460810.
2023-04-25 13:44:43 -04:00
Pedram Alizadeh e146460810 fixing the error message for mpirun when number of requested GPUs exceeds the limits (#33) 2023-04-03 11:37:13 -04:00
alan.souza 7ccda3c97b fix handling of variable NVCC. Permit overriding the variable using environment variables 2023-03-25 16:56:16 -03:00
Pedram Alizadeh 255750b094 Adding -pthread flag for linking issues into CMakeLists.txt and src/Makefile (#31) 2023-03-02 11:05:25 -05:00
Pedram Alizadeh 5275aa5715 Adding -pthread flag for linking issues into src/Makefile (#30)
* Adding -pthread flag for linking issues into src/Makefile

* Adding -pthread flag for linking issues into CMakeLists.txt
2023-02-24 21:39:04 -05:00
David Addison 0b4c4cb99f Add boot_id to the hostname hash due to collisions on Azure
Fixes #60
2022-12-12 01:16:46 -08:00
Jithin Jose 0aeba157db Use DJB2a hash algorithm in getHostHash() 2022-12-12 01:16:38 -08:00
David Addison 24fcf64ed1 Call cudaFreeHost() on wrongPerGpu not cudaFree() 2022-11-22 11:18:37 -08:00
David Addison 3bd2bd292b Add fflush(stdout) before perf output 2022-11-22 11:16:47 -08:00
akolliasAMD 9d3a53dfa3 added std::max to avoid buffer overflow in printing (#25) 2022-11-01 11:34:55 -06:00
Edgar Gabriel 377b28e5fb make cmake stage also pass in CI
the subdir entry is not actually required for the compilation.
2022-10-31 22:07:15 +00:00
Edgar Gabriel 9c9746739a add the rccl/lib directory to the link path 2022-10-31 19:01:22 +00:00
Edgar Gabriel 8a754f15ad fix a messing endif statement
error introduced with the web merger-resolution tool :-(
2022-10-25 16:31:57 +00:00
Edgar Gabriel 4d7cd871c1 Merge branch 'develop' into topic/v2.13.4-sync 2022-10-21 17:12:45 -05:00
Wenkai Du 9a89c300b6 Allow more precise measurements of single operation (#20) 2022-10-21 22:07:41 +00:00
Edgar Gabriel 641e93e99c make rccl-test compile again.
all files compile now.
mpi tests also pass
2022-10-21 22:07:33 +00:00
Edgar Gabriel 3ae371cce7 Merge remote-tracking branch 'nccl-tests/master' into topic/v2.13.4-sync 2022-10-14 16:02:54 -05:00
Wenkai Du d22281cb3f Allow more precise measurements of single operation (#20) 2022-10-12 17:28:04 -07:00
Sylvain Jeaugey 365b92a1ea Fix build on RHEL7 with GCC 4.8
Add -std=c++11 to CXXFLAGS.
Fixes #116.
2022-10-12 01:24:14 -07:00
akolliasAMD 3fbd3280ce removed hypercube from Makefile (#19) 2022-09-29 15:36:39 -06:00
Sylvain Jeaugey d313d20a26 Update NCCL tests 2022-09-23 01:13:29 -07:00
David Addison 749573f2d6 Fix preprocessor version check for ncclGetLastError()
ncclGetLastError() was added in NCCL 2.13.0
2022-09-07 16:10:41 -07:00
David Addison afa4c56b6a Fix an issue with the last commit when data checking is disabled 2022-09-07 11:23:49 -07:00
David Addison a0a14911ee Display N/A for error count in AlltoAll in-place test
AlltoAll does not support in-place buffers
2022-09-06 13:17:15 -07:00
John Bachan 51af5572bf Resync with NCCL 2.13
* Added "verifiable", a suite of kernels for generating and verifying reduction
  input and output arrays in a bit-precise way.
* Data corruption errors now reported in number of wrong elements instead of max
  deviation.
* Use ncclGetLastError.
* Don't run hypercube on non-powers of 2 ranks.
* Fix to hypercube data verification.
* Use "thread local" as the defaut CUDA capture mode.
* Replaced pthread_yield -> sched_yield()
* Bugfix to the cpu-side barrier/allreduce implementations.
2022-08-22 17:51:06 -07:00
Wenkai Du 45ec598ac4 Fix typo from previous merge 2022-08-12 14:42:17 +00:00
gilbertlee-amd f6f3c44a7a Enabling hipGraph codepath for future support (#18) 2022-08-09 16:45:27 -06:00
Wenkai Du 9025051bbb Fix missing error checking for AllocateBuffs due to merge (#17) 2022-08-09 11:04:38 -07:00