Граф коммитов

141 Коммитов

Автор SHA1 Сообщение Дата
Edgar Gabriel a80fbba12b Merge pull request #23 from edgargabriel/pr/link-fix
add the rccl/lib directory to the link path
2022-10-31 15:54:55 -05:00
Edgar Gabriel 9c9746739a add the rccl/lib directory to the link path 2022-10-31 19:01:22 +00:00
Edgar Gabriel fb0d339c1b Merge pull request #22 from edgargabriel/pr/compile-fix
fix a messing endif statement
2022-10-25 12:19:25 -05:00
Edgar Gabriel 8a754f15ad fix a messing endif statement
error introduced with the web merger-resolution tool :-(
2022-10-25 16:31:57 +00:00
Edgar Gabriel 84e8be8e65 Merge pull request #21 from ROCmSoftwarePlatform/topic/v2.13.4-sync
Topic/v2.13.4 sync
2022-10-21 17:17:27 -05:00
Edgar Gabriel 4d7cd871c1 Merge branch 'develop' into topic/v2.13.4-sync 2022-10-21 17:12:45 -05:00
Wenkai Du 9a89c300b6 Allow more precise measurements of single operation (#20) 2022-10-21 22:07:41 +00:00
Edgar Gabriel 641e93e99c make rccl-test compile again.
all files compile now.
mpi tests also pass
2022-10-21 22:07:33 +00:00
Edgar Gabriel 3ae371cce7 Merge remote-tracking branch 'nccl-tests/master' into topic/v2.13.4-sync 2022-10-14 16:02:54 -05:00
Wenkai Du d22281cb3f Allow more precise measurements of single operation (#20) 2022-10-12 17:28:04 -07:00
akolliasAMD 3fbd3280ce removed hypercube from Makefile (#19) 2022-09-29 15:36:39 -06:00
Sylvain Jeaugey d313d20a26 Update NCCL tests 2022-09-23 01:13:29 -07:00
David Addison 749573f2d6 Fix preprocessor version check for ncclGetLastError()
ncclGetLastError() was added in NCCL 2.13.0
2022-09-07 16:10:41 -07:00
David Addison afa4c56b6a Fix an issue with the last commit when data checking is disabled 2022-09-07 11:23:49 -07:00
David Addison a0a14911ee Display N/A for error count in AlltoAll in-place test
AlltoAll does not support in-place buffers
2022-09-06 13:17:15 -07:00
John Bachan bc5f7cfb0a Changed top-level Makefile behavior so that BUILDDIR is interpreted
as relative to top-level directory. This done is by abspath'ing it before
passing it to subdirectory Makefile's.

The old behavior had two cases: with and without BUILDDIR being set by
the user. With BUILDDIR not set, the build dir would be named "build"
in the top-level directory. If BUILDDIR was set, then the build dir
would be placed at "src/${BUILDDIR}".

The new behavior is simpler, if BUILDDIR is not set then it defaults
to "build", and the directory holding the final build is always at just
"${BUILDDIR}" in the top level.
2022-08-23 10:08:49 -07:00
John Bachan 51af5572bf Resync with NCCL 2.13
* Added "verifiable", a suite of kernels for generating and verifying reduction
  input and output arrays in a bit-precise way.
* Data corruption errors now reported in number of wrong elements instead of max
  deviation.
* Use ncclGetLastError.
* Don't run hypercube on non-powers of 2 ranks.
* Fix to hypercube data verification.
* Use "thread local" as the defaut CUDA capture mode.
* Replaced pthread_yield -> sched_yield()
* Bugfix to the cpu-side barrier/allreduce implementations.
2022-08-22 17:51:06 -07:00
Wenkai Du 45ec598ac4 Fix typo from previous merge 2022-08-12 14:42:17 +00:00
gilbertlee-amd f6f3c44a7a Enabling hipGraph codepath for future support (#18) 2022-08-09 16:45:27 -06:00
Wenkai Du 9025051bbb Fix missing error checking for AllocateBuffs due to merge (#17) 2022-08-09 11:04:38 -07:00
Liam Wrubleski d704668bf7 Add CMake files to build & package (#15)
* Add CMake files to build & package

* Change build technique on CI

* Correct CI build command
2022-08-09 11:17:07 -06:00
Eiden Yoshida 2af4f6bc3a Allow gpu config override in CI (#14) 2022-07-28 09:19:16 -06:00
akolliasAMD 9925195afc updated alltoallV test to not have any zero values (#12)
updated alltoallV test to not have any zero values between ranks
2022-07-21 10:28:53 -06:00
Edgar Gabriel 2a18737dc6 Merge pull request #11 from edgargabriel/ci-fix
update pytest before running CI
2022-06-13 09:52:40 -05:00
Edgar 67544e2c34 update pytest before running CI
There seems to be in an incompatibility between the python installation
used in the CI and pytest. Update pytest before running CI.
2022-06-13 10:20:33 -04:00
Edgar Gabriel 937ea1926e Merge pull request #10 from edgargabriel/multi-rank
Multi rank support
2022-06-10 14:03:33 -05:00
Edgar 0500f2f132 implementation of multi-rank support in rccl-tests. 2022-06-10 14:54:10 -04:00
Edgar 5cd2374edb create branch up-to-date with rccl-test 2022-06-10 12:41:56 -04:00
amdkila 3d6f70659a Check for error code in install script (#2) 2022-06-10 12:37:53 -04:00
David Addison 8274cb47b6 Merge pull request #96 from NVIDIA/nersc-linkage-fix
Add option to statically link cudart
2022-05-26 16:54:44 -07:00
Wenkai Du 6156759a40 Print GPU's full PCI bus ID 2022-04-06 16:46:17 +00:00
Wenkai Du 47238336d9 Update include path for custom RCCL build 2022-03-31 13:18:02 -04:00
Ziyue Yang 698524e42e move to a2a api (#9) 2022-02-18 08:31:40 -08:00
Wenkai Du 602b745ff4 Add missing hipStreamDestroy at test exit 2021-11-16 07:50:18 -08:00
David Addison de3ddbe261 Add option to statically link cudart
Build with CUDARTLIB=cudart_static to remove dynamic linkage

Also removed unused curand and nvToolsExt dependencies

BUG 95
2021-11-10 10:02:41 -08:00
David Addison 7130fa6096 Add MPI_IBM build option 2021-10-25 16:30:57 -07:00
Wenkai Du 8b35847d36 Use rccl_bfloat16 class 2021-09-23 16:39:11 -07:00
Wenkai Du dc1ad4853d Fix divide by zero error 2021-09-22 08:43:01 -07:00
Wenkai Du 213abee002 Merge remote-tracking branch 'nccl/master' into develop 2021-09-20 14:01:22 -07:00
David Addison f773748b46 Resync with NCCL 2.11
New operator: mulsum
New test: gather
2021-09-17 09:02:45 -07:00
Wenkai Du cc34c54509 Use ROCM_PATH instead of ROCM_HOME 2021-07-21 14:19:48 -07:00
Wenkai Du 2d9be62621 Merge remote-tracking branch 'nccl/master' 2021-07-15 13:54:43 -07:00
David Addison 1f8f541686 Add CUDA graph support only for CUDA 11.3 and later builds
Fixes #90
2021-07-13 10:47:47 -07:00
Wenkai Du 9f8ddadcdf Merge remote-tracking branch 'nccl/master' into develop 2021-07-13 08:11:44 -07:00
David Addison b9f90d12a9 Removed MPI_SUPPORT conditional compilation of average flag 2021-07-12 11:43:57 -07:00
David Addison 547e119d35 Fix issues with MPI_Allreduce and multi-threaded tests 2021-07-08 16:42:40 -07:00
David Addison 11cff17a04 Updated with new command line arguments 2021-07-06 16:27:45 -07:00
David Addison f476f4a17a Merge branch 'bfloat16' 2021-07-06 10:20:32 -07:00
David Addison 1dfc76eccc Added new option to report average iteration time 2021-06-30 19:36:07 -07:00
David Addison 1ae8cdc315 Resync with changes in gitilab-master code 2021-06-30 13:16:04 -07:00