Commit Graph

40 Commits

Author SHA1 Message Date
Gilbert Lee f1a9ce3fa5 Using GTEST_SKIP() to skip unit tests that have insufficient devices. Skipping out earlier 2021-02-09 03:54:04 +00:00
Stanley Tsang d00b7d17bd Update MP UT to support arbitrary # of GPUs; multiple bugfixes (#16)
* Fixing temp file creation/deletion for Clique kernel mode.

* Refactoring of MP unit tests; include bugfixes and general support for any number of GPUs

* GroupCall MP UT properly quits when too many devices specified

* MP UT will programmatically set NCCL_COMM_ID if not specified; updated install script
2021-02-05 16:49:25 -08:00
Wenkai Du ab1e7a0318 Merge remote-tracking branch 'origin/develop' into 2.8.3 2021-02-04 20:02:34 -05:00
Gilbert Lee 01a998b17c Removing in-place tests from Combined calls (no support for send/recv) 2021-01-28 20:09:03 +00:00
gilbertlee-amd 3e62ceddc5 Clique kernel support (#295) (#15)
* Adding experimental clique-based kernels (opt-in only)

Co-authored-by: Stanley Tsang <stanley.tsang@amd.com>
Co-authored-by: Gilbert Lee <gilbert.lee@amd.com>
Co-authored-by: Wenkai Du <43822138+wenkaidu@users.noreply.github.com>

Co-authored-by: Stanley Tsang <stanley.tsang@amd.com>
Co-authored-by: Wenkai Du <43822138+wenkaidu@users.noreply.github.com>
2021-01-28 09:45:01 -07:00
Stanley Tsang d3fa257682 Adding multiprocess unit tests (#312)
Adding multiprocess unit tests for collectives.  

To run, NCCL_COMM_ID=$HOSTNAME:12345 build/release/test/UnitTestsMultiProcess
2021-01-15 16:34:36 -07:00
Wenkai Du b33a2cac8b gtest: add scatter to combined calls and use loops (#303)
* gtest: add scatter to combined calls and use loops

* gtest: run validation inside loop

* gtest: revert small element count to 2520

* gtest: fix memory leak in validation

(cherry picked from commit b0853ccd51)

* Fix combined call UT

* Fix memory leak

* Fix alltoallv test
2021-01-14 19:28:01 -05:00
Wenkai Du b0853ccd51 gtest: add scatter to combined calls and use loops (#303)
* gtest: add scatter to combined calls and use loops

* gtest: run validation inside loop

* gtest: revert small element count to 2520

* gtest: fix memory leak in validation
2020-11-13 17:57:44 -08:00
gilbertlee-amd 41bcfb8878 Clique kernel support (#295)
* Adding experimental clique-based kernels (opt-in only)

Co-authored-by: Stanley Tsang <stanley.tsang@amd.com>
Co-authored-by: Gilbert Lee <gilbert.lee@amd.com>
Co-authored-by: Wenkai Du <43822138+wenkaidu@users.noreply.github.com>
2020-11-10 15:44:10 -07:00
Wenkai Du b871ea3c0c Add Alltoallv RCCL kernel implementation (#269)
* Add alltoallv API and implementation

* Extend Rome P2P channel limit to multinode and alltoall kernels

* topo_expl: fix compilation and sync up with main

* gtest: use RCCL alltoallv API

* Code review changes
2020-09-30 16:25:36 -07:00
Wenkai Du 60819dcf8d Merge pull request #262 from wenkaidu/alignment
Make data alignment requirements matching ISA manual
2020-09-08 10:40:42 -07:00
Aaron Enye Shi 958b213428 Add RCCL Static Lib Creation with -fgpu-rdc
RCCL uses -fgpu-rdc to compile its source objects. When linking
the RCCL static library, the link and archive step must do through
hipcc and uses the flag --emit-static-lib. When compiling
UnitTests, the librccl.a must be consumed through -l and -L.
2020-09-03 11:25:41 -04:00
Wenkai Du b163a8898f gtest: add alltoallv test 2020-09-02 21:28:32 +00:00
Wenkai Du 7e3f841fab Merge remote-tracking branch 'nccl/master' into 2.7.8 2020-08-10 16:11:00 +00:00
saadrahim 0dc019e35f Download GTest if not found in system (#237)
Co-authored-by: Stanley Tsang <stanley.tsang@amd.com>
2020-08-06 09:36:58 -06:00
Stanley Tsang 684f3e6af4 Adding better naming to unit tests for filtering; adding short and full unit test suites (#235) 2020-07-21 12:19:47 -06:00
saadrahim 99a491273f Changing GTest inclusion in cmake to use find_package (#234)
* GTest is used via find_package. No longer downloaded in cmake.

* Adding error handling
2020-07-15 20:51:48 -06:00
gilbertlee-amd f87ba17737 Removing UnitTest as install, removing unused env var (#231) 2020-07-10 09:30:28 -06:00
Wenkai Du 8db0aa8f4c gtest: extend testing up to 8 GPUs 2020-06-29 09:32:31 -07:00
Wenkai Du fee1a20b74 gtest: add scatter, gather and all to all unit tests 2020-06-09 17:44:15 -07:00
Stanley Tsang 20fa04d9b6 Updating copyright notices for 2020. 2020-01-29 15:28:08 -08:00
Wenkai Du fe6d012eb0 Merge remote-tracking branch 'remotes/rccl/master' into rccl_2.5.6_cleanup 2020-01-29 15:28:03 -08:00
Wenkai Du 1e55645d97 Misc fixes and improvements for 2.5.6
1. Fix RCCL unit test
2. Add ROME detection and tuning
3. Change default P2P level
4. Fix search algorithm for XGMI
5. Remove explicit channel duplication with implicit by using half of link speed
6. Add collective trace support
7. Correct Intel Skylake CPU detection and bandwidth
8. Fix topo connect function
9. Disable GDR read and remove unreachable code
10. Disable LL128 kernels
11. Add tuning parameters
12. Use original clock64() implementation which returns RTC counter value
13. Print out timestamp of collective trace
14. Do not use struct ncclColl in kernel launch parameter
15. Fix abort handling and add tracing
17. Add __launch_bounds__ to kernel functions
18. Remove unused abortCount
19. Unset default MIN_NRINGS and MIN_NCHANNELS
20. Do not allocate shared memory when not using LL128 kernels
21. Correct time print out in tuning log
2020-01-29 15:27:05 -08:00
gilbertlee-amd 000bce6f27 Removing OpenMP from unit tests (#163) 2019-12-20 11:41:56 -07:00
Wenkai Du 4ca05c1297 Support bfloat16 on rest of the unit tests 2019-11-18 14:18:34 -08:00
Wenkai Du bdac0256a5 Add bfloat16 all reduce unit test 2019-11-18 13:50:29 -08:00
Akila Premachandra f48ae5c98d Added hip-clang options to install script, and openmp/pthread options to CMakeLists.txt 2019-08-23 22:02:42 +00:00
Wenkai Du f11c8f60cd RCCL 2.4 update 2019-08-14 10:42:35 -07:00
Sylvain Jeaugey f93fe9bfd9 2.3.5-5
Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .
2018-09-25 14:12:01 -07:00
sclarkson 680a35c6b7 fix tests on maxwell 2017-11-11 19:22:06 -08:00
Sylvain Jeaugey 1093821c33 Replace min BW by average BW in tests 2016-12-01 15:16:35 -08:00
Sylvain Jeaugey ca330b110a Add scan tests 2016-09-22 11:58:33 -07:00
Sylvain Jeaugey 6c77476cc1 Make tests check for deltas and report bandwidth 2016-09-22 11:58:28 -07:00
Sylvain Jeaugey 75bad643bd Updated LICENCE.txt 2016-08-26 15:08:20 -07:00
Nathan Luehr 55c42ad681 Fixed redundant contexts in multi-process apps
Change-Id: If787014450fd281304f0c7baf01d25963e40905d
2016-07-25 10:10:30 -07:00
Sylvain Jeaugey bd3cf73e6e Changed CURAND generator to work on a wider set of platforms. 2016-06-06 14:34:03 -07:00
Nathan Luehr 03df4c7759 Moved no-as-needed flag to link rule.
Avoids link errors for tests linked with nvcc.
2016-04-19 14:51:03 -07:00
Sylvain Jeaugey 9de361a1b9 Fix MPI test usage
Only display usage from rank 0 and exit instead of continuing (and seg fault).
2016-04-19 10:43:38 -07:00
Nathan Luehr 2758353380 Added NCCL error checking to tests.
Also cleaned up makefile so that tests and lib are not built unnecessarily.

Change-Id: Ia0c596cc2213628de2f066be97615c09bb1bb262
Reviewed-on: http://git-master/r/999627
Reviewed-by: Przemek Tredak <ptredak@nvidia.com>
Tested-by: Przemek Tredak <ptredak@nvidia.com>
2016-01-29 11:09:05 -08:00
Sylvain Jeaugey c05312f151 Moved tests to separate dir and improved MPI test
test sources moved to test/ directory.
MPI test displays PASS/FAIL and returns code accordingly.

Change-Id: I058ebd1bd5202d8f38cc9787898b2480100c102b
Reviewed-on: http://git-master/r/936086
Reviewed-by: Przemek Tredak <ptredak@nvidia.com>
Tested-by: Przemek Tredak <ptredak@nvidia.com>
2016-01-28 12:56:36 -08:00