Commit Graph

471 Commits

Author SHA1 Message Date
Wenkai Du e2042ccf8a Fix broken profiling build (#263) 2020-09-02 15:39:52 -07:00
Wenkai Du 4180e6409e Fix incorrect threads split in sendrecv (#261) 2020-08-31 17:33:22 -07:00
Wenkai Du c5cbece6d0 Increase minimal channels for gfx908 (#259) 2020-08-26 11:40:11 -07:00
Wenkai Du b0919dc46c Only use software barrier for synchronization (#258) 2020-08-25 13:16:34 -07:00
Wenkai Du 391bbf3f1e Add NPS4 support on some models (#256)
* Add NPS4 support on some models

* Add XML models
2020-08-19 11:03:20 -07:00
gilbertlee-amd ec9af40fcd Upgrading various TransferBench features (#257) 2020-08-19 09:47:19 -06:00
Wenkai Du a51e4071e3 Add another Rome model (#249)
* Add another Rome model

* Add gfx908 4P3L models and support

* Revert "Use cached value for detecting GDR support only once"

This reverts commit 67c8e72ce3.

* Skip using ibverb for GPU direct RDMA detection

* Fine tune one Rome model
2020-08-17 10:51:02 -07:00
gilbertlee-amd c985478133 Fixes to make TransferBench compile for hipclang (#254) 2020-08-13 12:25:28 -06:00
saadrahim 6d8e19929c Adding gfx908 to CI (#253) 2020-08-13 11:07:33 -06:00
Wenkai Du 7e3d8a31cc Collect gcnArch and hipDeviceArch_t in XML (#252) 2020-08-12 15:48:38 -07:00
saadrahim 50af2e9b66 Cleaning up CI code be removing overrides (#251) 2020-08-12 12:38:10 -06:00
Wenkai Du 066223333d Merge pull request #248 from wenkaidu/2.7.8
2.7.8
2020-08-11 08:20:37 -07:00
Wenkai Du 7e3f841fab Merge remote-tracking branch 'nccl/master' into 2.7.8 2020-08-10 16:11:00 +00:00
Wenkai Du 3c46cb8ad4 Merge pull request #247 from wenkaidu/rome
Additional Rome models support
2020-08-07 10:56:12 -07:00
MurtadhaAldallal 390c63cf0d Update rccl_prim_test.cpp (#246)
Adding doublelocalcopy operation and freeing buffer memory at end.
DoubleLocalCopy Patch Added
2020-08-07 08:20:14 -07:00
Wenkai Du 09ef75656a Add more Rome 4P2H models 2020-08-06 18:20:02 +00:00
Stanley Tsang c5d4d9eb76 Adding static library building option. (#244)
* Adding static library building option.

* Disabling running tests for static build

* Removing static packaging in CI

Co-authored-by: Saad Rahim <saad.rahim@amd.com>
2020-08-06 11:19:43 -06:00
saadrahim 0dc019e35f Download GTest if not found in system (#237)
Co-authored-by: Stanley Tsang <stanley.tsang@amd.com>
2020-08-06 09:36:58 -06:00
Jack Snyder de49a77074 Setting type when gpu sub node is discovered 2020-08-05 13:39:23 -07:00
Sylvain Jeaugey 3d63f89068 Merge pull request #364 from badgerious/net-class
Add GPUs and NICs based on XML sub tags instead of PCI class.
2020-08-05 12:52:38 -07:00
Eric Badger 700c0e0f24 Don't require NIC devices to have specific PCI class
If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.
2020-08-05 12:46:29 -07:00
Wenkai Du 5b03132ace Allow setup ring through NCCL_RINGS to facilitate testing 2020-08-04 21:07:00 +00:00
Wenkai Du d1e20b4c5e Improve 4P2H topology on Rome (#243)
1. Use bi-directional rings
2. GPU search is sorted by PCI device ID to get consistent results
2020-07-28 14:21:44 -07:00
David Addison 033d799524 2.7.8-1
Fix collective mismatch error when using ncclSend/ncclRecv
2020-07-27 16:34:09 -07:00
Wenkai Du e7a10aa0e4 Topology tuning for 4P2H on Rome (#242)
* Topology tuning for 4P2H on Rome

* Use ncclTopoIdToIndex
2020-07-27 11:53:57 -07:00
Wenkai Du 8d5fb920b6 ib-test: support multiple channels (#241) 2020-07-27 11:03:12 -07:00
Sourav Chakraborty fe3d520601 Merge pull request #240 from ROCmSoftwarePlatform/sourav/topo-expl-1
simplify model definitions in topo expl
2020-07-22 12:35:17 -05:00
Sourav Chakraborty 2475daafee add 4 node 8P6L 1 NIC 2nd Hive model 2020-07-22 16:27:15 +00:00
Sourav Chakraborty db55afb014 simplify model definitions in topo expl 2020-07-22 16:05:53 +00:00
Wenkai Du d5f90e19b5 Add 8P6L multi-node models (#239) 2020-07-21 14:10:36 -07:00
Stanley Tsang 684f3e6af4 Adding better naming to unit tests for filtering; adding short and full unit test suites (#235) 2020-07-21 12:19:47 -06:00
Wenkai Du 35c5a7fe45 Fix RCCL build package name (#236) 2020-07-20 14:43:00 -07:00
saadrahim 99a491273f Changing GTest inclusion in cmake to use find_package (#234)
* GTest is used via find_package. No longer downloaded in cmake.

* Adding error handling
2020-07-15 20:51:48 -06:00
saadrahim 7f93aa7e53 Changing dependency to hip-rocclr (#228) 2020-07-14 17:49:56 -06:00
Wenkai Du ab787c767e Change default channels duplication for chordal ring (#233) 2020-07-14 15:16:50 -07:00
gilbertlee-amd f87ba17737 Removing UnitTest as install, removing unused env var (#231) 2020-07-10 09:30:28 -06:00
Wenkai Du 5215130168 Revert "Split primitive class to smaller structures" (#230)
This reverts commit 486fd436af.
2020-07-08 11:06:50 -07:00
Wenkai Du 1addf4f196 Match RCCL package name to API version (#229) 2020-07-07 13:30:39 -07:00
Riatre Foo 2d8601701d Fix build action order
Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.
2020-07-07 10:20:51 -07:00
Stanley Tsang 9bd4c14603 Adding appropriate references in rccl-prim-test (#227)
Adding appropriate references to rccl-prim-test.
2020-07-06 10:15:03 -06:00
Wenkai Du ecae1cd76a Merge pull request #226 from wenkaidu/develop
Sync up to NCCL 2.7.6
2020-07-06 09:10:09 -07:00
Wenkai Du da3b197d6c Merge remote-tracking branch 'nccl/master' into develop 2020-07-01 16:51:25 -07:00
Wenkai Du d3548cc474 topo_expl: each rank needs to have its own memory for graphs (#225) 2020-07-01 15:11:02 -07:00
Wenkai Du a6be82f5ab topo_expl: fix broken build (#224) 2020-06-30 11:11:23 -07:00
Wenkai Du a144a85465 Merge pull request #223 from wenkaidu/sendrecv
Use separate threads for send and receive
2020-06-30 10:50:06 -07:00
Wenkai Du 8db0aa8f4c gtest: extend testing up to 8 GPUs 2020-06-29 09:32:31 -07:00
Wenkai Du 964c4c2061 Merge sendrecv kernel from NCCL 2.7.3
This commit was cherry-picked and modified from
https://github.com/NVIDIA/nccl/commit/5949d96f36d050e59d05872f8bbffd2549318e95
2020-06-29 08:47:46 -07:00
Wenkai Du b90735c935 Use separate threads for send and receive 2020-06-29 08:47:15 -07:00
Sylvain Jeaugey 1952325569 2.7.6-1
Fix crash when NVswitch is not visible inside a VM.
2020-06-26 16:35:54 -07:00
Sylvain Jeaugey 01afd20a77 2.7.5-1
Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.
2020-06-26 14:39:49 -07:00