Граф коммитов

1007 Коммитов

Автор SHA1 Сообщение Дата
Wenkai Du 8bb3340fcb Skip checking of some settings in Cray OS (#739) 2023-05-09 07:59:56 -07:00
Wenkai Du b6542c9b82 Merge pull request #735 from wenkaidu/nvls
Remove references to NVLS functions
2023-05-05 15:09:58 -07:00
Wenkai Du 897745a266 Remove references to NVLS functions 2023-05-05 07:55:20 -07:00
Wenkai Du e21be4acaf Merge pull request #732 from ROCmSoftwarePlatform/2.17.1
Sync up to NCCL 2.17.1
2023-05-02 08:30:36 -07:00
Wenkai Du 53a1f91857 Merge remote-tracking branch 'nccl/master' into develop 2023-04-25 15:38:32 -07:00
Wenkai Du 36e453c61e Ensure memory copy integrity during transport setup (#731) 2023-04-25 14:41:43 -07:00
dependabot[bot] 79df139f0b Bump rocm-docs-core from 0.2.0 to 0.5.0 in /docs/.sphinx (#728)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.2.0 to 0.5.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/commits/v0.5.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-04-16 18:20:17 -06:00
Saad Rahim a78ff46861 Standardizing documentation homepage message (#726) 2023-04-16 18:14:56 -06:00
David Addison 9b7d5edbfc Merge pull request #822 from KaimingOuyang/github/pytorch-hang-fix
Shutdown socket before close in ncclSocketClose()
2023-04-14 19:52:45 -07:00
Wenkai Du 4b09ffba43 msccl: print stack and memory usage (#723)
* msccl: print stack and memory usage

* Update number of kernels calculation
2023-04-14 14:59:03 -07:00
Pedram Alizadeh 53c1c38f0e Disabled hipgraph tests! (#725) 2023-04-13 17:42:05 -04:00
Kaiming Ouyang 006b6bc7dc Add a comment to shutdown() in ncclSocketClose 2023-04-13 09:13:44 -07:00
Kaiming Ouyang 367e9b61c3 Shutdown socket before close in ncclSocketClose() 2023-04-13 09:11:52 -07:00
Ziyue Yang 7289c05146 MSCCL: Fix memcpy bug (#721) 2023-04-11 14:46:53 -07:00
Sam Wu dc149a9fbd pin rocm-docs-core and add dependabot config (#722) 2023-04-11 10:01:24 -06:00
akolliasAMD 2ce7d971e5 lessened the amount of child processes to active ones (#720) 2023-04-11 08:59:56 -06:00
gilbertlee-amd 27e0cb43c2 Unit test performance refactor (#700)
* Refactoring unit tests to improve performance
* Spawning child processes during InitComms instead of on TestBed construction
* Temporarily disabling graph unit tests
2023-04-06 12:28:53 -06:00
akolliasAMD 9fe5a349f1 added npkit_enable on CI tests (#698) 2023-04-05 08:05:23 -06:00
Wenkai Du addbf4bd90 rccl-prim-test: minor update (#718) 2023-04-03 07:30:04 -07:00
Ziyue Yang c8e33b1232 fix msccl stream usage (#717) 2023-03-24 10:59:36 -07:00
akolliasAMD caf7a3d47d fixed jenkins LD_LIBRARY_PATH (#714) 2023-03-21 11:05:43 -06:00
gilbertlee-amd 00c3d8d850 Adding interactive mode for unit tests (UT_INTERACTIVE) (#715) 2023-03-21 10:58:24 -06:00
akolliasAMD 9a0d4a07a6 Test Fixes (#710)
* splitting CI tests in running SP first and MP second
* set device before hipStreamSynchronize on tests
2023-03-21 08:48:39 -06:00
Wenkai Du b02fd04165 Fix unit test HIP graph error (#712) 2023-03-20 15:34:09 -07:00
Sam Wu 8c56e6c892 Fix Docs static analysis (#708)
* remove ref to old script in static analysis

* update README with doc build instructions

update gitignore with doc artifacts
2023-03-16 13:12:43 -06:00
Ziyue Yang e3b2342f39 MSCCL: Improve executor and integrate scheduler (#694)
* MSCCL: improve executor and add scheduler for testing

* Use external scheduler

* Fix cmake error

* Address comments

* Fix thread safe issue

* Make MSCCL lifecycle APIs thread safe

* Make MSCCL internal scheduler aware of topology hint

* Revise error message
2023-03-14 14:34:25 -07:00
Saad Rahim 6e48e518d9 Standard template implementation (#703) 2023-03-13 11:00:57 -06:00
Wenkai Du 22b81fbaae Fix XGMI detection (#699)
* Fix XGMI detection

* Increase stack size

* Temporarily disable signal hangler in CI

[Process: 17281] Inside handler function signal: Segmentation fault (11)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
2023-03-08 14:08:07 -08:00
Wenkai Du 79a2031951 Warn user on incorrect system settings (#696)
* Warn user on incorrect system settings

* Fix typo

* Add possible impact

* Ignore iommu settings in VM
2023-03-06 08:17:06 -08:00
Sylvain Jeaugey 5d3ab08b69 2.17.1-1
Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.
2023-03-01 00:39:04 -08:00
gilbertlee-amd 80ed608a9d Multi stream unit test (#693)
* Adding multi-stream support to unit tests
2023-02-23 13:28:50 -07:00
Wenkai Du d601c4909c Merge pull request #685 from ROCmSoftwarePlatform/2.16.5
Sync up to NCCL 2.16.5
2023-02-22 10:29:02 -08:00
gilbertlee-amd f63d3b1978 Adding UnitTest timing summary (UT_SHOW_TIMING) (#692) 2023-02-22 08:57:13 -07:00
akolliasAMD d119c0886e UnitTests: made reduceScatter run a smaller amount of tests (#691) 2023-02-21 16:21:24 -07:00
Wenkai Du 86e7b71234 Fix P2P scheduling (#690) 2023-02-21 07:49:54 -08:00
gilbertlee-amd a640c6983f Unit test fail check (#689)
* Adding fall-through on unit test failure

* Workaround for hipGraph validity check issue
2023-02-18 08:50:46 -08:00
Wenkai Du 1c166046a2 Add back __syncthreads() in barrier and adjust stack size (#688) 2023-02-18 08:50:31 -08:00
Ziyue Yang f4bf47f325 NPKit: improve clock calibration and fix GPU clock API (#683)
* Improve clock calibration in NPKit

* Improve gfx macro

* Fix macro
2023-02-17 12:26:57 -07:00
Wenkai Du aee7b42bb8 Merge remote-tracking branch 'nccl/master' into HEAD 2023-02-14 17:14:13 -08:00
Wenkai Du f7a456122c Remove workaround and use indirect function call (#684) 2023-02-14 13:59:48 -08:00
Wenkai Du 9461a43168 Merge pull request #681 from wenkaidu/gfx9
Add HIP event optimization and remove special code for gfx90a
2023-02-13 08:04:59 -08:00
Pedram Alizadeh f525b8e1e6 Adding -pthread flag into CMakeLists.txt (#682)
Adding -pthread flag for linking issues into CMakeLists.txt
2023-02-10 17:22:30 -05:00
Pedram Alizadeh e05560ea82 Updated RCCL introduction at docs/source/library.rst (#680)
* Updated RCCL introduction at docs/source/library.rst

* space after period

---------

Co-authored-by: Saad Rahim <44449863+saadrahim@users.noreply.github.com>
2023-02-10 17:20:54 -05:00
Wenkai Du 39534e8724 Add HIP event optimization and remove special code for gfx90a 2023-02-10 16:46:01 +00:00
gilbertlee-amd df46645ff8 Switching to relaxed capture for unit tests (#679) 2023-02-08 11:28:58 -07:00
Pedram Alizadeh 0df82bd8a3 Update library.rst (#678) 2023-02-06 16:28:59 -05:00
Wenkai Du 11567e5157 Merge pull request #671 from ROCmSoftwarePlatform/2.16.2
Sync up with NCCL 2.16.2
2023-02-04 06:54:30 -08:00
Wenkai Du e1cb45ff22 Merge remote-tracking branch 'nccl/master' into HEAD 2023-02-04 01:44:43 +00:00
Pedram Alizadeh fddb5e6be8 UnitTest: add test cases for 2.14 API (ncclCommInitRankConfig and ncclCommFinalize for non-blocking communicator) (#674) 2023-02-03 17:36:30 -05:00
Sylvain Jeaugey f3d5166783 2.16.5-1
Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit
2023-02-02 12:52:47 -08:00