Wenkai Du
d677c5fa7b
Merge pull request #735 from wenkaidu/nvls
...
Remove references to NVLS functions
[ROCm/rccl commit: b6542c9b82 ]
2023-05-05 15:09:58 -07:00
Wenkai Du
a8f43b496c
Remove references to NVLS functions
...
[ROCm/rccl commit: 897745a266 ]
2023-05-05 07:55:20 -07:00
Wenkai Du
757ce8c554
Merge pull request #732 from ROCmSoftwarePlatform/2.17.1
...
Sync up to NCCL 2.17.1
[ROCm/rccl commit: e21be4acaf ]
2023-05-02 08:30:36 -07:00
Wenkai Du
18562abdb2
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: 53a1f91857 ]
2023-04-25 15:38:32 -07:00
Wenkai Du
0c0c35927a
Ensure memory copy integrity during transport setup ( #731 )
...
[ROCm/rccl commit: 36e453c61e ]
2023-04-25 14:41:43 -07:00
dependabot[bot]
b9f89624be
Bump rocm-docs-core from 0.2.0 to 0.5.0 in /docs/.sphinx ( #728 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.2.0 to 0.5.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/commits/v0.5.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 79df139f0b ]
2023-04-16 18:20:17 -06:00
Saad Rahim
99e407ba76
Standardizing documentation homepage message ( #726 )
...
[ROCm/rccl commit: a78ff46861 ]
2023-04-16 18:14:56 -06:00
David Addison
19e361cc45
Merge pull request #822 from KaimingOuyang/github/pytorch-hang-fix
...
Shutdown socket before close in ncclSocketClose()
[ROCm/rccl commit: 9b7d5edbfc ]
2023-04-14 19:52:45 -07:00
Wenkai Du
1af52edd6a
msccl: print stack and memory usage ( #723 )
...
* msccl: print stack and memory usage
* Update number of kernels calculation
[ROCm/rccl commit: 4b09ffba43 ]
2023-04-14 14:59:03 -07:00
Pedram Alizadeh
aeffdf872b
Disabled hipgraph tests! ( #725 )
...
[ROCm/rccl commit: 53c1c38f0e ]
2023-04-13 17:42:05 -04:00
Kaiming Ouyang
5eba5f178d
Add a comment to shutdown() in ncclSocketClose
...
[ROCm/rccl commit: 006b6bc7dc ]
2023-04-13 09:13:44 -07:00
Kaiming Ouyang
e35b05a872
Shutdown socket before close in ncclSocketClose()
...
[ROCm/rccl commit: 367e9b61c3 ]
2023-04-13 09:11:52 -07:00
Ziyue Yang
29c3da724c
MSCCL: Fix memcpy bug ( #721 )
...
[ROCm/rccl commit: 7289c05146 ]
2023-04-11 14:46:53 -07:00
Sam Wu
382306e2e8
pin rocm-docs-core and add dependabot config ( #722 )
...
[ROCm/rccl commit: dc149a9fbd ]
2023-04-11 10:01:24 -06:00
akolliasAMD
af8c2194a7
lessened the amount of child processes to active ones ( #720 )
...
[ROCm/rccl commit: 2ce7d971e5 ]
2023-04-11 08:59:56 -06:00
gilbertlee-amd
ff2c1c5d0f
Unit test performance refactor ( #700 )
...
* Refactoring unit tests to improve performance
* Spawning child processes during InitComms instead of on TestBed construction
* Temporarily disabling graph unit tests
[ROCm/rccl commit: 27e0cb43c2 ]
2023-04-06 12:28:53 -06:00
akolliasAMD
7c15eeb38a
added npkit_enable on CI tests ( #698 )
...
[ROCm/rccl commit: 9fe5a349f1 ]
2023-04-05 08:05:23 -06:00
Wenkai Du
6f3d035396
rccl-prim-test: minor update ( #718 )
...
[ROCm/rccl commit: addbf4bd90 ]
2023-04-03 07:30:04 -07:00
Ziyue Yang
aa605b0989
fix msccl stream usage ( #717 )
...
[ROCm/rccl commit: c8e33b1232 ]
2023-03-24 10:59:36 -07:00
akolliasAMD
1025c41f7f
fixed jenkins LD_LIBRARY_PATH ( #714 )
...
[ROCm/rccl commit: caf7a3d47d ]
2023-03-21 11:05:43 -06:00
gilbertlee-amd
b859549866
Adding interactive mode for unit tests (UT_INTERACTIVE) ( #715 )
...
[ROCm/rccl commit: 00c3d8d850 ]
2023-03-21 10:58:24 -06:00
akolliasAMD
9daf0bc3d1
Test Fixes ( #710 )
...
* splitting CI tests in running SP first and MP second
* set device before hipStreamSynchronize on tests
[ROCm/rccl commit: 9a0d4a07a6 ]
2023-03-21 08:48:39 -06:00
Wenkai Du
5d9c3b0277
Fix unit test HIP graph error ( #712 )
...
[ROCm/rccl commit: b02fd04165 ]
2023-03-20 15:34:09 -07:00
Sam Wu
1e1be0c808
Fix Docs static analysis ( #708 )
...
* remove ref to old script in static analysis
* update README with doc build instructions
update gitignore with doc artifacts
[ROCm/rccl commit: 8c56e6c892 ]
2023-03-16 13:12:43 -06:00
Ziyue Yang
f7f669e7f0
MSCCL: Improve executor and integrate scheduler ( #694 )
...
* MSCCL: improve executor and add scheduler for testing
* Use external scheduler
* Fix cmake error
* Address comments
* Fix thread safe issue
* Make MSCCL lifecycle APIs thread safe
* Make MSCCL internal scheduler aware of topology hint
* Revise error message
[ROCm/rccl commit: e3b2342f39 ]
2023-03-14 14:34:25 -07:00
Saad Rahim
8fdc4795fd
Standard template implementation ( #703 )
...
[ROCm/rccl commit: 6e48e518d9 ]
2023-03-13 11:00:57 -06:00
Wenkai Du
216c83c39d
Fix XGMI detection ( #699 )
...
* Fix XGMI detection
* Increase stack size
* Temporarily disable signal hangler in CI
[Process: 17281] Inside handler function signal: Segmentation fault (11)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
[ROCm/rccl commit: 22b81fbaae ]
2023-03-08 14:08:07 -08:00
Wenkai Du
62f5e6a82f
Warn user on incorrect system settings ( #696 )
...
* Warn user on incorrect system settings
* Fix typo
* Add possible impact
* Ignore iommu settings in VM
[ROCm/rccl commit: 79a2031951 ]
2023-03-06 08:17:06 -08:00
Sylvain Jeaugey
8dcf8e8720
2.17.1-1
...
Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.
[ROCm/rccl commit: 5d3ab08b69 ]
2023-03-01 00:39:04 -08:00
gilbertlee-amd
0da1d6a6cd
Multi stream unit test ( #693 )
...
* Adding multi-stream support to unit tests
[ROCm/rccl commit: 80ed608a9d ]
2023-02-23 13:28:50 -07:00
Wenkai Du
8aaf6f8e0d
Merge pull request #685 from ROCmSoftwarePlatform/2.16.5
...
Sync up to NCCL 2.16.5
[ROCm/rccl commit: d601c4909c ]
2023-02-22 10:29:02 -08:00
gilbertlee-amd
80ce7507c7
Adding UnitTest timing summary (UT_SHOW_TIMING) ( #692 )
...
[ROCm/rccl commit: f63d3b1978 ]
2023-02-22 08:57:13 -07:00
akolliasAMD
3b6b34705a
UnitTests: made reduceScatter run a smaller amount of tests ( #691 )
...
[ROCm/rccl commit: d119c0886e ]
2023-02-21 16:21:24 -07:00
Wenkai Du
1f42d485f8
Fix P2P scheduling ( #690 )
...
[ROCm/rccl commit: 86e7b71234 ]
2023-02-21 07:49:54 -08:00
gilbertlee-amd
7d860d3642
Unit test fail check ( #689 )
...
* Adding fall-through on unit test failure
* Workaround for hipGraph validity check issue
[ROCm/rccl commit: a640c6983f ]
2023-02-18 08:50:46 -08:00
Wenkai Du
393d0ba7f8
Add back __syncthreads() in barrier and adjust stack size ( #688 )
...
[ROCm/rccl commit: 1c166046a2 ]
2023-02-18 08:50:31 -08:00
Ziyue Yang
7c1290f995
NPKit: improve clock calibration and fix GPU clock API ( #683 )
...
* Improve clock calibration in NPKit
* Improve gfx macro
* Fix macro
[ROCm/rccl commit: f4bf47f325 ]
2023-02-17 12:26:57 -07:00
Wenkai Du
0c7a94c462
Merge remote-tracking branch 'nccl/master' into HEAD
...
[ROCm/rccl commit: aee7b42bb8 ]
2023-02-14 17:14:13 -08:00
Wenkai Du
4fb1ebcf4b
Remove workaround and use indirect function call ( #684 )
...
[ROCm/rccl commit: f7a456122c ]
2023-02-14 13:59:48 -08:00
Wenkai Du
cb7e2e8eeb
Merge pull request #681 from wenkaidu/gfx9
...
Add HIP event optimization and remove special code for gfx90a
[ROCm/rccl commit: 9461a43168 ]
2023-02-13 08:04:59 -08:00
Pedram Alizadeh
29a0f74de1
Adding -pthread flag into CMakeLists.txt ( #682 )
...
Adding -pthread flag for linking issues into CMakeLists.txt
[ROCm/rccl commit: f525b8e1e6 ]
2023-02-10 17:22:30 -05:00
Pedram Alizadeh
93fa62b654
Updated RCCL introduction at docs/source/library.rst ( #680 )
...
* Updated RCCL introduction at docs/source/library.rst
* space after period
---------
Co-authored-by: Saad Rahim <44449863+saadrahim@users.noreply.github.com >
[ROCm/rccl commit: e05560ea82 ]
2023-02-10 17:20:54 -05:00
Wenkai Du
8a0d254c69
Add HIP event optimization and remove special code for gfx90a
...
[ROCm/rccl commit: 39534e8724 ]
2023-02-10 16:46:01 +00:00
gilbertlee-amd
e5795cf101
Switching to relaxed capture for unit tests ( #679 )
...
[ROCm/rccl commit: df46645ff8 ]
2023-02-08 11:28:58 -07:00
Pedram Alizadeh
186aff690d
Update library.rst ( #678 )
...
[ROCm/rccl commit: 0df82bd8a3 ]
2023-02-06 16:28:59 -05:00
Wenkai Du
86d9ab47a3
Merge pull request #671 from ROCmSoftwarePlatform/2.16.2
...
Sync up with NCCL 2.16.2
[ROCm/rccl commit: 11567e5157 ]
2023-02-04 06:54:30 -08:00
Wenkai Du
c76bc214c8
Merge remote-tracking branch 'nccl/master' into HEAD
...
[ROCm/rccl commit: e1cb45ff22 ]
2023-02-04 01:44:43 +00:00
Pedram Alizadeh
f7982e9bed
UnitTest: add test cases for 2.14 API (ncclCommInitRankConfig and ncclCommFinalize for non-blocking communicator) ( #674 )
...
[ROCm/rccl commit: fddb5e6be8 ]
2023-02-03 17:36:30 -05:00
Sylvain Jeaugey
e6e8f2555c
2.16.5-1
...
Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit
[ROCm/rccl commit: f3d5166783 ]
2023-02-02 12:52:47 -08:00
Pedram Alizadeh
847899dd7e
removed the wrapper script so that the old name is no longer referenced ( #676 )
...
[ROCm/rccl commit: fbe52b6caa ]
2023-01-31 11:11:02 -05:00