Wen-Heng (Jack) Chung
1bafe9c240
Merge pull request #742 from whchung/skip_done_event_msccl
...
Allow skipping doneEvent inside MSCCL.
[ROCm/rccl commit: eba4e9e100 ]
2023-05-18 10:17:20 -05:00
Wenkai Du
f119977daa
Fix merge error ( #744 )
...
[ROCm/rccl commit: 403cda6322 ]
2023-05-18 08:09:27 -07:00
Wen-Heng (Jack) Chung
932822fc46
Address review feedbacks and make the flag be disabled by default.
...
[ROCm/rccl commit: ca4a1dfd67 ]
2023-05-17 17:50:25 +00:00
Wen-Heng (Jack) Chung
a7fbe68535
Skip doneEvent inside MSCCL by default.
...
Added a RCCL_MSCCL_ENABLE_DONE_EVENT env var, set it be 0 by default.
The env var is to control whether to use doneEvent when invoking MSCCL
kernels.
Skipping doneEvent would cause the firmware to skip L2 cache flush,
resulting in overall performance improvement.
[ROCm/rccl commit: 12dba425de ]
2023-05-17 16:49:42 +00:00
Wenkai Du
f7184aef3f
Revert "Ensure memory copy integrity during transport setup ( #731 )" ( #741 )
...
* Revert "Ensure memory copy integrity during transport setup (#731 )"
This reverts commit 0c0c35927a .
Add stream synchronization in ncclStrongStreamRelease.
* Use event record and wait
[ROCm/rccl commit: 4ca7742c61 ]
2023-05-16 10:34:47 -07:00
akolliasAMD
a26ca6d766
added modified npkit_trace_generator.py to scripts ( #738 )
...
* added modified npkit_trace_generator.py to scripts
[ROCm/rccl commit: c88475462b ]
2023-05-09 10:11:35 -06:00
Wenkai Du
3c7cf5b110
Skip checking of some settings in Cray OS ( #739 )
...
[ROCm/rccl commit: 8bb3340fcb ]
2023-05-09 07:59:56 -07:00
Wenkai Du
a8f43b496c
Remove references to NVLS functions
...
[ROCm/rccl commit: 897745a266 ]
2023-05-05 07:55:20 -07:00
Wenkai Du
18562abdb2
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: 53a1f91857 ]
2023-04-25 15:38:32 -07:00
Wenkai Du
0c0c35927a
Ensure memory copy integrity during transport setup ( #731 )
...
[ROCm/rccl commit: 36e453c61e ]
2023-04-25 14:41:43 -07:00
dependabot[bot]
b9f89624be
Bump rocm-docs-core from 0.2.0 to 0.5.0 in /docs/.sphinx ( #728 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.2.0 to 0.5.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/commits/v0.5.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 79df139f0b ]
2023-04-16 18:20:17 -06:00
Saad Rahim
99e407ba76
Standardizing documentation homepage message ( #726 )
...
[ROCm/rccl commit: a78ff46861 ]
2023-04-16 18:14:56 -06:00
Wenkai Du
1af52edd6a
msccl: print stack and memory usage ( #723 )
...
* msccl: print stack and memory usage
* Update number of kernels calculation
[ROCm/rccl commit: 4b09ffba43 ]
2023-04-14 14:59:03 -07:00
Pedram Alizadeh
aeffdf872b
Disabled hipgraph tests! ( #725 )
...
[ROCm/rccl commit: 53c1c38f0e ]
2023-04-13 17:42:05 -04:00
Kaiming Ouyang
5eba5f178d
Add a comment to shutdown() in ncclSocketClose
...
[ROCm/rccl commit: 006b6bc7dc ]
2023-04-13 09:13:44 -07:00
Kaiming Ouyang
e35b05a872
Shutdown socket before close in ncclSocketClose()
...
[ROCm/rccl commit: 367e9b61c3 ]
2023-04-13 09:11:52 -07:00
Ziyue Yang
29c3da724c
MSCCL: Fix memcpy bug ( #721 )
...
[ROCm/rccl commit: 7289c05146 ]
2023-04-11 14:46:53 -07:00
Sam Wu
382306e2e8
pin rocm-docs-core and add dependabot config ( #722 )
...
[ROCm/rccl commit: dc149a9fbd ]
2023-04-11 10:01:24 -06:00
akolliasAMD
af8c2194a7
lessened the amount of child processes to active ones ( #720 )
...
[ROCm/rccl commit: 2ce7d971e5 ]
2023-04-11 08:59:56 -06:00
gilbertlee-amd
ff2c1c5d0f
Unit test performance refactor ( #700 )
...
* Refactoring unit tests to improve performance
* Spawning child processes during InitComms instead of on TestBed construction
* Temporarily disabling graph unit tests
[ROCm/rccl commit: 27e0cb43c2 ]
2023-04-06 12:28:53 -06:00
akolliasAMD
7c15eeb38a
added npkit_enable on CI tests ( #698 )
...
[ROCm/rccl commit: 9fe5a349f1 ]
2023-04-05 08:05:23 -06:00
Wenkai Du
6f3d035396
rccl-prim-test: minor update ( #718 )
...
[ROCm/rccl commit: addbf4bd90 ]
2023-04-03 07:30:04 -07:00
Ziyue Yang
aa605b0989
fix msccl stream usage ( #717 )
...
[ROCm/rccl commit: c8e33b1232 ]
2023-03-24 10:59:36 -07:00
akolliasAMD
1025c41f7f
fixed jenkins LD_LIBRARY_PATH ( #714 )
...
[ROCm/rccl commit: caf7a3d47d ]
2023-03-21 11:05:43 -06:00
gilbertlee-amd
b859549866
Adding interactive mode for unit tests (UT_INTERACTIVE) ( #715 )
...
[ROCm/rccl commit: 00c3d8d850 ]
2023-03-21 10:58:24 -06:00
akolliasAMD
9daf0bc3d1
Test Fixes ( #710 )
...
* splitting CI tests in running SP first and MP second
* set device before hipStreamSynchronize on tests
[ROCm/rccl commit: 9a0d4a07a6 ]
2023-03-21 08:48:39 -06:00
Wenkai Du
5d9c3b0277
Fix unit test HIP graph error ( #712 )
...
[ROCm/rccl commit: b02fd04165 ]
2023-03-20 15:34:09 -07:00
Sam Wu
1e1be0c808
Fix Docs static analysis ( #708 )
...
* remove ref to old script in static analysis
* update README with doc build instructions
update gitignore with doc artifacts
[ROCm/rccl commit: 8c56e6c892 ]
2023-03-16 13:12:43 -06:00
Ziyue Yang
f7f669e7f0
MSCCL: Improve executor and integrate scheduler ( #694 )
...
* MSCCL: improve executor and add scheduler for testing
* Use external scheduler
* Fix cmake error
* Address comments
* Fix thread safe issue
* Make MSCCL lifecycle APIs thread safe
* Make MSCCL internal scheduler aware of topology hint
* Revise error message
[ROCm/rccl commit: e3b2342f39 ]
2023-03-14 14:34:25 -07:00
Saad Rahim
8fdc4795fd
Standard template implementation ( #703 )
...
[ROCm/rccl commit: 6e48e518d9 ]
2023-03-13 11:00:57 -06:00
Wenkai Du
216c83c39d
Fix XGMI detection ( #699 )
...
* Fix XGMI detection
* Increase stack size
* Temporarily disable signal hangler in CI
[Process: 17281] Inside handler function signal: Segmentation fault (11)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
[ROCm/rccl commit: 22b81fbaae ]
2023-03-08 14:08:07 -08:00
Wenkai Du
62f5e6a82f
Warn user on incorrect system settings ( #696 )
...
* Warn user on incorrect system settings
* Fix typo
* Add possible impact
* Ignore iommu settings in VM
[ROCm/rccl commit: 79a2031951 ]
2023-03-06 08:17:06 -08:00
Sylvain Jeaugey
8dcf8e8720
2.17.1-1
...
Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.
[ROCm/rccl commit: 5d3ab08b69 ]
2023-03-01 00:39:04 -08:00
gilbertlee-amd
0da1d6a6cd
Multi stream unit test ( #693 )
...
* Adding multi-stream support to unit tests
[ROCm/rccl commit: 80ed608a9d ]
2023-02-23 13:28:50 -07:00
Wenkai Du
8aaf6f8e0d
Merge pull request #685 from ROCmSoftwarePlatform/2.16.5
...
Sync up to NCCL 2.16.5
[ROCm/rccl commit: d601c4909c ]
2023-02-22 10:29:02 -08:00
gilbertlee-amd
80ce7507c7
Adding UnitTest timing summary (UT_SHOW_TIMING) ( #692 )
...
[ROCm/rccl commit: f63d3b1978 ]
2023-02-22 08:57:13 -07:00
akolliasAMD
3b6b34705a
UnitTests: made reduceScatter run a smaller amount of tests ( #691 )
...
[ROCm/rccl commit: d119c0886e ]
2023-02-21 16:21:24 -07:00
Wenkai Du
1f42d485f8
Fix P2P scheduling ( #690 )
...
[ROCm/rccl commit: 86e7b71234 ]
2023-02-21 07:49:54 -08:00
gilbertlee-amd
7d860d3642
Unit test fail check ( #689 )
...
* Adding fall-through on unit test failure
* Workaround for hipGraph validity check issue
[ROCm/rccl commit: a640c6983f ]
2023-02-18 08:50:46 -08:00
Wenkai Du
393d0ba7f8
Add back __syncthreads() in barrier and adjust stack size ( #688 )
...
[ROCm/rccl commit: 1c166046a2 ]
2023-02-18 08:50:31 -08:00
Ziyue Yang
7c1290f995
NPKit: improve clock calibration and fix GPU clock API ( #683 )
...
* Improve clock calibration in NPKit
* Improve gfx macro
* Fix macro
[ROCm/rccl commit: f4bf47f325 ]
2023-02-17 12:26:57 -07:00
Wenkai Du
0c7a94c462
Merge remote-tracking branch 'nccl/master' into HEAD
...
[ROCm/rccl commit: aee7b42bb8 ]
2023-02-14 17:14:13 -08:00
Wenkai Du
4fb1ebcf4b
Remove workaround and use indirect function call ( #684 )
...
[ROCm/rccl commit: f7a456122c ]
2023-02-14 13:59:48 -08:00
Wenkai Du
cb7e2e8eeb
Merge pull request #681 from wenkaidu/gfx9
...
Add HIP event optimization and remove special code for gfx90a
[ROCm/rccl commit: 9461a43168 ]
2023-02-13 08:04:59 -08:00
Pedram Alizadeh
29a0f74de1
Adding -pthread flag into CMakeLists.txt ( #682 )
...
Adding -pthread flag for linking issues into CMakeLists.txt
[ROCm/rccl commit: f525b8e1e6 ]
2023-02-10 17:22:30 -05:00
Pedram Alizadeh
93fa62b654
Updated RCCL introduction at docs/source/library.rst ( #680 )
...
* Updated RCCL introduction at docs/source/library.rst
* space after period
---------
Co-authored-by: Saad Rahim <44449863+saadrahim@users.noreply.github.com >
[ROCm/rccl commit: e05560ea82 ]
2023-02-10 17:20:54 -05:00
Wenkai Du
8a0d254c69
Add HIP event optimization and remove special code for gfx90a
...
[ROCm/rccl commit: 39534e8724 ]
2023-02-10 16:46:01 +00:00
gilbertlee-amd
e5795cf101
Switching to relaxed capture for unit tests ( #679 )
...
[ROCm/rccl commit: df46645ff8 ]
2023-02-08 11:28:58 -07:00
Pedram Alizadeh
186aff690d
Update library.rst ( #678 )
...
[ROCm/rccl commit: 0df82bd8a3 ]
2023-02-06 16:28:59 -05:00
Wenkai Du
c76bc214c8
Merge remote-tracking branch 'nccl/master' into HEAD
...
[ROCm/rccl commit: e1cb45ff22 ]
2023-02-04 01:44:43 +00:00