Bertan Dogancay
a32eeae7b5
add DMA_BUF support ( #763 )
...
* add DMA_BUF support
* remove unused libraries in src/init.cc
* change NCCL_ALL to NCCL_INIT
* remove extra pointer functions in transport/net.cc
[ROCm/rccl commit: d52b6c0d24 ]
2023-06-01 12:46:42 -06:00
gilbertlee-amd
05dac01df5
Removing init_nvtx.cc from source list ( #762 )
...
[ROCm/rccl commit: c62aebe882 ]
2023-05-31 14:44:55 -06:00
Wenkai Du
61e30182b2
Rework barrier and event code ( #761 )
...
* Rework barrier and event code
* Switch to inline asm
[ROCm/rccl commit: 5a38ff192b ]
2023-05-31 13:36:51 -07:00
Nusrat Islam
f4d63d21ca
Merge pull request #757 from nusislam/unroll
...
device: change unroll factor
[ROCm/rccl commit: 7519ecb476 ]
2023-05-30 14:19:32 -05:00
akolliasAMD
fd64cb242d
added time results on npkit generator ( #749 )
...
[ROCm/rccl commit: 2b1efa9e9a ]
2023-05-30 12:57:25 -06:00
gilbertlee-amd
d2c1295f79
Refactoring CMakeFiles ( #755 )
...
[ROCm/rccl commit: 777d8747a5 ]
2023-05-25 16:08:54 -06:00
Nusrat Islam
325ff1dc11
device: change unroll factor
...
The default value of unroll factor is 2. Changing the unroll
factor to 4 provides better performance for most of the collectives.
[ROCm/rccl commit: 4d1cfb17c8 ]
2023-05-25 15:42:35 -05:00
akolliasAMD
f371afe6fd
updated install script to enable all of npkit ( #754 )
...
[ROCm/rccl commit: 58db1cb96d ]
2023-05-24 14:44:01 -06:00
Ziyue Yang
a7557cf7b0
revert npkit ( #748 )
...
[ROCm/rccl commit: 7d6e7bcd7d ]
2023-05-24 07:41:05 -07:00
Ziyue Yang
4430e4448f
Limit MSCCL reduce unrolling to pow-2 cases to shrink kernel size ( #746 )
...
[ROCm/rccl commit: ed252c30f4 ]
2023-05-19 11:46:36 -07:00
Ziyue Yang
73c6d51454
fix min, max and avg ( #745 )
...
[ROCm/rccl commit: 11676267b5 ]
2023-05-18 11:02:59 -07:00
Sam Wu
2707b6c6c1
Update documentation requirements ( #743 )
...
Co-authored-by: samjwu <samjwu@users.noreply.github.com >
[ROCm/rccl commit: edb2ee1a44 ]
2023-05-18 10:52:35 -06:00
Wen-Heng (Jack) Chung
1bafe9c240
Merge pull request #742 from whchung/skip_done_event_msccl
...
Allow skipping doneEvent inside MSCCL.
[ROCm/rccl commit: eba4e9e100 ]
2023-05-18 10:17:20 -05:00
Wenkai Du
f119977daa
Fix merge error ( #744 )
...
[ROCm/rccl commit: 403cda6322 ]
2023-05-18 08:09:27 -07:00
Wen-Heng (Jack) Chung
932822fc46
Address review feedbacks and make the flag be disabled by default.
...
[ROCm/rccl commit: ca4a1dfd67 ]
2023-05-17 17:50:25 +00:00
Wen-Heng (Jack) Chung
a7fbe68535
Skip doneEvent inside MSCCL by default.
...
Added a RCCL_MSCCL_ENABLE_DONE_EVENT env var, set it be 0 by default.
The env var is to control whether to use doneEvent when invoking MSCCL
kernels.
Skipping doneEvent would cause the firmware to skip L2 cache flush,
resulting in overall performance improvement.
[ROCm/rccl commit: 12dba425de ]
2023-05-17 16:49:42 +00:00
Wenkai Du
f7184aef3f
Revert "Ensure memory copy integrity during transport setup ( #731 )" ( #741 )
...
* Revert "Ensure memory copy integrity during transport setup (#731 )"
This reverts commit 0c0c35927a .
Add stream synchronization in ncclStrongStreamRelease.
* Use event record and wait
[ROCm/rccl commit: 4ca7742c61 ]
2023-05-16 10:34:47 -07:00
akolliasAMD
a26ca6d766
added modified npkit_trace_generator.py to scripts ( #738 )
...
* added modified npkit_trace_generator.py to scripts
[ROCm/rccl commit: c88475462b ]
2023-05-09 10:11:35 -06:00
Wenkai Du
3c7cf5b110
Skip checking of some settings in Cray OS ( #739 )
...
[ROCm/rccl commit: 8bb3340fcb ]
2023-05-09 07:59:56 -07:00
Wenkai Du
a8f43b496c
Remove references to NVLS functions
...
[ROCm/rccl commit: 897745a266 ]
2023-05-05 07:55:20 -07:00
Wenkai Du
18562abdb2
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: 53a1f91857 ]
2023-04-25 15:38:32 -07:00
Wenkai Du
0c0c35927a
Ensure memory copy integrity during transport setup ( #731 )
...
[ROCm/rccl commit: 36e453c61e ]
2023-04-25 14:41:43 -07:00
dependabot[bot]
b9f89624be
Bump rocm-docs-core from 0.2.0 to 0.5.0 in /docs/.sphinx ( #728 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.2.0 to 0.5.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/commits/v0.5.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 79df139f0b ]
2023-04-16 18:20:17 -06:00
Saad Rahim
99e407ba76
Standardizing documentation homepage message ( #726 )
...
[ROCm/rccl commit: a78ff46861 ]
2023-04-16 18:14:56 -06:00
Wenkai Du
1af52edd6a
msccl: print stack and memory usage ( #723 )
...
* msccl: print stack and memory usage
* Update number of kernels calculation
[ROCm/rccl commit: 4b09ffba43 ]
2023-04-14 14:59:03 -07:00
Pedram Alizadeh
aeffdf872b
Disabled hipgraph tests! ( #725 )
...
[ROCm/rccl commit: 53c1c38f0e ]
2023-04-13 17:42:05 -04:00
Kaiming Ouyang
5eba5f178d
Add a comment to shutdown() in ncclSocketClose
...
[ROCm/rccl commit: 006b6bc7dc ]
2023-04-13 09:13:44 -07:00
Kaiming Ouyang
e35b05a872
Shutdown socket before close in ncclSocketClose()
...
[ROCm/rccl commit: 367e9b61c3 ]
2023-04-13 09:11:52 -07:00
Ziyue Yang
29c3da724c
MSCCL: Fix memcpy bug ( #721 )
...
[ROCm/rccl commit: 7289c05146 ]
2023-04-11 14:46:53 -07:00
Sam Wu
382306e2e8
pin rocm-docs-core and add dependabot config ( #722 )
...
[ROCm/rccl commit: dc149a9fbd ]
2023-04-11 10:01:24 -06:00
akolliasAMD
af8c2194a7
lessened the amount of child processes to active ones ( #720 )
...
[ROCm/rccl commit: 2ce7d971e5 ]
2023-04-11 08:59:56 -06:00
gilbertlee-amd
ff2c1c5d0f
Unit test performance refactor ( #700 )
...
* Refactoring unit tests to improve performance
* Spawning child processes during InitComms instead of on TestBed construction
* Temporarily disabling graph unit tests
[ROCm/rccl commit: 27e0cb43c2 ]
2023-04-06 12:28:53 -06:00
akolliasAMD
7c15eeb38a
added npkit_enable on CI tests ( #698 )
...
[ROCm/rccl commit: 9fe5a349f1 ]
2023-04-05 08:05:23 -06:00
Wenkai Du
6f3d035396
rccl-prim-test: minor update ( #718 )
...
[ROCm/rccl commit: addbf4bd90 ]
2023-04-03 07:30:04 -07:00
Ziyue Yang
aa605b0989
fix msccl stream usage ( #717 )
...
[ROCm/rccl commit: c8e33b1232 ]
2023-03-24 10:59:36 -07:00
akolliasAMD
1025c41f7f
fixed jenkins LD_LIBRARY_PATH ( #714 )
...
[ROCm/rccl commit: caf7a3d47d ]
2023-03-21 11:05:43 -06:00
gilbertlee-amd
b859549866
Adding interactive mode for unit tests (UT_INTERACTIVE) ( #715 )
...
[ROCm/rccl commit: 00c3d8d850 ]
2023-03-21 10:58:24 -06:00
akolliasAMD
9daf0bc3d1
Test Fixes ( #710 )
...
* splitting CI tests in running SP first and MP second
* set device before hipStreamSynchronize on tests
[ROCm/rccl commit: 9a0d4a07a6 ]
2023-03-21 08:48:39 -06:00
Wenkai Du
5d9c3b0277
Fix unit test HIP graph error ( #712 )
...
[ROCm/rccl commit: b02fd04165 ]
2023-03-20 15:34:09 -07:00
Sam Wu
1e1be0c808
Fix Docs static analysis ( #708 )
...
* remove ref to old script in static analysis
* update README with doc build instructions
update gitignore with doc artifacts
[ROCm/rccl commit: 8c56e6c892 ]
2023-03-16 13:12:43 -06:00
Ziyue Yang
f7f669e7f0
MSCCL: Improve executor and integrate scheduler ( #694 )
...
* MSCCL: improve executor and add scheduler for testing
* Use external scheduler
* Fix cmake error
* Address comments
* Fix thread safe issue
* Make MSCCL lifecycle APIs thread safe
* Make MSCCL internal scheduler aware of topology hint
* Revise error message
[ROCm/rccl commit: e3b2342f39 ]
2023-03-14 14:34:25 -07:00
Saad Rahim
8fdc4795fd
Standard template implementation ( #703 )
...
[ROCm/rccl commit: 6e48e518d9 ]
2023-03-13 11:00:57 -06:00
Wenkai Du
216c83c39d
Fix XGMI detection ( #699 )
...
* Fix XGMI detection
* Increase stack size
* Temporarily disable signal hangler in CI
[Process: 17281] Inside handler function signal: Segmentation fault (11)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
[ROCm/rccl commit: 22b81fbaae ]
2023-03-08 14:08:07 -08:00
Wenkai Du
62f5e6a82f
Warn user on incorrect system settings ( #696 )
...
* Warn user on incorrect system settings
* Fix typo
* Add possible impact
* Ignore iommu settings in VM
[ROCm/rccl commit: 79a2031951 ]
2023-03-06 08:17:06 -08:00
Sylvain Jeaugey
8dcf8e8720
2.17.1-1
...
Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.
[ROCm/rccl commit: 5d3ab08b69 ]
2023-03-01 00:39:04 -08:00
gilbertlee-amd
0da1d6a6cd
Multi stream unit test ( #693 )
...
* Adding multi-stream support to unit tests
[ROCm/rccl commit: 80ed608a9d ]
2023-02-23 13:28:50 -07:00
Wenkai Du
8aaf6f8e0d
Merge pull request #685 from ROCmSoftwarePlatform/2.16.5
...
Sync up to NCCL 2.16.5
[ROCm/rccl commit: d601c4909c ]
2023-02-22 10:29:02 -08:00
gilbertlee-amd
80ce7507c7
Adding UnitTest timing summary (UT_SHOW_TIMING) ( #692 )
...
[ROCm/rccl commit: f63d3b1978 ]
2023-02-22 08:57:13 -07:00
akolliasAMD
3b6b34705a
UnitTests: made reduceScatter run a smaller amount of tests ( #691 )
...
[ROCm/rccl commit: d119c0886e ]
2023-02-21 16:21:24 -07:00
Wenkai Du
1f42d485f8
Fix P2P scheduling ( #690 )
...
[ROCm/rccl commit: 86e7b71234 ]
2023-02-21 07:49:54 -08:00