Cory Bloor
8df2b752cd
Fix build on additional architectures ( #740 )
...
* Fix build on additional architectures
Instead of directly wrapping a platform-specific operation with a
preprocessor check against a gfx macro, it can be more flexible to
check a macro that can be overriden by the user. The gfx macro can then
just provide the default value for the macro, resulting in the same
default behaviour as if the gfx macro was checked directly but with
more control at build-time.
For example, to build rccl without using buffer_wbinvl1_vol on
gfx902, but still use the default on other archs, a user could
export CXXFLAGS='-Xarch_gfx902 -DRCCL_USE_WBINVL1_VOL=1' before
configuring the build. This flexibility isn't always necessary, but
it's nicer to have it and not need it than to need it and not have it.
* Define WARP_SIZE using warpSize builtin
[ROCm/rccl commit: b1a65afd58 ]
2023-06-06 16:45:50 -06:00
dependabot[bot]
184a6365a2
Bump rocm-docs-core from 0.11.0 to 0.13.3 in /docs/.sphinx ( #765 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.11.0 to 0.13.3.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.11.0...v0.13.3 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: a029d34628 ]
2023-06-06 16:45:15 -06:00
Pedram Alizadeh
a34679b2e6
resolving the pthread-gtest linking issue for rccl-UnitTests ( #768 )
...
[ROCm/rccl commit: 520f15e61b ]
2023-06-06 14:21:40 -04:00
Wenkai Du
90cbef7042
Add NCCL_NCHANNELS_PER_PEER override ( #767 )
...
Also fix topol_expl build issue
[ROCm/rccl commit: 3af90902c8 ]
2023-06-06 08:41:38 -07:00
Bertan Dogancay
a32eeae7b5
add DMA_BUF support ( #763 )
...
* add DMA_BUF support
* remove unused libraries in src/init.cc
* change NCCL_ALL to NCCL_INIT
* remove extra pointer functions in transport/net.cc
[ROCm/rccl commit: d52b6c0d24 ]
2023-06-01 12:46:42 -06:00
gilbertlee-amd
05dac01df5
Removing init_nvtx.cc from source list ( #762 )
...
[ROCm/rccl commit: c62aebe882 ]
2023-05-31 14:44:55 -06:00
Wenkai Du
61e30182b2
Rework barrier and event code ( #761 )
...
* Rework barrier and event code
* Switch to inline asm
[ROCm/rccl commit: 5a38ff192b ]
2023-05-31 13:36:51 -07:00
Nusrat Islam
f4d63d21ca
Merge pull request #757 from nusislam/unroll
...
device: change unroll factor
[ROCm/rccl commit: 7519ecb476 ]
2023-05-30 14:19:32 -05:00
akolliasAMD
fd64cb242d
added time results on npkit generator ( #749 )
...
[ROCm/rccl commit: 2b1efa9e9a ]
2023-05-30 12:57:25 -06:00
gilbertlee-amd
d2c1295f79
Refactoring CMakeFiles ( #755 )
...
[ROCm/rccl commit: 777d8747a5 ]
2023-05-25 16:08:54 -06:00
Nusrat Islam
325ff1dc11
device: change unroll factor
...
The default value of unroll factor is 2. Changing the unroll
factor to 4 provides better performance for most of the collectives.
[ROCm/rccl commit: 4d1cfb17c8 ]
2023-05-25 15:42:35 -05:00
akolliasAMD
f371afe6fd
updated install script to enable all of npkit ( #754 )
...
[ROCm/rccl commit: 58db1cb96d ]
2023-05-24 14:44:01 -06:00
Ziyue Yang
a7557cf7b0
revert npkit ( #748 )
...
[ROCm/rccl commit: 7d6e7bcd7d ]
2023-05-24 07:41:05 -07:00
Ziyue Yang
4430e4448f
Limit MSCCL reduce unrolling to pow-2 cases to shrink kernel size ( #746 )
...
[ROCm/rccl commit: ed252c30f4 ]
2023-05-19 11:46:36 -07:00
Ziyue Yang
73c6d51454
fix min, max and avg ( #745 )
...
[ROCm/rccl commit: 11676267b5 ]
2023-05-18 11:02:59 -07:00
Sam Wu
2707b6c6c1
Update documentation requirements ( #743 )
...
Co-authored-by: samjwu <samjwu@users.noreply.github.com >
[ROCm/rccl commit: edb2ee1a44 ]
2023-05-18 10:52:35 -06:00
Wen-Heng (Jack) Chung
1bafe9c240
Merge pull request #742 from whchung/skip_done_event_msccl
...
Allow skipping doneEvent inside MSCCL.
[ROCm/rccl commit: eba4e9e100 ]
2023-05-18 10:17:20 -05:00
Wenkai Du
f119977daa
Fix merge error ( #744 )
...
[ROCm/rccl commit: 403cda6322 ]
2023-05-18 08:09:27 -07:00
Wen-Heng (Jack) Chung
932822fc46
Address review feedbacks and make the flag be disabled by default.
...
[ROCm/rccl commit: ca4a1dfd67 ]
2023-05-17 17:50:25 +00:00
Wen-Heng (Jack) Chung
a7fbe68535
Skip doneEvent inside MSCCL by default.
...
Added a RCCL_MSCCL_ENABLE_DONE_EVENT env var, set it be 0 by default.
The env var is to control whether to use doneEvent when invoking MSCCL
kernels.
Skipping doneEvent would cause the firmware to skip L2 cache flush,
resulting in overall performance improvement.
[ROCm/rccl commit: 12dba425de ]
2023-05-17 16:49:42 +00:00
Wenkai Du
f7184aef3f
Revert "Ensure memory copy integrity during transport setup ( #731 )" ( #741 )
...
* Revert "Ensure memory copy integrity during transport setup (#731 )"
This reverts commit 0c0c35927a .
Add stream synchronization in ncclStrongStreamRelease.
* Use event record and wait
[ROCm/rccl commit: 4ca7742c61 ]
2023-05-16 10:34:47 -07:00
akolliasAMD
a26ca6d766
added modified npkit_trace_generator.py to scripts ( #738 )
...
* added modified npkit_trace_generator.py to scripts
[ROCm/rccl commit: c88475462b ]
2023-05-09 10:11:35 -06:00
Wenkai Du
3c7cf5b110
Skip checking of some settings in Cray OS ( #739 )
...
[ROCm/rccl commit: 8bb3340fcb ]
2023-05-09 07:59:56 -07:00
Wenkai Du
d677c5fa7b
Merge pull request #735 from wenkaidu/nvls
...
Remove references to NVLS functions
[ROCm/rccl commit: b6542c9b82 ]
2023-05-05 15:09:58 -07:00
Wenkai Du
a8f43b496c
Remove references to NVLS functions
...
[ROCm/rccl commit: 897745a266 ]
2023-05-05 07:55:20 -07:00
Wenkai Du
757ce8c554
Merge pull request #732 from ROCmSoftwarePlatform/2.17.1
...
Sync up to NCCL 2.17.1
[ROCm/rccl commit: e21be4acaf ]
2023-05-02 08:30:36 -07:00
Wenkai Du
18562abdb2
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: 53a1f91857 ]
2023-04-25 15:38:32 -07:00
Wenkai Du
0c0c35927a
Ensure memory copy integrity during transport setup ( #731 )
...
[ROCm/rccl commit: 36e453c61e ]
2023-04-25 14:41:43 -07:00
dependabot[bot]
b9f89624be
Bump rocm-docs-core from 0.2.0 to 0.5.0 in /docs/.sphinx ( #728 )
...
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core ) from 0.2.0 to 0.5.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases )
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/commits/v0.5.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 79df139f0b ]
2023-04-16 18:20:17 -06:00
Saad Rahim
99e407ba76
Standardizing documentation homepage message ( #726 )
...
[ROCm/rccl commit: a78ff46861 ]
2023-04-16 18:14:56 -06:00
David Addison
19e361cc45
Merge pull request #822 from KaimingOuyang/github/pytorch-hang-fix
...
Shutdown socket before close in ncclSocketClose()
[ROCm/rccl commit: 9b7d5edbfc ]
2023-04-14 19:52:45 -07:00
Wenkai Du
1af52edd6a
msccl: print stack and memory usage ( #723 )
...
* msccl: print stack and memory usage
* Update number of kernels calculation
[ROCm/rccl commit: 4b09ffba43 ]
2023-04-14 14:59:03 -07:00
Pedram Alizadeh
aeffdf872b
Disabled hipgraph tests! ( #725 )
...
[ROCm/rccl commit: 53c1c38f0e ]
2023-04-13 17:42:05 -04:00
Kaiming Ouyang
5eba5f178d
Add a comment to shutdown() in ncclSocketClose
...
[ROCm/rccl commit: 006b6bc7dc ]
2023-04-13 09:13:44 -07:00
Kaiming Ouyang
e35b05a872
Shutdown socket before close in ncclSocketClose()
...
[ROCm/rccl commit: 367e9b61c3 ]
2023-04-13 09:11:52 -07:00
Ziyue Yang
29c3da724c
MSCCL: Fix memcpy bug ( #721 )
...
[ROCm/rccl commit: 7289c05146 ]
2023-04-11 14:46:53 -07:00
Sam Wu
382306e2e8
pin rocm-docs-core and add dependabot config ( #722 )
...
[ROCm/rccl commit: dc149a9fbd ]
2023-04-11 10:01:24 -06:00
akolliasAMD
af8c2194a7
lessened the amount of child processes to active ones ( #720 )
...
[ROCm/rccl commit: 2ce7d971e5 ]
2023-04-11 08:59:56 -06:00
gilbertlee-amd
ff2c1c5d0f
Unit test performance refactor ( #700 )
...
* Refactoring unit tests to improve performance
* Spawning child processes during InitComms instead of on TestBed construction
* Temporarily disabling graph unit tests
[ROCm/rccl commit: 27e0cb43c2 ]
2023-04-06 12:28:53 -06:00
akolliasAMD
7c15eeb38a
added npkit_enable on CI tests ( #698 )
...
[ROCm/rccl commit: 9fe5a349f1 ]
2023-04-05 08:05:23 -06:00
Wenkai Du
6f3d035396
rccl-prim-test: minor update ( #718 )
...
[ROCm/rccl commit: addbf4bd90 ]
2023-04-03 07:30:04 -07:00
Ziyue Yang
aa605b0989
fix msccl stream usage ( #717 )
...
[ROCm/rccl commit: c8e33b1232 ]
2023-03-24 10:59:36 -07:00
akolliasAMD
1025c41f7f
fixed jenkins LD_LIBRARY_PATH ( #714 )
...
[ROCm/rccl commit: caf7a3d47d ]
2023-03-21 11:05:43 -06:00
gilbertlee-amd
b859549866
Adding interactive mode for unit tests (UT_INTERACTIVE) ( #715 )
...
[ROCm/rccl commit: 00c3d8d850 ]
2023-03-21 10:58:24 -06:00
akolliasAMD
9daf0bc3d1
Test Fixes ( #710 )
...
* splitting CI tests in running SP first and MP second
* set device before hipStreamSynchronize on tests
[ROCm/rccl commit: 9a0d4a07a6 ]
2023-03-21 08:48:39 -06:00
Wenkai Du
5d9c3b0277
Fix unit test HIP graph error ( #712 )
...
[ROCm/rccl commit: b02fd04165 ]
2023-03-20 15:34:09 -07:00
Sam Wu
1e1be0c808
Fix Docs static analysis ( #708 )
...
* remove ref to old script in static analysis
* update README with doc build instructions
update gitignore with doc artifacts
[ROCm/rccl commit: 8c56e6c892 ]
2023-03-16 13:12:43 -06:00
Ziyue Yang
f7f669e7f0
MSCCL: Improve executor and integrate scheduler ( #694 )
...
* MSCCL: improve executor and add scheduler for testing
* Use external scheduler
* Fix cmake error
* Address comments
* Fix thread safe issue
* Make MSCCL lifecycle APIs thread safe
* Make MSCCL internal scheduler aware of topology hint
* Revise error message
[ROCm/rccl commit: e3b2342f39 ]
2023-03-14 14:34:25 -07:00
Saad Rahim
8fdc4795fd
Standard template implementation ( #703 )
...
[ROCm/rccl commit: 6e48e518d9 ]
2023-03-13 11:00:57 -06:00
Wenkai Du
216c83c39d
Fix XGMI detection ( #699 )
...
* Fix XGMI detection
* Increase stack size
* Temporarily disable signal hangler in CI
[Process: 17281] Inside handler function signal: Segmentation fault (11)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
[ROCm/rccl commit: 22b81fbaae ]
2023-03-08 14:08:07 -08:00