Граф коммитов

1013 Коммитов

Автор SHA1 Сообщение Дата
Wen-Heng (Jack) Chung 1bafe9c240 Merge pull request #742 from whchung/skip_done_event_msccl
Allow skipping doneEvent inside MSCCL.

[ROCm/rccl commit: eba4e9e100]
2023-05-18 10:17:20 -05:00
Wenkai Du f119977daa Fix merge error (#744)
[ROCm/rccl commit: 403cda6322]
2023-05-18 08:09:27 -07:00
Wen-Heng (Jack) Chung 932822fc46 Address review feedbacks and make the flag be disabled by default.
[ROCm/rccl commit: ca4a1dfd67]
2023-05-17 17:50:25 +00:00
Wen-Heng (Jack) Chung a7fbe68535 Skip doneEvent inside MSCCL by default.
Added a RCCL_MSCCL_ENABLE_DONE_EVENT env var, set it be 0 by default.

The env var is to control whether to use doneEvent when invoking MSCCL
kernels.

Skipping doneEvent would cause the firmware to skip L2 cache flush,
resulting in overall performance improvement.


[ROCm/rccl commit: 12dba425de]
2023-05-17 16:49:42 +00:00
Wenkai Du f7184aef3f Revert "Ensure memory copy integrity during transport setup (#731)" (#741)
* Revert "Ensure memory copy integrity during transport setup (#731)"

This reverts commit 0c0c35927a.

Add stream synchronization in ncclStrongStreamRelease.

* Use event record and wait

[ROCm/rccl commit: 4ca7742c61]
2023-05-16 10:34:47 -07:00
akolliasAMD a26ca6d766 added modified npkit_trace_generator.py to scripts (#738)
* added modified npkit_trace_generator.py to scripts

[ROCm/rccl commit: c88475462b]
2023-05-09 10:11:35 -06:00
Wenkai Du 3c7cf5b110 Skip checking of some settings in Cray OS (#739)
[ROCm/rccl commit: 8bb3340fcb]
2023-05-09 07:59:56 -07:00
Wenkai Du d677c5fa7b Merge pull request #735 from wenkaidu/nvls
Remove references to NVLS functions

[ROCm/rccl commit: b6542c9b82]
2023-05-05 15:09:58 -07:00
Wenkai Du a8f43b496c Remove references to NVLS functions
[ROCm/rccl commit: 897745a266]
2023-05-05 07:55:20 -07:00
Wenkai Du 757ce8c554 Merge pull request #732 from ROCmSoftwarePlatform/2.17.1
Sync up to NCCL 2.17.1

[ROCm/rccl commit: e21be4acaf]
2023-05-02 08:30:36 -07:00
Wenkai Du 18562abdb2 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 53a1f91857]
2023-04-25 15:38:32 -07:00
Wenkai Du 0c0c35927a Ensure memory copy integrity during transport setup (#731)
[ROCm/rccl commit: 36e453c61e]
2023-04-25 14:41:43 -07:00
dependabot[bot] b9f89624be Bump rocm-docs-core from 0.2.0 to 0.5.0 in /docs/.sphinx (#728)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.2.0 to 0.5.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/commits/v0.5.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 79df139f0b]
2023-04-16 18:20:17 -06:00
Saad Rahim 99e407ba76 Standardizing documentation homepage message (#726)
[ROCm/rccl commit: a78ff46861]
2023-04-16 18:14:56 -06:00
David Addison 19e361cc45 Merge pull request #822 from KaimingOuyang/github/pytorch-hang-fix
Shutdown socket before close in ncclSocketClose()

[ROCm/rccl commit: 9b7d5edbfc]
2023-04-14 19:52:45 -07:00
Wenkai Du 1af52edd6a msccl: print stack and memory usage (#723)
* msccl: print stack and memory usage

* Update number of kernels calculation

[ROCm/rccl commit: 4b09ffba43]
2023-04-14 14:59:03 -07:00
Pedram Alizadeh aeffdf872b Disabled hipgraph tests! (#725)
[ROCm/rccl commit: 53c1c38f0e]
2023-04-13 17:42:05 -04:00
Kaiming Ouyang 5eba5f178d Add a comment to shutdown() in ncclSocketClose
[ROCm/rccl commit: 006b6bc7dc]
2023-04-13 09:13:44 -07:00
Kaiming Ouyang e35b05a872 Shutdown socket before close in ncclSocketClose()
[ROCm/rccl commit: 367e9b61c3]
2023-04-13 09:11:52 -07:00
Ziyue Yang 29c3da724c MSCCL: Fix memcpy bug (#721)
[ROCm/rccl commit: 7289c05146]
2023-04-11 14:46:53 -07:00
Sam Wu 382306e2e8 pin rocm-docs-core and add dependabot config (#722)
[ROCm/rccl commit: dc149a9fbd]
2023-04-11 10:01:24 -06:00
akolliasAMD af8c2194a7 lessened the amount of child processes to active ones (#720)
[ROCm/rccl commit: 2ce7d971e5]
2023-04-11 08:59:56 -06:00
gilbertlee-amd ff2c1c5d0f Unit test performance refactor (#700)
* Refactoring unit tests to improve performance
* Spawning child processes during InitComms instead of on TestBed construction
* Temporarily disabling graph unit tests

[ROCm/rccl commit: 27e0cb43c2]
2023-04-06 12:28:53 -06:00
akolliasAMD 7c15eeb38a added npkit_enable on CI tests (#698)
[ROCm/rccl commit: 9fe5a349f1]
2023-04-05 08:05:23 -06:00
Wenkai Du 6f3d035396 rccl-prim-test: minor update (#718)
[ROCm/rccl commit: addbf4bd90]
2023-04-03 07:30:04 -07:00
Ziyue Yang aa605b0989 fix msccl stream usage (#717)
[ROCm/rccl commit: c8e33b1232]
2023-03-24 10:59:36 -07:00
akolliasAMD 1025c41f7f fixed jenkins LD_LIBRARY_PATH (#714)
[ROCm/rccl commit: caf7a3d47d]
2023-03-21 11:05:43 -06:00
gilbertlee-amd b859549866 Adding interactive mode for unit tests (UT_INTERACTIVE) (#715)
[ROCm/rccl commit: 00c3d8d850]
2023-03-21 10:58:24 -06:00
akolliasAMD 9daf0bc3d1 Test Fixes (#710)
* splitting CI tests in running SP first and MP second
* set device before hipStreamSynchronize on tests

[ROCm/rccl commit: 9a0d4a07a6]
2023-03-21 08:48:39 -06:00
Wenkai Du 5d9c3b0277 Fix unit test HIP graph error (#712)
[ROCm/rccl commit: b02fd04165]
2023-03-20 15:34:09 -07:00
Sam Wu 1e1be0c808 Fix Docs static analysis (#708)
* remove ref to old script in static analysis

* update README with doc build instructions

update gitignore with doc artifacts

[ROCm/rccl commit: 8c56e6c892]
2023-03-16 13:12:43 -06:00
Ziyue Yang f7f669e7f0 MSCCL: Improve executor and integrate scheduler (#694)
* MSCCL: improve executor and add scheduler for testing

* Use external scheduler

* Fix cmake error

* Address comments

* Fix thread safe issue

* Make MSCCL lifecycle APIs thread safe

* Make MSCCL internal scheduler aware of topology hint

* Revise error message

[ROCm/rccl commit: e3b2342f39]
2023-03-14 14:34:25 -07:00
Saad Rahim 8fdc4795fd Standard template implementation (#703)
[ROCm/rccl commit: 6e48e518d9]
2023-03-13 11:00:57 -06:00
Wenkai Du 216c83c39d Fix XGMI detection (#699)
* Fix XGMI detection

* Increase stack size

* Temporarily disable signal hangler in CI

[Process: 17281] Inside handler function signal: Segmentation fault (11)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)

[ROCm/rccl commit: 22b81fbaae]
2023-03-08 14:08:07 -08:00
Wenkai Du 62f5e6a82f Warn user on incorrect system settings (#696)
* Warn user on incorrect system settings

* Fix typo

* Add possible impact

* Ignore iommu settings in VM

[ROCm/rccl commit: 79a2031951]
2023-03-06 08:17:06 -08:00
Sylvain Jeaugey 8dcf8e8720 2.17.1-1
Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.


[ROCm/rccl commit: 5d3ab08b69]
2023-03-01 00:39:04 -08:00
gilbertlee-amd 0da1d6a6cd Multi stream unit test (#693)
* Adding multi-stream support to unit tests

[ROCm/rccl commit: 80ed608a9d]
2023-02-23 13:28:50 -07:00
Wenkai Du 8aaf6f8e0d Merge pull request #685 from ROCmSoftwarePlatform/2.16.5
Sync up to NCCL 2.16.5

[ROCm/rccl commit: d601c4909c]
2023-02-22 10:29:02 -08:00
gilbertlee-amd 80ce7507c7 Adding UnitTest timing summary (UT_SHOW_TIMING) (#692)
[ROCm/rccl commit: f63d3b1978]
2023-02-22 08:57:13 -07:00
akolliasAMD 3b6b34705a UnitTests: made reduceScatter run a smaller amount of tests (#691)
[ROCm/rccl commit: d119c0886e]
2023-02-21 16:21:24 -07:00
Wenkai Du 1f42d485f8 Fix P2P scheduling (#690)
[ROCm/rccl commit: 86e7b71234]
2023-02-21 07:49:54 -08:00
gilbertlee-amd 7d860d3642 Unit test fail check (#689)
* Adding fall-through on unit test failure

* Workaround for hipGraph validity check issue

[ROCm/rccl commit: a640c6983f]
2023-02-18 08:50:46 -08:00
Wenkai Du 393d0ba7f8 Add back __syncthreads() in barrier and adjust stack size (#688)
[ROCm/rccl commit: 1c166046a2]
2023-02-18 08:50:31 -08:00
Ziyue Yang 7c1290f995 NPKit: improve clock calibration and fix GPU clock API (#683)
* Improve clock calibration in NPKit

* Improve gfx macro

* Fix macro

[ROCm/rccl commit: f4bf47f325]
2023-02-17 12:26:57 -07:00
Wenkai Du 0c7a94c462 Merge remote-tracking branch 'nccl/master' into HEAD
[ROCm/rccl commit: aee7b42bb8]
2023-02-14 17:14:13 -08:00
Wenkai Du 4fb1ebcf4b Remove workaround and use indirect function call (#684)
[ROCm/rccl commit: f7a456122c]
2023-02-14 13:59:48 -08:00
Wenkai Du cb7e2e8eeb Merge pull request #681 from wenkaidu/gfx9
Add HIP event optimization and remove special code for gfx90a

[ROCm/rccl commit: 9461a43168]
2023-02-13 08:04:59 -08:00
Pedram Alizadeh 29a0f74de1 Adding -pthread flag into CMakeLists.txt (#682)
Adding -pthread flag for linking issues into CMakeLists.txt

[ROCm/rccl commit: f525b8e1e6]
2023-02-10 17:22:30 -05:00
Pedram Alizadeh 93fa62b654 Updated RCCL introduction at docs/source/library.rst (#680)
* Updated RCCL introduction at docs/source/library.rst

* space after period

---------

Co-authored-by: Saad Rahim <44449863+saadrahim@users.noreply.github.com>

[ROCm/rccl commit: e05560ea82]
2023-02-10 17:20:54 -05:00
Wenkai Du 8a0d254c69 Add HIP event optimization and remove special code for gfx90a
[ROCm/rccl commit: 39534e8724]
2023-02-10 16:46:01 +00:00