Граф коммитов

1029 Коммитов

Автор SHA1 Сообщение Дата
Cory Bloor 8df2b752cd Fix build on additional architectures (#740)
* Fix build on additional architectures

Instead of directly wrapping a platform-specific operation with a
preprocessor check against a gfx macro, it can be more flexible to
check a macro that can be overriden by the user. The gfx macro can then
just provide the default value for the macro, resulting in the same
default behaviour as if the gfx macro was checked directly but with
more control at build-time.

For example, to build rccl without using buffer_wbinvl1_vol on
gfx902, but still use the default on other archs, a user could
export CXXFLAGS='-Xarch_gfx902 -DRCCL_USE_WBINVL1_VOL=1' before
configuring the build. This flexibility isn't always necessary, but
it's nicer to have it and not need it than to need it and not have it.

* Define WARP_SIZE using warpSize builtin

[ROCm/rccl commit: b1a65afd58]
2023-06-06 16:45:50 -06:00
dependabot[bot] 184a6365a2 Bump rocm-docs-core from 0.11.0 to 0.13.3 in /docs/.sphinx (#765)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.11.0 to 0.13.3.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.11.0...v0.13.3)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: a029d34628]
2023-06-06 16:45:15 -06:00
Pedram Alizadeh a34679b2e6 resolving the pthread-gtest linking issue for rccl-UnitTests (#768)
[ROCm/rccl commit: 520f15e61b]
2023-06-06 14:21:40 -04:00
Wenkai Du 90cbef7042 Add NCCL_NCHANNELS_PER_PEER override (#767)
Also fix topol_expl build issue

[ROCm/rccl commit: 3af90902c8]
2023-06-06 08:41:38 -07:00
Bertan Dogancay a32eeae7b5 add DMA_BUF support (#763)
* add DMA_BUF support

* remove unused libraries in src/init.cc

* change NCCL_ALL to NCCL_INIT

* remove extra pointer functions in transport/net.cc

[ROCm/rccl commit: d52b6c0d24]
2023-06-01 12:46:42 -06:00
gilbertlee-amd 05dac01df5 Removing init_nvtx.cc from source list (#762)
[ROCm/rccl commit: c62aebe882]
2023-05-31 14:44:55 -06:00
Wenkai Du 61e30182b2 Rework barrier and event code (#761)
* Rework barrier and event code

* Switch to inline asm

[ROCm/rccl commit: 5a38ff192b]
2023-05-31 13:36:51 -07:00
Nusrat Islam f4d63d21ca Merge pull request #757 from nusislam/unroll
device: change unroll factor

[ROCm/rccl commit: 7519ecb476]
2023-05-30 14:19:32 -05:00
akolliasAMD fd64cb242d added time results on npkit generator (#749)
[ROCm/rccl commit: 2b1efa9e9a]
2023-05-30 12:57:25 -06:00
gilbertlee-amd d2c1295f79 Refactoring CMakeFiles (#755)
[ROCm/rccl commit: 777d8747a5]
2023-05-25 16:08:54 -06:00
Nusrat Islam 325ff1dc11 device: change unroll factor
The default value of unroll factor is 2. Changing the unroll
factor to 4 provides better performance for most of the collectives.


[ROCm/rccl commit: 4d1cfb17c8]
2023-05-25 15:42:35 -05:00
akolliasAMD f371afe6fd updated install script to enable all of npkit (#754)
[ROCm/rccl commit: 58db1cb96d]
2023-05-24 14:44:01 -06:00
Ziyue Yang a7557cf7b0 revert npkit (#748)
[ROCm/rccl commit: 7d6e7bcd7d]
2023-05-24 07:41:05 -07:00
Ziyue Yang 4430e4448f Limit MSCCL reduce unrolling to pow-2 cases to shrink kernel size (#746)
[ROCm/rccl commit: ed252c30f4]
2023-05-19 11:46:36 -07:00
Ziyue Yang 73c6d51454 fix min, max and avg (#745)
[ROCm/rccl commit: 11676267b5]
2023-05-18 11:02:59 -07:00
Sam Wu 2707b6c6c1 Update documentation requirements (#743)
Co-authored-by: samjwu <samjwu@users.noreply.github.com>

[ROCm/rccl commit: edb2ee1a44]
2023-05-18 10:52:35 -06:00
Wen-Heng (Jack) Chung 1bafe9c240 Merge pull request #742 from whchung/skip_done_event_msccl
Allow skipping doneEvent inside MSCCL.

[ROCm/rccl commit: eba4e9e100]
2023-05-18 10:17:20 -05:00
Wenkai Du f119977daa Fix merge error (#744)
[ROCm/rccl commit: 403cda6322]
2023-05-18 08:09:27 -07:00
Wen-Heng (Jack) Chung 932822fc46 Address review feedbacks and make the flag be disabled by default.
[ROCm/rccl commit: ca4a1dfd67]
2023-05-17 17:50:25 +00:00
Wen-Heng (Jack) Chung a7fbe68535 Skip doneEvent inside MSCCL by default.
Added a RCCL_MSCCL_ENABLE_DONE_EVENT env var, set it be 0 by default.

The env var is to control whether to use doneEvent when invoking MSCCL
kernels.

Skipping doneEvent would cause the firmware to skip L2 cache flush,
resulting in overall performance improvement.


[ROCm/rccl commit: 12dba425de]
2023-05-17 16:49:42 +00:00
Wenkai Du f7184aef3f Revert "Ensure memory copy integrity during transport setup (#731)" (#741)
* Revert "Ensure memory copy integrity during transport setup (#731)"

This reverts commit 0c0c35927a.

Add stream synchronization in ncclStrongStreamRelease.

* Use event record and wait

[ROCm/rccl commit: 4ca7742c61]
2023-05-16 10:34:47 -07:00
akolliasAMD a26ca6d766 added modified npkit_trace_generator.py to scripts (#738)
* added modified npkit_trace_generator.py to scripts

[ROCm/rccl commit: c88475462b]
2023-05-09 10:11:35 -06:00
Wenkai Du 3c7cf5b110 Skip checking of some settings in Cray OS (#739)
[ROCm/rccl commit: 8bb3340fcb]
2023-05-09 07:59:56 -07:00
Wenkai Du d677c5fa7b Merge pull request #735 from wenkaidu/nvls
Remove references to NVLS functions

[ROCm/rccl commit: b6542c9b82]
2023-05-05 15:09:58 -07:00
Wenkai Du a8f43b496c Remove references to NVLS functions
[ROCm/rccl commit: 897745a266]
2023-05-05 07:55:20 -07:00
Wenkai Du 757ce8c554 Merge pull request #732 from ROCmSoftwarePlatform/2.17.1
Sync up to NCCL 2.17.1

[ROCm/rccl commit: e21be4acaf]
2023-05-02 08:30:36 -07:00
Wenkai Du 18562abdb2 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 53a1f91857]
2023-04-25 15:38:32 -07:00
Wenkai Du 0c0c35927a Ensure memory copy integrity during transport setup (#731)
[ROCm/rccl commit: 36e453c61e]
2023-04-25 14:41:43 -07:00
dependabot[bot] b9f89624be Bump rocm-docs-core from 0.2.0 to 0.5.0 in /docs/.sphinx (#728)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.2.0 to 0.5.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/commits/v0.5.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: 79df139f0b]
2023-04-16 18:20:17 -06:00
Saad Rahim 99e407ba76 Standardizing documentation homepage message (#726)
[ROCm/rccl commit: a78ff46861]
2023-04-16 18:14:56 -06:00
David Addison 19e361cc45 Merge pull request #822 from KaimingOuyang/github/pytorch-hang-fix
Shutdown socket before close in ncclSocketClose()

[ROCm/rccl commit: 9b7d5edbfc]
2023-04-14 19:52:45 -07:00
Wenkai Du 1af52edd6a msccl: print stack and memory usage (#723)
* msccl: print stack and memory usage

* Update number of kernels calculation

[ROCm/rccl commit: 4b09ffba43]
2023-04-14 14:59:03 -07:00
Pedram Alizadeh aeffdf872b Disabled hipgraph tests! (#725)
[ROCm/rccl commit: 53c1c38f0e]
2023-04-13 17:42:05 -04:00
Kaiming Ouyang 5eba5f178d Add a comment to shutdown() in ncclSocketClose
[ROCm/rccl commit: 006b6bc7dc]
2023-04-13 09:13:44 -07:00
Kaiming Ouyang e35b05a872 Shutdown socket before close in ncclSocketClose()
[ROCm/rccl commit: 367e9b61c3]
2023-04-13 09:11:52 -07:00
Ziyue Yang 29c3da724c MSCCL: Fix memcpy bug (#721)
[ROCm/rccl commit: 7289c05146]
2023-04-11 14:46:53 -07:00
Sam Wu 382306e2e8 pin rocm-docs-core and add dependabot config (#722)
[ROCm/rccl commit: dc149a9fbd]
2023-04-11 10:01:24 -06:00
akolliasAMD af8c2194a7 lessened the amount of child processes to active ones (#720)
[ROCm/rccl commit: 2ce7d971e5]
2023-04-11 08:59:56 -06:00
gilbertlee-amd ff2c1c5d0f Unit test performance refactor (#700)
* Refactoring unit tests to improve performance
* Spawning child processes during InitComms instead of on TestBed construction
* Temporarily disabling graph unit tests

[ROCm/rccl commit: 27e0cb43c2]
2023-04-06 12:28:53 -06:00
akolliasAMD 7c15eeb38a added npkit_enable on CI tests (#698)
[ROCm/rccl commit: 9fe5a349f1]
2023-04-05 08:05:23 -06:00
Wenkai Du 6f3d035396 rccl-prim-test: minor update (#718)
[ROCm/rccl commit: addbf4bd90]
2023-04-03 07:30:04 -07:00
Ziyue Yang aa605b0989 fix msccl stream usage (#717)
[ROCm/rccl commit: c8e33b1232]
2023-03-24 10:59:36 -07:00
akolliasAMD 1025c41f7f fixed jenkins LD_LIBRARY_PATH (#714)
[ROCm/rccl commit: caf7a3d47d]
2023-03-21 11:05:43 -06:00
gilbertlee-amd b859549866 Adding interactive mode for unit tests (UT_INTERACTIVE) (#715)
[ROCm/rccl commit: 00c3d8d850]
2023-03-21 10:58:24 -06:00
akolliasAMD 9daf0bc3d1 Test Fixes (#710)
* splitting CI tests in running SP first and MP second
* set device before hipStreamSynchronize on tests

[ROCm/rccl commit: 9a0d4a07a6]
2023-03-21 08:48:39 -06:00
Wenkai Du 5d9c3b0277 Fix unit test HIP graph error (#712)
[ROCm/rccl commit: b02fd04165]
2023-03-20 15:34:09 -07:00
Sam Wu 1e1be0c808 Fix Docs static analysis (#708)
* remove ref to old script in static analysis

* update README with doc build instructions

update gitignore with doc artifacts

[ROCm/rccl commit: 8c56e6c892]
2023-03-16 13:12:43 -06:00
Ziyue Yang f7f669e7f0 MSCCL: Improve executor and integrate scheduler (#694)
* MSCCL: improve executor and add scheduler for testing

* Use external scheduler

* Fix cmake error

* Address comments

* Fix thread safe issue

* Make MSCCL lifecycle APIs thread safe

* Make MSCCL internal scheduler aware of topology hint

* Revise error message

[ROCm/rccl commit: e3b2342f39]
2023-03-14 14:34:25 -07:00
Saad Rahim 8fdc4795fd Standard template implementation (#703)
[ROCm/rccl commit: 6e48e518d9]
2023-03-13 11:00:57 -06:00
Wenkai Du 216c83c39d Fix XGMI detection (#699)
* Fix XGMI detection

* Increase stack size

* Temporarily disable signal hangler in CI

[Process: 17281] Inside handler function signal: Segmentation fault (11)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)

[ROCm/rccl commit: 22b81fbaae]
2023-03-08 14:08:07 -08:00