Граф коммитов

494 Коммитов

Автор SHA1 Сообщение Дата
Wenkai Du 8bb3340fcb Skip checking of some settings in Cray OS (#739) 2023-05-09 07:59:56 -07:00
Wenkai Du 897745a266 Remove references to NVLS functions 2023-05-05 07:55:20 -07:00
Wenkai Du 53a1f91857 Merge remote-tracking branch 'nccl/master' into develop 2023-04-25 15:38:32 -07:00
Wenkai Du 36e453c61e Ensure memory copy integrity during transport setup (#731) 2023-04-25 14:41:43 -07:00
Wenkai Du 4b09ffba43 msccl: print stack and memory usage (#723)
* msccl: print stack and memory usage

* Update number of kernels calculation
2023-04-14 14:59:03 -07:00
Kaiming Ouyang 006b6bc7dc Add a comment to shutdown() in ncclSocketClose 2023-04-13 09:13:44 -07:00
Kaiming Ouyang 367e9b61c3 Shutdown socket before close in ncclSocketClose() 2023-04-13 09:11:52 -07:00
Ziyue Yang 7289c05146 MSCCL: Fix memcpy bug (#721) 2023-04-11 14:46:53 -07:00
Ziyue Yang c8e33b1232 fix msccl stream usage (#717) 2023-03-24 10:59:36 -07:00
Wenkai Du b02fd04165 Fix unit test HIP graph error (#712) 2023-03-20 15:34:09 -07:00
Ziyue Yang e3b2342f39 MSCCL: Improve executor and integrate scheduler (#694)
* MSCCL: improve executor and add scheduler for testing

* Use external scheduler

* Fix cmake error

* Address comments

* Fix thread safe issue

* Make MSCCL lifecycle APIs thread safe

* Make MSCCL internal scheduler aware of topology hint

* Revise error message
2023-03-14 14:34:25 -07:00
Wenkai Du 22b81fbaae Fix XGMI detection (#699)
* Fix XGMI detection

* Increase stack size

* Temporarily disable signal hangler in CI

[Process: 17281] Inside handler function signal: Segmentation fault (11)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
2023-03-08 14:08:07 -08:00
Wenkai Du 79a2031951 Warn user on incorrect system settings (#696)
* Warn user on incorrect system settings

* Fix typo

* Add possible impact

* Ignore iommu settings in VM
2023-03-06 08:17:06 -08:00
Sylvain Jeaugey 5d3ab08b69 2.17.1-1
Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.
2023-03-01 00:39:04 -08:00
Wenkai Du d601c4909c Merge pull request #685 from ROCmSoftwarePlatform/2.16.5
Sync up to NCCL 2.16.5
2023-02-22 10:29:02 -08:00
Wenkai Du 86e7b71234 Fix P2P scheduling (#690) 2023-02-21 07:49:54 -08:00
Wenkai Du 1c166046a2 Add back __syncthreads() in barrier and adjust stack size (#688) 2023-02-18 08:50:31 -08:00
Ziyue Yang f4bf47f325 NPKit: improve clock calibration and fix GPU clock API (#683)
* Improve clock calibration in NPKit

* Improve gfx macro

* Fix macro
2023-02-17 12:26:57 -07:00
Wenkai Du aee7b42bb8 Merge remote-tracking branch 'nccl/master' into HEAD 2023-02-14 17:14:13 -08:00
Wenkai Du f7a456122c Remove workaround and use indirect function call (#684) 2023-02-14 13:59:48 -08:00
Wenkai Du 39534e8724 Add HIP event optimization and remove special code for gfx90a 2023-02-10 16:46:01 +00:00
Wenkai Du e1cb45ff22 Merge remote-tracking branch 'nccl/master' into HEAD 2023-02-04 01:44:43 +00:00
Sylvain Jeaugey f3d5166783 2.16.5-1
Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit
2023-02-02 12:52:47 -08:00
Wenkai Du 2288e9ae80 Switch to hipLaunchHostFunc for HIP graph (#667) 2022-12-15 10:16:46 -08:00
Ziyue Yang adafc0f759 Add MSCCL Support (#658)
* Add MSCCL support

* Add alignment and message size checking

* Fix nRanks checking, in-place and out-of-place tests and group call handling

* Fix hipGraph unit test

* Change MSCCL init warning to INFO

* Revise license info
2022-12-12 15:51:04 -08:00
Wenkai Du b953544a59 Fix typo in detecting Intel platforms (#661) 2022-12-07 13:36:11 -08:00
akolliasAMD eca623df07 decreased warp size for gfx110x (#655) 2022-12-01 12:19:21 -07:00
Wenkai Du fb9938cffa Query DMABuf support through HSA runtime API (#654) 2022-11-30 08:53:03 -08:00
Sylvain Jeaugey 28189e2df8 2.16.2-1
Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.
2022-11-30 02:31:59 -08:00
Wenkai Du 9594bbee3b Adjust P2P channels on Intel platform (#653) 2022-11-29 13:57:10 -08:00
Wenkai Du 57764f8152 Fix incorrect rocm-smi ID conversion (#648) 2022-11-21 19:44:39 -08:00
Wenkai Du 9cb72a3d0f Fix collective trace timestamp format (#647) 2022-11-21 08:11:12 -08:00
Wenkai Du cf3c32a626 Fix typo in previous hipify change (#645) 2022-11-15 11:51:47 -08:00
Wenkai Du 562dd87036 Move hipify to cmake stage
Add minimal ROCm/HIP version requirements for Graph support
2022-11-14 18:10:45 +00:00
Wenkai Du 94ad7f6f51 Update tuning table and fix topo_expl 2022-11-07 18:24:24 +00:00
Wenkai Du 9a077e6947 Merge remote-tracking branch 'nccl/master' into develop 2022-11-03 21:17:42 +00:00
Wenkai Du 72ef100050 Fix P2P scheduling 2022-10-31 08:54:34 -07:00
Sylvain Jeaugey 2f4cb874ba Merge tag 'v2.15.5-1' 2022-10-25 01:15:22 -07:00
Sylvain Jeaugey cb111f764a 2.15.5-1
Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.
2022-10-25 00:55:55 -07:00
Wenkai Du 4f0e223db4 Merge remote-tracking branch 'nccl/master' into develop 2022-10-20 15:41:29 +00:00
Wenkai Du bc8ef779df Fix missing initialization due to merge error (#640) 2022-10-19 21:20:11 -07:00
Wenkai Du 9ddf0e0649 Support P2P with invisible devices (#636)
* Support P2P with invisible devices

* Update copyright year
2022-10-17 10:24:59 -07:00
Wenkai Du 9916a09818 Merge pull request #634 from yzygitzh/ziyyang/npkit-fix
Apply several fixes to NPKit
2022-10-17 08:01:24 -07:00
gilbertlee-amd ebb8b5bf63 Updating files for missing licenses (#637) 2022-10-14 13:49:16 -06:00
Ziyue Yang 7d6bbc19d4 apply npkit 2022-10-14 01:28:17 +00:00
Sylvain Jeaugey d128d62238 Merge tag 'v2.15.1-1' 2022-10-07 11:00:26 -07:00
John Bachan 2401f4a918 Fixes a double-free in the error path of ncclCommInitAll.
Fixes https://github.com/NVIDIA/nccl/issues/726
2022-10-03 17:12:32 -07:00
Edgar Gabriel e645b02cd8 introduce a hw topology aware bintree
for hayabusa architecture.
2022-10-03 15:26:21 +00:00
akolliasAMD ef71550738 Added new gpu targets (#631) 2022-09-29 14:53:55 -06:00
Wenkai Du a523b37ac7 Another threadfence and flags rework (#629) 2022-09-28 16:49:29 -07:00