Wykres commitów

673 Commity

Autor SHA1 Wiadomość Data
Andy li 6777e65c1d Enable fp8 support (#1101)
* initial checkin

* resolve cr comments

* resolve the build issue

* fix the data correctless issue

* update fp8 header file and update the unit test for fp8 support

* remove fp16 from fp8 headers

* fix ut issue and catch up the latest code from develop

* udate according to cr comments

* update ut according to cr comments

* update num floats for each SumPostDiv from 4 to 6

* update fp8 header file name

* fix the typo
2024-03-08 15:17:53 -08:00
Wenkai Du ff951e607d Improve debug messages of memory allocations (#1107) 2024-03-08 10:55:10 -08:00
Wenkai Du 77615cce28 msccl: fix scratch memory allocation after API change (#1103) 2024-03-06 11:11:04 -08:00
Wenkai Du cbd955627e Add support for using contiguous for GPU direct RDMA (#1096)
Enabled by env var RCCL_NET_CONTIGUOUS_MEM=1
2024-02-29 10:06:43 -08:00
Wenkai Du df98a6957d Add another Rome model (#1095) 2024-02-28 10:46:05 -08:00
Bertan Dogancay b617aecc31 Implement ROCTX (#1094)
* Implement roctx
2024-02-27 15:46:15 -07:00
Wenkai Du 74f9e5db64 Add new GPU model (#1080) 2024-02-23 12:19:42 -08:00
Wenkai Du c5ab37211b Update RCCL/MSCCL work FIFO depth to 256K (#1091) 2024-02-21 17:15:11 -08:00
Bertan Dogancay b275ed0b56 LL128 check if all XGMI (#1089) 2024-02-21 09:41:40 -07:00
Bertan Dogancay 2fb12a9358 Merge pull request #1079 from BertanDogancay/2.19.4-sync
2.19.4 Sync
2024-02-16 09:50:11 -07:00
akolliasAMD bac57421c7 Allow bus id to be null (#1085)
* Allow bus id to be null
2024-02-15 16:36:51 -07:00
BertanDogancay 6f3310605c Disable unsupported ld/st instructions 2024-02-15 13:58:16 -08:00
BertanDogancay 76f83f95ab Merge remote-tracking branch 'rccl/develop' into 2.19.4 2024-02-15 13:37:14 -08:00
Wenkai Du 51003c9980 Use native half without conversion (#1083) 2024-02-13 16:57:34 -08:00
Wenkai Du 1f0af90206 Fix undefined symbol when nvtx is not enabled (#1082) 2024-02-13 14:03:43 -08:00
BertanDogancay 32cca51894 Fix docs 2024-02-11 22:32:55 -08:00
Wenkai Du d999d9ad21 Merge remote-tracking branch 'rccl/develop' into 2.19.4 2024-02-09 11:31:03 -06:00
Wenkai Du 5669b0d7b6 2.18.5 fix (#1077)
* Revert "Revert "2.18.5-1""

This reverts commit 767fde8210.

* Fix initial net device value
2024-02-09 09:18:38 -08:00
Bertan Dogancay 8a442faa12 Nvtx support (#1076)
* NVTX support
2024-02-08 14:08:24 -07:00
Wenkai Du 5257c753c5 msccl: use relaxed atomics on scratch buffer (#1075) 2024-02-08 12:09:56 -08:00
Wenkai Du 704c9ef0d1 Doubling P2P channels per peer on single node gfx94x only (#1074) 2024-02-07 14:05:57 -08:00
Wenkai Du 1d989f6524 Doubling P2P channels per peer on single node only (#1069) 2024-02-02 12:41:00 -08:00
BertanDogancay 12ac20ade5 Revert re-usage of connect and listen ports 2024-02-01 10:03:13 -08:00
BertanDogancay 00fdb1ef51 Clean up 2024-01-31 17:27:15 -08:00
BertanDogancay da85abab54 Fix stack size 2024-01-31 17:09:07 -08:00
Wenkai Du 95f87232c4 Fix transport merge 2024-01-31 17:35:12 -06:00
Wenkai Du 1a134b283b Merge remote-tracking branch 'rccl/develop' into 2.19.4 2024-01-31 11:53:10 -06:00
BertanDogancay 9ff53eeeae Merge remote-tracking branch 'nccl/master' into develop 2024-01-30 14:43:43 -08:00
Bertan Dogancay 01b359027b Include common.h in enqueue.cc instead (#1067) 2024-01-30 08:24:22 -08:00
Wenkai Du f7550d83b8 msccl: ensure memory coherence after data receive (#1062) 2024-01-30 08:22:50 -08:00
BertanDogancay 31ec5d5cb0 correct data type 2024-01-28 19:55:19 -08:00
Pedram Alizadeh ccfb35fa6d modifying the tuning table to improve the performance of allreduce for 8MB and 16MB for single-node MI300X (#1063) 2024-01-26 09:05:53 -05:00
Wenkai Du be8ef4367f colltrace: fix dropped trace messages (#1059)
* colltrace: fix dropped trace messages

* Remove extra space
2024-01-25 13:31:53 -08:00
Wenkai Du ffde530af5 Increase P2P channels per peer (#1060) 2024-01-25 11:21:58 -08:00
Wenkai Du 4aafb2a3c5 Fix sendrecv merge 2024-01-24 16:23:53 -08:00
BertanDogancay 81ddf9de89 Merge remote-tracking branch 'nccl/v2.19' into develop 2024-01-24 15:25:33 -08:00
Wenkai Du 7987015a19 Revert "msccl: build same number of kernels as in ROCm 5.7" (#1058)
This reverts commit f960174d03be7e5174baa83b256526d388a38842.
2024-01-24 08:43:50 -08:00
Bertan Dogancay 5564d65e71 Use binary search for direct function calls (#1057)
* Use binary search for direct function calls

* fix scratch mem issue on MI300
2024-01-22 17:37:56 -07:00
Bertan Dogancay c4dbf8a914 Fix collective trace when rccl is configured (#1056)
* Fix collective trace when rccl is configured
2024-01-22 09:26:44 -07:00
Wenkai Du 7e25d5bc55 Use new HIP graph API compatible with CUDA 11030 (#991)
* Use new HIP graph API compatible with CUDA 11030

* Update dependency to ROCm 6.1

* Fix single stream use case
2024-01-21 19:00:50 -08:00
Nilesh M Negi 8b97a20943 COLLECTIVES: Switch to unroll 2 for MI300 (#1051)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-01-19 12:16:05 -06:00
Bertan Dogancay 28d9b170c9 [DEV] Configure functions in RCCL (#986)
* configure functions in rccl
2024-01-18 15:07:16 -07:00
Wenkai Du 3325f96c56 Only use full MAXCHANNELS for gfx94x (#1050) 2024-01-17 09:00:49 -08:00
Pedram Alizadeh b08124c85d adding rccl tuning parameters for MI300X gfx942 with 8 GPUs single and multi-node (#1047) 2024-01-16 13:44:32 -05:00
Wenkai Du 261707d90a Add option to force enable network transport on single node (#1046) 2024-01-16 07:54:18 -08:00
PedramAlizadeh 767fde8210 Revert "2.18.5-1"
This reverts commit 559b70f86c.
2024-01-12 16:54:19 +00:00
Bertan Dogancay cf248d9402 Addressing the compiler warning (#988) 2024-01-10 14:59:40 -07:00
Hossein Pourreza 735178c1fe cover more gpu/nic mapping cases (#1037) 2024-01-10 08:01:37 -08:00
Wenkai Du 5851ae5974 Re-enable L128 on gfx90a of compiler supports it (#1036) 2024-01-10 08:01:11 -08:00
Nilesh M Negi 249e9f7f65 Un-escaped character causes error with address sanitizer builds (#992)
Signed-off-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Co-authored-by: Jenkins <jenkins-compute@amd.com>
2024-01-09 13:28:32 -06:00