Commit Graph

58 Commits

Author SHA1 Message Date
Nilesh M Negi 4e406acc43 [UT] Include iomanip if not defined (#1510)
* [UT] Include iomanip if not defined

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Remove include guards

`iomanip.h` has pre-defined include guards. These are not needed.

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2025-02-11 08:48:47 -07:00
isaki001 3398fa78fe non-hipGraph MSCCL++ tests for allReduce and allGather (#1503)
* working tests for a single message size

* move call_RCCL routine StandaloneUtils, create .cpp file for StandaloneUtils so that it can be included in several tests

* simplify test invocation

* remove unecessary logs and exit from ncclCommRegister

* set expected results for allGather

* skip test if nranks doesn't match number of gpus, call getAndDistributeNCCLid only from parent process

* fix improper size of expected-results vector

* Removing unused changes.

* Refactored to create a new file for the forked collectives call, as StandaloneUtils is for the Standalone tests. Renamed the functions to be slightly more accurate and follow existing naming conventions.

* Apply suggestions from code review

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Co-authored-by: isaki001 <isakioti@banff-pla-r27-38.pla.dcgpu>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
Co-authored-by: Corey Derochie <corey.derochie@amd.com>
2025-02-04 09:11:32 -06:00
corey-derochie-amd c68b558ed5 Increased gfx90a stack size expectation to 320 to match latest compiler. (#1487) 2025-01-16 17:04:51 -07:00
mberenjk 39483c55f8 Initializing all ranks to the same value to avoid failure of UT AllR… (#1459)
* Initializing all ranks to the same value to avoid failure of  UT AllReduce for FP8 type

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-01-02 11:39:02 -06:00
saurabhAMD 69b2b712ab GPU allocation for CPX Unit Tests using PCI bus id (#1403)
* mapping devices wrt pci

* Gpu allocation by using pci mapping

* Passing gpuPriorityOrder in as an argument rather than making the functions non-static.

* Removing redundant testBed instance calling
2024-11-04 10:51:00 -06:00
Bertan Dogancay 984f1e4343 Increase MAX_STACK_SIZE for UT (#1398) 2024-11-01 13:07:45 -04:00
Tim fd9924cfe7 Adjustment for UT Sendrecv (#1400)
Enabled UT sendrecv to same rank and refactor UBR call
2024-10-30 15:13:53 -04:00
saurabhAMD 4856309413 Making variable names consistent in EnvVars.cpp (#1327)
* Making variable names consistent in EnvVars.cpp
2024-09-11 09:23:31 -05:00
saurabhAMD 289a80c4e9 Enabling Unit Tests for CPX mode (#1324)
* Unit Tests for RCCL in CPX mode

* override pow2gpus set by cpx mode by user argument

* Adding comment for UT_POW2_GPUS

* Additional comment on why using pow2gpus for cpx mode.
2024-09-09 10:12:33 -05:00
Tim 757d1891e9 Update EnvVars.cpp 2024-09-04 16:55:36 -04:00
mberenjk db840f024e adding all nccl apis to api_support to enable rccl tracing by rocprofv3 (#1297)
* adding all nccl apis to api_support to enable rccl tracing by rocprofv3

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
2024-08-22 12:36:07 -05:00
Tim a4793286c7 Adding User Buffer Registration support for Unit test (#1199)
* Adding UBR support for UT SendRecv

Signed-off-by: Tim Hu <timhu102@amd.com>

* Update test/common/TestBedChild.cpp

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Signed-off-by: Tim Hu <timhu102@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2024-07-30 13:39:25 -04:00
akolliasAMD c246e25f8e gfx12 Disable ll protocol (#1268) 2024-07-26 08:59:55 -06:00
Wenkai Du 89349f2ce4 Template unroll for RCCL kernels (#1250)
* Template unroll for RCCL kernels

* Adding unroll template arg during CMake hipification

* Reduce linking parallel jobs to avoid OOM in CI

* Workaround issues with UT tests

SWDEV-469533: register spill fix is needed for mainline build
LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs
Use -parallel-jobs=8 for linking

* CI: do not use -j 16 when building

* CI: use -j 8 when building

* Only reduce parallel linking job for CI extended

* Restore original jenkins command. Change parallel linking jobs in cmake

* Disable MSCCLPP

---------

Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>
2024-07-19 08:15:59 -07:00
corey-derochie-amd 0c36d571ea Enable multi-threading for MSCCL (#1203)
MSCCL can now run in a multi-threaded configuration. To test in the unit tests, added the ENABLE_OPENMP compile definition flag and the --openmp-test-enable flag to the unit test build script. To activate, set the environment variables UT_MULTITHREADED=1 and UT_PROCESS_MASK=1. Set Jenkins to use this mode.
2024-07-04 09:34:38 -06:00
saurabhAMD e170f41ddd Unit Tests for testing channels (#1222) 2024-06-25 10:10:10 -05:00
saurabhAMD 392a73fdef enable UT to test with channels greater than 64 2024-06-13 13:54:08 -05:00
Bertan Dogancay 0ec41f1386 [UT] Start supporting multiple group calls and graphs (#1151)
* Start supporting multiple group calls UT
2024-04-25 11:11:16 -06:00
mberenjk 428837ffe4 replacing rccl_bfloat16 with hip_bfloat16 (#1126)
Co-authored-by: mberenjk <mberenjk@amd.com>
2024-04-11 11:30:37 -05:00
Andy li 6777e65c1d Enable fp8 support (#1101)
* initial checkin

* resolve cr comments

* resolve the build issue

* fix the data correctless issue

* update fp8 header file and update the unit test for fp8 support

* remove fp16 from fp8 headers

* fix ut issue and catch up the latest code from develop

* udate according to cr comments

* update ut according to cr comments

* update num floats for each SumPostDiv from 4 to 6

* update fp8 header file name

* fix the typo
2024-03-08 15:17:53 -08:00
Tim 0d06b0f1de Adding FP16 cases to unit tests(#1093)
Signed-off-by: Tim Hu <timhu102@amd.com>
2024-02-26 12:08:04 -05:00
BertanDogancay b098120c40 Increase max stack size when ll128 enabled 2024-02-15 15:56:59 -08:00
Bertan Dogancay dc2d486ba0 Add stack size UT (#1081)
* Add stack size UT
2024-02-12 17:56:15 -07:00
Shilei Tian ba9f7917ba Add a constructor for PtrUnion in case it is not initialized explicitly (#1064) 2024-01-26 08:00:27 -08:00
Bertan Dogancay 28d9b170c9 [DEV] Configure functions in RCCL (#986)
* configure functions in rccl
2024-01-18 15:07:16 -07:00
Tim 05850e89f2 Relaxing default timeout limit, add error log (#1052)
Signed-off-by: Tim Hu <timhu102@amd.com>
2024-01-18 15:09:08 -05:00
Tim 9c0ef11ac7 Adding timeout functionality/EnvVar to TestBed (#1044)
* Adding timeout functionality/EnvVar to TestBed
* updating timeout unit to microseconds

Signed-off-by: Tim Hu <timhu102@amd.com>
2024-01-17 11:33:01 -05:00
Bertan Dogancay 9d11cd092f Add ncclCommSplit test (#852)
Add ncclSplitCommTest
2023-08-25 16:26:45 -06:00
Wenkai Du ce6a2ffac8 Merge pull request #782 from ROCmSoftwarePlatform/2.18.3
Sync up with NCCL 2.18.3
2023-06-29 15:04:16 -07:00
gilbertlee-amd f7c553edad Report unit test environment variable values as part of output (#789) 2023-06-29 07:13:05 -06:00
Wenkai Du abd0615351 Merge remote-tracking branch 'nccl/master' into develop 2023-06-26 22:51:56 +00:00
akolliasAMD 2ce7d971e5 lessened the amount of child processes to active ones (#720) 2023-04-11 08:59:56 -06:00
gilbertlee-amd 27e0cb43c2 Unit test performance refactor (#700)
* Refactoring unit tests to improve performance
* Spawning child processes during InitComms instead of on TestBed construction
* Temporarily disabling graph unit tests
2023-04-06 12:28:53 -06:00
gilbertlee-amd 00c3d8d850 Adding interactive mode for unit tests (UT_INTERACTIVE) (#715) 2023-03-21 10:58:24 -06:00
akolliasAMD 9a0d4a07a6 Test Fixes (#710)
* splitting CI tests in running SP first and MP second
* set device before hipStreamSynchronize on tests
2023-03-21 08:48:39 -06:00
gilbertlee-amd 80ed608a9d Multi stream unit test (#693)
* Adding multi-stream support to unit tests
2023-02-23 13:28:50 -07:00
gilbertlee-amd f63d3b1978 Adding UnitTest timing summary (UT_SHOW_TIMING) (#692) 2023-02-22 08:57:13 -07:00
gilbertlee-amd a640c6983f Unit test fail check (#689)
* Adding fall-through on unit test failure

* Workaround for hipGraph validity check issue
2023-02-18 08:50:46 -08:00
gilbertlee-amd df46645ff8 Switching to relaxed capture for unit tests (#679) 2023-02-08 11:28:58 -07:00
Pedram Alizadeh fddb5e6be8 UnitTest: add test cases for 2.14 API (ncclCommInitRankConfig and ncclCommFinalize for non-blocking communicator) (#674) 2023-02-03 17:36:30 -05:00
Pedram Alizadeh fbe52b6caa removed the wrapper script so that the old name is no longer referenced (#676) 2023-01-31 11:11:02 -05:00
akolliasAMD 24aa8bd802 added a different way for getting device count, by running it in a child process (#665) 2022-12-14 16:10:14 -07:00
Pedram Alizadeh 54a3da04eb Revert "UnitTest: add test cases for 2.14 API (ncclCommInitRankConfig and ncclCommFinalize for non-blocking communicator) (#662)" (#666)
This reverts commit 8250092367.
2022-12-14 11:28:40 -05:00
PedramAlizadeh 45872d170f Changed the name of UnitTests to rccl-UnitTests (wrapper executable included). 2022-12-13 21:45:57 +00:00
Pedram Alizadeh 8250092367 UnitTest: add test cases for 2.14 API (ncclCommInitRankConfig and ncclCommFinalize for non-blocking communicator) (#662) 2022-12-13 16:05:09 -05:00
gilbertlee-amd faed69f9fc Graph unit tests (#656)
* Adding hipGraph unit tests
2022-12-01 10:28:42 -07:00
gilbertlee-amd ebb8b5bf63 Updating files for missing licenses (#637) 2022-10-14 13:49:16 -06:00
akolliasAMD 06bce9d0c9 added stream synch after hipMemset (#609) 2022-08-30 16:18:37 -06:00
Edgar Gabriel f6e00dec13 introduce support for ncclFloat16/half in UT 2022-08-24 15:28:24 +00:00
akolliasAMD 8b9291eb47 moved default number of max ranks per gpu to 1 2022-06-22 17:37:49 +00:00