Rahul Vaidya
0abe3c80bb
Ensure backward compatibility for fp8 datatypes ( #126 )
...
* Ensure backward compatibility for fp8 datatypes
Signed-off-by: ravaidya <ravaidya@amd.com >
* Update code comments
Signed-off-by: ravaidya <ravaidya@amd.com >
---------
Signed-off-by: ravaidya <ravaidya@amd.com >
2025-05-15 13:56:40 -05:00
mberenjk
4b2b635766
Switched to using the hip_fp8 header instead of rccl_float8, resolving compatibility issues.( #109 )
...
* addressing hip_fp8 support compatibility issue
* skipping mulsum and avg test for fp8, using hip_fp8 for product
* syncing with nccl-tests
removing the fp8 filter for pre-hopper gpus and resolving the merge conflict
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com >
2025-05-14 15:30:07 -05:00
Wenkai Du
cac33a8c2f
Automatically set in-place option from out-of-place ( #123 )
2025-05-09 16:48:42 -05:00
Nilesh M Negi
41b383a0d4
[BUILD] Add options to install script for compiler and GPU targets ( #121 )
...
* [BUILD] Add options to install script for compiler and GPU targets
* Fix GPU_TARGETS field and add option for custom ROCm path
* Check for ROCM_PATH
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2025-05-07 13:19:10 -05:00
Marius Brehler
5b27b961b2
Link Threads::Threads ( #119 )
...
`pthread.h` is included in `src/common.h` but lib is not properly
linked, resulting in the build failing with unresolved symbols when
trying to link.
2025-04-29 16:18:51 -05:00
Nilesh M Negi
c96deb13cd
[BUILD] Fix rccl-tests version string for packaging ( #117 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2025-04-29 08:51:43 -05:00
Rahul Vaidya
a4fd8f4667
Fix build issues caused by 2.24.3 sync ( #118 )
2025-04-28 10:22:38 -05:00
Grant Pinkert
f611dbd49a
Fix message size logging ( #115 )
...
Previously, the logger was logging the number of expected bytes a node was to recieve.
This differs from the stdout logging, where the reported message size is the total size of a message.
Signed-off-by: Grant Pinkert <gpinkert@amd.com >
2025-04-25 11:05:21 -05:00
Nilesh M Negi
83d38d91b6
Merge pull request #116 from nileshnegi/sync/nccl-tests/02-28-2025
...
[SYNC] NCCL-Tests v2.14.1
2025-04-21 19:53:35 -05:00
nileshnegi
5625599dda
Merge remote-tracking branch 'nccl-tests/master' into develop
2025-04-21 19:46:10 -05:00
mberenjk
5e838ad9df
skipping the prod test for FP8 types in reduce and reduce-scatter ( #111 )
...
* skipping the prod test for FP8 types in reduce and reduce-scatter
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com >
2025-04-15 09:38:33 -05:00
Alex Breslow
284ff2ac84
Add instructions to README regarding benchmarking on pre ROCm 6.4.x versions with HSA_NO_SCRATCH_RECLAIM=1 ( #114 )
2025-04-08 09:59:57 -07:00
David Addison
b4300cc79d
Add PCI domain and device ID for GPU device BDF display
2025-02-28 13:25:51 -08:00
Sylvain Jeaugey
903918fc54
Add NCCL_TESTS_SPLIT documentation in the README
2025-02-06 14:10:07 +01:00
Junyu Ma
a89cf07fe8
Perftests: Introduce NCCL_TESTS_SPLIT env
...
`NCCL_TESTS_SPLIT` serves as new way of computing the color for splitting communicators.
Will be overrided by `NCCL_TESTS_SPLIT_MASK`.
Examples:
NCCL_TESTS_SPLIT_MASK="0x7" # color = rank & 0x7. What we do today to run on a DGX with one GPU per node.
NCCL_TESTS_SPLIT="AND 0x7" # color = rank & 0x7. New way to run on one GPU per node on a DGX, equivalent to NCCL_TESTS_SPLIT_MASK=0x7
NCCL_TESTS_SPLIT="MOD 72" # color = rank % 72. One GPU per NVLink domain on an NVL72 system.
NCCL_TESTS_SPLIT="DIV 72" # color = rank / 72. Intra NVLink domain on NVL72.
You can also use: "%" "&" "|" "/" for short.
Extra spaces in the middle will be automatically ignored.
Not case sensitive.
The followings are all equivalent:
NCCL_TESTS_SPLIT="%0x7"
NCCL_TESTS_SPLIT="%0b111"
NCCL_TESTS_SPLIT="AND 7"
NCCL_TESTS_SPLIT="and 0x7"
2025-02-04 15:18:09 -08:00
David Addison
cb6a46fdd6
Update CUDA gencodes
...
Add support for Blackwell sm100 and sm120 from CUDA 12.8
Add support for Hopper sm90 from CUDA 12.0
2025-01-25 17:32:16 -08:00
Nilesh M Negi
448c4c7269
[GIT] Add CODEOWNERS and PR Template ( #102 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2025-01-16 17:05:48 -07:00
David Sidler
46152785f0
Use find_package for MPI ( #92 )
...
* Use find_package for MPI
* Minor fixes
2025-01-14 11:49:20 -06:00
David Sidler
959cc19920
Add option to output results to a file ( #93 )
...
* Use find_package for MPI
* Add functionality to output results to file
* fix compilation
* report num gpus
* Revert "Use find_package for MPI"
This reverts commit c8fa253724ef4d0beac0d9c72f968062fbc6908e.
* Change inplace key
* remove dependency on json library
* Print "ranks, ranksPerNode, gpusPerRank"
* Add "nodes" field
---------
Co-authored-by: nileshnegi <Nilesh.Negi@amd.com >
2025-01-13 17:28:29 -06:00
saurabhAMD
fc9917e0da
Updating to use hipDeviceMallocUncached ( #95 )
...
Use hipDeviceMallocUncached instead of hipDeviceMallocFinegrained on newer ROCm versions.
2025-01-11 23:25:24 -06:00
Mustafa Abduljabbar
f2a48983ae
Memset to fix inflated performance when GPU is reset ( #94 )
...
* Memset to fix inflated performance when GPU is reset
* use hipMemset for both memsets
2025-01-11 23:25:24 -06:00
Sam Wu
5c41a915c8
[CI] Clone rccl and build from tip of develop ( #99 )
...
- Set cron to weekly
- Remove unused properties
- Try rccl install as sudo
- Clear existing rccl repo
- Run install with sudo and env vars
- Fix path
- Add rccl to path
- Attempt to fix build and install of rccl during compile stage.
- Remove existing clone from workspace
- Fix path when install rccl
- Fix path for install rccl-tests
- Install rccl local only
- Set RCCL_DIR
- Build rccl and rccl-tests with cmake
- Add extra env vars
- Use installer instead of cmake for rccl
- Update .jenkins/common.groovy
- Get librccl.so from rccl/build/release
- Switching job command to build rccl and rccl-tests using install.sh because those work properly together.
2025-01-11 23:23:06 -06:00
Sam Wu
df26b32687
Remove precheckin steps from staticanalysis ( #101 )
2025-01-10 16:41:43 -07:00
Tim
f7a5df7fc4
hot fixing ncclMemFree for mscclpp ( #100 )
2025-01-09 12:03:52 -05:00
mberenjk
77ae744c18
removing FP8 product from allReduce test cases ( #97 )
...
* removing FP8 product from allReduce test cases
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com >
2025-01-06 14:05:38 -06:00
John Bachan
29f4114f02
Fixes to all tests that divide buffers by nranks so that they trim buffer sizes to be multiples of 16 bytes.
...
This ensures non-pow2 ranks have buffer addresses aligned suitably for performance.
2024-12-18 11:20:28 -08:00
Sylvain Jeaugey
8dfeab9eb9
Merge pull request #259 from NVIDIA/fix-ncclstringtotype
...
Future-proof ncclstringtotype
2024-10-24 10:28:02 -07:00
Kamil Iskra
34d6d53910
Future-proof ncclstringtotype
...
Ensure that ncclstringtotype iterates only over data types known to
nccl-tests (as indicated by test_typenum), not over a potentially larger
set of all NCCL types.
2024-10-24 09:21:37 -07:00
Tim
ae3e6357cb
Scaling tests to #ngpus ( #81 )
...
* scaling tests to #ngpus
Signed-off-by: AtlantaPepsi <hyj1999110@gmail.com >
* switching to rocminfo
---------
Signed-off-by: AtlantaPepsi <hyj1999110@gmail.com >
2024-09-10 19:05:22 -04:00
Tim
52aee698fa
Merge pull request #86 from AtlantaPepsi/UBR_merge
...
Registered Buffer option from nccl-tests merged
2024-07-31 11:05:49 -04:00
AtlantaPepsi
71355df959
Fixing typo in readme
...
Signed-off-by: AtlantaPepsi <timhu102@amd.com >
2024-07-31 14:59:47 +00:00
AtlantaPepsi
afd5ca10ae
Merge -R option for memory allocation
...
Signed-off-by: AtlantaPepsi <timhu102@amd.com >
2024-07-31 14:57:20 +00:00
David Addison
9d26b8422b
Merge pull request #226 from netgroup/master
...
improve parsing of stepbytes (increment size) argument
2024-07-30 14:58:54 -07:00
David Addison
0d86b5a6e7
Added some missing command line options to README.md
...
Also updated single and multi-node examples.
2024-07-30 14:50:45 -07:00
David Addison
d2d40cc824
Added -N,--run_cycles option
2024-07-25 22:00:23 -07:00
David Addison
3a3f790efd
Merge pull request #240 from OrenLeung/patch-1
...
doc: add all2all factor
2024-07-25 22:00:06 -07:00
Oren
c6eb15875f
doc: add all2all factor
2024-07-24 22:55:00 -04:00
Nilesh M Negi
e635e9c9be
[CI] Add static analysis CI ( #85 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2024-07-23 22:21:26 -05:00
Rahul Vaidya
c5cae38bb8
Fix --root all issue. ( #83 )
...
Signed-off-by: rahulvaidya20 <ravaidya@amd.com >
2024-06-14 11:46:08 -05:00
Stefano Salsano
746549b28d
improve parsing of stepbytes (increment size) argument
2024-06-14 11:28:55 +02:00
Kaiming Ouyang
d028efcf35
Change ncclCommRegister size to maxBytes in serial comm init
2024-06-06 06:54:48 -07:00
saurabhAMD
073d56f6e2
Merge pull request #79 from saurabhAMD/rotating_tensor
...
Rotating tensor -R (default:off)
2024-06-04 17:51:26 -05:00
saurabhAMD
36a2c372ac
Rotating tensor -R (default:off)
2024-06-04 11:35:39 -05:00
Giuseppe Congiu
a1efb427e7
Add -R option to register user buffers
2024-06-03 01:04:58 -07:00
saurabhAMD
6378438c27
Merge pull request #76 from saurabhAMD/develop
...
Enable cache flush after every -F iteration. Default : 0 (No cache flush)
2024-05-13 10:41:31 -05:00
saurabhAMD
74c4177f58
updating cache flush on functionality
2024-05-10 08:46:13 -07:00
saurabhAMD
699478dadf
Enable cache flush after every -F iteration. Default : 0 (No cache flush)
2024-05-07 11:32:30 -05:00
saurabhAMD
3c0728e8eb
Cache flush
2024-05-07 11:09:32 -05:00
Wenkai Du
16dfeaf89b
Fix incorrect device ordinal with limited device visibility ( #74 )
2024-05-02 11:14:57 -07:00
corey-derochie-amd
605fa143ad
Merge pull request #75 from corey-derochie-amd/develop
...
Wrapped the warmup iters in captures when doing graph mode to do a proper warmup.
2024-05-02 09:18:52 -06:00