نمودار کامیت

302 کامیت‌ها

مولف SHA1 پیام تاریخ
Tim ba44f170ad Update RCCL Replayer README.md (#1870)
* Update Replayer README.md
2025-09-19 17:57:48 -04:00
isaki001 9c36439354 add reduce/broadcast algo/proto selection table for multi-node gfx940 (#1889) 2025-09-10 14:25:23 -05:00
Mustafa Abduljabbar 6e45eaf75e Use add_unroll.sh in topo_expl makefile (#1886) 2025-09-03 09:43:16 -04:00
Mustafa Abduljabbar 7ccc6f268f Force enable proto and/or algo after model selection (#1799)
* Force enable proto or algo

* Remove inc nccl_common.h

* Move logic and add error checks

* Fix topo_expl compatibility

* Allow algo/proto overrides

* Remove extra function decl

* Clarify warning message

* Move algo/proto overrides into separate functions

* Update CHANGELOG.md
2025-09-03 08:54:13 -04:00
BertanDogancay 08a7be231b Merge remote-tracking branch 'nccl/master' into develop 2025-08-28 15:46:28 -05:00
Mustafa Abduljabbar dfad51e3c9 Support gfx950 in topo_expl and resolve dependency on FMT (#1829)
* Support gfx950 in topo_expl

* Fix dependencies and fetch fmt from sources

* Remove third_party folder in make clean

* Add empty target when fmt is found

* Add MI350 example

* Update README.md

---------

Co-authored-by: isaki001 <ioannissakiotis@gmail.com>
2025-08-26 10:11:38 -04:00
Atul Kulkarni 81ec6bff4c Added new unit tests for src/transport/p2p.cc (#1774) 2025-07-25 12:57:57 -05:00
Wenkai Du 9a4213356d Support fused all reduce and elementwise operations (#1729)
* Support fused all reduce and elementwise operations

Add additional "acc" parameter to RCCL Replayer logs

Add flag which indicates availability of new API

* Fix Recorder json parsing

* Remove unreachable code

* Remove extra acc pointer check

* .

* Revert "[DEVICE] Adding ability to choose unroll factor at runtime (#1734)"

This reverts commit 9d72be7b2f.

* Use noinline to reduce kernels linking time

* Don't use noinline for gfx942 and gfx950 to avoid perf regression

---------

Co-authored-by: AtlantaPepsi <timhu102@amd.com>
Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>
2025-07-23 09:04:17 -07:00
Atul Kulkarni 275fdd43c1 Code coverage improvements (#1665)
* Increased max stack size to 640

* Added new binary for executing unit tests

Added new unit tests for argcheck.cc and alt_rsmi.cc files

Modified the method to execute unit tests to cover static methods
by using a bash script to convert static to non-static functions
and variables on the fly restricted to debug build type.
2025-07-17 11:20:49 -05:00
Nikhil-Nunna 7abc7538ea topo_explorer initial readme (#1797)
* topo_explorer intial readme

* topo_explorer readme update

* topo_explorer readme update

* Added sample output to README

* Update README.md

* Update README.md

---------

Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>
2025-07-11 11:28:20 -05:00
Nilesh M Negi f839e4edef [TOOLS] Update p2p-latency-test for gfx950 (#1730) 2025-07-10 12:13:29 -05:00
Arm Patinyasakdikul 35024ca1cb [topo-expl] update header file location. (#1769) 2025-06-27 15:29:37 -05:00
gilbertlee-amd 16101e654f Fixing HelloRccl include path to RCCL, fixing some warnings (#1778) 2025-06-27 09:12:59 -06:00
Mustafa Abduljabbar fb4ad82d0d Fix topo_explorer compatibility and capture WarpSize (#1743) 2025-06-16 08:18:35 -04:00
Tim ba97c9c18b replayer update v0 (#1733)
* First version of new replayer, with comments on future TODOs

* plus minor fixes for UT

* Updated format of recorder, especially in binary department, according to replayer's need
2025-06-13 15:05:34 -04:00
Arm Patinyasakdikul 6c37ae9470 Added missing copyright message. (#1742)
* Added missing copyright message.

* addressed comments.
2025-06-12 09:58:01 -05:00
Mustafa Abduljabbar d665547eef Remove MSCCL single node AllGather XMLs (#1693)
* Remove MSCCL single node XMLs

* Remove comment on MSCCL AG single node support
2025-05-13 17:07:03 -05:00
Mustafa Abduljabbar fdad89690b Add missing MACRO to topo_expl (#1677)
* Fix header compatibility
2025-05-05 15:58:57 -04:00
Mustafa Abduljabbar f3f3336468 Fix topo explorer's compatibility with NCCL 2.24 (#1671)
* Fix build issues

* Fix failure to find path remote rank
2025-05-05 15:26:29 -04:00
Bertan Dogancay f8067a76dc Merge pull request #1645 from corey-derochie-amd/nccl-2.24
NCCL Sync 2.24.3-1
2025-04-24 10:08:58 -04:00
Mustafa Abduljabbar aa7991dfc8 [AllGather MSCCL] Multinode and single node support up to certain send count (#1650)
* Add multinode and singlenode allgather XML
2025-04-24 09:02:03 -04:00
BertanDogancay a6bf9bfc9e Merge remote-tracking branch 'nccl/master' into develop 2025-04-23 20:47:43 -07:00
Mustafa Abduljabbar 82afb2bcfe Expose production tuning table in topo_explorer using internal RCCL/NCCL logic (#1628)
* Internal RCCL/NCCL functionality exposed when RCCL_EXPOSE_STATIC is enabled
* Algo/protocol/max channels can be obtained with the new RCCL API
* Introduce rccl_static and rccl_static_inline macros to work around invisible functions in core source files like enqueue.cc
* Add usage example in topo-explorer tool
2025-04-23 15:44:56 -04:00
Tim 45e1c3f3e2 reverting change to RcclReplayer (#1657) 2025-04-23 15:36:46 -04:00
Tim 9a55ff60a9 RCCL Replayer update (#1603)
RCCL recorder w/ suggested change and UT
2025-04-19 00:21:27 -04:00
AbandiGa 7a84c5dbb0 added copyright (#1635) 2025-04-14 09:46:18 -05:00
Pedram Alizadeh 5b36b68d06 single-node AR msccl algorithm tuning for MI300 (#1629) 2025-04-10 10:42:28 -04:00
Nikhil-Nunna 3dc0478722 Fetching RCCL version from shared object file. (#1569)
* added rccl version using rccl-tests

* Added function to get rccl version from rccl-tests

* removed whitespace

* Added rccl version

* Updated readme and fixed formatting

* removed debug prints
2025-04-08 14:09:39 -05:00
Mustafa Abduljabbar aace4e27f8 Fix topo explorer's nccl 2.23 compatibility (#1623)
* Fix compiler issues due to broken compatibility 

* Fix segfault and pass rank instead of busid and add a pointer to cover a new algorithm
2025-04-02 09:47:29 -04:00
gilbertlee-amd 626dc50ab5 Removing the experimental clique kernel files (#1610) 2025-03-20 18:10:01 -06:00
corey-derochie-amd 6505639cf4 removed gfx940 and gfx941 (#1606)
* removed gfx940 and gfx941

* removed gfx940 and gfx941

* Update "gfx94" to "gfx942" in init.cc

* Updated remaining "gfx94" updates to "gfx942"

* Update filenames and variables from gfx940 to gfx942

---------

Co-authored-by: akolliasAMD <akollias@amd.com>
2025-03-20 09:34:53 -06:00
Pedram Alizadeh f268553ee4 enable building rccl for gfx950 (#1571) 2025-02-25 16:13:48 -05:00
gilbertlee-amd ddc5d58b93 Rail optimized trees (#1540)
* Allow disabling rail-optimized trees via RCCL_DISABLE_RAIL_TREES, Graphviz-friendly output via RCCL_OUTPUT_TREES
2025-02-20 15:18:29 -07:00
Nilesh M Negi 159587be5c [TOOLS] Update rcclDiagnostics script (#1557)
* [TOOLS] Update rcclDiagnostics script

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Fix typo in valid_marketing_names list

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2025-02-20 16:11:05 -06:00
Nikhil-Nunna 4ba94d6662 Env conf debug (#1534)
* Initial Script ready for review

* Added RCCL-tests and RCCL versions

* Added output folder and README

* Base format built

* Added ROCm version

* Added function to center titles and Vram information

* Added HIP version

* Cleaned formatting

* UCX version and MPI version

* Added NUMA balancing

* Added rocminfo

* Removed notes

* Changed regex for broadcom Nic

* Removed note by the ACS info

* Added Hostname to summary and details

* Print summary to terminal

* Added argparse

* Added flags and readme

* Added GPU ID

* fixed spelling

* renamed script again

* Added file descriptor and locked mem checks

* Added file descriptor and locked mem checks

* Removed extra spaces from summary table

* printing output file location

* Removed sudo in code and ACS flag
2025-02-14 17:31:18 -06:00
gilbertlee-amd 6cb0599e38 Updating topology explorer (#1536) 2025-02-07 08:44:04 -07:00
Benjamin Kitor a05329bd0d Add Topologies for 16-GPU gfx942 SuperNode (#1417)
* Add Topologies for 16-GPU gfx942 SuperNode

- Add GigaIO topologies to tools/topo_expl for dev and testing
- Add GigaIO Columba 16 GPU romeModel and adjust topology
  matching algorithm in rome_models for 16 GPU system
- Fix bug which failed to match Rome Model when using subsets
  of system resources (i.e. ROCR_VISIBLE_DEVICES is set)
- Fixes for topo_expl

* Fix bug w/ 1H16P
2024-12-03 13:12:03 -08:00
gilbertlee-amd cb1027de97 Updating RCCL Replayer README (#1408) 2024-11-05 08:06:11 -07:00
Bertan Dogancay cfecce790f [Replayer] Add validation (#1387)
* Add validation to rccl_replayer
2024-10-22 10:41:08 -04:00
Wenkai Du 62d10fdc25 msccl: disable 1-shot xmls (#1375)
MSCCL 1-shot xmls may cause different output values on different ranks.
Disabling them for now to avoid undefined behavior in applications.
2024-10-14 15:10:53 -07:00
Wenkai Du a680e329e6 Temporarily disable MSCCL all gather XMLs due to UT failure (#1373) 2024-10-12 08:43:16 -07:00
BertanDogancay 84081064a0 Merge remote-tracking branch 'nccl/master' into develop 2024-10-02 09:31:25 -05:00
Wenkai Du e453f1ced9 Add another Rome model (#1354) 2024-10-01 17:41:27 -05:00
Wenkai Du 1a48e19b18 topo_expl: update sm fields in topo xml files (#1310) 2024-08-29 12:03:51 -07:00
Edgar Gabriel bba3559334 Merge pull request #1299 from edgargabriel/topic/remove-multirank-examples
Remove MultiRank examples
2024-08-23 08:32:16 -05:00
Wenkai Du 532b70afb6 Add new Rome model (#1304)
* Add another rome model and override

* Fix bug

* Fix typo

* Add ring

* Update ring

* Fix model matching

* Clean up

* Clean up

* Reverse rings for NCCL_RINGS input

* Only reverse NCCL_RINGS for ring graph

* Fix mapping issue when using  NCCL_RINGS

* Add NCCL_RINGS_REMAP to handle inconsistant net names
2024-08-23 08:45:43 +08:00
Edgar Gabriel 8953a26bcd Remove MultiRank examples
remove the MultiRank examples, the features was never released (because
it didn't work reliably), and it might just cause confusion if somebody
sees it. In additional, the locdation in tools was suboptimal.
2024-08-14 14:11:16 -07:00
Ziyue Yang 145a13235a Fix number of loops in p2p-latency-test (#1286) 2024-08-05 13:35:56 -07:00
Benjamin Kitor 4bc118336a topo_expl: Update channel masks for >64 channels (#1279) 2024-07-25 17:27:34 -07:00
saurabhAMD cf311b71ee Adding performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing (#1265)
* Adding performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing

* Performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing
2024-07-22 10:21:29 -05:00