15 Коммитов

Автор SHA1 Сообщение Дата
Aravind Ravikumar f336ad5133 Enable Robust Multi-Node RCCL Testing with cvs-sbatch and Improve CI Reliability (#2123)
* sbatch changes and TheRock SHA update

* Move tests location from /home to /apps/cvs_tests

* Add comments and move credential.ini file to /apps/cvs_tests

* Changed salloc reservation to rccl reservation

---------

Co-authored-by: Aravind Ravikumar <arravikum@amd.com>

[ROCm/rccl commit: 239d62f545]
2026-01-16 23:13:06 -05:00
Geo Min dfdb64572c [TheRock CI] Adding working single node tests (#2142)
* Adding working single node tests

* Revert to old docker sha

* adding back no perf tests

---------

Co-authored-by: Aravind Ravikumar <arravikum@amd.com>

[ROCm/rccl commit: 4b295c9893]
2026-01-13 08:35:58 -08:00
Geo Min c199df6b96 Revert "Adding org var and dynamic runner selection (#2106)" (#2114)
This reverts commit 4f7698c27e.

[ROCm/rccl commit: 4f474a7389]
2025-12-19 12:53:09 -08:00
Geo Min 4f7698c27e Adding org var and dynamic runner selection (#2106)
[ROCm/rccl commit: 2e193aed68]
2025-12-16 10:41:57 -08:00
Geo Min 1b4eef8f86 Correct runner name (#2098)
[ROCm/rccl commit: 5384a8abb2]
2025-12-10 11:44:48 -08:00
Geo Min 2e0abab81a [ci] Bumping TheRock CI commit hash (#2097)
* Bumping TheRock CI commit hasH

* fixing artifact group

[ROCm/rccl commit: 6af9087b0c]
2025-12-09 16:25:57 -08:00
Aravind Ravikumar 4babb01f4d Add S3 upload support for Perf and test reports by run ID and architecture (#2020)
* Commits to enable scp report copy

* Added Post report upload step

* Added extra arg for fetch artifacts

* Moved to a specific commit

* Add write permissions to s3

* Added comment for TheRock sha commit date

---------

Co-authored-by: arravikum <arravikum@amd.com>

[ROCm/rccl commit: 07f8f6d6c6]
2025-11-03 19:09:34 -05:00
Aravind Ravikumar a7a1647926 Adding reservation time for salloc in CI (#1992)
Co-authored-by: arravikum <arravikum@amd.com>

[ROCm/rccl commit: 506c2e9878]
2025-10-22 10:00:01 -04:00
JC 08d93e763e [CI] Enable ccache w/ namespace for external use (#1966)
* Enable ccache w/ namespace for external use

* Remove TheRock from setup_tools.py command line

* Bump TheRock commit to use health_status.py

Resolves https://github.com/ROCm/rccl/pull/1966/files/f9d6d76440b88ecf67d08765ee0e9bac00b55b40#r2420734710

* Bump TheRock to older commit with health_status.py

* Add git safe directory for working directory

* Move install python deps

* Remove pip freeze

[ROCm/rccl commit: b1589a5786]
2025-10-20 08:44:42 -07:00
Geo Min 3ead4ca4a1 fixing group id (#1975)
[ROCm/rccl commit: 97f2665da2]
2025-10-10 16:40:44 -07:00
Aravind Ravikumar 45abdcfe62 Enable Presubmit CI Gating for develop Branch (TheRock CI for RCCL) (#1954)
* Trigger CI run on pull request

* Enabling CI run on different PR types

---------

Co-authored-by: arravikum <arravikum@amd.com>

[ROCm/rccl commit: 1858a31c41]
2025-10-07 09:11:50 -04:00
Sai Enduri 15628819e2 Enable multi node rccl tests on MI350x slurm cluster. (#1900)
* Add tests on slurm cluster

* Integrate slurm.

* Add flags.

* Added dynamic selection of runners for tests and cleanup for slurm reservation

* Revert "Added dynamic selection of runners for tests and cleanup for slurm reservation"

This reverts commit fdd5a6cc968c764d3d1039f0897fb11f11422928.

* Refactor so tests run on both architectures.

* continue on error

* fail fast false on matrix

* remove scancel

* skip all single node tests

* fix pattern matching for pytest

* switch to always skip github job

* Update to latest allocation.

* Clean up workflows and update docker image.

* Updated container image published from PR #1517

* Switch back to TheRock main branch sha.

---------

Co-authored-by: arravikum <arravikum@amd.com>

[ROCm/rccl commit: 01d16d4139]
2025-09-23 22:00:26 -07:00
Geo Min 6db483845d [TheRock CI] Adding single node tests for RCCL (#1876)
* Add single-node testing

* Adding single node test

* Adding quotes

* fix typo

* Adding test flag

* No MPI

* Adding openmpi install

* Adding comment

* PR comments

* Missing proj

* Adding half

* Adding rocr runtime

* Adding them all'

* new sha

* Fixing script

* Removing confusing skip test case

* Adding docs

* Update .github/workflows/therock-test-packages-single-node.yml

Co-authored-by: Marius Brehler <marius.brehler@amd.com>

---------

Co-authored-by: Marius Brehler <marius.brehler@amd.com>

[ROCm/rccl commit: f404624d9e]
2025-08-27 08:13:10 -07:00
Marius Brehler 5277457f21 Bump TheRock version used for testing (#1885)
[ROCm/rccl commit: 221205ebd4]
2025-08-27 16:22:27 +02:00
Geo Min bec7d58b04 [TheRock CI] Adding TheRock RCCL tests (#1873)
* First commit for rccl multi node test workflow

* Adding workflow dispatch

* Added branch based pull trigger

* Changed typo in branch name

* Add input variables to push

* Removed input variables to push

* Added self hosted runner for Vultr cloud

* Skipping build and only running test

* Changed test runner label name

* Made changes to executable paths in test script

* Made changes to run

* Made changes to cd into cvs dir

* This is a dummy commit

* Added cmake options

* Modified build options

* Commiting build changes

* Adding rccl and rccl-tests

* Re-ordering rccl and rccl-tests

* adding --global command

* modified cmake command

* modified script paths

* Testing OIDC for rccl repo

* Testing OIDC for rccl repo

* Testing build and upload workflow

* use default env variable for AMDGPU families on push workflow trigger

* Adding cleanup and correct role

* Adding additional yml files

* Fixing typo';

* Adding new sha

* Adding correct gpu target

* Adding back venv bin activate

* Adding workflow dispatch for tests

* Testing

* Adding cat

* Adding cat

* Adding rocm dir change

* Adding checkout

* cat with sudo

* rccl checkout

* correcting branch

* removing sudo

* trying to adjust correct path'

* Adding output dir path

* Use docker container with pre-installed MPI

* Adding back build steps

* Fixing SHA

* Adding exclusion logic:

* Adding test

* Adding CI check

* Removing testing

* Limit to build only rccl, rccl-tests and required dependencies

* Adding test

* Removing test

* Removing quote

* Reverting test

* PR comments

---------

Co-authored-by: arravikum <arravikum@amd.com>
Co-authored-by: Marius Brehler <marius.brehler@amd.com>

[ROCm/rccl commit: f9a957bbab]
2025-08-20 15:07:23 -07:00