Sai Enduri 15628819e2 Enable multi node rccl tests on MI350x slurm cluster. (#1900)
* Add tests on slurm cluster

* Integrate slurm.

* Add flags.

* Added dynamic selection of runners for tests and cleanup for slurm reservation

* Revert "Added dynamic selection of runners for tests and cleanup for slurm reservation"

This reverts commit fdd5a6cc968c764d3d1039f0897fb11f11422928.

* Refactor so tests run on both architectures.

* continue on error

* fail fast false on matrix

* remove scancel

* skip all single node tests

* fix pattern matching for pytest

* switch to always skip github job

* Update to latest allocation.

* Clean up workflows and update docker image.

* Updated container image published from PR #1517

* Switch back to TheRock main branch sha.

---------

Co-authored-by: arravikum <arravikum@amd.com>

[ROCm/rccl commit: 01d16d4139]
2025-09-23 22:00:26 -07:00
S
説明
説明が提供されていません
282 MiB
言語
C++ 67.5%
C 20.6%
Python 6.6%
CMake 3.4%
Shell 0.6%
その他 1.1%