* Adding working single node tests
* Revert to old docker sha
* adding back no perf tests
---------
Co-authored-by: Aravind Ravikumar <arravikum@amd.com>
* Commits to enable scp report copy
* Added Post report upload step
* Added extra arg for fetch artifacts
* Moved to a specific commit
* Add write permissions to s3
* Added comment for TheRock sha commit date
---------
Co-authored-by: arravikum <arravikum@amd.com>
* Updated CODEOWNERS to instead use RCCL-Reviewers team
* Apply suggestion from @nileshnegi
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
---------
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
* Enable ccache w/ namespace for external use
* Remove TheRock from setup_tools.py command line
* Bump TheRock commit to use health_status.py
Resolves https://github.com/ROCm/rccl/pull/1966/files/c6d2e8ce5c14a2c94bfb47e21d3e2d466f25c9b4#r2420734710
* Bump TheRock to older commit with health_status.py
* Add git safe directory for working directory
* Move install python deps
* Remove pip freeze
* Add tests on slurm cluster
* Integrate slurm.
* Add flags.
* Added dynamic selection of runners for tests and cleanup for slurm reservation
* Revert "Added dynamic selection of runners for tests and cleanup for slurm reservation"
This reverts commit d5350ff6e4f563ddd56ad81e4bc2a393ed55ba00.
* Refactor so tests run on both architectures.
* continue on error
* fail fast false on matrix
* remove scancel
* skip all single node tests
* fix pattern matching for pytest
* switch to always skip github job
* Update to latest allocation.
* Clean up workflows and update docker image.
* Updated container image published from PR #1517
* Switch back to TheRock main branch sha.
---------
Co-authored-by: arravikum <arravikum@amd.com>
* First commit for rccl multi node test workflow
* Adding workflow dispatch
* Added branch based pull trigger
* Changed typo in branch name
* Add input variables to push
* Removed input variables to push
* Added self hosted runner for Vultr cloud
* Skipping build and only running test
* Changed test runner label name
* Made changes to executable paths in test script
* Made changes to run
* Made changes to cd into cvs dir
* This is a dummy commit
* Added cmake options
* Modified build options
* Commiting build changes
* Adding rccl and rccl-tests
* Re-ordering rccl and rccl-tests
* adding --global command
* modified cmake command
* modified script paths
* Testing OIDC for rccl repo
* Testing OIDC for rccl repo
* Testing build and upload workflow
* use default env variable for AMDGPU families on push workflow trigger
* Adding cleanup and correct role
* Adding additional yml files
* Fixing typo';
* Adding new sha
* Adding correct gpu target
* Adding back venv bin activate
* Adding workflow dispatch for tests
* Testing
* Adding cat
* Adding cat
* Adding rocm dir change
* Adding checkout
* cat with sudo
* rccl checkout
* correcting branch
* removing sudo
* trying to adjust correct path'
* Adding output dir path
* Use docker container with pre-installed MPI
* Adding back build steps
* Fixing SHA
* Adding exclusion logic:
* Adding test
* Adding CI check
* Removing testing
* Limit to build only rccl, rccl-tests and required dependencies
* Adding test
* Removing test
* Removing quote
* Reverting test
* PR comments
---------
Co-authored-by: arravikum <arravikum@amd.com>
Co-authored-by: Marius Brehler <marius.brehler@amd.com>
* update documentation
add version number to documentation
rename .sphinx/.doxygen to sphinx/doxygen
enable htmlzip, pdf, epub formats when publishing on Read the Docs
* add noCI label for dependabot PRs
since RTD CI is separate from math lib CI
* update rocm-docs-core to v0.13.4
* update README with link to rocm.docs.amd.com