* update AG direct and single node LL threshold
* update thresholds based on MI350 expeirmental results
* disable using LL for direct AG
* enable direct AG for lower GPU counts
* direct AG single node tuning
* fix in-place buffer allocation for AG unit test
* whitespace fix
* gate direct AG for gfx950 and gfx942
---------
Co-authored-by: Nusrat Islam <nusislam@nova-login-gtu2.prov.gtu.zts.cpe.ice.amd.com>
In this commit it disabled by default and can be enabled via
`RCCL_ENABLE_CONTEXT_TRACKING=1` for both (CDNA, RDNA)
Original PR https://github.com/ROCm/rccl/pull/1927
* Revert disabling of context tracking for Radeon
Original commit 6fc228e2
`Disable context tracking for the current version. (#1839)`
* Add env variable for disabling of context tracking for Radeon
`export NCCL_DISABLE_CONTEXT_TRACKING=1` to force disable of context tracking
* Update docs/how-to/rccl-usage-tips.rst
Fix grammar, thanks @amd-jnovotny
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
* Rename NCCL_DISABLE_CONTEXT_TRACKING -> RCCL_DISABLE_CONTEXT_TRACKING
* Revert changes in includes and rename util function
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
* Add tests on slurm cluster
* Integrate slurm.
* Add flags.
* Added dynamic selection of runners for tests and cleanup for slurm reservation
* Revert "Added dynamic selection of runners for tests and cleanup for slurm reservation"
This reverts commit d5350ff6e4f563ddd56ad81e4bc2a393ed55ba00.
* Refactor so tests run on both architectures.
* continue on error
* fail fast false on matrix
* remove scancel
* skip all single node tests
* fix pattern matching for pytest
* switch to always skip github job
* Update to latest allocation.
* Clean up workflows and update docker image.
* Updated container image published from PR #1517
* Switch back to TheRock main branch sha.
---------
Co-authored-by: arravikum <arravikum@amd.com>
* Batch P2P operations (2 per CU/channel) and update channel-part mapping
- Revert bitreversal and fix channel mapping to be compatible with P2P batching and avoid hangs
- P2P batching is only used for more than 2 nodes to avoid aggregating intra-node traffic when it is dominant for less than 2 nodes
* Address single node regression and channel per net peer
* Add batching threshold
* Add enable switch for batching
* Update CHANGELOG.md
* Add minor comment change
* Update CHANGELOG.md
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
* Update CHANGELOG.md
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Minimum required cmake version of test/CMakeList.txt is bumped from 2.8
to 3.16. This alignes with the version used in CMakeList.txt and will
enable building with cmake 4.
* [rocm_regression] Return errors when HSA_NO_SCRATCH_RECLAIM=1 even for rocm >= 6.4.0
* [rocm_regression] Check firmware version
* [rocm_regression] Resolve review comments
* [rocm_regression] Move hsa env checking into init once func
* [rocm_regression] Prevent hot fix version in firmware
* [rocm_regression] Improve unit tests