* Implements casting key loads and stores to address_space(1) so that vector global load and store instructions are emitted by the compiler instead of more costly flat loads and stores
* Removes nontemporality from some key stores for gfx950.
[ROCm/rccl commit: e69b11eba5]
* Updated CODEOWNERS to instead use RCCL-Reviewers team
* Apply suggestion from @nileshnegi
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
---------
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
[ROCm/rccl commit: f290e302d3]
* create dir regardless of default or user-provided path if it doesn't exist
* Fix npkit_dump_dir on npkit_trace_generator.py
---------
Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>
[ROCm/rccl commit: aec4f0a659]
* Adding more path to the kernel load and an environment variable to force enable DMABUF
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
[ROCm/rccl commit: b58f234539]
- Earlier fix PR1788 is no longer necessary after ROCr fix and pre-ROCr fix workaround
- Inserts an s_waitcnt vmcnt(0), which fixes a data corruption issue in MSCCL
[ROCm/rccl commit: 154350baaf]
* Gate based on ROCM version, safe for ROCm 7.0.2 and beyond.
* Updates naming to gfx9CheapFenceOff since we use this for gfx942 and gfx950. Thanks Nilesh.
* Add info logging statement to NCCL_INIT to print whether enabled when INFO logging is enabled.
[ROCm/rccl commit: c70f5b4621]
Adds --kernel-resource-use flag to install.sh to allow dumping per-GPU kernel resource use at compile time (e.g., VGPRs, LDS, SGPRs, scratch, etc.)
[ROCm/rccl commit: ff209e5b19]
* update AG direct and single node LL threshold
* update thresholds based on MI350 expeirmental results
* disable using LL for direct AG
* enable direct AG for lower GPU counts
* direct AG single node tuning
* fix in-place buffer allocation for AG unit test
* whitespace fix
* gate direct AG for gfx950 and gfx942
---------
Co-authored-by: Nusrat Islam <nusislam@nova-login-gtu2.prov.gtu.zts.cpe.ice.amd.com>
[ROCm/rccl commit: d22a39e954]
In this commit it disabled by default and can be enabled via
`RCCL_ENABLE_CONTEXT_TRACKING=1` for both (CDNA, RDNA)
Original PR https://github.com/ROCm/rccl/pull/1927
[ROCm/rccl commit: 00a42c80f3]
* Trigger CI run on pull request
* Enabling CI run on different PR types
---------
Co-authored-by: arravikum <arravikum@amd.com>
[ROCm/rccl commit: 1858a31c41]
* Revert disabling of context tracking for Radeon
Original commit df3b7e47
`Disable context tracking for the current version. (#1839)`
* Add env variable for disabling of context tracking for Radeon
`export NCCL_DISABLE_CONTEXT_TRACKING=1` to force disable of context tracking
* Update docs/how-to/rccl-usage-tips.rst
Fix grammar, thanks @amd-jnovotny
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
* Rename NCCL_DISABLE_CONTEXT_TRACKING -> RCCL_DISABLE_CONTEXT_TRACKING
* Revert changes in includes and rename util function
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
[ROCm/rccl commit: 07925ec027]
* Add tests on slurm cluster
* Integrate slurm.
* Add flags.
* Added dynamic selection of runners for tests and cleanup for slurm reservation
* Revert "Added dynamic selection of runners for tests and cleanup for slurm reservation"
This reverts commit fdd5a6cc968c764d3d1039f0897fb11f11422928.
* Refactor so tests run on both architectures.
* continue on error
* fail fast false on matrix
* remove scancel
* skip all single node tests
* fix pattern matching for pytest
* switch to always skip github job
* Update to latest allocation.
* Clean up workflows and update docker image.
* Updated container image published from PR #1517
* Switch back to TheRock main branch sha.
---------
Co-authored-by: arravikum <arravikum@amd.com>
[ROCm/rccl commit: 01d16d4139]