* Add initial commit to increase tb size to 512
* Fix LL perf issue when subset of NCCL_MAX_NTHREADS is used
Adding a constant to barrier_generic logic from using fallback logic when nthreads < NCCL_MAX_NTHREADS and nthreads == blockDim.X
* Adjust nthreads for LL
* Opt threads for reduce_scatter upper small range
* Add macro for single node
* Restrict MSCCL to 256 threads to prevent mem access fault
* Support pre-MI350 compatibility
* Partially refactor threadblock size override
* Use const macros instead of numerals
* opt out of unused function
* Implements casting key loads and stores to address_space(1) so that vector global load and store instructions are emitted by the compiler instead of more costly flat loads and stores
* Removes nontemporality from some key stores for gfx950.
* Updated CODEOWNERS to instead use RCCL-Reviewers team
* Apply suggestion from @nileshnegi
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
---------
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
* create dir regardless of default or user-provided path if it doesn't exist
* Fix npkit_dump_dir on npkit_trace_generator.py
---------
Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>
* Adding more path to the kernel load and an environment variable to force enable DMABUF
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
* Enable ccache w/ namespace for external use
* Remove TheRock from setup_tools.py command line
* Bump TheRock commit to use health_status.py
Resolves https://github.com/ROCm/rccl/pull/1966/files/c6d2e8ce5c14a2c94bfb47e21d3e2d466f25c9b4#r2420734710
* Bump TheRock to older commit with health_status.py
* Add git safe directory for working directory
* Move install python deps
* Remove pip freeze
- Earlier fix PR1788 is no longer necessary after ROCr fix and pre-ROCr fix workaround
- Inserts an s_waitcnt vmcnt(0), which fixes a data corruption issue in MSCCL
* Gate based on ROCM version, safe for ROCm 7.0.2 and beyond.
* Updates naming to gfx9CheapFenceOff since we use this for gfx942 and gfx950. Thanks Nilesh.
* Add info logging statement to NCCL_INIT to print whether enabled when INFO logging is enabled.
* update AG direct and single node LL threshold
* update thresholds based on MI350 expeirmental results
* disable using LL for direct AG
* enable direct AG for lower GPU counts
* direct AG single node tuning
* fix in-place buffer allocation for AG unit test
* whitespace fix
* gate direct AG for gfx950 and gfx942
---------
Co-authored-by: Nusrat Islam <nusislam@nova-login-gtu2.prov.gtu.zts.cpe.ice.amd.com>
In this commit it disabled by default and can be enabled via
`RCCL_ENABLE_CONTEXT_TRACKING=1` for both (CDNA, RDNA)
Original PR https://github.com/ROCm/rccl/pull/1927
* Revert disabling of context tracking for Radeon
Original commit 6fc228e2
`Disable context tracking for the current version. (#1839)`
* Add env variable for disabling of context tracking for Radeon
`export NCCL_DISABLE_CONTEXT_TRACKING=1` to force disable of context tracking
* Update docs/how-to/rccl-usage-tips.rst
Fix grammar, thanks @amd-jnovotny
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
* Rename NCCL_DISABLE_CONTEXT_TRACKING -> RCCL_DISABLE_CONTEXT_TRACKING
* Revert changes in includes and rename util function
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>