* Use one side stream per process
* Handle multiple GPUs per process
* Reset stream when not found
* Address review comments
* Fix missing mutex initializer
[ROCm/rccl commit: 185e78a8f0]
* for multinode gfx950, extend AR LL128 up to 256MB, extend RS LL128 up to 8MB per rank, extend AG LL up to 64KB per rank
* dont override direct allgather threshold if set to -1
* restore 2-node AR simple at earlier message sizes than higher multi-node AR
* extend range of LL for single-node RS on gfx950
* update algo/proto for multi-node allreduce on gfx942
* set single-node AR on gfx950 to Tree LL for KB message sizes
* decrease threshold for single node Tree for gfx950 AR
[ROCm/rccl commit: 0d09f86608]
* Fail the job if compiler flag HIP_HOST_UNCACHED_MEMORY is not turned on on mi350x
Place the check after initTransportsRank as the GPU arch info in comm->topo->nodes info is populated after that.
* Update src/init.cc to use ERROR instead of WARN
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
[ROCm/rccl commit: 05f914c997]
* Enhance logging in NCCL initialization
It's convenient to log comms obj and default channels together for debugging
* Add opCount to collDevWork and update increment logic
Added opCount to collDevWork and incremented it when proxyOpQueue is empty (e.g., for intra-node comms)
* Clarify opCount increment logic in enqueue.cc
Updated comment to clarify incrementing opCount for intranode communications.
* Refactor NCCL_INIT logging format
Updated logging format for NCCL_INIT to improve clarity.
* Remove duplicate INFO logging in init.cc
[ROCm/rccl commit: b00ee4c83c]
* Commits to enable scp report copy
* Added Post report upload step
* Added extra arg for fetch artifacts
* Moved to a specific commit
* Add write permissions to s3
* Added comment for TheRock sha commit date
---------
Co-authored-by: arravikum <arravikum@amd.com>
[ROCm/rccl commit: 07f8f6d6c6]
* Added ERROR message class to handle fatal error messages.
New ERROR message class will print the message in all debug level,
including none.
Change some of the fatal error message to be in ERROR instead of WARN.
Added new error handler function to print out more meaningful error
message in the future.
* Added CHANGELOG entry.
* Update CHANGELOG.md
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
* Change to no longer reuse NONE as ERROR. ERROR is now a separated class.
* Update CHANGELOG.md
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
[ROCm/rccl commit: 1ce83d5cc0]
* prevent batching when send/recv bytes dont match, restore bit reversal for channel to part mapping, prevent batching beyond 32-nodes
* correct computation for channel to part mapping
* update changelog
* disabling p2p-batching by default
[ROCm/rccl commit: 641c0eb51c]
* Add initial commit to increase tb size to 512
* Fix LL perf issue when subset of NCCL_MAX_NTHREADS is used
Adding a constant to barrier_generic logic from using fallback logic when nthreads < NCCL_MAX_NTHREADS and nthreads == blockDim.X
* Adjust nthreads for LL
* Opt threads for reduce_scatter upper small range
* Add macro for single node
* Restrict MSCCL to 256 threads to prevent mem access fault
* Support pre-MI350 compatibility
* Partially refactor threadblock size override
* Use const macros instead of numerals
* opt out of unused function
[ROCm/rccl commit: 12f51ba8bf]