* for multinode gfx950, extend AR LL128 up to 256MB, extend RS LL128 up to 8MB per rank, extend AG LL up to 64KB per rank
* dont override direct allgather threshold if set to -1
* restore 2-node AR simple at earlier message sizes than higher multi-node AR
* extend range of LL for single-node RS on gfx950
* update algo/proto for multi-node allreduce on gfx942
* set single-node AR on gfx950 to Tree LL for KB message sizes
* decrease threshold for single node Tree for gfx950 AR
* Fail the job if compiler flag HIP_HOST_UNCACHED_MEMORY is not turned on on mi350x
Place the check after initTransportsRank as the GPU arch info in comm->topo->nodes info is populated after that.
* Update src/init.cc to use ERROR instead of WARN
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
* Enhance logging in NCCL initialization
It's convenient to log comms obj and default channels together for debugging
* Add opCount to collDevWork and update increment logic
Added opCount to collDevWork and incremented it when proxyOpQueue is empty (e.g., for intra-node comms)
* Clarify opCount increment logic in enqueue.cc
Updated comment to clarify incrementing opCount for intranode communications.
* Refactor NCCL_INIT logging format
Updated logging format for NCCL_INIT to improve clarity.
* Remove duplicate INFO logging in init.cc
* Commits to enable scp report copy
* Added Post report upload step
* Added extra arg for fetch artifacts
* Moved to a specific commit
* Add write permissions to s3
* Added comment for TheRock sha commit date
---------
Co-authored-by: arravikum <arravikum@amd.com>
* Added ERROR message class to handle fatal error messages.
New ERROR message class will print the message in all debug level,
including none.
Change some of the fatal error message to be in ERROR instead of WARN.
Added new error handler function to print out more meaningful error
message in the future.
* Added CHANGELOG entry.
* Update CHANGELOG.md
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
* Change to no longer reuse NONE as ERROR. ERROR is now a separated class.
* Update CHANGELOG.md
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
* prevent batching when send/recv bytes dont match, restore bit reversal for channel to part mapping, prevent batching beyond 32-nodes
* correct computation for channel to part mapping
* update changelog
* disabling p2p-batching by default
* Add initial commit to increase tb size to 512
* Fix LL perf issue when subset of NCCL_MAX_NTHREADS is used
Adding a constant to barrier_generic logic from using fallback logic when nthreads < NCCL_MAX_NTHREADS and nthreads == blockDim.X
* Adjust nthreads for LL
* Opt threads for reduce_scatter upper small range
* Add macro for single node
* Restrict MSCCL to 256 threads to prevent mem access fault
* Support pre-MI350 compatibility
* Partially refactor threadblock size override
* Use const macros instead of numerals
* opt out of unused function
* Implements casting key loads and stores to address_space(1) so that vector global load and store instructions are emitted by the compiler instead of more costly flat loads and stores
* Removes nontemporality from some key stores for gfx950.
* Updated CODEOWNERS to instead use RCCL-Reviewers team
* Apply suggestion from @nileshnegi
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
---------
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
* create dir regardless of default or user-provided path if it doesn't exist
* Fix npkit_dump_dir on npkit_trace_generator.py
---------
Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>