* Revert "Revert "replacing rccl_float8 with hip_fp8 and address compatibility …"
This reverts commit 824b81c034.
* [UT] Modify max stack size to 496
* adding a check for OCP type and replacing ROCM_VERSION with HIP_VERSION
* addressing the ci failure
* Adding the device tag
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
* Update LL128 elems per thread
* Precompute ix[g] in LL128 prim
* Make Threadthreshold part of tuning models
* Ignore channel tuning when channels are env controlled
* Tune LL128 max limit for AG
* Tune LL128 max limit for RS
* Retune AR LL128 limits due to changes
* Update CHANGELOG.md
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
* Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv()
For rocm older than 6.4, we need to set HSA_NO_SCRATCH_RECLAIM=1 to use LL128 protocol.
This Env is set outside of RCCL, add the logging to detect whether its set during runtime.
* check hip runtime ver via hipRuntimeGetVersion
* move the detection to ncclinit func
* correct rocm version integer
* update warning message
* avoid unnecessary info msg on hsa_no_scratch_reclaim detection
* Fix when more than 64 channels are used for multi-collective group calls
* Update CHANGELOG.md
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
This commit handles DMABUF initialization and call appropriate handling function. This fixes crash in OS with no peermem support and relying on only DMABUF.
* Initial test commit
* Handling Dmabuf_fd opening and closing
* Cleanup
* Use DMABuff or Peermem as needed
* Using user input for ibDmaBufSupportInitOnce
* Revert all changes to rocmwrap.cc
* Revert all changes to rocmwrap.cc
* Changing to func definition braces
* Reverting line removal in utils.h
* useDmaBuf to calculate flushEnabled
* Internal RCCL/NCCL functionality exposed when RCCL_EXPOSE_STATIC is enabled
* Algo/protocol/max channels can be obtained with the new RCCL API
* Introduce rccl_static and rccl_static_inline macros to work around invisible functions in core source files like enqueue.cc
* Add usage example in topo-explorer tool
w/o the header, experience following errors:
```
error: use of undeclared identifier 'RTLD_NOW'
error: use of undeclared identifier 'RTLD_LOCAL'
error: use of undeclared identifier 'dlerror'
error: use of undeclared identifier 'dlsym'
error: use of undeclared identifier 'dlclose'
```
* Enabling LL128 by default on MI300
* Add missing CUDACHECK
* Adjust BW correction factors to fix the Tree->Ring switching point
* Refactor and add ll128 AR logarithmic factor to tuning models
* Move RCCL tuning changes to a separate file
* Use enum for tunable indexing
* Use explicit indexing in tuning models to avoid mismatch issues
* Place rcclGetSizePerRank in a function
* Remove HIP ifdef for rccl-only call
---------
Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>
* [SRC] Enable unroll=1 for gfx950
* Fix typo from rebase in generate.py
* Support for unroll=1 and gfx90a when building for all GPU targets
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* Add fault injection of starting warps with random variations
This is done by inserting randomly delays after __syncthreads().
The feature can be turned off by FAULT_INJECTION=OFF in cmake.
* Remove manually introduced bug for demo purpose
* Use only one thread per warp for checking wall clock
* removed gfx940 and gfx941
* removed gfx940 and gfx941
* Update "gfx94" to "gfx942" in init.cc
* Updated remaining "gfx94" updates to "gfx942"
* Update filenames and variables from gfx940 to gfx942
---------
Co-authored-by: akolliasAMD <akollias@amd.com>
* misc/msccl: read graphCaptureStatus for every collective call
* fix a bug in checking whether UBR is enabled in MSCCLPP
* cmake: Fix patch reversal order
* misc/msccl: add logging