* Updated 6.5 release to be 7.0
* Corrected the RCCL version for 6.4.1
* Moved items to the correct releases
* Added NCCL 2.25.1 compatibility item
* Fixed wording
* Added entry for `ManagedMem` and `ManagedMemGraph` test fix
* Make offload-compress the default
* Add guard for --offload-compress since it was introduced in ROCm 6.2
* Address some of Nilesh's feedback.
* Reorganize for code cleanliness
* Improve comment
* Compress gpu code at link and compile time
* Revert "Revert "replacing rccl_float8 with hip_fp8 and address compatibility …"
This reverts commit 824b81c034.
* [UT] Modify max stack size to 496
* adding a check for OCP type and replacing ROCM_VERSION with HIP_VERSION
* addressing the ci failure
* Adding the device tag
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
* Update LL128 elems per thread
* Precompute ix[g] in LL128 prim
* Make Threadthreshold part of tuning models
* Ignore channel tuning when channels are env controlled
* Tune LL128 max limit for AG
* Tune LL128 max limit for RS
* Retune AR LL128 limits due to changes
* Update CHANGELOG.md
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
* Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv()
For rocm older than 6.4, we need to set HSA_NO_SCRATCH_RECLAIM=1 to use LL128 protocol.
This Env is set outside of RCCL, add the logging to detect whether its set during runtime.
* check hip runtime ver via hipRuntimeGetVersion
* move the detection to ncclinit func
* correct rocm version integer
* update warning message
* avoid unnecessary info msg on hsa_no_scratch_reclaim detection
* Fix when more than 64 channels are used for multi-collective group calls
* Update CHANGELOG.md
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
This commit handles DMABUF initialization and call appropriate handling function. This fixes crash in OS with no peermem support and relying on only DMABUF.
* Initial test commit
* Handling Dmabuf_fd opening and closing
* Cleanup
* Use DMABuff or Peermem as needed
* Using user input for ibDmaBufSupportInitOnce
* Revert all changes to rocmwrap.cc
* Revert all changes to rocmwrap.cc
* Changing to func definition braces
* Reverting line removal in utils.h
* useDmaBuf to calculate flushEnabled
* mscclpp patch apply clip patch and set allreduce8 blocks from 512 to 1024
* add compilation flag for enabling/disabling clipping in mscclpp
* change flag name for consistency, set flag to OFF
* add compilation flag in rccl for enabling clipping in mscclpp
* set 1024 threads for mscclpp allreduce8 only for bfloat16
* fix improper description for ENABLE_MSCCLPP_CLIP flag
* Revert "Merge branch 'clip-patch' of https://github.com/isaki001/rccl into clip-patch"
This reverts commit 6e31857a9db98314b8a748eb024f2c3699ebe2d5, reversing
changes made to 193f4caa8ffa78b4e056893212fd8344aa14e937.
* update clip remove-clip.patch for rebase
* Internal RCCL/NCCL functionality exposed when RCCL_EXPOSE_STATIC is enabled
* Algo/protocol/max channels can be obtained with the new RCCL API
* Introduce rccl_static and rccl_static_inline macros to work around invisible functions in core source files like enqueue.cc
* Add usage example in topo-explorer tool
w/o the header, experience following errors:
```
error: use of undeclared identifier 'RTLD_NOW'
error: use of undeclared identifier 'RTLD_LOCAL'
error: use of undeclared identifier 'dlerror'
error: use of undeclared identifier 'dlsym'
error: use of undeclared identifier 'dlclose'
```