When the `roc-obj-ls` executable fails, it sometimes does not return. Since the `execute_process` command will wait until the executable finishes, this means that in some cases, the build will hang indefinitely. There is no error message, and no indication that anything is wrong. This commit fixes that by introducing timeouts into the code and better error reporting.
* [Azure CI] rccl nightly pipeline that runs on slurm
- Login node will be set up as a self-hosted agent on Azure Pipelines.
- Login node will run this job nightly.
- Login node will checkout the latest develop source, and then run build and test through sbatch calls, and then waiting for the jobs to complete. When the jobs are complete, print out the logs.
* First version of new replayer, with comments on future TODOs
* plus minor fixes for UT
* Updated format of recorder, especially in binary department, according to replayer's need
* Reapply "[AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641)"
This reverts commit 943ad6f7820739385a0b54e81f823d0df1dbf71c.
* Decreasing NCCL_LL128_SHMEM_ELEMS_PER_THREAD from 16 to 8
* Updated 6.5 release to be 7.0
* Corrected the RCCL version for 6.4.1
* Moved items to the correct releases
* Added NCCL 2.25.1 compatibility item
* Fixed wording
* Added entry for `ManagedMem` and `ManagedMemGraph` test fix
* Make offload-compress the default
* Add guard for --offload-compress since it was introduced in ROCm 6.2
* Address some of Nilesh's feedback.
* Reorganize for code cleanliness
* Improve comment
* Compress gpu code at link and compile time
Fix profiler_v2 compatibility layer
* Removing trafficBytes in profiler_v3 breaks casting to ncclProfilerEventDescr_v2_t
in the compatibility layer for profiler_v2 interface. This patch fixes the issue
by making the conversion between the two descriptors explicit.
* Revert "Revert "replacing rccl_float8 with hip_fp8 and address compatibility …"
This reverts commit 824b81c034.
* [UT] Modify max stack size to 496
* adding a check for OCP type and replacing ROCM_VERSION with HIP_VERSION
* addressing the ci failure
* Adding the device tag
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
* Update LL128 elems per thread
* Precompute ix[g] in LL128 prim
* Make Threadthreshold part of tuning models
* Ignore channel tuning when channels are env controlled
* Tune LL128 max limit for AG
* Tune LL128 max limit for RS
* Retune AR LL128 limits due to changes
* Update CHANGELOG.md
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
* Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv()
For rocm older than 6.4, we need to set HSA_NO_SCRATCH_RECLAIM=1 to use LL128 protocol.
This Env is set outside of RCCL, add the logging to detect whether its set during runtime.
* check hip runtime ver via hipRuntimeGetVersion
* move the detection to ncclinit func
* correct rocm version integer
* update warning message
* avoid unnecessary info msg on hsa_no_scratch_reclaim detection
* Fix when more than 64 channels are used for multi-collective group calls
* Update CHANGELOG.md
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
This commit handles DMABUF initialization and call appropriate handling function. This fixes crash in OS with no peermem support and relying on only DMABUF.
* Initial test commit
* Handling Dmabuf_fd opening and closing
* Cleanup
* Use DMABuff or Peermem as needed
* Using user input for ibDmaBufSupportInitOnce
* Revert all changes to rocmwrap.cc
* Revert all changes to rocmwrap.cc
* Changing to func definition braces
* Reverting line removal in utils.h
* useDmaBuf to calculate flushEnabled