Add nccl_debug variable and values (#2756)

* nccl-debug variables table test

* spacing

* spacing

* RCCL variable edits from SME

* Update projects/rccl/docs/api-reference/env-variables.rst

Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Co-authored-by: Matt Williams <Matt.Williams+amdeng@amd.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Matt Williams <matt.williams@amd.com>
Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>
This commit is contained in:
systems-assistant[bot]
2026-01-30 12:00:18 -05:00
committed by GitHub
orang tua 00e8a67165
melakukan 327778ef18
@@ -19,10 +19,10 @@ in the following table.
.. list-table::
:header-rows: 1
:widths: 70,30
:widths: 40,60
* - **Environment variable**
- **Value**
- **Values**
* - | ``NCCL_CONF_FILE``
| Specifies the path to the RCCL configuration file.
@@ -42,20 +42,62 @@ in the following table.
.. list-table::
:header-rows: 1
:widths: 70,30
:widths: 35,65
* - **Environment variable**
- **Value**
- **Values**
* - | ``RCCL_LOG_LEVEL``
| Controls RCCL logging verbosity.
- | Integer value (default: ``1``)
| Higher values increase logging detail
* - | ``NCCL_DEBUG``
| Controls debug logging in RCCL for troubleshooting and monitoring collective communication operations.
- | These are the logging levels in RCCL set via ``NCCL_DEBUG``. Each logging level contains all logging for levels below it. The default logging level is ``ERROR``.
|
| ``NONE``: No logging is printed.
| ``ERROR``: These messages report when a fatal condition has occurred in RCCL and the operation can't continue.
| ``VERSION``: ``librccl`` version info is printed during the initialization phase.
| ``WARN``: Prints warnings about unusual conditions that could lead to unexpected results.
| ``INFO``: Prints standard logging messages about status and operations performed.
| ``ABORT``: Unused.
| ``TRACE``: Prints trace-level logging of function calls and parameters. Only active when ``librccl`` is built using ``ENABLE_TRACE``.
* - | ``NCCL_DEBUG_SUBSYS``
| Controls which subsystems generate debug output.
- | Comma-separated list of subsystems (e.g., ``INIT,COLL``)
| Prefix with ``^`` to invert selection
- | These are the logging subsystems set via ``NCCL_DEBUG_SUBSYS``. These can be set as a comma-separated list, and can be inverted using the ``^`` prefix. The default subsystem set is ``INIT``, ``BOOTSTRAP``, and ``ENV``.
|
| ``INIT``: Prints during the initialization phase.
| ``COLL``: Prints during execution of collectives.
| ``P2P``: Prints logs related to peer-to-peer setup or communication.
| ``SHM``: Prints logs related to shared memory.
| ``NET``: Prints logs related to network setup or communication.
| ``GRAPH``: Prints logs related to parsing the topology of the network.
| ``TUNING``: Prints logs related to the tuner plugin.
| ``ENV``: Prints logs related to environment variables.
| ``ALLOC``: Prints logs related to memory allocation.
| ``CALL``: Prints logs for function calls (``TRACE`` only).
| ``PROXY``: Prints logs related to the proxy thread.
| ``NVLS``: Not valid for AMD/RCCL.
| ``BOOTSTRAP``: Prints logs related to the bootstrapping phase of initialization.
| ``REG``: Prints logs related to registration and deregistration of transport initialization.
| ``PROFILE``: Prints logs related to the profiling/timing info.
| ``RAS``: Prints logs related to RAS.
| ``VERBS``: Prints logs related to IB/Verbs.
| ``ALL``: Activates all logging subsystems.
* - | ``NCCL_WARN_ENABLE_DEBUG_INFO``
| Converts all ``WARN`` level logs to ``INFO`` level logs.
- | ``0``: Default value. Variable is not enabled.
| ``1``: Enable the variable.
* - | ``NCCL_DEBUG_TIMESTAMP_LEVELS``
| The timestamp levels for ``NCCL_DEBUG``.
- | A set of ``NCCL_DEBUG`` levels can have a timestamp prepended set as a comma-separated list which can be inverted using the ``^`` prefix. The default set is ``WARN``.
* - | ``NCCL_DEBUG_TIMESTAMP_FORMAT``
| The timestamp format for ``NCCL_DEBUG``.
- | Set the format of the timestamp in ``printf`` style. The default format is ``"[%F %T] "``.
* - | ``NCCL_DEBUG_FILE``
| Write logs to a file rather than ``stdout``.
- | The filename can be formatted using ``%h`` for hostname, ``%p`` for pid, and ``%%`` to escape the ``%`` character. It is recommended to use ``%p`` to output to individual files per pid to avoid mixing or potentially overwriting the output. Example usage: ``NCCL_DEBUG_FILE=debugfile.%h.%p``
Algorithm and protocol control
==============================
@@ -65,10 +107,10 @@ collected in the following table.
.. list-table::
:header-rows: 1
:widths: 70,30
:widths: 40,60
* - **Environment variable**
- **Value**
- **Values**
* - | ``NCCL_ALGO``
| Forces specific algorithm selection for collectives.
@@ -88,10 +130,10 @@ in the following table.
.. list-table::
:header-rows: 1
:widths: 70,30
:widths: 40,60
* - **Environment variable**
- **Value**
- **Values**
* - | ``NCCL_IB_HCA``
| Specifies InfiniBand device:port to use.
@@ -149,10 +191,10 @@ intended for debugging and development purposes.
.. list-table::
:header-rows: 1
:widths: 70,30
:widths: 40,60
* - **Environment variable**
- **Value**
- **Values**
* - | ``CUDA_LAUNCH_BLOCKING``
| Controls CUDA kernel launch blocking behavior.