diff --git a/projects/rccl/docs/api-reference/env-variables.rst b/projects/rccl/docs/api-reference/env-variables.rst index 165f3f0816..6d4b36eb49 100644 --- a/projects/rccl/docs/api-reference/env-variables.rst +++ b/projects/rccl/docs/api-reference/env-variables.rst @@ -19,10 +19,10 @@ in the following table. .. list-table:: :header-rows: 1 - :widths: 70,30 + :widths: 40,60 * - **Environment variable** - - **Value** + - **Values** * - | ``NCCL_CONF_FILE`` | Specifies the path to the RCCL configuration file. @@ -42,20 +42,62 @@ in the following table. .. list-table:: :header-rows: 1 - :widths: 70,30 + :widths: 35,65 * - **Environment variable** - - **Value** + - **Values** - * - | ``RCCL_LOG_LEVEL`` - | Controls RCCL logging verbosity. - - | Integer value (default: ``1``) - | Higher values increase logging detail + * - | ``NCCL_DEBUG`` + | Controls debug logging in RCCL for troubleshooting and monitoring collective communication operations. + - | These are the logging levels in RCCL set via ``NCCL_DEBUG``. Each logging level contains all logging for levels below it. The default logging level is ``ERROR``. + | + | ``NONE``: No logging is printed. + | ``ERROR``: These messages report when a fatal condition has occurred in RCCL and the operation can't continue. + | ``VERSION``: ``librccl`` version info is printed during the initialization phase. + | ``WARN``: Prints warnings about unusual conditions that could lead to unexpected results. + | ``INFO``: Prints standard logging messages about status and operations performed. + | ``ABORT``: Unused. + | ``TRACE``: Prints trace-level logging of function calls and parameters. Only active when ``librccl`` is built using ``ENABLE_TRACE``. * - | ``NCCL_DEBUG_SUBSYS`` | Controls which subsystems generate debug output. - - | Comma-separated list of subsystems (e.g., ``INIT,COLL``) - | Prefix with ``^`` to invert selection + - | These are the logging subsystems set via ``NCCL_DEBUG_SUBSYS``. These can be set as a comma-separated list, and can be inverted using the ``^`` prefix. The default subsystem set is ``INIT``, ``BOOTSTRAP``, and ``ENV``. + | + | ``INIT``: Prints during the initialization phase. + | ``COLL``: Prints during execution of collectives. + | ``P2P``: Prints logs related to peer-to-peer setup or communication. + | ``SHM``: Prints logs related to shared memory. + | ``NET``: Prints logs related to network setup or communication. + | ``GRAPH``: Prints logs related to parsing the topology of the network. + | ``TUNING``: Prints logs related to the tuner plugin. + | ``ENV``: Prints logs related to environment variables. + | ``ALLOC``: Prints logs related to memory allocation. + | ``CALL``: Prints logs for function calls (``TRACE`` only). + | ``PROXY``: Prints logs related to the proxy thread. + | ``NVLS``: Not valid for AMD/RCCL. + | ``BOOTSTRAP``: Prints logs related to the bootstrapping phase of initialization. + | ``REG``: Prints logs related to registration and deregistration of transport initialization. + | ``PROFILE``: Prints logs related to the profiling/timing info. + | ``RAS``: Prints logs related to RAS. + | ``VERBS``: Prints logs related to IB/Verbs. + | ``ALL``: Activates all logging subsystems. + + * - | ``NCCL_WARN_ENABLE_DEBUG_INFO`` + | Converts all ``WARN`` level logs to ``INFO`` level logs. + - | ``0``: Default value. Variable is not enabled. + | ``1``: Enable the variable. + + * - | ``NCCL_DEBUG_TIMESTAMP_LEVELS`` + | The timestamp levels for ``NCCL_DEBUG``. + - | A set of ``NCCL_DEBUG`` levels can have a timestamp prepended set as a comma-separated list which can be inverted using the ``^`` prefix. The default set is ``WARN``. + + * - | ``NCCL_DEBUG_TIMESTAMP_FORMAT`` + | The timestamp format for ``NCCL_DEBUG``. + - | Set the format of the timestamp in ``printf`` style. The default format is ``"[%F %T] "``. + + * - | ``NCCL_DEBUG_FILE`` + | Write logs to a file rather than ``stdout``. + - | The filename can be formatted using ``%h`` for hostname, ``%p`` for pid, and ``%%`` to escape the ``%`` character. It is recommended to use ``%p`` to output to individual files per pid to avoid mixing or potentially overwriting the output. Example usage: ``NCCL_DEBUG_FILE=debugfile.%h.%p`` Algorithm and protocol control ============================== @@ -65,10 +107,10 @@ collected in the following table. .. list-table:: :header-rows: 1 - :widths: 70,30 + :widths: 40,60 * - **Environment variable** - - **Value** + - **Values** * - | ``NCCL_ALGO`` | Forces specific algorithm selection for collectives. @@ -88,10 +130,10 @@ in the following table. .. list-table:: :header-rows: 1 - :widths: 70,30 + :widths: 40,60 * - **Environment variable** - - **Value** + - **Values** * - | ``NCCL_IB_HCA`` | Specifies InfiniBand device:port to use. @@ -149,10 +191,10 @@ intended for debugging and development purposes. .. list-table:: :header-rows: 1 - :widths: 70,30 + :widths: 40,60 * - **Environment variable** - - **Value** + - **Values** * - | ``CUDA_LAUNCH_BLOCKING`` | Controls CUDA kernel launch blocking behavior.