Added ERROR message class to handle fatal error messages. (#2002)

* Added ERROR message class to handle fatal error messages.

New ERROR message class will print the message in all debug level,
including none.

Change some of the fatal error message to be in ERROR instead of WARN.

Added new error handler function to print out more meaningful error
message in the future.

* Added CHANGELOG entry.

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Change to no longer reuse NONE as ERROR. ERROR is now a separated class.

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
This commit is contained in:
Arm Patinyasakdikul
2025-10-30 14:14:20 -07:00
committed by GitHub
szülő 84fdcab68a
commit 1ce83d5cc0
14 fájl változott, egészen pontosan 49 új sor hozzáadva és 20 régi sor törölve
+1 -1
Fájl megtekintése
@@ -7,7 +7,7 @@
#ifndef COMMON_H_
#define COMMON_H_
typedef enum {NCCL_LOG_NONE=0, NCCL_LOG_VERSION=1, NCCL_LOG_WARN=2, NCCL_LOG_INFO=3, NCCL_LOG_ABORT=4, NCCL_LOG_TRACE=5} ncclDebugLogLevel;
typedef enum {NCCL_LOG_NONE=0, NCCL_LOG_ERROR=1, NCCL_LOG_VERSION=2, NCCL_LOG_WARN=3, NCCL_LOG_INFO=4, NCCL_LOG_ABORT=5, NCCL_LOG_TRACE=6} ncclDebugLogLevel;
typedef enum {NCCL_INIT=1, NCCL_COLL=2, NCCL_P2P=4, NCCL_SHM=8, NCCL_NET=16, NCCL_GRAPH=32, NCCL_TUNING=64, NCCL_ENV=128, NCCL_ALLOC=256, NCCL_CALL=512, NCCL_PROXY=1024, NCCL_NVLS=2048, NCCL_BOOTSTRAP=4096, NCCL_REG=8192, NCCL_ALL=~0} ncclDebugLogSubSys;
typedef void (*ncclDebugLogger_t)(ncclDebugLogLevel level, unsigned long flags, const char *file, int line, const char *fmt, ...);
@@ -62,7 +62,8 @@ void mock_logger(ncclDebugLogLevel level, unsigned long flags,
// Convert log level to string
const char* level_str;
switch(level) {
case NCCL_LOG_NONE: level_str = "NONE"; break;
case NCCL_LOG_NONE: level_str = "NONE"; break;
case NCCL_LOG_ERROR: level_str = "ERROR"; break;
case NCCL_LOG_VERSION: level_str = "VERSION"; break;
case NCCL_LOG_WARN: level_str = "WARN"; break;
case NCCL_LOG_INFO: level_str = "INFO"; break;