Files
Atul Kulkarni 142860442a Enable MPI support to execute MPI specific unit/functional tests (#1996)
* Added MPI support to execute unit/functional tests

Update node and process validation
Updated node detection count and modified validation method
Update validation logic to include max procs and nodes

* Address review comments

* Fix warnings

* Added a new NET transport test and clean up

* Added MPI test logging mechanism

* Decoupled GTest framework

* Added Net IB functional tests

* Updated with resource guards

* Added NET IB tests and refactored code

* Update P2pWorkflow test

* Update documentation

* Add MPI_TESTS_ENABLED guard to the file

* Fix Shm and NetIB tests

* Applied refactoring and cleanup

* Replaced BufferGuard with AutoGuard

* Modified test debug logging

* Use macro to reduce NcclTypeTraits code duplication

- Replace repetitive template specializations with a single
  DEFINE_NCCL_TYPE_TRAIT macro
- Use stringification operator (#) to auto-generate type name strings
- Add #undef to keep macro from polluting namespace
- Makes adding new type mappings trivial

* Unify buffer initialization with generic pattern function

- Remove initializeBufferWithCustomPattern
- Make initializeBufferWithPattern generic with PatternFunc template param
- Now single function handles all patterns via lambda injection
- Updated all test files to use lambdas for pattern generation
- Pattern logic now visible at call site (self-documenting)

* Unify buffer verification with pluggable pattern function

- Remove verifyBufferWithCustomCheck
- Make verifyBufferData generic with PatternFunc template param
- Single function handles all verification patterns via lambda injection
- Updated all test files to use lambdas
- Better defaults: num_samples=0 means verify all elements
- Pattern logic now visible at call site (self-documenting)

* Docs: Add DeviceBufferHelpers section to MPITestRunner.md

- Document new refactored buffer initialization/verification API
- Explain pluggable pattern functions with lambda examples
- Show type mapping and automatic float/int comparison
- Include migration guide from old API to new unified functions
- Demonstrate best practices with real-world examples
- Reference recent refactoring commits (macro-based type traits)

* Docs: Update documentation and examples

- Update on DeviceBufferHelpers
- Update examples using DeviceBufferHelpers methods, e.g. data verification

* Address review comment.

- Replace manual pattern generation loop with initializeBufferWithPattern call
- Use downloadBuffer to get host copy instead of manual hipMemcpy

* Remove non-existent dependency

* Remove duplicate testcase

* Code cleanup in test files

* Moved common constants to base class

[ROCm/rccl commit: 29e1567b95]
2025-12-06 16:05:37 -06:00

188 righe
5.4 KiB
C++

/*************************************************************************
* Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
/**
* @file MPIHelpers.hpp
* @brief Shared MPI utility functions for both GTest and standalone tests
*
* Provides common functionality for MPI test initialization, GPU setup,
* and per-rank logging that can be used by both GTest-based tests and
* standalone tests (performance benchmarks, etc.).
*/
#ifndef MPI_HELPERS_HPP
#define MPI_HELPERS_HPP
#ifdef MPI_TESTS_ENABLED
#include <array>
#include <atomic>
#include <memory>
#include <optional>
#include <string>
#include <thread>
/**
* @namespace MPIHelpers
* @brief Shared MPI utilities for test infrastructure
*/
namespace MPIHelpers
{
/**
* @struct MPIContext
* @brief MPI environment context information
*/
struct MPIContext
{
int world_rank; ///< MPI rank in MPI_COMM_WORLD
int world_size; ///< Total number of MPI processes
int thread_support; ///< MPI thread support level provided
};
/**
* @brief Initialize MPI with thread support
*
* Initializes MPI with MPI_THREAD_MULTIPLE support and returns context info.
*
* @param argc Pointer to argc from main()
* @param argv Pointer to argv from main()
* @return MPIContext with rank, size, and thread support info
*
* @note Must be called before any other MPI operations
* @note Automatically sets MPIEnvironment static variables
*/
MPIContext initializeMPI(int* argc, char*** argv);
/**
* @brief Setup GPU device for this MPI rank
*
* Assigns GPU device based on local rank (ranks on same node).
* Uses MPI_COMM_TYPE_SHARED to detect node topology and assigns
* GPUs in round-robin fashion.
*
* @param world_rank MPI rank in MPI_COMM_WORLD
*
* @note Handles multiple ranks per node automatically
* @note Uses hipSetDevice() to assign GPU
*/
void setupGPU(int world_rank);
/**
* @class FileDescriptor
* @brief RAII wrapper for POSIX file descriptors
*
* Automatically closes file descriptor on destruction.
* Move-only semantics prevent accidental duplication.
*/
class FileDescriptor
{
public:
explicit FileDescriptor(int fd = -1) noexcept;
~FileDescriptor();
// Move-only semantics
FileDescriptor(FileDescriptor&& other) noexcept;
FileDescriptor& operator=(FileDescriptor&& other) noexcept;
// Delete copy operations
FileDescriptor(const FileDescriptor&) = delete;
FileDescriptor& operator=(const FileDescriptor&) = delete;
[[nodiscard]] int get() const noexcept;
[[nodiscard]] bool is_valid() const noexcept;
int release() noexcept;
private:
int fd_;
};
/**
* @class TeeThread
* @brief Thread for duplicating output to console and log file
*
* Used by rank 0 when per-rank logging is enabled to send output
* to both console and log file simultaneously.
*/
class TeeThread
{
public:
TeeThread(int read_fd, int console_fd, int log_fd);
~TeeThread();
// Delete copy/move operations
TeeThread(const TeeThread&) = delete;
TeeThread& operator=(const TeeThread&) = delete;
TeeThread(TeeThread&&) = delete;
TeeThread& operator=(TeeThread&&) = delete;
private:
void tee_loop();
int read_fd_;
int console_fd_;
int log_fd_;
std::atomic<bool> running_;
std::thread thread_;
};
/**
* @struct RankLogConfig
* @brief Per-rank logging configuration and state
*
* Manages file descriptors and threads for per-rank logging when
* RCCL_MPI_LOG_ALL_RANKS=1 environment variable is set.
*/
struct RankLogConfig
{
std::optional<FileDescriptor> log_fd; ///< Log file descriptor
std::optional<FileDescriptor> saved_stdout; ///< Saved stdout for restoration
std::optional<FileDescriptor> saved_stderr; ///< Saved stderr for restoration
std::optional<FileDescriptor> pipe_read_fd; ///< Pipe read end (rank 0 only)
std::optional<FileDescriptor> pipe_write_fd; ///< Pipe write end (rank 0 only)
std::unique_ptr<TeeThread> tee_thread; ///< Tee thread (rank 0 only)
bool logging_enabled{false}; ///< Is per-rank logging enabled?
bool is_rank_zero{false}; ///< Is this rank 0?
};
/**
* @brief Setup per-rank logging if RCCL_MPI_LOG_ALL_RANKS=1
*
* Configures output redirection for MPI ranks:
* - Rank 0: Output to BOTH console AND log file (tee behavior)
* - Rank 1-N: Output redirected to rccl_test_rank_<N>.log
*
* If RCCL_MPI_LOG_ALL_RANKS is not set:
* - Rank 0: Normal console output
* - Rank 1-N: Output suppressed (redirected to /dev/null)
*
* @param rank MPI rank in MPI_COMM_WORLD
* @return Optional RankLogConfig if logging was configured, std::nullopt otherwise
*
* @note Call before any test output
* @note Must call restoreRankLogging() at end to cleanup
*/
std::optional<RankLogConfig> setupRankLogging(int rank);
/**
* @brief Restore original stdout/stderr after per-rank logging
*
* Cleans up per-rank logging configuration and restores original
* stdout/stderr file descriptors.
*
* @param config RankLogConfig to cleanup
*
* @note Safe to call multiple times
* @note Flushes pending output before restoration
*/
void restoreRankLogging(RankLogConfig& config);
} // namespace MPIHelpers
#endif // MPI_TESTS_ENABLED
#endif // MPI_HELPERS_HPP