142860442a
* Added MPI support to execute unit/functional tests
Update node and process validation
Updated node detection count and modified validation method
Update validation logic to include max procs and nodes
* Address review comments
* Fix warnings
* Added a new NET transport test and clean up
* Added MPI test logging mechanism
* Decoupled GTest framework
* Added Net IB functional tests
* Updated with resource guards
* Added NET IB tests and refactored code
* Update P2pWorkflow test
* Update documentation
* Add MPI_TESTS_ENABLED guard to the file
* Fix Shm and NetIB tests
* Applied refactoring and cleanup
* Replaced BufferGuard with AutoGuard
* Modified test debug logging
* Use macro to reduce NcclTypeTraits code duplication
- Replace repetitive template specializations with a single
DEFINE_NCCL_TYPE_TRAIT macro
- Use stringification operator (#) to auto-generate type name strings
- Add #undef to keep macro from polluting namespace
- Makes adding new type mappings trivial
* Unify buffer initialization with generic pattern function
- Remove initializeBufferWithCustomPattern
- Make initializeBufferWithPattern generic with PatternFunc template param
- Now single function handles all patterns via lambda injection
- Updated all test files to use lambdas for pattern generation
- Pattern logic now visible at call site (self-documenting)
* Unify buffer verification with pluggable pattern function
- Remove verifyBufferWithCustomCheck
- Make verifyBufferData generic with PatternFunc template param
- Single function handles all verification patterns via lambda injection
- Updated all test files to use lambdas
- Better defaults: num_samples=0 means verify all elements
- Pattern logic now visible at call site (self-documenting)
* Docs: Add DeviceBufferHelpers section to MPITestRunner.md
- Document new refactored buffer initialization/verification API
- Explain pluggable pattern functions with lambda examples
- Show type mapping and automatic float/int comparison
- Include migration guide from old API to new unified functions
- Demonstrate best practices with real-world examples
- Reference recent refactoring commits (macro-based type traits)
* Docs: Update documentation and examples
- Update on DeviceBufferHelpers
- Update examples using DeviceBufferHelpers methods, e.g. data verification
* Address review comment.
- Replace manual pattern generation loop with initializeBufferWithPattern call
- Use downloadBuffer to get host copy instead of manual hipMemcpy
* Remove non-existent dependency
* Remove duplicate testcase
* Code cleanup in test files
* Moved common constants to base class
[ROCm/rccl commit: 29e1567b95]
150 строки
4.4 KiB
C++
150 строки
4.4 KiB
C++
/*************************************************************************
|
|
* Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
|
|
*
|
|
* See LICENSE.txt for license information
|
|
************************************************************************/
|
|
|
|
/**
|
|
* @file MPIEnvironment.hpp
|
|
* @brief Global MPI environment and error checking macros for RCCL testing
|
|
*
|
|
* Provides a Google Test Environment for managing MPI initialization/finalization
|
|
* and error checking macros for MPI, NCCL, and HIP operations in tests.
|
|
*/
|
|
|
|
#ifndef RCCL_MPI_ENVIRONMENT_HPP
|
|
#define RCCL_MPI_ENVIRONMENT_HPP
|
|
|
|
#include <gtest/gtest.h>
|
|
|
|
// Conditionally include MPI headers for MPI-based tests
|
|
#ifdef MPI_TESTS_ENABLED
|
|
|
|
#include "rccl/rccl.h"
|
|
#include <hip/hip_runtime.h>
|
|
#include <mpi.h>
|
|
|
|
#include "TestChecks.hpp"
|
|
#include "ResourceGuards.hpp"
|
|
|
|
/**
|
|
* @class MPIEnvironment
|
|
* @brief Google Test Environment for global MPI setup and teardown
|
|
*
|
|
* Manages the global MPI state for all MPI-based tests:
|
|
* - One-time MPI initialization (MPI_Init_thread)
|
|
* - GPU device initialization and assignment
|
|
* - MPI finalization and result aggregation across ranks
|
|
*
|
|
* @note MPI_Init can only be called once, so this uses static flags
|
|
* @note Each MPI rank is assigned to a unique GPU
|
|
* @see MPITestBase for test-level functionality
|
|
*/
|
|
class MPIEnvironment : public ::testing::Environment
|
|
{
|
|
public:
|
|
/**
|
|
* @brief Current MPI rank in MPI_COMM_WORLD
|
|
*
|
|
* Valid after MPI initialization. Each rank corresponds to one GPU.
|
|
*/
|
|
inline static int world_rank{0};
|
|
|
|
/**
|
|
* @brief Total number of MPI processes in MPI_COMM_WORLD
|
|
*
|
|
* Valid after MPI initialization. Must not exceed number of available GPUs.
|
|
*/
|
|
inline static int world_size{0};
|
|
|
|
/**
|
|
* @brief Aggregated return code for test results
|
|
*
|
|
* Set to non-zero on test failure. Aggregated across all ranks during cleanup.
|
|
*/
|
|
inline static int retCode{0};
|
|
|
|
/**
|
|
* @brief Flag indicating MPI has been initialized
|
|
*
|
|
* Prevents multiple MPI_Init calls (only allowed once per process).
|
|
*/
|
|
inline static bool mpi_initialized{false};
|
|
|
|
/**
|
|
* @brief Cached result of multi-node detection
|
|
*
|
|
* Computed once during SetUp() using MPI_Comm_split_type().
|
|
* -1 = not computed, 0 = single node, 1 = multi-node
|
|
*
|
|
* @note MUST be initialized before any TEST_* macros are called
|
|
* @note Prevents nested MPI collective operations in isMultiNodeTest()
|
|
*/
|
|
inline static int cached_multi_node_result{-1};
|
|
|
|
/**
|
|
* @brief Flag indicating GPU devices have been initialized
|
|
*
|
|
* Prevents redundant device setup across multiple test runs.
|
|
*/
|
|
inline static bool devices_initialized{false};
|
|
|
|
/**
|
|
* @brief Initialize MPI with thread support
|
|
*
|
|
* Calls MPI_Init_thread() with MPI_THREAD_MULTIPLE support and sets
|
|
* world_rank and world_size. Safe to call multiple times (idempotent).
|
|
*
|
|
* @note Should be called before any MPI operations
|
|
* @see mpi_initialized flag
|
|
*/
|
|
static void initialize_mpi();
|
|
|
|
/**
|
|
* @brief Initialize and assign GPU devices to MPI ranks
|
|
*
|
|
* Performs the following:
|
|
* 1. Queries available GPU count
|
|
* 2. Validates sufficient GPUs for all ranks
|
|
* 3. Assigns one GPU per rank (rank N → GPU N)
|
|
* 4. Resets and sets HIP device context
|
|
* 5. Synchronizes all ranks
|
|
*
|
|
* @note Requires world_size ≤ number of available GPUs
|
|
* @see devices_initialized flag
|
|
*/
|
|
static void initialize_devices();
|
|
|
|
/**
|
|
* @brief Clean up MPI resources and finalize
|
|
*
|
|
* Performs the following cleanup:
|
|
* 1. Synchronizes all ranks with MPI_Barrier
|
|
* 2. Aggregates test results across ranks with MPI_Allreduce
|
|
* 3. Prints final results from rank 0
|
|
* 4. Calls MPI_Finalize()
|
|
*
|
|
* @note Uses static guard to prevent multiple cleanup attempts
|
|
* @note Safe to call from signal handlers or error paths
|
|
*/
|
|
static void cleanup_mpi();
|
|
|
|
/**
|
|
* @brief Google Test SetUp hook - called once before all tests
|
|
*
|
|
* Initializes MPI and GPU devices for the entire test suite.
|
|
*/
|
|
void SetUp() override;
|
|
|
|
/**
|
|
* @brief Google Test TearDown hook - called once after all tests
|
|
*
|
|
* Synchronizes all ranks and calls cleanup_mpi() to finalize MPI.
|
|
*/
|
|
void TearDown() override;
|
|
};
|
|
|
|
#endif // MPI_TESTS_ENABLED
|
|
|
|
#endif // RCCL_MPI_ENVIRONMENT_HPP
|