Comhaid

T

Atul Kulkarni 7e10267dfd Added a Process Isolated Test Runner (#1993 )

* Added single process isolation support to execute tests

* Address review comments

* Update README

* Removed requirement of explicit call to clear method

* Added macros for simplified usage

* Updated tests to use process isolation framework

* Adjust summary output format for isolated tests

* Updated rccl_wrap tests

* Used process isolation in AllocTests

* Used process isolation and fixed failing tests

* Modified test output, added signal handling

Updated macros to handle lambdas

* Convert argcheck tests to isolated tests

* Convert proxy tests to isolated tests

* Remove non-supported test

* Fixed file descriptor handling and clearing env vars for tests

2025-12-08 10:36:05 -06:00

33 KiB

Amh An milleán Stair

Process Isolated Test Runner

A lightweight C++ testing framework for running Google Test cases in isolated processes with clean environment settings.

Overview
Why Use Process Isolation?
Quick Start
Core Concepts
API Reference
Examples
Best Practices
Troubleshooting

Overview

ProcessIsolatedTestRunner is a framework that executes tests in separate processes using fork(). This ensures complete isolation between tests, particularly useful when testing code with static variables or environment-dependent behavior.

Key Features:

✅ Process-based test isolation (each test runs in its own process)
✅ Per-test environment variable management
✅ Configurable timeouts
✅ Sequential or stop-on-failure execution
✅ Thread-safe test registration
✅ Detailed test result reporting

Location: test/common/ProcessIsolatedTestRunner.hpp

Why use Process Isolation?

Problem: Static Variable Pollution

Consider this RCCL code with static variables:

void rcclSetP2pNetChunkSize(struct ncclComm* comm, int& chunkSize) {
  static int p2pNetChunkSize = RCCL_VALUE_UNSET;  // ← Static variable!

  if (p2pNetChunkSize == RCCL_VALUE_UNSET) {
    const char* inputStr = getenv("NCCL_P2P_NET_CHUNKSIZE");
    if (inputStr) {
      // Parse the environment variable value
      p2pNetChunkSize = parseValue(inputStr);  // e.g., "12345" → 12345
    } else {
      // No env var set, calculate value based on architecture...
      p2pNetChunkSize = calculateValue();
    }
  }
  chunkSize = p2pNetChunkSize;
}

How the static variable gets set:

First time called: p2pNetChunkSize == RCCL_VALUE_UNSET is true
Code reads environment variable with getenv("NCCL_P2P_NET_CHUNKSIZE")
If env var exists → parse its value (e.g., "12345" string) and assign to static variable
If env var doesn't exist → calculate default value and assign to static variable
Static variable is now set and persists for the lifetime of the process

Without Process Isolation:

TEST(MyTest, FirstTest) {
  setenv("NCCL_P2P_NET_CHUNKSIZE", "12345", 1);
  rcclSetP2pNetChunkSize(comm, chunkSize);
  // ✓ getenv() returns "12345"
  // ✓ Static variable p2pNetChunkSize gets set to 12345
  // ✓ chunkSize is now 12345
}

TEST(MyTest, SecondTest) {
  unsetenv("NCCL_P2P_NET_CHUNKSIZE");
  rcclSetP2pNetChunkSize(comm, chunkSize);
  // ❌ getenv() returns nullptr (env var cleared)
  // ❌ BUT: p2pNetChunkSize != RCCL_VALUE_UNSET (still 12345 from FirstTest!)
  // ❌ Code skips the if-block, never reads env var or recalculates
  // ❌ chunkSize is STILL 12345 from previous test!
  // This test will fail or produce incorrect results
}

The Problem: Static variables are initialized once per process and persist across multiple tests. Even if you change or clear environment variables, the static variable retains its old value.

With Process Isolation:

// Each test runs in a separate process
// Static variables are reset for each test
// ✅ Tests are truly independent

Common Use Cases

Testing environment variable behavior - When code reads env vars into static variables
Testing architecture-specific logic - Different GPU architectures with cached state
Testing initialization code - One-time initialization patterns
Testing configuration changes - When config is cached statically

Quick Start

Basic Example (Using Macros)

The simplest way to use ProcessIsolatedTestRunner is with the macros:

#include "common/ProcessIsolatedTestRunner.hpp"

TEST(Rcclwrap, MyIsolatedTest) {
  // Single test with environment variables - all in one call!
  RUN_ISOLATED_TEST_WITH_ENV("TestWithCleanEnvironment",
    []() {
      // This runs in a separate process
      const char* value = getenv("MY_VARIABLE");
      EXPECT_STREQ(value, "test_value");
      EXPECT_TRUE(someFunction());
    },
    {{"MY_VARIABLE", "test_value"}}
  );
}

TEST(Rcclwrap, MyIsolatedTests) {
  // Multiple tests with different configurations
  RUN_ISOLATED_TESTS(
    ProcessIsolatedTestRunner::TestConfig("Test1", []() {
      EXPECT_TRUE(checkCondition1());
    }),
    ProcessIsolatedTestRunner::TestConfig("Test2", []() {
      EXPECT_TRUE(checkCondition2());
    }).withEnvironment({{"VAR", "value"}}),
    ProcessIsolatedTestRunner::TestConfig("Test3", []() {
      EXPECT_TRUE(checkCondition3());
    }).withTimeout(std::chrono::seconds(60))
  );
}

Manual API (For Advanced Use Cases)

You can also use the API directly for more control:

#include "common/ProcessIsolatedTestRunner.hpp"

TEST(Rcclwrap, MyIsolatedTests) {
  // Register a test with environment variables
  ProcessIsolatedTestRunner::registerTest(
      ProcessIsolatedTestRunner::TestConfig(
          "TestWithCleanEnvironment",
          []() {
            // This runs in a separate process
            const char* value = getenv("MY_VARIABLE");
            EXPECT_STREQ(value, "test_value");

            // Your test logic here
            EXPECT_TRUE(someFunction());
          })
          .withEnvironment({{"MY_VARIABLE", "test_value"}})
  );

  // Execute all registered tests
  bool allTestsPassed = ProcessIsolatedTestRunner::executeAllTests();
  EXPECT_TRUE(allTestsPassed);
}

Core Concepts

1. Test Configuration (`TestConfig`)

Defines how a test should be executed:

TestConfig config(
    "TestName",           // Test name (for reporting)
    []() { /* logic */ }  // Test function (lambda or function pointer)
);

// Optional configurations
config.withEnvironment({{"VAR1", "value1"}, {"VAR2", "value2"}})
      .withTimeout(std::chrono::seconds(60))
      .withCleanEnvironment(false);  // Inherit parent environment

2. Test Registration

Tests must be registered before execution:

// Method 1: Full configuration
ProcessIsolatedTestRunner::registerTest(config);

// Method 2: Simple (name + logic only)
ProcessIsolatedTestRunner::registerTest("SimplTest", []() {
  EXPECT_TRUE(true);
});

// Method 3: With environment
ProcessIsolatedTestRunner::registerTest(
    "EnvTest",
    []() { /* logic */ },
    {{"ENV_VAR", "value"}}
);

3. Test Execution

⚠️ IMPORTANT: Tests do NOT run automatically after registration. You MUST explicitly call executeAllTests() to run them.

Execute all registered tests:

// Default options (continue on failure, no verbose logging)
bool passed = ProcessIsolatedTestRunner::executeAllTests();

// Custom options
ProcessIsolatedTestRunner::ExecutionOptions options;
options.stopOnFirstFailure = true;   // Stop after first failure
options.verboseLogging = true;       // Print detailed logs

bool passed = ProcessIsolatedTestRunner::executeAllTests(options);

Common Mistake:

// ❌ BAD: Tests registered but never executed!
TEST(MyTest, IsolatedTests) {
  ProcessIsolatedTestRunner::registerTest("Test1", []() { /* ... */ });
  ProcessIsolatedTestRunner::registerTest("Test2", []() { /* ... */ });
  // Missing executeAllTests() - tests will NOT run!
}

// ✅ GOOD: Tests registered and executed
TEST(MyTest, IsolatedTests) {
  ProcessIsolatedTestRunner::registerTest("Test1", []() { /* ... */ });
  ProcessIsolatedTestRunner::registerTest("Test2", []() { /* ... */ });
  bool passed = ProcessIsolatedTestRunner::executeAllTests();
  EXPECT_TRUE(passed);
}

4. Test Results

Each test produces a TestResult:

struct TestResult {
  std::string testName;               // Name of the test
  bool passed;                        // Whether the test passed
  bool skipped;                       // Whether the test was skipped
  int exitCode;                       // Process exit code
  pid_t processId;                    // Process ID that ran the test
  std::chrono::milliseconds duration; // Execution duration
  std::string errorMessage;           // Error message if failed
  std::unordered_map<std::string, std::string> environment;  // Env used
};

API Reference

Macros (Recommended)

These macros provide the simplest way to use ProcessIsolatedTestRunner with minimal boilerplate.

`RUN_ISOLATED_TEST(test_name, test_body)`

RUN_ISOLATED_TEST("MySimpleTest", []() {
  EXPECT_TRUE(someFunction());
});

`RUN_ISOLATED_TEST_WITH_ENV(test_name, test_body, ...)`

Uses variadic macros (... and __VA_ARGS__) to automatically handle commas in initializer lists without requiring extra parentheses.

RUN_ISOLATED_TEST_WITH_ENV("MyEnvTest",
  []() {
    const char* value = getenv("MY_VAR");
    EXPECT_STREQ(value, "expected_value");
  },
  {{"MY_VAR", "expected_value"}}
);

// Multiple environment variables work naturally:
RUN_ISOLATED_TEST_WITH_ENV("MultiEnvTest",
  []() { /* test code */ },
  {{"VAR1", "val1"}, {"VAR2", "val2"}, {"VAR3", "val3"}}  // Commas handled automatically
);

Note: The macro uses __VA_ARGS__ internally, which automatically handles commas in the environment variable initializer list. Users don't need to worry about preprocessor comma issues.

`RUN_ISOLATED_TESTS(...)`

RUN_ISOLATED_TESTS(
  ProcessIsolatedTestRunner::TestConfig("Test1", []() { ... }),
  ProcessIsolatedTestRunner::TestConfig("Test2", []() { ... })
    .withEnvironment({{"VAR", "value"}}),
  ProcessIsolatedTestRunner::TestConfig("Test3", []() { ... })
    .withTimeout(std::chrono::seconds(60))
);

`RUN_ISOLATED_TESTS_WITH_OPTIONS(options, ...)`

ProcessIsolatedTestRunner::ExecutionOptions opts;
opts.stopOnFirstFailure = true;
opts.verboseLogging = true;

RUN_ISOLATED_TESTS_WITH_OPTIONS(opts,
  ProcessIsolatedTestRunner::TestConfig("Test1", []() { ... }),
  ProcessIsolatedTestRunner::TestConfig("Test2", []() { ... })
);

Main Methods (For Manual Use)

`registerTest()`

// Variant 1: Full configuration
static void registerTest(const TestConfig& config);

// Variant 2: Simple registration
static void registerTest(
    const std::string& name,
    std::function<void()> testLogic
);

// Variant 3: With environment
static void registerTest(
    const std::string& name,
    std::function<void()> testLogic,
    const std::unordered_map<std::string, std::string>& env
);

`executeAllTests()`

Execute all registered tests sequentially.

static bool executeAllTests(
    const ExecutionOptions& options = ExecutionOptions()
);

Returns: true if all tests passed, false if any failed.

Note: This method automatically clears all test registrations and results after execution, ensuring a clean state for the next test suite. Users do not need to call clear() manually.

`getTestResults()`

Retrieve detailed results from the last execution.

static std::vector<TestResult> getTestResults();

`clear()`

Clear all registered tests and results.

static void clear();

Note: Calling this method manually is typically not necessary, as executeAllTests() automatically clears registrations after execution. This method is primarily useful for advanced use cases or when tests are registered but not executed.

⚠️ Automatic Warning: If clear() is called when tests have been registered but not fully executed, it will automatically print a warning to stderr:

⚠️  WARNING: ProcessIsolatedTestRunner::clear() called with 2 unexecuted test(s)!
   Registered: 2 test(s)
   Executed:   0 test(s)
   Did you forget to call executeAllTests()?

`getTestCount()`

Get the number of currently registered tests (before execution).

static size_t getTestCount();

Use case: Verify that tests were actually registered and executed.

TEST(MyTest, VerifyExecution) {
  ProcessIsolatedTestRunner::clear();

  // Register tests
  ProcessIsolatedTestRunner::registerTest("Test1", []() { /* ... */ });
  ProcessIsolatedTestRunner::registerTest("Test2", []() { /* ... */ });

  // Check registration count
  size_t registeredCount = ProcessIsolatedTestRunner::getTestCount();
  EXPECT_EQ(registeredCount, 2) << "Expected 2 tests to be registered";

  // Execute
  bool passed = ProcessIsolatedTestRunner::executeAllTests();
  EXPECT_TRUE(passed);

  // Verify execution count
  auto results = ProcessIsolatedTestRunner::getTestResults();
  EXPECT_EQ(results.size(), registeredCount)
      << "Registered " << registeredCount << " tests but only "
      << results.size() << " executed";
}

TestConfig Methods

`withEnvironment()`

Set environment variables for the test.

TestConfig& withEnvironment(
    const std::unordered_map<std::string, std::string>& env
);

Note: Variables are set in the child process only.

`withTimeout()`

Set a timeout for test execution.

TestConfig& withTimeout(std::chrono::seconds timeoutSeconds);

Default: 30 seconds

`withCleanEnvironment()`

Control whether to inherit parent process environment.

TestConfig& withCleanEnvironment(bool inherit = true);

Default: true (inherits parent environment)

Examples

Note: The examples below use helper functions from RcclWrapTests.cpp:

// Helper to create a mock NCCL communicator with specified architecture and ranks
static void CreateMockComm(ncclComm_t &mockComm,
                           struct ncclTopoSystem &mockTopo,
                           struct ncclTopoNode &mockGpuNode,
                           const char *arch,
                           int nRanks);

// Helper to cleanup a mock communicator
static void CleanupMockComm(ncclComm_t &mockComm);

Example 1: Testing Environment Variable Behavior

TEST(Rcclwrap, EnvironmentVariableTests) {
  // Test 1: With environment variable set
  ProcessIsolatedTestRunner::registerTest(
      ProcessIsolatedTestRunner::TestConfig(
          "WithEnvVarSet",
          []() {
            ncclComm_t mockComm = nullptr;
            struct ncclTopoSystem mockTopo;
            struct ncclTopoNode mockGpuNode;
            CreateMockComm(mockComm, mockTopo, mockGpuNode, "gfx942", 128);

            int chunkSize = RCCL_VALUE_UNSET;
            rcclSetP2pNetChunkSize(mockComm, chunkSize);

            // Should use default architecture-based value
            EXPECT_EQ(chunkSize, 1 << 19);

            CleanupMockComm(mockComm);
          })
          .withEnvironment({{"NCCL_P2P_NET_CHUNKSIZE", "999999"}})
  );

  // Test 2: Without environment variable (clean state)
  ProcessIsolatedTestRunner::registerTest(
      ProcessIsolatedTestRunner::TestConfig(
          "WithoutEnvVar",
          []() {
            // Verify environment is clean
            const char* value = getenv("NCCL_P2P_NET_CHUNKSIZE");
            EXPECT_EQ(value, nullptr);

            // Test default behavior
            ncclComm_t mockComm = nullptr;
            struct ncclTopoSystem mockTopo;
            struct ncclTopoNode mockGpuNode;
            CreateMockComm(mockComm, mockTopo, mockGpuNode, "gfx942", 32);

            int chunkSize = RCCL_VALUE_UNSET;
            rcclSetP2pNetChunkSize(mockComm, chunkSize);
            EXPECT_EQ(chunkSize, 1 << 17);  // Default for < 64 ranks

            CleanupMockComm(mockComm);
          })
  );

  // Execute both tests in isolated processes
  bool passed = ProcessIsolatedTestRunner::executeAllTests();
  EXPECT_TRUE(passed);
}

Example 2: Testing Multiple Architectures

TEST(Rcclwrap, ArchitectureTests) {
  struct TestCase {
    std::string name;
    std::string arch;
    int ranks;
    int expectedChunkSize;
  };

  std::vector<TestCase> testCases = {
    {"GFX942_SmallRanks", "gfx942", 32, 1 << 17},
    {"GFX942_LargeRanks", "gfx942", 128, 1 << 19},
    {"GFX950_SmallRanks", "gfx950", 8, 1 << 17},
    {"GFX950_MediumRanks", "gfx950", 24, 1 << 18},
    {"GFX950_LargeRanks", "gfx950", 64, 1 << 19},
  };

  for (const auto& tc : testCases) {
    ProcessIsolatedTestRunner::registerTest(
        ProcessIsolatedTestRunner::TestConfig(
            tc.name,
            [tc]() {
              ncclComm_t mockComm = nullptr;
              struct ncclTopoSystem mockTopo;
              struct ncclTopoNode mockGpuNode;
              CreateMockComm(mockComm, mockTopo, mockGpuNode, tc.arch.c_str(), tc.ranks);

              int chunkSize = RCCL_VALUE_UNSET;
              rcclSetP2pNetChunkSize(mockComm, chunkSize);

              EXPECT_EQ(chunkSize, tc.expectedChunkSize)
                  << "Failed for " << tc.arch << " with " << tc.ranks << " ranks";

              CleanupMockComm(mockComm);
            })
    );
  }

  ProcessIsolatedTestRunner::ExecutionOptions options;
  options.verboseLogging = true;
  options.stopOnFirstFailure = false;  // Run all tests even if one fails

  bool passed = ProcessIsolatedTestRunner::executeAllTests(options);
  EXPECT_TRUE(passed);
}

Example 3: Testing with Timeouts

TEST(Rcclwrap, TimeoutHandling) {
  // Test that completes quickly
  ProcessIsolatedTestRunner::registerTest(
      ProcessIsolatedTestRunner::TestConfig(
          "FastTest",
          []() {
            EXPECT_TRUE(true);
          })
          .withTimeout(std::chrono::seconds(5))
  );

  // Test with longer timeout for complex operations
  ProcessIsolatedTestRunner::registerTest(
      ProcessIsolatedTestRunner::TestConfig(
          "SlowTest",
          []() {
            // Simulate slow operation
            std::this_thread::sleep_for(std::chrono::seconds(2));
            EXPECT_TRUE(true);
          })
          .withTimeout(std::chrono::seconds(10))
  );

  bool passed = ProcessIsolatedTestRunner::executeAllTests();
  EXPECT_TRUE(passed);
}

Example 4: Stop on First Failure

TEST(Rcclwrap, CriticalTests) {
  // Register multiple critical tests
  ProcessIsolatedTestRunner::registerTest(
      "CriticalTest1", []() { EXPECT_TRUE(checkCriticalCondition1()); });

  ProcessIsolatedTestRunner::registerTest(
      "CriticalTest2", []() { EXPECT_TRUE(checkCriticalCondition2()); });

  ProcessIsolatedTestRunner::registerTest(
      "CriticalTest3", []() { EXPECT_TRUE(checkCriticalCondition3()); });

  // Stop on first failure - don't waste time if critical tests fail
  ProcessIsolatedTestRunner::ExecutionOptions options;
  options.stopOnFirstFailure = true;

  bool passed = ProcessIsolatedTestRunner::executeAllTests(options);
  EXPECT_TRUE(passed) << "Critical test suite failed";
}

Best Practices

1. Use Macros for Simple Cases

// ✅ GOOD: Simple and clean using macros
TEST(MyTest, SimpleIsolatedTest) {
  RUN_ISOLATED_TEST("CheckSomething", []() {
    EXPECT_TRUE(checkSomething());
  });
}

// ❌ MORE VERBOSE: Manual registration (still valid for complex cases)
TEST(MyTest, SimpleIsolatedTest) {
  ProcessIsolatedTestRunner::registerTest("CheckSomething", []() {
    EXPECT_TRUE(checkSomething());
  });
  bool passed = ProcessIsolatedTestRunner::executeAllTests();
  EXPECT_TRUE(passed);
}

2. Always Execute Registered Tests (When Using Manual API)

TEST(MyTest, IsolatedTests) {
  // Register tests
  ProcessIsolatedTestRunner::registerTest(/* ... */);

  // ✅ IMPORTANT: Don't forget to execute!
  bool passed = ProcessIsolatedTestRunner::executeAllTests();
  EXPECT_TRUE(passed);
}

When Using Manual API (Optional Verification):

You can verify that tests were registered and executed:

TEST(MyTest, IsolatedTests) {
  // Register tests
  ProcessIsolatedTestRunner::registerTest("Test1", []() { /* ... */ });
  ProcessIsolatedTestRunner::registerTest("Test2", []() { /* ... */ });

  // Get count of registered tests
  size_t registeredCount = ProcessIsolatedTestRunner::getTestCount();
  EXPECT_EQ(registeredCount, 2) << "Expected 2 tests to be registered";

  // Execute all tests (automatically clears after execution)
  bool passed = ProcessIsolatedTestRunner::executeAllTests();
  EXPECT_TRUE(passed);

  // Optional: Verify execution count matches registration count
  auto results = ProcessIsolatedTestRunner::getTestResults();
  EXPECT_EQ(results.size(), registeredCount)
      << "Registered " << registeredCount << " but executed " << results.size();
}

3. Use Descriptive Test Names

// ❌ BAD: Vague name
RUN_ISOLATED_TEST("Test1", []() { /* ... */ });

// ✅ GOOD: Descriptive name
RUN_ISOLATED_TEST("GFX942_LargeRanks_P2PChunkSize_ExpectHighValue",
  []() { /* ... */ }
);

TEST(Rcclwrap, AllP2PChunkSizeTests) {
  // Using macros to group related tests
  RUN_ISOLATED_TESTS(
    ProcessIsolatedTestRunner::TestConfig("GFX942_Test1", []() { ... }),
    ProcessIsolatedTestRunner::TestConfig("GFX942_Test2", []() { ... }),
    ProcessIsolatedTestRunner::TestConfig("GFX950_Test1", []() { ... }),
    ProcessIsolatedTestRunner::TestConfig("GFX950_Test2", []() { ... })
  );
}

5. Use Options for Better Control

// For debugging: verbose + stop on failure
ProcessIsolatedTestRunner::ExecutionOptions debugOptions;
debugOptions.stopOnFirstFailure = true;
debugOptions.verboseLogging = true;

RUN_ISOLATED_TESTS_WITH_OPTIONS(debugOptions,
  ProcessIsolatedTestRunner::TestConfig("Test1", []() { ... }),
  ProcessIsolatedTestRunner::TestConfig("Test2", []() { ... })
);

// For CI: run all tests, collect all failures
ProcessIsolatedTestRunner::ExecutionOptions ciOptions;
ciOptions.stopOnFirstFailure = false;
ciOptions.verboseLogging = false;

RUN_ISOLATED_TESTS_WITH_OPTIONS(ciOptions,
  ProcessIsolatedTestRunner::TestConfig("Test1", []() { ... }),
  ProcessIsolatedTestRunner::TestConfig("Test2", []() { ... })
);

6. Set Appropriate Timeouts

// ✅ GOOD: Different timeouts for different test types
RUN_ISOLATED_TESTS(
  ProcessIsolatedTestRunner::TestConfig("QuickTest", []() { ... })
    .withTimeout(std::chrono::seconds(5)),
  ProcessIsolatedTestRunner::TestConfig("NormalTest", []() { ... })
    .withTimeout(std::chrono::seconds(30)),
  ProcessIsolatedTestRunner::TestConfig("SlowTest", []() { ... })
    .withTimeout(std::chrono::seconds(120))
);

// ❌ BAD: Same long timeout for everything
RUN_ISOLATED_TESTS(
  ProcessIsolatedTestRunner::TestConfig("Test1", []() { ... })
    .withTimeout(std::chrono::seconds(300)),
  ProcessIsolatedTestRunner::TestConfig("Test2", []() { ... })
    .withTimeout(std::chrono::seconds(300))
);

7. Clean Up Resources in Tests

RUN_ISOLATED_TEST("ResourceTest", []() {
  ncclComm_t comm = nullptr;
  struct ncclTopoSystem topo;
  struct ncclTopoNode gpuNode;
  CreateMockComm(comm, topo, gpuNode, "gfx942", 32);

  try {
    // Your test logic
    EXPECT_TRUE(someTest(comm));

    // ✅ GOOD: Clean up in all paths
    CleanupMockComm(comm);
  } catch (...) {
    CleanupMockComm(comm);
    throw;
  }
});

8. Use RAII for GPU Resource Management

When tests allocate GPU memory, use RAII wrappers to ensure cleanup:

// ✅ GOOD: RAII ensures cleanup even on failure
struct GPUBuffer {
  void* ptr = nullptr;
  size_t size;

  GPUBuffer(size_t s) : size(s) {
    hipError_t err = hipMalloc(&ptr, size);
    ASSERT_EQ(err, hipSuccess);
  }

  ~GPUBuffer() {
    if (ptr) {
      hipFree(ptr);
      ptr = nullptr;
    }
  }

  // Prevent copying
  GPUBuffer(const GPUBuffer&) = delete;
  GPUBuffer& operator=(const GPUBuffer&) = delete;
};

RUN_ISOLATED_TEST("GPUTest", []() {
  GPUBuffer buffer(1024);  // Automatically cleaned up
  // ... test logic ...
  // No manual cleanup needed - destructor handles it
});

// ❌ BAD: Manual cleanup can be forgotten
RUN_ISOLATED_TEST("GPUTest", []() {
  void* buffer;
  hipMalloc(&buffer, 1024);
  // ... test logic ...
  // If test fails before this line, buffer leaks!
  hipFree(buffer);
});

9. Avoid GPU Initialization in Test Fixtures

When using process isolation, avoid initializing GPU resources in test fixture SetUp() methods:

// ❌ BAD: GPU initialization in fixture (runs in parent process)
class GPUTests : public ::testing::Test {
protected:
  void SetUp() override {
    hipMalloc(&gpuBuffer, 1024);  // Parent process - will pollute fork()!
  }
  void* gpuBuffer;
};

// ✅ GOOD: GPU initialization inside isolated test
class GPUTests : public ::testing::Test {
  // Empty fixture or only CPU resources in SetUp()
};

TEST_F(GPUTests, MyTest) {
  RUN_ISOLATED_TEST("GPUOperation", []() {
    void* gpuBuffer;
    hipMalloc(&gpuBuffer, 1024);  // Child process only - safe!
    // ... test logic ...
    hipFree(gpuBuffer);
  });
}

// ✅ EVEN BETTER: Use RAII + helper structure
struct GPUTestEnvironment {
  void* buffer;
  void setup() { hipMalloc(&buffer, 1024); }
  void cleanup() { if (buffer) hipFree(buffer); }
  ~GPUTestEnvironment() { cleanup(); }
};

TEST_F(GPUTests, MyTest) {
  RUN_ISOLATED_TEST("GPUOperation", []() {
    GPUTestEnvironment env;
    env.setup();
    // ... test logic ...
    env.cleanup();  // Explicit + destructor cleanup
  });
}

Troubleshooting

Test Hangs / Times Out

Symptom: Test never completes, eventually times out.

Solutions:

Increase timeout: .withTimeout(std::chrono::seconds(120))
Check for deadlocks in test logic
Enable verbose logging to see where it hangs:
```
options.verboseLogging = true;
```

Environment Variables Not Being Set

Symptom: getenv() returns nullptr in test.

Solutions:

Verify environment variable name is correct

Check that you're calling withEnvironment():

config.withEnvironment({{"VAR_NAME", "value"}})

Verify the test is actually executing (check test name)

Tests Pass Individually but Fail Together

Symptom: Individual tests pass, but fail when run in a suite.

Cause: This is the exact problem that ProcessIsolatedTestRunner solves!

Solution: Already solved - each test runs in isolated process. If you're still seeing this, check:

Are you using executeAllTests() correctly?
Are there shared external resources (files, network, etc.)?

Fork Failures

Symptom: Error messages about fork() failing.

Solutions:

Check system resource limits: ulimit -u (max processes)
Reduce number of tests or run in smaller batches
Check for resource leaks in parent process

Test Results Not Available

Symptom: getTestResults() returns empty vector.

Solution:

// Call executeAllTests() first
ProcessIsolatedTestRunner::executeAllTests();

// Then get results
auto results = ProcessIsolatedTestRunner::getTestResults();

Tests Registered but Never Executed

Symptom: Tests pass but you suspect they didn't actually run.

Cause: Forgot to call executeAllTests() after registration.

Detection:

TEST(MyTest, IsolatedTests) {
  // Register tests
  ProcessIsolatedTestRunner::registerTest("Test1", []() { EXPECT_TRUE(true); });
  ProcessIsolatedTestRunner::registerTest("Test2", []() { EXPECT_TRUE(true); });

  // ❌ FORGOT TO CALL executeAllTests()!

  // Later, when the test ends, registered tests are lost
}

Solution:

TEST(MyTest, IsolatedTests) {
  // Register tests
  ProcessIsolatedTestRunner::registerTest("Test1", []() { EXPECT_TRUE(true); });
  ProcessIsolatedTestRunner::registerTest("Test2", []() { EXPECT_TRUE(true); });

  // ✅ ALWAYS execute registered tests
  bool passed = ProcessIsolatedTestRunner::executeAllTests();
  EXPECT_TRUE(passed);

  // ✅ Optionally verify execution count
  auto results = ProcessIsolatedTestRunner::getTestResults();
  EXPECT_EQ(results.size(), 2) << "Expected 2 tests to execute";
}

Prevention: Always verify that getTestResults().size() matches your expected number of tests:

// After execution
auto results = ProcessIsolatedTestRunner::getTestResults();
EXPECT_EQ(results.size(), expectedTestCount)
    << "Test count mismatch - some tests may not have executed";

Implementation Details

How It Works

Registration Phase:
- Tests are registered into a static vector
- Each test gets a TestConfig with name, logic, and environment
Execution Phase:
- Parent process iterates through registered tests
- For each test:
  - fork() creates a child process
  - Child applies environment variables
  - Child executes test logic
  - Parent waits for child to complete
  - Result is collected and stored
Result Collection:
- Exit codes are captured from child processes
- Timing information is recorded
- All results stored in static vector
Automatic Cleanup:
- After execution completes, executeAllTests() automatically clears all test registrations and results
- This ensures a clean state for the next test suite without manual intervention

Exit Codes

enum RcclTestCode {
  RCCL_TEST_SUCCESS = 0,           // Test passed
  RCCL_TEST_FAILURE = 1,           // Test failed (assertion)
  RCCL_TEST_UNKNOWN_EXCEPTION = 2, // Uncaught exception
  RCCL_TEST_TIMEOUT = 3,           // Test timed out
  RCCL_TEST_SKIPPED = 4            // Test was skipped
};

Thread Safety

The framework uses mutexes for thread-safe operations:

Test registration (write)
Result recording (write)
Result retrieval (read)

Limitations

Process Overhead: Each test creates a new process (fork overhead)
Sequential Execution: Tests run one at a time (not parallel)
Linux/Unix Only: Uses fork() - not available on Windows
Memory Duplication: Each forked process duplicates memory
No Shared State: Tests cannot share data between processes

FAQ

Q: When should I use ProcessIsolatedTestRunner vs regular Google Test?

A: Use ProcessIsolatedTestRunner when:

Testing code with static variables
Testing environment variable behavior
Testing one-time initialization
Need guaranteed clean state between tests

Use regular Google Test when:

Tests are truly independent
No static state concerns
Need parallel execution
Testing simple units

Q: Can I use this with MPI tests?

A: Not directly. Process Isolated test runner is for single-process tests. For MPI tests, use MPI Test Runner instead. Process Isolated test runner is currently hooked into rccl-UnitTestsFixtures binary and MPI test runner is hooked into rccl-UnitTestsMPI binary. These are two independent implementation.

Q: How do I debug a test that's running in an isolated process?

Enable verbose logging
Add print statements in your test lambda
Temporarily run the test logic outside the framework
Use GDB

Q: Can I run tests in parallel?

A: No, the current implementation only supports sequential execution.

Q: Does this work with CTest/CMake?

A: Yes! The tests are still Google Test cases, so they work with standard test runners.

Q: Should I use the macros or the manual API?

A: Use the macros (RUN_ISOLATED_TEST, RUN_ISOLATED_TESTS, etc.) for most cases - they're simpler and less error-prone. Use the manual API (registerTest() + executeAllTests()) only when you need more control over the registration/execution flow, such as:

Dynamically generating test configurations at runtime
Sharing test registration logic across multiple TEST blocks
Advanced control flow scenarios

Q: Do tests run automatically after registration, or do I need to call executeAllTests()?

A: You MUST call executeAllTests() explicitly. Tests do NOT run automatically. If you forget to call it, your tests will be silently ignored. Always follow this pattern:

TEST(MyTest, IsolatedTests) {
  ProcessIsolatedTestRunner::registerTest("MyTest", []() { /* ... */ });

  // ✅ REQUIRED: Execute the tests
  bool passed = ProcessIsolatedTestRunner::executeAllTests();
  EXPECT_TRUE(passed);
}

Q: How can I detect if I forgot to execute registered tests?

A: After executeAllTests(), verify that getTestResults().size() matches your expected test count:

// Register N tests
ProcessIsolatedTestRunner::registerTest("Test1", []() { /* ... */ });
ProcessIsolatedTestRunner::registerTest("Test2", []() { /* ... */ });

// Execute
bool passed = ProcessIsolatedTestRunner::executeAllTests();

// Verify count
auto results = ProcessIsolatedTestRunner::getTestResults();
EXPECT_EQ(results.size(), 2) << "Expected 2 tests to run";

Q: Do I need to call clear() manually?

A: No. The clear() method is only useful for advanced use cases where you need to clear tests that were registered but never executed. If you manually call clear() when tests were registered but not executed, it will warn you:

⚠️  WARNING: ProcessIsolatedTestRunner::clear() called with 2 unexecuted test(s)!
   Registered: 2 test(s)
   Executed:   0 test(s)
   Did you forget to call executeAllTests()?

33 KiB Amh An milleán Stair

Process Isolated Test Runner

Table of Contents

Overview

Why use Process Isolation?

Problem: Static Variable Pollution

Common Use Cases

Quick Start

Basic Example (Using Macros)

Manual API (For Advanced Use Cases)

Core Concepts

1. Test Configuration (TestConfig)

2. Test Registration

3. Test Execution

4. Test Results

API Reference

Macros (Recommended)

RUN_ISOLATED_TEST(test_name, test_body)

RUN_ISOLATED_TEST_WITH_ENV(test_name, test_body, ...)

RUN_ISOLATED_TESTS(...)

RUN_ISOLATED_TESTS_WITH_OPTIONS(options, ...)

Main Methods (For Manual Use)

registerTest()

executeAllTests()

getTestResults()

clear()

getTestCount()

TestConfig Methods

withEnvironment()

withTimeout()

withCleanEnvironment()

Examples

Example 1: Testing Environment Variable Behavior

Example 2: Testing Multiple Architectures

Example 3: Testing with Timeouts

Example 4: Stop on First Failure

Best Practices

1. Use Macros for Simple Cases

2. Always Execute Registered Tests (When Using Manual API)

3. Use Descriptive Test Names

4. Group Related Tests

5. Use Options for Better Control

6. Set Appropriate Timeouts

7. Clean Up Resources in Tests

8. Use RAII for GPU Resource Management

9. Avoid GPU Initialization in Test Fixtures

Troubleshooting

Test Hangs / Times Out

Environment Variables Not Being Set

Tests Pass Individually but Fail Together

Fork Failures

Test Results Not Available

Tests Registered but Never Executed

Implementation Details

How It Works

Exit Codes

Thread Safety

Limitations

FAQ

See Also

33 KiB

Amh An milleán Stair

1. Test Configuration (`TestConfig`)

`RUN_ISOLATED_TEST(test_name, test_body)`

`RUN_ISOLATED_TEST_WITH_ENV(test_name, test_body, ...)`

`RUN_ISOLATED_TESTS(...)`

`RUN_ISOLATED_TESTS_WITH_OPTIONS(options, ...)`

`registerTest()`

`executeAllTests()`

`getTestResults()`

`clear()`

`getTestCount()`

`withEnvironment()`

`withTimeout()`

`withCleanEnvironment()`