Added Functional Tests for CSV Tuner Plugin (#1968)

* Add functional tests for CSV Tuner Plugin

* Updated directory structure

* Updated and renamed directories

* Updated csv conf files

* Updated readme

* Updated readme

* Updated readme

[ROCm/rccl commit: c8da880dc7]
Этот коммит содержится в:
Kapil S. Pawar
2025-11-11 10:11:19 -06:00
коммит произвёл GitHub
родитель 0d3fba9a22
Коммит c4d7680749
17 изменённых файлов: 2824 добавлений и 0 удалений
+20
Просмотреть файл
@@ -0,0 +1,20 @@
# Ignore Python cache and virtual environment folders
__pycache__/
*.pyc
*.pyo
*.pyd
# Ignore pytest cache
.pytest_cache/
.cache/
# Ignore log folders
logs/
log/
*.log
# Ignore virtual environment folders
venv/
# Ignore build artifacts
build/
+159
Просмотреть файл
@@ -0,0 +1,159 @@
# RCCL CSV Tuner Plugin Tests
## Description
This directory contains automated tests for the RCCL (ROCm Communication Collectives Library) CSV Tuner Plugin. The test suite validates the functionality of the CSV-based tuning plugin across different collective operations (AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter) and various configuration scenarios.
The tests are written in Python using the pytest framework, making it easy to run, maintain, and extend the test coverage.
## Directory Structure
```
ext-plugins/
├── README.md # This file - documentation for the test suite
├── requirements.txt # Python dependencies required for testing
├── pytest.ini # Pytest configuration and test markers
├── .gitignore # Git ignore rules for Python/pytest artifacts
├── venv/ # Python virtual environment (created after setup)
├── logs/ # Test execution logs and output files
├── assets/ # Test configuration files and assets
│ └── csv_confs/ # CSV configuration files for testing
│ ├── incorrect_values_config.conf
│ ├── multinode_config.conf
│ ├── no_matching_config.conf
│ ├── singlenode_config.conf
│ ├── unsupported_algo_proto_config.conf
│ ├── valid_config_with_wildcards.conf
│ └── valid_config_without_wildcards.conf
└── tests/ # Test suite directory
├── conftest.py # Pytest fixtures and shared test configuration
└── ext-tuner/ # CSV Tuner Plugin specific tests
├── test_allreduce.py
├── test_broadcast.py
├── test_reduce.py
├── test_allgather.py
└── test_reducescatter.py
```
## Installation & Setup
### Prerequisites
- Python 3.6 or higher
- RCCL library installed
- ROCm environment configured
- **Important**: Update the installation paths in `tests/conftest.py` to match your environment:
```python
RCCL_INSTALL_DIR = "path/to/rccl"
OMPI_INSTALL_DIR = "path/to/ompi"
RCCL_TESTS_DIR = "path/to/rccl-tests"
```
Replace these placeholder paths with your actual installation directories before running the tests.
### Building the RCCL CSV Tuner Plugin
Before running the tests, you need to build the RCCL CSV tuner plugin library. The plugin is located in the `ext-tuner/example` directory.
#### Step 1: Navigate to the plugin directory
```bash
cd rccl/ext-tuner/example
```
#### Step 2: Build the plugin
```bash
make
```
This will compile the plugin and create `libnccl-tuner-example.so` in the same directory.
### Step 1: Create Virtual Environment
Create a Python virtual environment to isolate the test dependencies:
```bash
python3 -m venv venv
```
### Step 2: Activate Virtual Environment
Activate the virtual environment using the appropriate command for your shell:
**On Linux:**
```bash
source venv/bin/activate
```
Once activated, you should see `(venv)` at the beginning of your command prompt.
### Step 3: Install Dependencies
Install the required Python packages:
```bash
pip install -r requirements.txt
```
## Running Tests
### Run All Tests
To run the entire test suite:
```bash
pytest --cache-clear
```
### Run Tests with Verbose Output
For more detailed test output:
```bash
pytest -v --cache-clear
```
### Run Tests by Marker
Tests are organized using pytest markers. You can run specific groups of tests:
**Run CSV Plugin tests:**
```bash
pytest -m mark.ext_tuner --cache-clear
```
**Run tests for specific collective operations:**
```bash
pytest -m allreduce --cache-clear # AllReduce tests
pytest -m broadcast --cache-clear # Broadcast tests
pytest -m reduce --cache-clear # Reduce tests
pytest -m allgather --cache-clear # AllGather tests
pytest -m reducescatter --cache-clear # ReduceScatter tests
```
### Run Tests with Log Output
To see live log output during test execution:
```bash
pytest -v -s --cache-clear
```
### Generate Test Report
To generate a detailed test report:
```bash
pytest --verbose --tb=short
```
## Additional Notes
- **Deactivating Virtual Environment**: When you're done testing, deactivate the virtual environment by running:
```bash
deactivate
```
- **Log Files**: Test execution logs are stored in the `logs/` directory for later review and debugging.
+61
Просмотреть файл
@@ -0,0 +1,61 @@
# Test configuration with invalid/incorrect values to test plugin robustness - 8B-128M
# Invalid collective type (should default to allreduce)
wrongtype,8,65536,ring,simple,-1,-1,-1,-1,-1
# Invalid algorithm (should default to ring)
allreduce,65537,16777216,badring,simple,-1,-1,-1,-1,-1
# Invalid protocol (should default to simple)
allreduce,16777217,67108864,tree,badproto,-1,-1,-1,-1,-1
# Invalid numbers (letters instead of numbers)
allreduce,abc,def,ring,simple,xyz,-1,-1,-1,-1
# Wrong field count (should be ignored completely)
allreduce,8,128M,ring
# Mixed valid and invalid
allreduce,67108865,134217728,ring,simple,2,-1,-1,-1,-1
# Broadcast configurations with invalid/incorrect values - 8B-128M
# Invalid algorithm (should default to ring for broadcast)
broadcast,8,65536,badring,simple,-1,-1,-1,-1,-1
# Invalid protocol (should default to simple)
broadcast,65537,16777216,ring,badproto,-1,-1,-1,-1,-1
# Invalid numbers (letters instead of numbers)
broadcast,abc,def,ring,simple,xyz,-1,-1,-1,-1
# Wrong field count (should be ignored completely)
broadcast,8,128M,ring
# Mixed valid and invalid
broadcast,16777217,134217728,ring,simple,4,-1,-1,-1,-1
# AllGather configurations with invalid/incorrect values - 8B-128M
# Invalid algorithm (should default to ring for allgather)
allgather,8,65536,badring,simple,-1,-1,-1,-1,-1
# Invalid protocol (should default to simple)
allgather,65537,16777216,ring,badproto,-1,-1,-1,-1,-1
# Invalid numbers (letters instead of numbers)
allgather,abc,def,ring,simple,xyz,-1,-1,-1,-1
# Wrong field count (should be ignored completely)
allgather,8,128M,ring
# Mixed valid and invalid
allgather,16777217,134217728,ring,simple,4,-1,-1,-1,-1
# Reduce configurations with invalid/incorrect values - 8B-128M
# Invalid algorithm (should default to ring for reduce)
reduce,8,65536,badring,simple,-1,-1,-1,-1,-1
# Invalid protocol (should default to simple)
reduce,65537,16777216,ring,badproto,-1,-1,-1,-1,-1
# Invalid numbers (letters instead of numbers)
reduce,abc,def,ring,simple,xyz,-1,-1,-1,-1
# Wrong field count (should be ignored completely)
reduce,8,128M,ring
# Mixed valid and invalid
reduce,16777217,134217728,ring,simple,4,-1,-1,-1,-1
# ReduceScatter configurations with invalid/incorrect values - 8B-128M
# Invalid algorithm (should default to ring for reducescatter)
reducescatter,8,65536,badring,simple,-1,-1,-1,-1,-1
# Invalid protocol (should default to simple)
reducescatter,65537,16777216,ring,badproto,-1,-1,-1,-1,-1
# Invalid numbers (letters instead of numbers)
reducescatter,abc,def,ring,simple,xyz,-1,-1,-1,-1
# Wrong field count (should be ignored completely)
reducescatter,8,128M,ring
# Mixed valid and invalid
reducescatter,16777217,134217728,ring,simple,4,-1,-1,-1,-1
+104
Просмотреть файл
@@ -0,0 +1,104 @@
# AllReduce configurations for multi-node setups - 8B-128M
# 2 nodes, 16 ranks total - 8B-128M
allreduce,8,65536,tree,ll,4,2,16,-1,-1 # Small: tree (8B to 64K)
allreduce,65537,16777216,ring,ll128,6,2,16,-1,-1 # Medium: ring (64K to 16M)
allreduce,16777217,134217728,ring,simple,8,2,16,-1,-1 # Large: ring/simple (16M to 128M)
# 3 nodes, 24 ranks total - 8B-128M
allreduce,8,65536,tree,ll,4,2,24,-1,-1 # Small: tree (8B to 64K)
allreduce,65537,16777216,ring,ll128,6,2,24,-1,-1 # Medium: ring (64K to 16M)
allreduce,16777217,134217728,ring,simple,8,2,24,-1,-1 # Large: ring/simple (16M to 128M)
# 4 nodes, 32 ranks total - 8B-128M
allreduce,8,65536,tree,ll,4,2,32,-1,-1 # Small: tree (8B to 64K)
allreduce,65537,16777216,ring,ll128,6,2,32,-1,-1 # Medium: ring (64K to 16M)
allreduce,16777217,134217728,ring,simple,8,2,32,-1,-1 # Large: ring/simple (16M to 128M)
# 5 nodes, 40 ranks total - 8B-128M
allreduce,8,65536,tree,ll,4,2,40,-1,-1 # Small: tree (8B to 64K)
allreduce,65537,16777216,ring,ll128,6,2,40,-1,-1 # Medium: ring (64K to 16M)
allreduce,16777217,134217728,ring,simple,8,2,40,-1,-1 # Large: ring/simple (16M to 128M)
# Broadcast configurations for multi-node setups - 8B-128M
# 2 nodes, 16 ranks total
broadcast,8,65536,ring,ll,4,2,16,-1,-1 # Small: ring (8B to 64K)
broadcast,65537,16777216,ring,ll128,6,2,16,-1,-1 # Medium: ring (64K to 16M)
broadcast,16777217,134217728,ring,simple,8,2,16,-1,-1 # Large: ring/simple (16M to 128M)
# 3 nodes, 24 ranks total
broadcast,8,65536,ring,ll,4,2,24,-1,-1 # Small: ring (8B to 64K)
broadcast,65537,16777216,ring,ll128,6,2,24,-1,-1 # Medium: ring (64K to 16M)
broadcast,16777217,134217728,ring,simple,8,2,24,-1,-1 # Large: ring/simple (16M to 128M)
# 4 nodes, 32 ranks total
broadcast,8,65536,ring,ll,4,2,32,-1,-1 # Small: ring (8B to 64K)
broadcast,65537,16777216,ring,ll128,6,2,32,-1,-1 # Medium: ring (64K to 16M)
broadcast,16777217,134217728,ring,simple,8,2,32,-1,-1 # Large: ring/simple (16M to 128M)
# 5 nodes, 40 ranks total
broadcast,8,65536,ring,ll,4,2,40,-1,-1 # Small: ring (8B to 64K)
broadcast,65537,16777216,ring,ll128,6,2,40,-1,-1 # Medium: ring (64K to 16M)
broadcast,16777217,134217728,ring,simple,8,2,40,-1,-1 # Large: ring/simple (16M to 128M)
# AllGather configurations for multi-node setups - 8B-128M
# 2 nodes, 16 ranks total
allgather,8,65536,ring,ll,4,2,16,-1,-1 # Small: ring (8B to 64K)
allgather,65537,16777216,ring,ll128,6,2,16,-1,-1 # Medium: ring (64K to 16M)
allgather,16777217,134217728,ring,simple,8,2,16,-1,-1 # Large: ring/simple (16M to 128M)
# 3 nodes, 24 ranks total
allgather,8,65536,ring,ll,4,2,24,-1,-1 # Small: ring (8B to 64K)
allgather,65537,16777216,ring,ll128,6,2,24,-1,-1 # Medium: ring (64K to 16M)
allgather,16777217,134217728,ring,simple,8,2,24,-1,-1 # Large: ring/simple (16M to 128M)
# 4 nodes, 32 ranks total
allgather,8,65536,ring,ll,4,2,32,-1,-1 # Small: ring (8B to 64K)
allgather,65537,16777216,ring,ll128,6,2,32,-1,-1 # Medium: ring (64K to 16M)
allgather,16777217,134217728,ring,simple,8,2,32,-1,-1 # Large: ring/simple (16M to 128M)
# 5 nodes, 40 ranks total
allgather,8,65536,ring,ll,4,2,40,-1,-1 # Small: ring (8B to 64K)
allgather,65537,16777216,ring,ll128,6,2,40,-1,-1 # Medium: ring (64K to 16M)
allgather,16777217,134217728,ring,simple,8,2,40,-1,-1 # Large: ring/simple (16M to 128M)
# Reduce configurations for multi-node setups - 8B-128M
# 2 nodes, 16 ranks total
reduce,8,65536,ring,ll,4,2,16,-1,-1 # Small: ring (8B to 64K)
reduce,65537,16777216,ring,ll128,6,2,16,-1,-1 # Medium: ring (64K to 16M)
reduce,16777217,134217728,ring,simple,8,2,16,-1,-1 # Large: ring/simple (16M to 128M)
# 3 nodes, 24 ranks total
reduce,8,65536,ring,ll,4,2,24,-1,-1 # Small: ring (8B to 64K)
reduce,65537,16777216,ring,ll128,6,2,24,-1,-1 # Medium: ring (64K to 16M)
reduce,16777217,134217728,ring,simple,8,2,24,-1,-1 # Large: ring/simple (16M to 128M)
# 4 nodes, 32 ranks total
reduce,8,65536,ring,ll,4,2,32,-1,-1 # Small: ring (8B to 64K)
reduce,65537,16777216,ring,ll128,6,2,32,-1,-1 # Medium: ring (64K to 16M)
reduce,16777217,134217728,ring,simple,8,2,32,-1,-1 # Large: ring/simple (16M to 128M)
# 5 nodes, 40 ranks total
reduce,8,65536,ring,ll,4,2,40,-1,-1 # Small: ring (8B to 64K)
reduce,65537,16777216,ring,ll128,6,2,40,-1,-1 # Medium: ring (64K to 16M)
reduce,16777217,134217728,ring,simple,8,2,40,-1,-1 # Large: ring/simple (16M to 128M)
# ReduceScatter configurations for multi-node setups - 8B-128M
# 2 nodes, 16 ranks total
reducescatter,8,65536,ring,ll,4,2,16,-1,-1 # Small: ring (8B to 64K)
reducescatter,65537,16777216,ring,ll128,6,2,16,-1,-1 # Medium: ring (64K to 16M)
reducescatter,16777217,134217728,ring,simple,8,2,16,-1,-1 # Large: ring/simple (16M to 128M)
# 3 nodes, 24 ranks total
reducescatter,8,65536,ring,ll,4,2,24,-1,-1 # Small: ring (8B to 64K)
reducescatter,65537,16777216,ring,ll128,6,2,24,-1,-1 # Medium: ring (64K to 16M)
reducescatter,16777217,134217728,ring,simple,8,2,24,-1,-1 # Large: ring/simple (16M to 128M)
# 4 nodes, 32 ranks total
reducescatter,8,65536,ring,ll,4,2,32,-1,-1 # Small: ring (8B to 64K)
reducescatter,65537,16777216,ring,ll128,6,2,32,-1,-1 # Medium: ring (64K to 16M)
reducescatter,16777217,134217728,ring,simple,8,2,32,-1,-1 # Large: ring/simple (16M to 128M)
# 5 nodes, 40 ranks total
reducescatter,8,65536,ring,ll,4,2,40,-1,-1 # Small: ring (8B to 64K)
reducescatter,65537,16777216,ring,ll128,6,2,40,-1,-1 # Medium: ring (64K to 16M)
reducescatter,16777217,134217728,ring,simple,8,2,40,-1,-1 # Large: ring/simple (16M to 128M)
+44
Просмотреть файл
@@ -0,0 +1,44 @@
# Invalid configuration that should NOT match the test environment
# Wrong collective type (broadcast instead of allreduce for allreduce tests) - 8B-128M
broadcast,0,7,ring,simple,-1,-1,-1,-1,-1 # Size too small (broadcast test starts at 8B)
broadcast,134217729,268435456,ring,simple,-1,-1,-1,-1,-1 # Size too large (broadcast test ends at 128M)
# Wrong collective type (allreduce instead of broadcast for broadcast tests) - 8B-128M
allreduce,0,7,tree,simple,-1,-1,-1,-1,-1 # Size too small (doesn't include 8B+)
allreduce,134217729,268435456,tree,simple,-1,-1,-1,-1,-1 # Size too large (doesn't include 8B-128M)
# Wrong collective type (allgather instead of other tests) - 8B-128M
allgather,0,7,ring,simple,-1,-1,-1,-1,-1 # Size too small (allgather test starts at 8B)
allgather,134217729,268435456,ring,simple,-1,-1,-1,-1,-1 # Size too large (allgather test ends at 128M)
# Wrong collective type (reduce instead of other tests) - 8B-128M
reduce,0,7,ring,simple,-1,-1,-1,-1,-1 # Size too small (reduce test starts at 8B)
reduce,134217729,268435456,ring,simple,-1,-1,-1,-1,-1 # Size too large (reduce test ends at 128M)
# Wrong collective type (reducescatter instead of other tests) - 8B-128M
reducescatter,0,7,ring,simple,-1,-1,-1,-1,-1 # Size too small (reducescatter test starts at 8B)
reducescatter,134217729,268435456,ring,simple,-1,-1,-1,-1,-1 # Size too large (reducescatter test ends at 128M)
# Wrong size ranges (tests use 8 bytes to 128M, these are way outside)
allreduce,2000000000,4000000000,ring,simple,-1,-1,-1,-1,-1
allreduce,5000000000,6000000000,tree,ll128,-1,-1,-1,-1,-1
broadcast,2000000000,4000000000,ring,simple,-1,-1,-1,-1,-1
broadcast,5000000000,6000000000,ring,ll128,-1,-1,-1,-1,-1
allgather,2000000000,4000000000,ring,simple,-1,-1,-1,-1,-1
allgather,5000000000,6000000000,ring,ll128,-1,-1,-1,-1,-1
reduce,2000000000,4000000000,ring,simple,-1,-1,-1,-1,-1
reduce,5000000000,6000000000,ring,ll128,-1,-1,-1,-1,-1
reducescatter,2000000000,4000000000,ring,simple,-1,-1,-1,-1,-1
reducescatter,5000000000,6000000000,ring,ll128,-1,-1,-1,-1,-1
# Wrong node/rank counts (broadcast tests use 1 node, 4/8 ranks - these specify different, size outside 8B-128M)
broadcast,134217729,268435456,ring,simple,-1,4,16,-1,-1 # 4 nodes, 16 ranks, size too large
broadcast,268435457,536870912,ring,simple,-1,2,4,-1,-1 # 2 nodes, 4 ranks, size too large
# Wrong node/rank counts (allreduce tests use 1 node, 4/8 ranks - these specify different, size outside 8B-128M)
allreduce,134217729,268435456,ring,simple,-1,8,16,-1,-1 # 8 nodes, 16 ranks, size too large
allreduce,268435457,536870912,tree,simple,-1,2,8,-1,-1 # 2 nodes, 8 ranks, size too large
# Wrong node/rank counts (allgather tests use 1 node, 8 ranks - these specify different, size outside 8B-128M)
allgather,134217729,268435456,ring,simple,-1,4,16,-1,-1 # 4 nodes, 16 ranks, size too large
allgather,268435457,536870912,ring,simple,-1,2,4,-1,-1 # 2 nodes, 4 ranks, size too large
# Wrong node/rank counts (reduce tests use 1 node, 8 ranks - these specify different, size outside 8B-128M)
reduce,134217729,268435456,ring,simple,-1,4,16,-1,-1 # 4 nodes, 16 ranks, size too large
reduce,268435457,536870912,ring,simple,-1,2,4,-1,-1 # 2 nodes, 4 ranks, size too large
# Wrong node/rank counts (reducescatter tests use 1 node, 8 ranks - these specify different, size outside 8B-128M)
reducescatter,134217729,268435456,ring,simple,-1,4,16,-1,-1 # 4 nodes, 16 ranks, size too large
reducescatter,268435457,536870912,ring,simple,-1,2,4,-1,-1 # 2 nodes, 4 ranks, size too large
+39
Просмотреть файл
@@ -0,0 +1,39 @@
# Single-node allreduce configuration for 8 ranks, 1 node - use working algorithm/protocol combinations - 8B-128M
# Small allreduce: tree/ll (8 bytes to 64K)
allreduce,8,65536,tree,ll,2,1,8,-1,-1
# Medium allreduce: ring/ll128 (64K to 16M)
allreduce,65537,16777216,ring,ll128,4,1,8,-1,-1
# Large allreduce: ring/simple (16M to 128M)
allreduce,16777217,134217728,ring,simple,6,1,8,-1,-1
# Single-node broadcast configuration for 8 ranks, 1 node - use ring algorithm (supported for broadcast) - 8B-128M
# Small broadcast: ring/ll (8 bytes to 64K)
broadcast,8,65536,ring,ll,2,1,8,-1,-1
# Medium broadcast: ring/ll128 (64K to 16M)
broadcast,65537,16777216,ring,ll128,4,1,8,-1,-1
# Large broadcast: ring/simple (16M to 128M)
broadcast,16777217,134217728,ring,simple,6,1,8,-1,-1
# Single-node allgather configuration for 8 ranks, 1 node - use ring algorithm (supported for allgather) - 8B-128M
# Small allgather: ring/ll (8 bytes to 64K)
allgather,8,65536,ring,ll,2,1,8,-1,-1
# Medium allgather: ring/ll128 (64K to 16M)
allgather,65537,16777216,ring,ll128,4,1,8,-1,-1
# Large allgather: ring/simple (16M to 128M)
allgather,16777217,134217728,ring,simple,6,1,8,-1,-1
# Single-node reduce configuration for 8 ranks, 1 node - use ring algorithm (supported for reduce) - 8B-128M
# Small reduce: ring/ll (8 bytes to 64K)
reduce,8,65536,ring,ll,2,1,8,-1,-1
# Medium reduce: ring/ll128 (64K to 16M)
reduce,65537,16777216,ring,ll128,4,1,8,-1,-1
# Large reduce: ring/simple (16M to 128M)
reduce,16777217,134217728,ring,simple,6,1,8,-1,-1
# Single-node reducescatter configuration for 8 ranks, 1 node - use ring algorithm (supported for reducescatter) - 8B-128M
# Small reducescatter: ring/ll (8 bytes to 64K)
reducescatter,8,65536,ring,ll,2,1,8,-1,-1
# Medium reducescatter: ring/ll128 (64K to 16M)
reducescatter,65537,16777216,ring,ll128,4,1,8,-1,-1
# Large reducescatter: ring/simple (16M to 128M)
reducescatter,16777217,134217728,ring,simple,6,1,8,-1,-1
+40
Просмотреть файл
@@ -0,0 +1,40 @@
# AllReduce configurations with unsupported algorithm/protocol combinations - 8B-128M
# These have valid size ranges and collective types but unsupported algorithms
allreduce,8,65536,collnet_direct,simple,2,-1,-1,-1,-1
allreduce,65537,8388608,collnet_chain,ll,4,-1,-1,-1,-1
allreduce,8388609,33554432,nvls,simple,-1,-1,-1,-1,-1
allreduce,33554433,67108864,nvls_tree,ll128,-1,-1,-1,-1,-1
allreduce,67108865,100663296,pat,simple,-1,-1,-1,-1,-1
allreduce,100663297,134217728,collnet_direct,ll128,-1,-1,-1,-1,-1
# Broadcast configurations with unsupported algorithm/protocol combinations - 8B-128M
broadcast,8,65536,collnet_direct,simple,2,-1,-1,-1,-1
broadcast,65537,8388608,collnet_chain,ll,4,-1,-1,-1,-1
broadcast,8388609,33554432,nvls,simple,-1,-1,-1,-1,-1
broadcast,33554433,67108864,nvls_tree,ll128,-1,-1,-1,-1,-1
broadcast,67108865,100663296,pat,simple,-1,-1,-1,-1,-1
broadcast,100663297,134217728,collnet_direct,ll128,-1,-1,-1,-1,-1
# AllGather configurations with unsupported algorithm/protocol combinations - 8B-128M
allgather,8,65536,collnet_direct,simple,2,-1,-1,-1,-1
allgather,65537,8388608,collnet_chain,ll,4,-1,-1,-1,-1
allgather,8388609,33554432,nvls,simple,-1,-1,-1,-1,-1
allgather,33554433,67108864,nvls_tree,ll128,-1,-1,-1,-1,-1
allgather,67108865,100663296,pat,simple,-1,-1,-1,-1,-1
allgather,100663297,134217728,collnet_direct,ll128,-1,-1,-1,-1,-1
# Reduce configurations with unsupported algorithm/protocol combinations - 8B-128M
reduce,8,65536,collnet_direct,simple,2,-1,-1,-1,-1
reduce,65537,8388608,collnet_chain,ll,4,-1,-1,-1,-1
reduce,8388609,33554432,nvls,simple,-1,-1,-1,-1,-1
reduce,33554433,67108864,nvls_tree,ll128,-1,-1,-1,-1,-1
reduce,67108865,100663296,pat,simple,-1,-1,-1,-1,-1
reduce,100663297,134217728,collnet_direct,ll128,-1,-1,-1,-1,-1
# ReduceScatter configurations with unsupported algorithm/protocol combinations - 8B-128M
reducescatter,8,65536,collnet_direct,simple,2,-1,-1,-1,-1
reducescatter,65537,8388608,collnet_chain,ll,4,-1,-1,-1,-1
reducescatter,8388609,33554432,nvls,simple,-1,-1,-1,-1,-1
reducescatter,33554433,67108864,nvls_tree,ll128,-1,-1,-1,-1,-1
reducescatter,67108865,100663296,pat,simple,-1,-1,-1,-1,-1
reducescatter,100663297,134217728,collnet_direct,ll128,-1,-1,-1,-1,-1
+39
Просмотреть файл
@@ -0,0 +1,39 @@
# AllReduce configurations with supported algo/proto combinations - 8B-128M
# Small allreduce: tree/ll (8 bytes to 64K)
allreduce,8,65536,tree,ll,2,-1,-1,-1,-1
# Medium allreduce: ring/ll128 (64K to 16M)
allreduce,65537,16777216,ring,ll128,4,-1,-1,-1,-1
# Large allreduce: ring/simple (16M to 128M)
allreduce,16777217,134217728,ring,simple,-1,-1,-1,-1,-1
# Broadcast configurations with supported algo/proto combinations - 8B-128M
# Small broadcast: ring/ll (8 bytes to 64K)
broadcast,8,65536,ring,ll,2,-1,-1,-1,-1
# Medium broadcast: ring/ll128 (64K to 16M)
broadcast,65537,16777216,ring,ll128,4,-1,-1,-1,-1
# Large broadcast: ring/simple (16M to 128M)
broadcast,16777217,134217728,ring,simple,-1,-1,-1,-1,-1
# AllGather configurations with supported algo/proto combinations - 8B-128M
# Small allgather: ring/ll (8 bytes to 64K)
allgather,8,65536,ring,ll,2,-1,-1,-1,-1
# Medium allgather: ring/ll128 (64K to 16M)
allgather,65537,16777216,ring,ll128,4,-1,-1,-1,-1
# Large allgather: ring/simple (16M to 128M)
allgather,16777217,134217728,ring,simple,-1,-1,-1,-1,-1
# Reduce configurations with supported algo/proto combinations - 8B-128M
# Small reduce: ring/ll (8 bytes to 64K)
reduce,8,65536,ring,ll,2,-1,-1,-1,-1
# Medium reduce: ring/ll128 (64K to 16M)
reduce,65537,16777216,ring,ll128,4,-1,-1,-1,-1
# Large reduce: ring/simple (16M to 128M)
reduce,16777217,134217728,ring,simple,-1,-1,-1,-1,-1
# ReduceScatter configurations with supported algo/proto combinations - 8B-128M
# Small reducescatter: ring/ll (8 bytes to 64K)
reducescatter,8,65536,ring,ll,2,-1,-1,-1,-1
# Medium reducescatter: ring/ll128 (64K to 16M)
reducescatter,65537,16777216,ring,ll128,4,-1,-1,-1,-1
# Large reducescatter: ring/simple (16M to 128M)
reducescatter,16777217,134217728,ring,simple,-1,-1,-1,-1,-1
+40
Просмотреть файл
@@ -0,0 +1,40 @@
# AllReduce configurations with specific nodes/ranks values (no wildcards for topology) - 8B-128M
# But use wildcards for pipeOps/regBuff since they vary at runtime
# Small allreduce: tree/ll, 2 channels, 1 node, 4 ranks, any pipeOps, any regBuff (8 bytes to 64K)
allreduce,8,65536,tree,ll,2,1,4,-1,-1
# Medium allreduce: ring/ll128, 4 channels, 1 node, 4 ranks, any pipeOps, any regBuff (64K to 16M)
allreduce,65537,16777216,ring,ll128,4,1,4,-1,-1
# Large allreduce: ring/simple, 6 channels, 1 node, 4 ranks, any pipeOps, any regBuff (16M to 128M)
allreduce,16777217,134217728,ring,simple,6,1,4,-1,-1
# Broadcast configurations with specific nodes/ranks values (no wildcards for topology) - 8B-128M
# Small broadcast: ring/ll, 2 channels, 1 node, 4 ranks, any pipeOps, any regBuff (8 bytes to 64K)
broadcast,8,65536,ring,ll,2,1,4,-1,-1
# Medium broadcast: ring/ll128, 4 channels, 1 node, 4 ranks, any pipeOps, any regBuff (64K to 16M)
broadcast,65537,16777216,ring,ll128,4,1,4,-1,-1
# Large broadcast: ring/simple, 6 channels, 1 node, 4 ranks, any pipeOps, any regBuff (16M to 128M)
broadcast,16777217,134217728,ring,simple,6,1,4,-1,-1
# AllGather configurations with specific nodes/ranks values (no wildcards for topology) - 8B-128M
# Small allgather: ring/ll, 2 channels, 1 node, 4 ranks, any pipeOps, any regBuff (8 bytes to 64K)
allgather,8,65536,ring,ll,2,1,4,-1,-1
# Medium allgather: ring/ll128, 4 channels, 1 node, 4 ranks, any pipeOps, any regBuff (64K to 16M)
allgather,65537,16777216,ring,ll128,4,1,4,-1,-1
# Large allgather: ring/simple, 6 channels, 1 node, 4 ranks, any pipeOps, any regBuff (16M to 128M)
allgather,16777217,134217728,ring,simple,6,1,4,-1,-1
# Reduce configurations with specific nodes/ranks values (no wildcards for topology) - 8B-128M
# Small reduce: ring/ll, 2 channels, 1 node, 4 ranks, any pipeOps, any regBuff (8 bytes to 64K)
reduce,8,65536,ring,ll,2,1,4,-1,-1
# Medium reduce: ring/ll128, 4 channels, 1 node, 4 ranks, any pipeOps, any regBuff (64K to 16M)
reduce,65537,16777216,ring,ll128,4,1,4,-1,-1
# Large reduce: ring/simple, 6 channels, 1 node, 4 ranks, any pipeOps, any regBuff (16M to 128M)
reduce,16777217,134217728,ring,simple,6,1,4,-1,-1
# ReduceScatter configurations with specific nodes/ranks values (no wildcards for topology) - 8B-128M
# Small reducescatter: ring/ll, 2 channels, 1 node, 4 ranks, any pipeOps, any regBuff (8 bytes to 64K)
reducescatter,8,65536,ring,ll,2,1,4,-1,-1
# Medium reducescatter: ring/ll128, 4 channels, 1 node, 4 ranks, any pipeOps, any regBuff (64K to 16M)
reducescatter,65537,16777216,ring,ll128,4,1,4,-1,-1
# Large reducescatter: ring/simple, 6 channels, 1 node, 4 ranks, any pipeOps, any regBuff (16M to 128M)
reducescatter,16777217,134217728,ring,simple,6,1,4,-1,-1
+12
Просмотреть файл
@@ -0,0 +1,12 @@
# pytest.ini
[pytest]
markers =
ext_tuner: marks tests related to CSV Tuner Plugin
allreduce: marks tests related to AllReduce collective
broadcast: marks tests related to Broadcast collective
reduce: marks tests related to Reduce collective
allgather: marks tests related to AllGather collective
reducescatter: marks tests related to ReduceScatter collective
multinode: marks tests related to multi-node configuration
testpaths = tests
+2
Просмотреть файл
@@ -0,0 +1,2 @@
# Testing dependencies
pytest
+109
Просмотреть файл
@@ -0,0 +1,109 @@
# *************************************************************************
# * Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
# *
# * See LICENSE.txt for license information
# ************************************************************************
import os
import pytest
import subprocess
import re
from types import SimpleNamespace
WORKDIR = os.getcwd()
RCCL_INSTALL_DIR = "path/to/rccl"
OMPI_INSTALL_DIR = "path/to/ompi/install"
RCCL_TESTS_DIR = "path/to/rccl-tests"
# Plugin Paths
PLUGIN_DIR = f"{RCCL_INSTALL_DIR}/ext-tuner/example"
PLUGIN_SO = f"{PLUGIN_DIR}/libnccl-tuner-example.so"
# CSV Configs
VALID_CONFIG_WITH_WILDCARDS = os.path.join(WORKDIR, "assets/csv_confs/valid_config_with_wildcards.conf")
VALID_CONFIG_WITHOUT_WILDCARDS = os.path.join(WORKDIR, "assets/csv_confs/valid_config_without_wildcards.conf")
NO_MATCHING_CONFIG = os.path.join(WORKDIR, "assets/csv_confs/no_matching_config.conf")
INCORRECT_VALUES_CONFIG = os.path.join(WORKDIR, "assets/csv_confs/incorrect_values_config.conf")
UNSUPPORTED_ALGO_PROTO_CONFIG = os.path.join(WORKDIR, "assets/csv_confs/unsupported_algo_proto_config.conf")
SINGLENODE_CONFIG = os.path.join(WORKDIR, "assets/csv_confs/singlenode_config.conf")
MULTINODE_CONFIG = os.path.join(WORKDIR, "assets/csv_confs/multinode_config.conf")
LOGDIR = os.path.join(WORKDIR, "logs")
os.makedirs(LOGDIR, exist_ok=True)
# Helper Functions
def get_avg_bus_bandwidth(log_content: str):
"""Extract average bus bandwidth from RCCL test log"""
pattern = r'#\s*Avg bus bandwidth\s*:\s*([\d.]+)'
match = re.search(pattern, log_content, re.IGNORECASE)
return float(match.group(1)) if match else None
def check_node_interface(node: str, interface: str) -> bool:
"""Check if a node has the specified interface with an IP address"""
try:
cmd = ["ssh", "-o", "ConnectTimeout=5", "-o", "StrictHostKeyChecking=no",
node, f"ip addr show {interface} | grep 'inet ' | wc -l"]
result = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
return result.returncode == 0 and int(result.stdout.strip()) > 0
except (subprocess.CalledProcessError, subprocess.TimeoutExpired, ValueError):
return False
def find_common_interface(nodelist):
"""Find a common network interface across all nodes"""
interfaces_to_check = ["eth0", "eth1"]
for interface in interfaces_to_check:
all_nodes_have_interface = True
for node in nodelist:
if not check_node_interface(node, interface):
all_nodes_have_interface = False
break
if all_nodes_have_interface:
return interface
return None
def get_available_nodes():
"""Get available nodes from SLURM environment"""
try:
# Get available nodes
result = subprocess.run(
["scontrol", "show", "hostnames"],
capture_output=True,
text=True,
check=True
)
nodelist = result.stdout.strip().split('\n')
nodelist = [node.strip() for node in nodelist if node.strip()]
return nodelist
except (subprocess.CalledProcessError, FileNotFoundError):
return []
# Pytest Fixture
@pytest.fixture(scope="session")
def paths():
return SimpleNamespace(
# Paths
WORKDIR=WORKDIR,
RCCL_INSTALL_DIR=RCCL_INSTALL_DIR,
OMPI_INSTALL_DIR=OMPI_INSTALL_DIR,
PLUGIN_DIR=PLUGIN_DIR,
PLUGIN_SO=PLUGIN_SO,
RCCL_TESTS_DIR=RCCL_TESTS_DIR,
# CSV Configs
VALID_CONFIG_WITH_WILDCARDS=VALID_CONFIG_WITH_WILDCARDS,
VALID_CONFIG_WITHOUT_WILDCARDS=VALID_CONFIG_WITHOUT_WILDCARDS,
NO_MATCHING_CONFIG=NO_MATCHING_CONFIG,
INCORRECT_VALUES_CONFIG=INCORRECT_VALUES_CONFIG,
UNSUPPORTED_ALGO_PROTO_CONFIG=UNSUPPORTED_ALGO_PROTO_CONFIG,
SINGLENODE_CONFIG=SINGLENODE_CONFIG,
MULTINODE_CONFIG=MULTINODE_CONFIG,
LOGDIR=LOGDIR,
# Helper Functions
get_avg_bus_bandwidth=get_avg_bus_bandwidth,
check_node_interface=check_node_interface,
find_common_interface=find_common_interface,
get_available_nodes=get_available_nodes,
)
+440
Просмотреть файл
@@ -0,0 +1,440 @@
# *************************************************************************
# * Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
# *
# * See LICENSE.txt for license information
# ************************************************************************
import os
import subprocess
import pytest
@pytest.mark.ext_tuner
@pytest.mark.allgather
def test_valid_config_with_wildcards(paths):
"""Test CSV plugin with wildcard values for matching configurations"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.VALID_CONFIG_WITH_WILDCARDS,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/all_gather_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
allgather_log_dir = os.path.join(paths.LOGDIR, "allgather_csv_plugin_test_logs")
os.makedirs(allgather_log_dir, exist_ok=True)
log_file = os.path.join(allgather_log_dir, "test_allgather_valid_config_with_wildcards.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin allgather test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.VALID_CONFIG_WITH_WILDCARDS}"
# Check that plugin applied valid configurations (test fails if no configs applied)
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
f"Plugin should have applied valid configurations, but none were applied. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.allgather
def test_valid_config_without_wildcards(paths):
"""Test CSV plugin with specific values (no wildcards -1) for precise matching"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.VALID_CONFIG_WITHOUT_WILDCARDS,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/all_gather_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
allgather_log_dir = os.path.join(paths.LOGDIR, "allgather_csv_plugin_test_logs")
os.makedirs(allgather_log_dir, exist_ok=True)
log_file = os.path.join(allgather_log_dir, "test_allgather_valid_config_without_wildcards.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin allgather test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.VALID_CONFIG_WITHOUT_WILDCARDS}"
# With specific values, plugin should either apply matching configs or report no matches
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
# Test should fail if no config is applied - we expect specific configs to match
assert plugin_applied, \
f"Plugin should have applied at least one configuration from {paths.VALID_CONFIG_WITHOUT_WILDCARDS}. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.allgather
def test_no_matching_config(paths):
"""Test CSV plugin behavior with no matching configurations"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.NO_MATCHING_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "8",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/all_gather_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
allgather_log_dir = os.path.join(paths.LOGDIR, "allgather_csv_plugin_test_logs")
os.makedirs(allgather_log_dir, exist_ok=True)
log_file = os.path.join(allgather_log_dir, "test_allgather_no_matching_config.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin allgather test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.NO_MATCHING_CONFIG}"
# Check that NO configurations were applied (they should not match the test environment)
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert not plugin_applied, \
f"Plugin should NOT have applied any configurations from {paths.NO_MATCHING_CONFIG} as they don't match the test environment. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.allgather
def test_incorrect_values_config(paths):
"""Test CSV plugin behavior with invalid/incorrect values in configuration"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.INCORRECT_VALUES_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/all_gather_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
allgather_log_dir = os.path.join(paths.LOGDIR, "allgather_csv_plugin_test_logs")
os.makedirs(allgather_log_dir, exist_ok=True)
log_file = os.path.join(allgather_log_dir, "test_allgather_incorrect_values_config.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin allgather test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded some configurations (plugin should handle invalid values gracefully)
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.INCORRECT_VALUES_CONFIG}"
# Plugin should still function despite invalid values (using defaults)
# It might apply configs with default values or report no matches
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
"Plugin should either apply configurations (with defaults) or report no matches"
@pytest.mark.ext_tuner
@pytest.mark.allgather
def test_unsupported_algo_proto_config(paths):
"""Test that plugin handles unsupported algorithm/protocol combinations"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.UNSUPPORTED_ALGO_PROTO_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/all_gather_perf",
"-b", "64",
"-e", "1M",
"-f", "2",
"-g", "1",
]
allgather_log_dir = os.path.join(paths.LOGDIR, "allgather_csv_plugin_test_logs")
os.makedirs(allgather_log_dir, exist_ok=True)
log_file = os.path.join(allgather_log_dir, "test_allgather_unsupported_algo_proto.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin allgather test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.UNSUPPORTED_ALGO_PROTO_CONFIG}"
# Check for unsupported combinations - should see IGNORE or out of bounds messages
ignored_combinations = "Algorithm/protocol combination" in log_content and "is marked as IGNORE" in log_content
out_of_bounds = "out of bounds" in log_content
assert ignored_combinations or out_of_bounds, \
f"Plugin should report unsupported algorithm/protocol combinations as IGNORE or out of bounds. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.allgather
def test_singlenode_config(paths):
"""Test CSV plugin with single-node configuration"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.SINGLENODE_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "8",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/all_gather_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
allgather_log_dir = os.path.join(paths.LOGDIR, "allgather_csv_plugin_test_logs")
os.makedirs(allgather_log_dir, exist_ok=True)
log_file = os.path.join(allgather_log_dir, "test_allgather_singlenode.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"Single-node CSV Plugin allgather test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.SINGLENODE_CONFIG}"
# Check that configurations were applied for single-node setup
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
f"Plugin should have applied single-node configurations from {paths.SINGLENODE_CONFIG}. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.allgather
@pytest.mark.multinode
def test_multinode_config(paths):
"""Test CSV plugin with multi-node configuration"""
try:
# Get available nodes
result = subprocess.run(
["scontrol", "show", "hostnames"],
capture_output=True,
text=True,
check=True
)
nodelist = result.stdout.strip().split('\n')
nodelist = [node.strip() for node in nodelist if node.strip()]
print(f"Available nodes: {nodelist}")
# Skip test if less than 2 nodes available
if len(nodelist) < 2:
pytest.skip(f"Multinode allgather test requires at least 2 nodes, but only {len(nodelist)} available: {nodelist}")
except (subprocess.CalledProcessError, FileNotFoundError):
pytest.skip("Multinode allgather test requires SLURM environment (scontrol command not available)")
# Check for common network interface across all nodes
common_interface = paths.find_common_interface(nodelist)
if common_interface is None:
pytest.skip(f"Multinode allgather test requires all nodes to have the same network interface (eth0 or eth1).")
# Build host specification string (4 processes per node)
host_spec = ",".join([f"{node}:8" for node in nodelist])
total_processes = len(nodelist) * 8
print(f"Using host specification: {host_spec}")
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_IGNORE_CPU_AFFINITY": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.MULTINODE_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
"NCCL_SOCKET_IFNAME": common_interface,
"NCCL_DMABUF_ENABLE": "1",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", f"{total_processes}",
"--host", host_spec,
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/all_gather_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
allgather_log_dir = os.path.join(paths.LOGDIR, "allgather_csv_plugin_test_logs")
os.makedirs(allgather_log_dir, exist_ok=True)
log_file = os.path.join(allgather_log_dir, "test_allgather_multinode.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"Multi-node CSV Plugin allgather test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.MULTINODE_CONFIG}"
# Check that configurations were applied for multi-node setup
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
f"Plugin should have applied multi-node configurations from {paths.MULTINODE_CONFIG}. Check {log_file} for details"
+431
Просмотреть файл
@@ -0,0 +1,431 @@
# *************************************************************************
# * Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
# *
# * See LICENSE.txt for license information
# ************************************************************************
import os
import subprocess
import pytest
@pytest.mark.ext_tuner
@pytest.mark.allreduce
def test_valid_config_with_wildcards(paths):
"""Test CSV plugin with wildcard values for matching configurations"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.VALID_CONFIG_WITH_WILDCARDS,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/all_reduce_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
allreduce_log_dir = os.path.join(paths.LOGDIR, "allreduce_csv_plugin_test_logs")
os.makedirs(allreduce_log_dir, exist_ok=True)
log_file = os.path.join(allreduce_log_dir, "test_allreduce_valid_config_with_wildcards.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.VALID_CONFIG_WITH_WILDCARDS}"
# Check that plugin applied valid configurations (test fails if no configs applied)
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
f"Plugin should have applied valid configurations, but none were applied. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.allreduce
def test_valid_config_without_wildcards(paths):
"""Test CSV plugin with specific values (no wildcards -1) for precise matching"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.VALID_CONFIG_WITHOUT_WILDCARDS,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/all_reduce_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
allreduce_log_dir = os.path.join(paths.LOGDIR, "allreduce_csv_plugin_test_logs")
os.makedirs(allreduce_log_dir, exist_ok=True)
log_file = os.path.join(allreduce_log_dir, "test_allreduce_valid_config_without_wildcards.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.VALID_CONFIG_WITHOUT_WILDCARDS}"
# With specific values, plugin should either apply matching configs or report no matches
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
# Test should fail if no config is applied - we expect specific configs to match
assert plugin_applied, \
f"Plugin should have applied at least one configuration from {paths.VALID_CONFIG_WITHOUT_WILDCARDS}. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.allreduce
def test_no_matching_config(paths):
"""Test CSV plugin behavior with no matching configurations"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.NO_MATCHING_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "8",
"--bind-to", "none",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/all_reduce_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
allreduce_log_dir = os.path.join(paths.LOGDIR, "allreduce_csv_plugin_test_logs")
os.makedirs(allreduce_log_dir, exist_ok=True)
log_file = os.path.join(allreduce_log_dir, "test_allreduce_no_matching_config.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.NO_MATCHING_CONFIG}"
# Check that NO configurations were applied (they should not match the test environment)
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert not plugin_applied, \
f"Plugin should NOT have applied any configurations from {paths.NO_MATCHING_CONFIG} as they don't match the test environment. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.allreduce
def test_incorrect_values_config(paths):
"""Test CSV plugin behavior with invalid/incorrect values in configuration"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.INCORRECT_VALUES_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/all_reduce_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
allreduce_log_dir = os.path.join(paths.LOGDIR, "allreduce_csv_plugin_test_logs")
os.makedirs(allreduce_log_dir, exist_ok=True)
log_file = os.path.join(allreduce_log_dir, "test_allreduce_incorrect_values_config.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded some configurations (plugin should handle invalid values gracefully)
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.INCORRECT_VALUES_CONFIG}"
# Plugin should still function despite invalid values (using defaults)
# It might apply configs with default values or report no matches
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
"Plugin should either apply configurations (with defaults) or report no matches"
@pytest.mark.ext_tuner
@pytest.mark.allreduce
def test_unsupported_algo_proto_config(paths):
"""Test that plugin handles unsupported algorithm/protocol combinations"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.UNSUPPORTED_ALGO_PROTO_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/all_reduce_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
allreduce_log_dir = os.path.join(paths.LOGDIR, "allreduce_csv_plugin_test_logs")
os.makedirs(allreduce_log_dir, exist_ok=True)
log_file = os.path.join(allreduce_log_dir, "test_allreduce_unsupported_algo_proto.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.UNSUPPORTED_ALGO_PROTO_CONFIG}"
# Check for unsupported combinations - should see IGNORE or out of bounds messages
ignored_combinations = "Algorithm/protocol combination" in log_content and "is marked as IGNORE" in log_content
out_of_bounds = "out of bounds" in log_content
assert ignored_combinations or out_of_bounds, \
f"Plugin should report unsupported algorithm/protocol combinations as IGNORE or out of bounds. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.allreduce
def test_singlenode_config(paths):
"""Test CSV plugin with single-node configuration"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.SINGLENODE_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "8",
"--bind-to", "none",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/all_reduce_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
allreduce_log_dir = os.path.join(paths.LOGDIR, "allreduce_csv_plugin_test_logs")
os.makedirs(allreduce_log_dir, exist_ok=True)
log_file = os.path.join(allreduce_log_dir, "test_allreduce_singlenode.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"Single-node CSV Plugin test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.SINGLENODE_CONFIG}"
# Check that configurations were applied for single-node setup
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
f"Plugin should have applied single-node configurations from {paths.SINGLENODE_CONFIG}. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.allreduce
@pytest.mark.multinode
def test_multinode_config(paths):
"""Test CSV plugin with multi-node configuration"""
# Get available nodes using the shared function
nodelist = paths.get_available_nodes()
# Skip test if no nodes available (SLURM not available) or less than 2 nodes
if len(nodelist) == 0:
pytest.skip("No nodes available")
elif len(nodelist) < 2:
pytest.skip(f"Multinode allreduce test requires at least 2 nodes, but only {len(nodelist)} available: {nodelist}")
# Check for common network interface across all nodes
common_interface = paths.find_common_interface(nodelist)
if common_interface is None:
pytest.skip(f"Multinode allreduce test requires all nodes to have the same network interface (eth0 or eth1).")
# Build host specification string (4 processes per node)
host_spec = ",".join([f"{node}:8" for node in nodelist])
total_processes = len(nodelist) * 8
print(f"Using host specification: {host_spec}")
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_IGNORE_CPU_AFFINITY": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.MULTINODE_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
"NCCL_SOCKET_IFNAME": common_interface,
"NCCL_DMABUF_ENABLE": "1",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", f"{total_processes}",
"--host", host_spec,
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/all_reduce_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
allreduce_log_dir = os.path.join(paths.LOGDIR, "allreduce_csv_plugin_test_logs")
os.makedirs(allreduce_log_dir, exist_ok=True)
log_file = os.path.join(allreduce_log_dir, "test_allreduce_multinode.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"Multi-node CSV Plugin test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.MULTINODE_CONFIG}"
# Check that configurations were applied for multi-node setup
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
f"Plugin should have applied multi-node configurations from {paths.MULTINODE_CONFIG}. Check {log_file} for details"
+428
Просмотреть файл
@@ -0,0 +1,428 @@
# *************************************************************************
# * Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
# *
# * See LICENSE.txt for license information
# ************************************************************************
import os
import subprocess
import pytest
@pytest.mark.ext_tuner
@pytest.mark.broadcast
def test_valid_config_with_wildcards(paths):
"""Test CSV plugin with wildcard values for matching configurations"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.VALID_CONFIG_WITH_WILDCARDS,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/broadcast_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
broadcast_log_dir = os.path.join(paths.LOGDIR, "broadcast_csv_plugin_test_logs")
os.makedirs(broadcast_log_dir, exist_ok=True)
log_file = os.path.join(broadcast_log_dir, "test_broadcast_valid_config_with_wildcards.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin broadcast test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.VALID_CONFIG_WITH_WILDCARDS}"
# Check that plugin applied valid configurations (test fails if no configs applied)
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
f"Plugin should have applied valid configurations, but none were applied. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.broadcast
def test_valid_config_without_wildcards(paths):
"""Test CSV plugin with specific values (no wildcards -1) for precise matching"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.VALID_CONFIG_WITHOUT_WILDCARDS,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/broadcast_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
broadcast_log_dir = os.path.join(paths.LOGDIR, "broadcast_csv_plugin_test_logs")
os.makedirs(broadcast_log_dir, exist_ok=True)
log_file = os.path.join(broadcast_log_dir, "test_broadcast_valid_config_without_wildcards.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin broadcast test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.VALID_CONFIG_WITHOUT_WILDCARDS}"
# With specific values, plugin should either apply matching configs or report no matches
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
# Test should fail if no config is applied - we expect specific configs to match
assert plugin_applied, \
f"Plugin should have applied at least one configuration from {paths.VALID_CONFIG_WITHOUT_WILDCARDS}. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.broadcast
def test_no_matching_config(paths):
"""Test CSV plugin behavior with no matching configurations"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.NO_MATCHING_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "8",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/broadcast_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
broadcast_log_dir = os.path.join(paths.LOGDIR, "broadcast_csv_plugin_test_logs")
os.makedirs(broadcast_log_dir, exist_ok=True)
log_file = os.path.join(broadcast_log_dir, "test_broadcast_no_matching_config.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin broadcast test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.NO_MATCHING_CONFIG}"
# Check that NO configurations were applied (they should not match the test environment)
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert not plugin_applied, \
f"Plugin should NOT have applied any configurations from {paths.NO_MATCHING_CONFIG} as they don't match the test environment. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.broadcast
def test_incorrect_values_config(paths):
"""Test CSV plugin behavior with invalid/incorrect values in configuration"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.INCORRECT_VALUES_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/broadcast_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
broadcast_log_dir = os.path.join(paths.LOGDIR, "broadcast_csv_plugin_test_logs")
os.makedirs(broadcast_log_dir, exist_ok=True)
log_file = os.path.join(broadcast_log_dir, "test_broadcast_incorrect_values_config.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin broadcast test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded some configurations (plugin should handle invalid values gracefully)
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.INCORRECT_VALUES_CONFIG}"
# Plugin should still function despite invalid values (using defaults)
# It might apply configs with default values or report no matches
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
"Plugin should either apply configurations (with defaults) or report no matches"
@pytest.mark.ext_tuner
@pytest.mark.broadcast
def test_unsupported_algo_proto_config(paths):
"""Test that plugin handles unsupported algorithm/protocol combinations"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.UNSUPPORTED_ALGO_PROTO_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/broadcast_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
broadcast_log_dir = os.path.join(paths.LOGDIR, "broadcast_csv_plugin_test_logs")
os.makedirs(broadcast_log_dir, exist_ok=True)
log_file = os.path.join(broadcast_log_dir, "test_broadcast_unsupported_algo_proto.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin broadcast test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.UNSUPPORTED_ALGO_PROTO_CONFIG}"
# Check for unsupported combinations - should see IGNORE or out of bounds messages
ignored_combinations = "Algorithm/protocol combination" in log_content and "is marked as IGNORE" in log_content
out_of_bounds = "out of bounds" in log_content
assert ignored_combinations or out_of_bounds, \
f"Plugin should report unsupported algorithm/protocol combinations as IGNORE or out of bounds. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.broadcast
def test_singlenode_config(paths):
"""Test CSV plugin with single-node configuration"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.SINGLENODE_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "8",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/broadcast_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
broadcast_log_dir = os.path.join(paths.LOGDIR, "broadcast_csv_plugin_test_logs")
os.makedirs(broadcast_log_dir, exist_ok=True)
log_file = os.path.join(broadcast_log_dir, "test_broadcast_singlenode.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"Single-node CSV Plugin broadcast test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.SINGLENODE_CONFIG}"
# Check that configurations were applied for single-node setup
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
f"Plugin should have applied single-node configurations from {paths.SINGLENODE_CONFIG}. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.broadcast
@pytest.mark.multinode
def test_multinode_config(paths):
"""Test CSV plugin with multi-node configuration"""
# Get available nodes using the shared function
nodelist = paths.get_available_nodes()
# Skip test if no nodes available (SLURM not available) or less than 2 nodes
if len(nodelist) == 0:
pytest.skip("No nodes available")
elif len(nodelist) < 2:
pytest.skip(f"Multinode broadcast test requires at least 2 nodes, but only {len(nodelist)} available: {nodelist}")
# Check for common network interface across all nodes
common_interface = paths.find_common_interface(nodelist)
if common_interface is None:
pytest.skip(f"Multinode broadcast test requires all nodes to have the same network interface (eth0 or eth1).")
# Build host specification string (4 processes per node)
host_spec = ",".join([f"{node}:8" for node in nodelist])
total_processes = len(nodelist) * 8
print(f"Using host specification: {host_spec}")
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_IGNORE_CPU_AFFINITY": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.MULTINODE_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
"NCCL_SOCKET_IFNAME": common_interface,
"NCCL_DMABUF_ENABLE": "1",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", f"{total_processes}",
"--host", host_spec,
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/broadcast_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
broadcast_log_dir = os.path.join(paths.LOGDIR, "broadcast_csv_plugin_test_logs")
os.makedirs(broadcast_log_dir, exist_ok=True)
log_file = os.path.join(broadcast_log_dir, "test_broadcast_multinode.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"Multi-node CSV Plugin broadcast test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.MULTINODE_CONFIG}"
# Check that configurations were applied for multi-node setup
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
f"Plugin should have applied multi-node configurations from {paths.MULTINODE_CONFIG}. Check {log_file} for details"
+428
Просмотреть файл
@@ -0,0 +1,428 @@
# *************************************************************************
# * Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
# *
# * See LICENSE.txt for license information
# ************************************************************************
import os
import subprocess
import pytest
@pytest.mark.ext_tuner
@pytest.mark.reduce
def test_valid_config_with_wildcards(paths):
"""Test CSV plugin with wildcard values for matching configurations"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.VALID_CONFIG_WITH_WILDCARDS,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/reduce_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
reduce_log_dir = os.path.join(paths.LOGDIR, "reduce_csv_plugin_test_logs")
os.makedirs(reduce_log_dir, exist_ok=True)
log_file = os.path.join(reduce_log_dir, "test_reduce_valid_config_with_wildcards.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin reduce test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.VALID_CONFIG_WITH_WILDCARDS}"
# Check that plugin applied valid configurations (test fails if no configs applied)
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
f"Plugin should have applied valid configurations, but none were applied. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.reduce
def test_valid_config_without_wildcards(paths):
"""Test CSV plugin with specific values (no wildcards -1) for precise matching"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.VALID_CONFIG_WITHOUT_WILDCARDS,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/reduce_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
reduce_log_dir = os.path.join(paths.LOGDIR, "reduce_csv_plugin_test_logs")
os.makedirs(reduce_log_dir, exist_ok=True)
log_file = os.path.join(reduce_log_dir, "test_reduce_valid_config_without_wildcards.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin reduce test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.VALID_CONFIG_WITHOUT_WILDCARDS}"
# With specific values, plugin should either apply matching configs or report no matches
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
# Test should fail if no config is applied - we expect specific configs to match
assert plugin_applied, \
f"Plugin should have applied at least one configuration from {paths.VALID_CONFIG_WITHOUT_WILDCARDS}. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.reduce
def test_no_matching_config(paths):
"""Test CSV plugin behavior with no matching configurations"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.NO_MATCHING_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "8",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/reduce_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
reduce_log_dir = os.path.join(paths.LOGDIR, "reduce_csv_plugin_test_logs")
os.makedirs(reduce_log_dir, exist_ok=True)
log_file = os.path.join(reduce_log_dir, "test_reduce_no_matching_config.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin reduce test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.NO_MATCHING_CONFIG}"
# Check that NO configurations were applied (they should not match the test environment)
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert not plugin_applied, \
f"Plugin should NOT have applied any configurations from {paths.NO_MATCHING_CONFIG} as they don't match the test environment. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.reduce
def test_incorrect_values_config(paths):
"""Test CSV plugin behavior with invalid/incorrect values in configuration"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.INCORRECT_VALUES_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/reduce_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
reduce_log_dir = os.path.join(paths.LOGDIR, "reduce_csv_plugin_test_logs")
os.makedirs(reduce_log_dir, exist_ok=True)
log_file = os.path.join(reduce_log_dir, "test_reduce_incorrect_values_config.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin reduce test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded some configurations (plugin should handle invalid values gracefully)
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.INCORRECT_VALUES_CONFIG}"
# Plugin should still function despite invalid values (using defaults)
# It might apply configs with default values or report no matches
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
"Plugin should either apply configurations (with defaults) or report no matches"
@pytest.mark.ext_tuner
@pytest.mark.reduce
def test_unsupported_algo_proto_config(paths):
"""Test that plugin handles unsupported algorithm/protocol combinations"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.UNSUPPORTED_ALGO_PROTO_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/reduce_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
reduce_log_dir = os.path.join(paths.LOGDIR, "reduce_csv_plugin_test_logs")
os.makedirs(reduce_log_dir, exist_ok=True)
log_file = os.path.join(reduce_log_dir, "test_reduce_unsupported_algo_proto.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin reduce test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.UNSUPPORTED_ALGO_PROTO_CONFIG}"
# Check for unsupported combinations - should see IGNORE or out of bounds messages
ignored_combinations = "Algorithm/protocol combination" in log_content and "is marked as IGNORE" in log_content
out_of_bounds = "out of bounds" in log_content
assert ignored_combinations or out_of_bounds, \
f"Plugin should report unsupported algorithm/protocol combinations as IGNORE or out of bounds. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.reduce
def test_singlenode_config(paths):
"""Test CSV plugin with single-node configuration"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.SINGLENODE_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "8",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/reduce_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
reduce_log_dir = os.path.join(paths.LOGDIR, "reduce_csv_plugin_test_logs")
os.makedirs(reduce_log_dir, exist_ok=True)
log_file = os.path.join(reduce_log_dir, "test_reduce_singlenode.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"Single-node CSV Plugin reduce test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.SINGLENODE_CONFIG}"
# Check that configurations were applied for single-node setup
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
f"Plugin should have applied single-node configurations from {paths.SINGLENODE_CONFIG}. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.reduce
@pytest.mark.multinode
def test_multinode_config(paths):
"""Test CSV plugin with multi-node configuration"""
# Get available nodes using the shared function
nodelist = paths.get_available_nodes()
# Skip test if no nodes available (SLURM not available) or less than 2 nodes
if len(nodelist) == 0:
pytest.skip("No nodes available")
elif len(nodelist) < 2:
pytest.skip(f"Multinode reduce test requires at least 2 nodes, but only {len(nodelist)} available: {nodelist}")
# Check for common network interface across all nodes
common_interface = paths.find_common_interface(nodelist)
if common_interface is None:
pytest.skip(f"Multinode reduce test requires all nodes to have the same network interface (eth0 or eth1).")
# Build host specification string (4 processes per node)
host_spec = ",".join([f"{node}:8" for node in nodelist])
total_processes = len(nodelist) * 8
print(f"Using host specification: {host_spec}")
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_IGNORE_CPU_AFFINITY": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.MULTINODE_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
"NCCL_SOCKET_IFNAME": common_interface,
"NCCL_DMABUF_ENABLE": "1",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", f"{total_processes}",
"--host", host_spec,
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/reduce_perf",
"-b", "8", # 64 bytes start (faster than 1 byte)
"-e", "128M", # 64MB end (much smaller than 16GB)
"-f", "2", # factor 4 (fewer size steps)
"-g", "1", # gap 1
]
reduce_log_dir = os.path.join(paths.LOGDIR, "reduce_csv_plugin_test_logs")
os.makedirs(reduce_log_dir, exist_ok=True)
log_file = os.path.join(reduce_log_dir, "test_reduce_multinode.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"Multi-node CSV Plugin reduce test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.MULTINODE_CONFIG}"
# Check that configurations were applied for multi-node setup
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
f"Plugin should have applied multi-node configurations from {paths.MULTINODE_CONFIG}. Check {log_file} for details"
+428
Просмотреть файл
@@ -0,0 +1,428 @@
# *************************************************************************
# * Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
# *
# * See LICENSE.txt for license information
# ************************************************************************
import os
import subprocess
import pytest
@pytest.mark.ext_tuner
@pytest.mark.reducescatter
def test_valid_config_with_wildcards(paths):
"""Test CSV plugin with wildcard values for matching configurations"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.VALID_CONFIG_WITH_WILDCARDS,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/reduce_scatter_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
reducescatter_log_dir = os.path.join(paths.LOGDIR, "reducescatter_csv_plugin_test_logs")
os.makedirs(reducescatter_log_dir, exist_ok=True)
log_file = os.path.join(reducescatter_log_dir, "test_reducescatter_valid_config_with_wildcards.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin reducescatter test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.VALID_CONFIG_WITH_WILDCARDS}"
# Check that plugin applied valid configurations (test fails if no configs applied)
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
f"Plugin should have applied valid configurations, but none were applied. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.reducescatter
def test_valid_config_without_wildcards(paths):
"""Test CSV plugin with specific values (no wildcards -1) for precise matching"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.VALID_CONFIG_WITHOUT_WILDCARDS,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/reduce_scatter_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
reducescatter_log_dir = os.path.join(paths.LOGDIR, "reducescatter_csv_plugin_test_logs")
os.makedirs(reducescatter_log_dir, exist_ok=True)
log_file = os.path.join(reducescatter_log_dir, "test_reducescatter_valid_config_without_wildcards.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin reducescatter test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.VALID_CONFIG_WITHOUT_WILDCARDS}"
# With specific values, plugin should either apply matching configs or report no matches
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
# Test should fail if no config is applied - we expect specific configs to match
assert plugin_applied, \
f"Plugin should have applied at least one configuration from {paths.VALID_CONFIG_WITHOUT_WILDCARDS}. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.reducescatter
def test_no_matching_config(paths):
"""Test CSV plugin behavior with no matching configurations"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.NO_MATCHING_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "8",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/reduce_scatter_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
reducescatter_log_dir = os.path.join(paths.LOGDIR, "reducescatter_csv_plugin_test_logs")
os.makedirs(reducescatter_log_dir, exist_ok=True)
log_file = os.path.join(reducescatter_log_dir, "test_reducescatter_no_matching_config.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin reducescatter test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.NO_MATCHING_CONFIG}"
# Check that NO configurations were applied (they should not match the test environment)
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert not plugin_applied, \
f"Plugin should NOT have applied any configurations from {paths.NO_MATCHING_CONFIG} as they don't match the test environment. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.reducescatter
def test_incorrect_values_config(paths):
"""Test CSV plugin behavior with invalid/incorrect values in configuration"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.INCORRECT_VALUES_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/reduce_scatter_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
reducescatter_log_dir = os.path.join(paths.LOGDIR, "reducescatter_csv_plugin_test_logs")
os.makedirs(reducescatter_log_dir, exist_ok=True)
log_file = os.path.join(reducescatter_log_dir, "test_reducescatter_incorrect_values_config.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin reducescatter test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded some configurations (plugin should handle invalid values gracefully)
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.INCORRECT_VALUES_CONFIG}"
# Plugin should still function despite invalid values (using defaults)
# It might apply configs with default values or report no matches
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
"Plugin should either apply configurations (with defaults) or report no matches"
@pytest.mark.ext_tuner
@pytest.mark.reducescatter
def test_unsupported_algo_proto_config(paths):
"""Test that plugin handles unsupported algorithm/protocol combinations"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.UNSUPPORTED_ALGO_PROTO_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "4",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/reduce_scatter_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
reducescatter_log_dir = os.path.join(paths.LOGDIR, "reducescatter_csv_plugin_test_logs")
os.makedirs(reducescatter_log_dir, exist_ok=True)
log_file = os.path.join(reducescatter_log_dir, "test_reducescatter_unsupported_algo_proto.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"CSV Plugin reducescatter test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.UNSUPPORTED_ALGO_PROTO_CONFIG}"
# Check for unsupported combinations - should see IGNORE or out of bounds messages
ignored_combinations = "Algorithm/protocol combination" in log_content and "is marked as IGNORE" in log_content
out_of_bounds = "out of bounds" in log_content
assert ignored_combinations or out_of_bounds, \
f"Plugin should report unsupported algorithm/protocol combinations as IGNORE or out of bounds. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.reducescatter
def test_singlenode_config(paths):
"""Test CSV plugin with single-node configuration"""
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.SINGLENODE_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", "8",
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/reduce_scatter_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
reducescatter_log_dir = os.path.join(paths.LOGDIR, "reducescatter_csv_plugin_test_logs")
os.makedirs(reducescatter_log_dir, exist_ok=True)
log_file = os.path.join(reducescatter_log_dir, "test_reducescatter_singlenode.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"Single-node CSV Plugin reducescatter test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.SINGLENODE_CONFIG}"
# Check that configurations were applied for single-node setup
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
f"Plugin should have applied single-node configurations from {paths.SINGLENODE_CONFIG}. Check {log_file} for details"
@pytest.mark.ext_tuner
@pytest.mark.reducescatter
@pytest.mark.multinode
def test_multinode_config(paths):
"""Test CSV plugin with multi-node configuration"""
# Get available nodes using the shared function
nodelist = paths.get_available_nodes()
# Skip test if no nodes available (SLURM not available) or less than 2 nodes
if len(nodelist) == 0:
pytest.skip("No nodes available")
elif len(nodelist) < 2:
pytest.skip(f"Multinode reducescatter test requires at least 2 nodes, but only {len(nodelist)} available: {nodelist}")
# Check for common network interface across all nodes
common_interface = paths.find_common_interface(nodelist)
if common_interface is None:
pytest.skip(f"Multinode reducescatter test requires all nodes to have the same network interface (eth0 or eth1).")
# Build host specification string (4 processes per node)
host_spec = ",".join([f"{node}:8" for node in nodelist])
total_processes = len(nodelist) * 8
print(f"Using host specification: {host_spec}")
env = os.environ.copy()
env.update({
"PATH": f"{paths.OMPI_INSTALL_DIR}/bin:{env.get('PATH', '')}",
"LD_LIBRARY_PATH": f"{paths.RCCL_INSTALL_DIR}:{paths.OMPI_INSTALL_DIR}/lib:{paths.PLUGIN_DIR}:{env.get('LD_LIBRARY_PATH', '')}",
"HSA_NO_SCRATCH_RECLAIM": "1",
"NCCL_IGNORE_CPU_AFFINITY": "1",
"NCCL_TUNER_PLUGIN": paths.PLUGIN_SO,
"NCCL_TUNER_CONFIG_FILE": paths.MULTINODE_CONFIG,
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "TUNING",
"NCCL_SOCKET_IFNAME": common_interface,
"NCCL_DMABUF_ENABLE": "1",
})
args = [
f"{paths.OMPI_INSTALL_DIR}/bin/mpirun", "-np", f"{total_processes}",
"--host", host_spec,
"--mca", "pml", "ucx",
"--mca", "btl", "^vader,openib",
f"{paths.RCCL_TESTS_DIR}/build/reduce_scatter_perf",
"-b", "8",
"-e", "128M",
"-f", "2",
"-g", "1",
]
reducescatter_log_dir = os.path.join(paths.LOGDIR, "reducescatter_csv_plugin_test_logs")
os.makedirs(reducescatter_log_dir, exist_ok=True)
log_file = os.path.join(reducescatter_log_dir, "test_reducescatter_multinode.log")
with open(log_file, "w") as logfile:
rccl_test = subprocess.run(
args,
env=env,
stdout=logfile,
stderr=subprocess.STDOUT,
universal_newlines=True
)
assert rccl_test.returncode == 0, f"Multi-node CSV Plugin reducescatter test failed, see {log_file}"
# Read and validate log content
with open(log_file, "r") as logfile:
log_content = logfile.read()
# Check that plugin loaded configurations
assert "TUNER/ExamplePlugin: Loaded" in log_content and "tuning configurations" in log_content, \
f"Plugin should have loaded configurations from {paths.MULTINODE_CONFIG}"
# Check that configurations were applied for multi-node setup
plugin_applied = "TUNER/ExamplePlugin: Applied config for collType=" in log_content
assert plugin_applied, \
f"Plugin should have applied multi-node configurations from {paths.MULTINODE_CONFIG}. Check {log_file} for details"