topo_explorer initial readme (#1797)
* topo_explorer intial readme
* topo_explorer readme update
* topo_explorer readme update
* Added sample output to README
* Update README.md
* Update README.md
---------
Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>
[ROCm/rccl commit: 7abc7538ea]
このコミットが含まれているのは:
@@ -0,0 +1,98 @@
|
||||
# RCCL Topology Explorer (topo_expl)
|
||||
|
||||
The RCCL Topology Explorer is a tool for analyzing and exploring network topologies for RCCL (ROCm Communication Collectives Library) collective operations. It simulates various hardware configurations and displays the actual algo/proto combo selections that RCCL would make.
|
||||
|
||||
## Building
|
||||
|
||||
### Prerequisites
|
||||
- ROCm/HIP development environment
|
||||
- RCCL source code
|
||||
- hipify-perl (for source transformation)
|
||||
|
||||
### Build Instructions
|
||||
|
||||
```bash
|
||||
cd tools/topo_expl
|
||||
make
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
./topo_expl -m model_id [-n numNodes=1]
|
||||
```
|
||||
|
||||
### Parameters
|
||||
|
||||
- `-m model_id`: Specifies the topology model to use (required)
|
||||
- `-n numNodes`: Number of nodes to simulate (default: 1)
|
||||
|
||||
### Available Models
|
||||
|
||||
Run `./topo_expl` without arguments to see the list of available models. Each model represents a different hardware configuration. [Each model file](./models) pertains to a particular GPU model and node configuration. It can be output by RCCL through setting the environment variable `NCCL_TOPO_DUMP_FILE`. Model XMLs have been generated for simplicity.
|
||||
|
||||
## Example Usage: Print RCCL's algorithm/protocol selections
|
||||
|
||||
The tool is typically run with the `NCCL_DEBUG=INFO` environment variable, but for the convenience of just printing the algo/proto table, we use version `NCCL_DEBUG=version` in this example to avoid printing topo details.
|
||||
|
||||
```bash
|
||||
# List available models
|
||||
./topo_expl
|
||||
|
||||
# Test MI300 configuration (model 55)
|
||||
NCCL_DEBUG=version ./topo_expl -m 55
|
||||
|
||||
# Test a multi-node MI300 configuration with 8 nodes
|
||||
NCCL_DEBUG=version ./topo_expl -m 55 -n 8
|
||||
|
||||
# Test MI250 configuration (model 42)
|
||||
NCCL_DEBUG=version ./topo_expl -m 42
|
||||
|
||||
# Test a multi-node MI250 configuration with 4 nodes
|
||||
NCCL_DEBUG=version ./topo_expl -m 42 -n 4
|
||||
```
|
||||
|
||||
|
||||
## Sample output
|
||||
|
||||
```bash
|
||||
# cmd used
|
||||
NCCL_DEBUG=version ./topo_expl -m 55 -n 8
|
||||
```
|
||||
|
||||
```bash
|
||||
|
||||
Running fp32 production choices for algorithm/protocol/maxChannels
|
||||
| Max Size(B) | Count | Collective | Algorithm | Protocol | Max Channels |
|
||||
|-----------------|-----------------|-----------------|------------|------------|--------------|
|
||||
| 32 | 8 | AllReduce | Tree | LL | 1 |
|
||||
| 64 | 16 | AllReduce | Tree | LL | 1 |
|
||||
| 128 | 32 | AllReduce | Tree | LL | 1 |
|
||||
| 256 | 64 | AllReduce | Tree | LL | 1 |
|
||||
| 512 | 128 | AllReduce | Tree | LL | 1 |
|
||||
| 1024 | 256 | AllReduce | Tree | LL | 1 |
|
||||
| 2048 | 512 | AllReduce | Tree | LL | 1 |
|
||||
| 4096 | 1024 | AllReduce | Tree | LL | 2 |
|
||||
| 8192 | 2048 | AllReduce | Tree | LL | 4 |
|
||||
| 16384 | 4096 | AllReduce | Tree | LL | 8 |
|
||||
| 32768 | 8192 | AllReduce | Tree | LL | 16 |
|
||||
| 65536 | 16384 | AllReduce | Tree | LL | 32 |
|
||||
| 131072 | 32768 | AllReduce | Tree | LL | 64 |
|
||||
| 262144 | 65536 | AllReduce | Tree | LL | 64 |
|
||||
| 524288 | 131072 | AllReduce | Tree | LL | 64 |
|
||||
| 1048576 | 262144 | AllReduce | Tree | LL | 64 |
|
||||
| 2097152 | 524288 | AllReduce | Tree | LL128 | 64 |
|
||||
| 4194304 | 1048576 | AllReduce | Tree | LL128 | 64 |
|
||||
| 8388608 | 2097152 | AllReduce | Tree | LL128 | 64 |
|
||||
| 16777216 | 4194304 | AllReduce | Tree | LL128 | 64 |
|
||||
| 33554432 | 8388608 | AllReduce | Tree | LL128 | 64 |
|
||||
| 67108864 | 16777216 | AllReduce | Tree | Simple | 64 |
|
||||
| 134217728 | 33554432 | AllReduce | Tree | Simple | 64 |
|
||||
| 268435456 | 67108864 | AllReduce | Tree | Simple | 64 |
|
||||
| 536870912 | 134217728 | AllReduce | Ring | Simple | 64 |
|
||||
| 1073741824 | 268435456 | AllReduce | Ring | Simple | 64 |
|
||||
| 2147483648 | 536870912 | AllReduce | Ring | Simple | 64 |
|
||||
| 4294967296 | 1073741824 | AllReduce | Ring | Simple | 64 |
|
||||
...
|
||||
```
|
||||
|
||||
新しいイシューから参照
ユーザーをブロックする