From bf4031276cd510562e6363661155e96ae014fa88 Mon Sep 17 00:00:00 2001 From: Nikhil-Nunna <49376782+Nikhil-Nunna@users.noreply.github.com> Date: Fri, 11 Jul 2025 11:28:20 -0500 Subject: [PATCH] topo_explorer initial readme (#1797) * topo_explorer intial readme * topo_explorer readme update * topo_explorer readme update * Added sample output to README * Update README.md * Update README.md --------- Co-authored-by: Mustafa Abduljabbar [ROCm/rccl commit: 7abc7538ea87b485b5e19f73930b0a26a048d3d6] --- projects/rccl/tools/topo_expl/README.md | 98 +++++++++++++++++++++++++ 1 file changed, 98 insertions(+) create mode 100644 projects/rccl/tools/topo_expl/README.md diff --git a/projects/rccl/tools/topo_expl/README.md b/projects/rccl/tools/topo_expl/README.md new file mode 100644 index 0000000000..cdc742d547 --- /dev/null +++ b/projects/rccl/tools/topo_expl/README.md @@ -0,0 +1,98 @@ +# RCCL Topology Explorer (topo_expl) + +The RCCL Topology Explorer is a tool for analyzing and exploring network topologies for RCCL (ROCm Communication Collectives Library) collective operations. It simulates various hardware configurations and displays the actual algo/proto combo selections that RCCL would make. + +## Building + +### Prerequisites +- ROCm/HIP development environment +- RCCL source code +- hipify-perl (for source transformation) + +### Build Instructions + +```bash +cd tools/topo_expl +make +``` + +## Usage + +```bash +./topo_expl -m model_id [-n numNodes=1] +``` + +### Parameters + +- `-m model_id`: Specifies the topology model to use (required) +- `-n numNodes`: Number of nodes to simulate (default: 1) + +### Available Models + +Run `./topo_expl` without arguments to see the list of available models. Each model represents a different hardware configuration. [Each model file](./models) pertains to a particular GPU model and node configuration. It can be output by RCCL through setting the environment variable `NCCL_TOPO_DUMP_FILE`. Model XMLs have been generated for simplicity. + +## Example Usage: Print RCCL's algorithm/protocol selections + +The tool is typically run with the `NCCL_DEBUG=INFO` environment variable, but for the convenience of just printing the algo/proto table, we use version `NCCL_DEBUG=version` in this example to avoid printing topo details. + +```bash +# List available models +./topo_expl + +# Test MI300 configuration (model 55) +NCCL_DEBUG=version ./topo_expl -m 55 + +# Test a multi-node MI300 configuration with 8 nodes +NCCL_DEBUG=version ./topo_expl -m 55 -n 8 + +# Test MI250 configuration (model 42) +NCCL_DEBUG=version ./topo_expl -m 42 + +# Test a multi-node MI250 configuration with 4 nodes +NCCL_DEBUG=version ./topo_expl -m 42 -n 4 +``` + + +## Sample output + +```bash +# cmd used +NCCL_DEBUG=version ./topo_expl -m 55 -n 8 +``` + +```bash + +Running fp32 production choices for algorithm/protocol/maxChannels +| Max Size(B) | Count | Collective | Algorithm | Protocol | Max Channels | +|-----------------|-----------------|-----------------|------------|------------|--------------| +| 32 | 8 | AllReduce | Tree | LL | 1 | +| 64 | 16 | AllReduce | Tree | LL | 1 | +| 128 | 32 | AllReduce | Tree | LL | 1 | +| 256 | 64 | AllReduce | Tree | LL | 1 | +| 512 | 128 | AllReduce | Tree | LL | 1 | +| 1024 | 256 | AllReduce | Tree | LL | 1 | +| 2048 | 512 | AllReduce | Tree | LL | 1 | +| 4096 | 1024 | AllReduce | Tree | LL | 2 | +| 8192 | 2048 | AllReduce | Tree | LL | 4 | +| 16384 | 4096 | AllReduce | Tree | LL | 8 | +| 32768 | 8192 | AllReduce | Tree | LL | 16 | +| 65536 | 16384 | AllReduce | Tree | LL | 32 | +| 131072 | 32768 | AllReduce | Tree | LL | 64 | +| 262144 | 65536 | AllReduce | Tree | LL | 64 | +| 524288 | 131072 | AllReduce | Tree | LL | 64 | +| 1048576 | 262144 | AllReduce | Tree | LL | 64 | +| 2097152 | 524288 | AllReduce | Tree | LL128 | 64 | +| 4194304 | 1048576 | AllReduce | Tree | LL128 | 64 | +| 8388608 | 2097152 | AllReduce | Tree | LL128 | 64 | +| 16777216 | 4194304 | AllReduce | Tree | LL128 | 64 | +| 33554432 | 8388608 | AllReduce | Tree | LL128 | 64 | +| 67108864 | 16777216 | AllReduce | Tree | Simple | 64 | +| 134217728 | 33554432 | AllReduce | Tree | Simple | 64 | +| 268435456 | 67108864 | AllReduce | Tree | Simple | 64 | +| 536870912 | 134217728 | AllReduce | Ring | Simple | 64 | +| 1073741824 | 268435456 | AllReduce | Ring | Simple | 64 | +| 2147483648 | 536870912 | AllReduce | Ring | Simple | 64 | +| 4294967296 | 1073741824 | AllReduce | Ring | Simple | 64 | +... +``` +