diff --git a/rdma_cluster/cluster.png b/rdma_cluster/cluster.png new file mode 100644 index 0000000..853d28d Binary files /dev/null and b/rdma_cluster/cluster.png differ diff --git a/rdma_cluster/concepts.png b/rdma_cluster/concepts.png new file mode 100644 index 0000000..2f81354 Binary files /dev/null and b/rdma_cluster/concepts.png differ diff --git a/rdma_cluster/setup_guide.md b/rdma_cluster/setup_guide.md index 9e40569..bd45053 100644 --- a/rdma_cluster/setup_guide.md +++ b/rdma_cluster/setup_guide.md @@ -45,6 +45,8 @@ This guide details how to configure a two-node **AMD Strix Halo** cluster linked ## 2. Concepts & Architecture +![concepts](concepts.png) + To fully utilize the Strix Halo cluster, it is helpful to understand the technologies involved: * **vLLM**: A high-performance inference engine. To run models larger than a single GPU (or APU) can handle, it splits the model using **Tensor Parallelism (TP)**. @@ -55,15 +57,20 @@ To fully utilize the Strix Halo cluster, it is helpful to understand the technol * **With RDMA**: Latency is ~5µs. * **Why it matters**: For interactive token generation, high latency kills performance. RoCE makes the two nodes feel like a single machine. + --- ## 3. Hardware Prerequisites +![cluster](cluster.png) + + * **Nodes**: 2x [Framework Desktop Mainboards](https://frame.work/gb/en/products/framework-desktop-mainboard-amd-ryzen-ai-max-300-series?v=FRAFMK0006) with AMD Ryzen AI MAX+ "Strix Halo", 128GB of Unified Memory. * **Network Cards**: [Intel Ethernet Controller E810-CQDA1](https://www.intel.com/content/www/us/en/products/sku/192558/intel-ethernet-network-adapter-e810cqda1/specifications.html) (or similar 100GbE QSFP28). * **Connection**: Direct Attach Copper (DAC) cable (e.g., [QSFPTEK 100G QSFP28 DAC](https://www.amazon.co.uk/dp/B09F32F7VK)). No switch required for 2 nodes. * **PCIe Note**: The Framework motherboard PCIe slot is physically **x4**, so a riser is required to plug in a 16x card (e.g., [CY PCI-E Express 4x to 16x Extender](https://www.amazon.co.uk/dp/B0837FZFJ6)). **Test Setup Note:** One of the boards in this setup has a modified PCIe slot (cut by Framework using an ultrasonic knife) to accept x16 cards directly. **This is not recommended for users.** Risers are the cheaper, safer, and easier solution. Performance is identical (~50Gbps bandwidth, ~5µs latency). + --- ## 4. Host Configuration (Fedora)