feat: Add comprehensive RDMA cluster setup guide, enforce eager mode in cluster benchmarks, and update documentation with cluster details.

2026-02-02 19:34:33 +00:00
@@ -3,6 +3,14 @@
 An **Fedora 43** Docker/Podman container that is **Toolbx-compatible** (usable as a Fedora toolbox) for serving LLMs with **vLLM** on **AMD Ryzen AI Max “Strix Halo” (gfx1151)**. Built on the **TheRock nightly builds** for ROCm.


+---
+
+## 🚀 High-Performance Clustering Support (New!)
+
+**Update:** This toolbox now ships with a **custom build of ROCm/RCCL** that enables **native RDMA/RoCE v2 support for Strix Halo (gfx1151)**. This allows you to connect two nodes via a low-latency interconnect (e.g., Intel E810) and run vLLM with Tensor Parallelism (TP=2) effectively acting as a single 256GB Unified Memory GPU.
+
+👉 **[Read the Full RDMA Cluster Setup Guide](rdma_cluster/setup_guide.md)** for hardware requirements and configuration instructions.
+
 ---

 ## Table of Contents
@@ -14,6 +22,7 @@ An **Fedora 43** Docker/Podman container that is **Toolbx-compatible** (usable a
 * [4) Testing the API](#4-testing-the-api)
 * [5) Use a Web UI for Chatting](#5-use-a-web-ui-for-chatting)
 * [6) Host Configuration](#6-host-configuration)
+* [7) Distributed Clustering (RDMA/RoCE)](#7-distributed-clustering-rdmaroce)


 ## Tested Models (Benchmarks)
@@ -165,17 +174,17 @@ This should work on any Strix Halo. For a complete list of available hardware, s
 | **CPU**           | Ryzen AI MAX+ 395 "Strix Halo"                              |
 | **System Memory** | 128 GB RAM                                                  |
 | **GPU Memory**    | 512 MB allocated in BIOS                                    |
-| **Host OS**       | Fedora 42, Linux 6.18.0-0.rc6.vanilla.fc42.x86_64           |
+| **Host OS**       | Fedora 43 (Rawhide), Linux 6.18.5-200.fc43.x86_64            |

 ### 6.2 Kernel Parameters (tested on Fedora 42)

 Add these boot parameters to enable unified memory while reserving a minimum of 4 GiB for the OS (max 124 GiB for iGPU):

-amd_iommu=off amdgpu.gttsize=126976 ttm.pages_limit=32505856
+amd_iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856

 | Parameter                   | Purpose                                                                                    |
 |-----------------------------|--------------------------------------------------------------------------------------------|
-| `amd_iommu=off`             | Disables IOMMU for lower latency                                                           |
+| `amd_iommu=pt`              | Sets IOMMU to pass-through mode; reduces DMA overhead for better performance               |
 | `amdgpu.gttsize=126976`     | Caps GPU unified memory to 124 GiB; 126976 MiB ÷ 1024 = 124 GiB                            |
 | `ttm.pages_limit=32505856`  | Caps pinned memory to 124 GiB; 32505856 × 4 KiB = 126976 MiB = 124 GiB                     |

@@ -188,4 +197,15 @@ Source: https://www.reddit.com/r/LocalLLaMA/comments/1m9wcdc/comment/n5gf53d/?co
 # Edit /etc/default/grub to add parameters to GRUB_CMDLINE_LINUX
 sudo grub2-mkconfig -o /boot/grub2/grub.cfg
 sudo reboot
-```
+```
+
+## 7) Distributed Clustering (RDMA/RoCE)
+
+This toolbox supports high-performance clustering of multiple Strix Halo nodes using Infiniband or RoCE v2 (e.g., Intel E810). This enables **Tensor Parallelism** across machines with extremely low latency (~5µs).
+
+**Detailed Documentation:** [RDMA Cluster Setup Guide](rdma_cluster/setup_guide.md)
+
+**Key Features:**
+*   **Custom RCCL Patch:** Use of a custom-built `librccl.so` to support RDMA on `gfx1151`.
+*   **Easy Setup:** `refresh_toolbox.sh` automatically detects and exposes RDMA devices.
+*   **Cluster Management:** Included `start-vllm-cluster` TUI for managing Ray and vLLM.
@@ -158,7 +158,8 @@ def get_model_args(model):
        
    if config.get("trust_remote"): cmd.append("--trust-remote-code")
    
-    if config.get("enforce_eager"): cmd.append("--enforce-eager")
+    # Force eager mode for cluster stability
+    cmd.append("--enforce-eager")
    
    return cmd

@@ -266,6 +267,7 @@ if __name__ == "__main__":

        
    log("Ray Cluster Detected. Starting Benchmarks (Dual Backend)...")
+    log("Note: Eager Mode (--enforce-eager) is ENABLED for cluster stability.")
    
    for m in MODELS_TO_RUN:
        run_cluster_throughput(m)
@@ -336,6 +336,26 @@
            margin-left: 12px;
        }

+        /* Info Box for Cluster */
+        .info-box {
+            background: #eff6ff;
+            border: 1px solid #bfdbfe;
+            border-radius: 8px;
+            padding: 16px;
+            margin-bottom: 24px;
+            display: flex;
+            align-items: flex-start;
+            gap: 12px;
+            color: #1e40af;
+            font-size: 0.95rem;
+        }
+
+        .info-box a {
+            color: #1d4ed8;
+            font-weight: 600;
+            text-decoration: underline;
+        }
+
        /* Footer */
        footer {
            margin-top: 60px;
@@ -432,12 +452,16 @@
            <div style="font-weight: 600; margin-bottom: 8px;">System Configuration</div>
            <div class="sys-config">
                <div class="sys-item">
-                    <span class="sys-label">System</span>
-                    <span>Framework Desktop · AMD Ryzen AI MAX 395+ · 128GB unified RAM</span>
+                    <span class="sys-label">Node 1 & 2</span>
+                    <span>Framework Desktop Mainboard · AMD Ryzen AI MAX 395+ (Strix Halo) · 128GB Unified RAM</span>
                </div>
                <div class="sys-item">
                    <span class="sys-label">OS/Kernel</span>
-                    <span>Fedora 42 · Linux 6.18.0-0.rc6.243.vanilla.fc42.x86_64</span>
+                    <span>Fedora 43 (Rawhide) · Linux 6.18.5-200.fc43.x86_64</span>
+                </div>
+                <div class="sys-item">
+                    <span class="sys-label">Interconnect</span>
+                    <span>RDMA (RoCE v2) via Intel E810 (Direct Attach) · ~5µs Latency</span>
                </div>
            </div>
        </footer>
@@ -633,6 +657,25 @@
            // Render Active Tab Content
            const test = activeTest;

+            // Cluster Info Box Logic
+            // If test name implies Tensor Parallelism > 1 (e.g. "Cluster", "TP=2", etc.)
+            // We default to checking if it's the "Throughput (Cluster)" tab or similar
+            if (test.name.toLowerCase().includes("tp=2") || test.name.toLowerCase().includes("cluster")) {
+                const infoBox = document.createElement('div');
+                infoBox.className = 'info-box';
+                infoBox.innerHTML = `
+                    <div style="font-size:1.2rem;">ℹ️</div>
+                    <div>
+                        <div style="font-weight:600; margin-bottom:4px;">Distributed Cluster (Tensor Parallelism = 2)</div>
+                        This benchmark runs on <b>2x Strix Halo nodes</b> connected via <b>Low-Latency RDMA (RoCE v2)</b>.
+                        The model is split across both APUs, effectively using 256GB of Unified Memory.
+                        <br><br>
+                        <a href="https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md" target="_blank">View Cluster Setup Guide &rarr;</a>
+                    </div>
+                `;
+                container.appendChild(infoBox);
+            }
+
            // Filter models within this test
            const models = test.models.filter(m => {
                const s = state.search;
@@ -645,7 +688,7 @@
            });

            if (models.length === 0) {
-                container.innerHTML = '<div id="loading">No models match current filters in this category.</div>';
+                container.innerHTML += '<div id="loading">No models match current filters in this category.</div>';
                return;
            }

@@ -0,0 +1,320 @@
+# AMD Strix Halo RDMA Cluster Setup Guide
+
+This guide details how to configure a two-node **AMD Strix Halo** cluster linked via **Intel E810 (RoCE v2)** for distributed vLLM inference using Tensor Parallelism.
+
+## Table of Contents
+
+1. [TL;DR (Quick Start)](#1-tldr-quick-start)
+2. [Concepts & Architecture](#2-concepts--architecture)
+3. [Hardware Prerequisites](#3-hardware-prerequisites)
+4. [Host Configuration (Fedora)](#4-host-configuration-fedora)
+    *   [4.1 Install Packages](#41-install-packages)
+    *   [4.2 Check Native Firmware](#42-check-native-firmware)
+    *   [4.3 Network Configuration](#43-network-configuration)
+    *   [4.4 BIOS & Kernel Configuration](#44-bios--kernel-configuration)
+    *   [4.5 Firewall Rules](#45-firewall-rules)
+5. [Toolbox Installation & Network Verification](#5-toolbox-installation--network-verification)
+    *   [5.1 Prerequisites: Passwordless SSH](#51-prerequisites-passwordless-ssh)
+    *   [5.2 Installation](#52-installation)
+    *   [5.3 Verify RDMA Connection](#53-verify-rdma-connection)
+6. [Running the Cluster](#6-running-the-cluster)
+    *   [6.1 Setup & Verify](#61-setup--verify)
+    *   [6.2 Launching vLLM](#62-launching-vllm)
+7. [Troubleshooting](#7-troubleshooting)
+
+---
+
+## 1. TL;DR (Quick Start)
+
+**On Both Nodes:**
+1.  **Preparation**:
+    *   **Install/Update Fedora 43** and the E810 NICs (Check firmware: `ethtool -i <iface>`).
+    *   **BIOS/Kernel**: Set iGPU to 512MB and apply kernel params (`iommu=pt`, `pci=realloc`, etc.).
+    *   **SSH**: Configure **passwordless SSH** between nodes.
+2.  **Networking**: Assign static IPs (`192.168.100.1` & `.2`), set MTU 9000, and trust the interface in firewall.
+3.  **Install Toolbox**: Run `./refresh_toolbox.sh` (this automatically installs the container with RDMA support and the custom `librccl.so` patch).
+4.  **Run Cluster**:
+    *   Run `start-vllm-cluster`.
+    *   Select **"2. Start Ray Cluster"** (Follow prompts using the TUI).
+    *   Select **"4. Launch VLLM Serve"** and choose your model. (Export `HF_TOKEN` first for gated models!)
+
+**Key Note**: The `refresh_toolbox.sh` script detects your Infiniband/RDMA devices and automatically configures the container to expose them.
+
+---
+
+## 2. Concepts & Architecture
+
+To fully utilize the Strix Halo cluster, it is helpful to understand the technologies involved:
+
+*   **vLLM**: A high-performance inference engine. To run models larger than a single GPU (or APU) can handle, it splits the model using **Tensor Parallelism (TP)**.
+*   **Ray**: A distributed computing framework. vLLM uses Ray to **orchestrate** the cluster, manage the "worker" processes on each node, and ensure they start up correctly. Ray handles the *control plane* (issuing commands).
+*   **RCCL (ROCm Collective Communication Library)**: The AMD equivalent of NVIDIA's NCCL. This library handles the **data plane**—specifically, the extremely fast synchronization of tensor data between GPUs. When TP=2, the two nodes must exchange partial results after *every single layer* of the neural network. This happens thousands of times per second.
+*   **RoCE v2 (RDMA over Converged Ethernet)**: The protocol that allows RCCL to write data directly from one Node's memory to the other Node's memory, bypassing the CPU and OS kernel.
+    *   **Without RDMA**: Latency is ~70-100µs (TCP/IP overhead).
+    *   **With RDMA**: Latency is ~5µs.
+    *   **Why it matters**: For interactive token generation, high latency kills performance. RoCE makes the two nodes feel like a single machine.
+
+---
+
+## 3. Hardware Prerequisites
+
+*   **Nodes**: 2x [Framework Desktop Mainboards](https://frame.work/gb/en/products/framework-desktop-mainboard-amd-ryzen-ai-max-300-series?v=FRAFMK0006) with AMD Ryzen AI MAX+ "Strix Halo", 128GB of Unified Memory.
+*   **Network Cards**: [Intel Ethernet Controller E810-CQDA1](https://www.intel.com/content/www/us/en/products/sku/192558/intel-ethernet-network-adapter-e810cqda1/specifications.html) (or similar 100GbE QSFP28).
+*   **Connection**: Direct Attach Copper (DAC) cable (e.g., [QSFPTEK 100G QSFP28 DAC](https://www.amazon.co.uk/dp/B09F32F7VK)). No switch required for 2 nodes.
+*   **PCIe Note**: The Framework motherboard PCIe slot is physically **x4**, so a riser is required to plug in a 16x card (e.g., [CY PCI-E Express 4x to 16x Extender](https://www.amazon.co.uk/dp/B0837FZFJ6)). **Test Setup Note:** One of the boards in this setup has a modified PCIe slot (cut by Framework using an ultrasonic knife) to accept x16 cards directly. **This is not recommended for users.** Risers are the cheaper, safer, and easier solution. Performance is identical (~50Gbps bandwidth, ~5µs latency).
+
+---
+
+## 4. Host Configuration (Fedora)
+
+Perform these steps on the **Host OS** (Fedora 43) of **both nodes**.
+
+**Tested Host Configuration:**
+
+| Node | Kernel | OS | IP (RDMA Interface) |
+| :--- | :--- | :--- | :--- |
+| **Node 1** | `6.18.5-200.fc43.x86_64` | Fedora Linux 43 (Rawhide) | `192.168.100.1/30` |
+| **Node 2** | `6.18.6-200.fc43.x86_64` | Fedora Linux 43 (Rawhide) | `192.168.100.2/30` |
+
+> **Note:** These specific kernel versions were verified to work. Fedora 43 is recommended.
+
+### 4.1 Install Packages
+
+Install the core RDMA userspace tools. You do **not** need proprietary Intel drivers; the in-kernel drivers work perfectly.
+*   **Ethernet Driver:** `ice`
+*   **RDMA Driver:** `irdma` (Unified driver for RoCE v2 & iWARP)
+
+```bash
+sudo dnf install rdma-core libibverbs-utils perftest
+```
+
+*   `rdma-core`: The userspace components for the RDMA subsystem (libraries, daemons, and configuration tools).
+*   `libibverbs-utils`: Utilities for querying RDMA devices (e.g., `ibv_devinfo`).
+*   `perftest`: A suite of benchmarks (e.g., `ib_write_bw`, `ib_send_lat`) to verify RDMA bandwidth and latency.
+
+### 4.2 Check Native Firmware
+
+Use `ethtool` to check the current firmware version of your Intel E810 card.
+
+```bash
+ethtool -i enp194s0np0
+```
+
+**Recommended Firmware:**
+Ensure your firmware is at least as new as the version shown below (Firmware `4.91...`). If your firmware is older, please update it using the [Intel® Ethernet NVM Update Tool for E810 Series](https://www.intel.com/content/www/us/en/download/19624/non-volatile-memory-nvm-update-utility-for-intel-ethernet-network-adapter-e810-series-linux.html).
+
+**Example Output:**
+```text
+driver: ice
+version: 6.18.5-200.fc43.x86_64
+firmware-version: 4.91 0x800214b5 1.3909.0
+expansion-rom-version: 
+bus-info: 0000:c2:00.0
+supports-statistics: yes
+supports-test: yes
+supports-eeprom-access: yes
+supports-register-dump: yes
+supports-priv-flags: yes
+```
+
+### 4.3 Network Configuration
+
+This guide assumes a subnet of `192.168.100.0/30`.
+
+**Identify your interface**:
+Run `ip link` to find your 100GbE card (e.g., `enp194s0np0`).
+
+**Node 1 (Head - 192.168.100.1):**
+```bash
+# Bring link up
+sudo ip link set enp194s0np0 up
+
+# Assign IP
+sudo ip addr add 192.168.100.1/30 dev enp194s0np0
+
+# Set MTU (Jumbo Frames)
+sudo nmcli connection modify "rdma0" ethernet.mtu 9000
+sudo nmcli connection up "rdma0"
+```
+
+**Node 2 (Worker - 192.168.100.2):**
+```bash
+# Bring link up
+sudo ip link set enp194s0np0 up
+
+# Assign IP
+sudo ip addr add 192.168.100.2/30 dev enp194s0np0
+
+# Set MTU
+sudo nmcli connection modify "rdma0" ethernet.mtu 9000
+sudo nmcli connection up "rdma0"
+```
+
+**Verify Routing:**
+Ensure the route exists on both:
+```bash
+sudo ip route add 192.168.100.0/30 dev enp194s0np0
+```
+
+**Verify Link:**
+```bash
+rdma link
+# Output should show: state ACTIVE physical_state LINK_UP used_usec X ...
+```
+
+### 4.4 BIOS & Kernel Configuration
+
+**1. BIOS Settings:**
+Set the **iGPU Memory Allocation** to the **minimum possible (512MB)**. We will use the GTT (Graphics Translation Table) to dynamically allocate system memory as "Unified Memory" for the GPU.
+
+**2. Kernel Parameters:**
+Update GRUB to enable unified memory, optimize RDMA performance, and fix PCI resource allocation.
+
+Edit `/etc/default/grub` and append to `GRUB_CMDLINE_LINUX`:
+```text
+iommu=pt pci=realloc pcie_aspm=off amdgpu.gttsize=126976 ttm.pages_limit=32505856
+```
+
+**Explanation of Parameters:**
+*   `iommu=pt`: Sets IOMMU to "Pass-Through" mode. This is critical for performance, reducing overhead for both the RDMA NIC and the iGPU unified memory access.
+*   `pci=realloc`: Reallocates PCI BARs. Often needed on consumer platforms to properly map large address spaces for devices like the E810 or Strix Halo.
+*   `pcie_aspm=off`: Disables PCIe Active State Power Management. Prevents latency spikes and link negotiation issues on the 100GbE connection.
+*   `amdgpu.gttsize=126976`: Caps the GPU GTT size to ~124GiB (126976MB). This defines how much system RAM the GPU can address as its own "VRAM".
+*   `ttm.pages_limit=32505856`: Limits the Translation Table Manager to ~124GiB (in 4KB pages), matching the GTT size.
+
+**3. Apply Changes:**
+```bash
+sudo grub2-mkconfig -o /boot/grub2/grub.cfg
+sudo reboot
+```
+
+### 4.5 Firewall Rules
+
+Applications like Ray and NCCL use random high ports. It is easiest to trust the internal RDMA interface completely.
+
+```bash
+# Assign the interface to the trusted zone permanently
+sudo firewall-cmd --permanent --zone=trusted --add-interface=enp194s0np0
+
+# Reload firewall
+sudo firewall-cmd --reload
+```
+
+---
+
+## 5. Toolbox Installation & Network Verification
+
+### 5.1 Prerequisites: Passwordless SSH
+
+The cluster management and verification scripts rely on SSH to execute commands on remote nodes. You must configure **passwordless SSH** between both nodes (root or sudo-enabled user).
+
+*   **Guide:** [How to Set Up SSH Keys on Linux (DigitalOcean)](https://www.digitalocean.com/community/tutorials/how-to-set-up-ssh-keys-on-ubuntu-20-04)
+*   **Quick Check:** Run `ssh <other-node-ip> date` from each node. It should print the date without asking for a password.
+
+### 5.2 Installation
+
+The toolbox container provided in this repo includes a **critical patch**: a custom-built `librccl.so` that enables `gfx1151` (Strix Halo) support for RDMA (https://github.com/kyuz0/rocm-systems/tree/gfx1151-rccl), which is currently missing in upstream ROCm packages. This library is automatically compiled using the [`build-rccl`](../.github/workflows/build-rccl.yml) GitHub Action in this repository, which generates the artifact that is then bundled into the Docker container.
+
+To install the toolbox on **both nodes**, run:
+
+```bash
+./refresh_toolbox.sh
+```
+
+**What this does:**
+1.  Pulls the latest `kyuz0/vllm-therock-gfx1151` image.
+2.  Detects if `/dev/infiniband` exists on your host.
+3.  Creates the toolbox with flags to expose:
+    *   **iGPU Access**: `/dev/dri`, `/dev/kfd` (Required for ROCm)
+    *   **RDMA Access**: `/dev/infiniband`, `--group-add rdma`
+    *   **Memory Pinning**: `--ulimit memlock=-1` (Required for DMA)
+
+### 5.3 Verify RDMA Connection
+
+Before proceeding to run the cluster, verify that RDMA is active and providing low latency (~5µs vs ~70µs for Ethernet).
+
+Run the provided verification script from the **Head Node**:
+
+```bash
+# Inside toolbox
+/opt/compare_eth_vs_rdma.sh
+```
+
+**Expected Results:**
+```text
+Path                 Latency      Bandwidth   
+------------------------------------------------
+Ethernet (1G LAN)    0.074 ms     0.94 Gbps   
+Ethernet (RoCE NIC)  0.068 ms     55.70 Gbps  
+RDMA (RoCE)          5.23 us      50.64 Gbps  
+```
+*Note the massive latency drop (milliseconds to microseconds) for RDMA.*
+
+---
+
+## 6. Running the Cluster
+
+A TUI utility, `start-vllm-cluster`, is provided to manage the Ray cluster and vLLM.
+
+### 6.1 Setup & Verify
+
+1.  **Enter the toolbox**:
+    ```bash
+    toolbox enter vllm
+    ```
+2.  **Run the Cluster Manager**:
+    ```bash
+    start-vllm-cluster
+    ```
+3.  **Configure IPs** (Option 1):
+    *   Ensure Head is `192.168.100.1` and Worker is `192.168.100.2`.
+4.  **Start Ray Cluster** (Option 2):
+    *   **On Node 1**: Select **"Head"** when prompted.
+    *   **On Node 2**: Select **"Worker"** when prompted.
+    *   The script effectively runs:
+        ```bash
+        # Head
+        export NCCL_SOCKET_IFNAME=<rdma_iface>
+        ray start --head --node-ip-address=192.168.100.1 ...
+        
+        # Worker
+        ray start --address=192.168.100.1:6379 ...
+        ```
+5.  **Check Status** (Option 3):
+    *   Ensure you see **2 nodes** and adequate GPU resources (e.g., `2.0 GPU`).
+
+### 6.2 Launching vLLM
+
+Once the cluster is active (checked via Option 3):
+
+1.  Select **"4. Launch VLLM Serve"** in the TUI.
+2.  Choose a model (e.g., `Meta-Llama-3.1-8B-Instruct`).
+3.  **Configuration Menu**:
+    *   **Tensor Parallelism**: Set to `2` (one GPU per node).
+    *   **Context Length**: Auto or custom (e.g., `131072`).
+    *   **Erase vLLM Cache**: Select `YES` if you are restarting after a crash.
+    *   **Force Eager Mode**: Select `YES`.
+        *   *Why?* CUDA Graphs can be unstable on distributed APU clusters and cause deadlocks. Eager mode is safer, but you might be able to squeeze 1-3% more performance if you take a chance and disable it.
+4.  **Launch**: Select "LAUNCH SERVER".
+
+**Important Gotchas:**
+*   **First Run Download**: When running a model for the first time, each node in the cluster must download the weights independently. This may take some time depending on your internet connection.
+*   **Gated Models (e.g., Gemma)**:
+    *   Models like `google/gemma-2-27b-it` are "gated" and require you to request access on Hugging Face.
+    *   You must export your Hugging Face token before running the cluster script:
+        ```bash
+        export HF_TOKEN=your_token_here
+        start-vllm-cluster
+        ```
+    *   If you don't provide a token or haven't accepted the license on Hugging Face, the download will fail.
+
+---
+
+## 7. Troubleshooting
+
+### vLLM Deadlocks / Hangs
+*   **Cause**: CUDA Graph capture can freeze on distributed APU nodes.
+*   **Fix**: Enable **"Force Eager Mode"** in the start menu.
+
+### Firmware
+If you see link issues, ensure your Intel E810 firmware is up to date using the Intel standard tools.
@@ -1,5 +1,12 @@
 # Issue Report: vLLM Tensor Parallelism over RDMA on AMD Strix Halo

+> **✅ RESOLVED (Feb 2, 2026)**
+> This issue is **SOLVED**. The root cause was indeed missing `gfx1151` support in the upstream RCCL library.
+>
+> I have patched and built a custom version of RCCL with native `gfx1151` support. This patched library is **now included** in the toolbox container provided by this repository (`kyuz0/vllm-therock-gfx1151`).
+>
+> See the [RDMA Cluster Setup Guide](setup_guide.md) for instructions on how to run the cluster using the fixed container.
+
 ## TL;DR
 I am attempting to run vLLM with Tensor Parallelism across two AMD Strix Halo (Ryzen AI MAX+ 395) nodes connected directly via Intel E810-CQDA1 RDMA (RoCE v2).

@@ -180,7 +180,8 @@ def configure_and_launch_vllm(model_idx, head_ip):
    current_util = verified["util"]
    
    clear_cache = False
-    use_eager = True # Default True for cluster as per request ("enforce-eager")
+    # Default to eager mode for stability in cluster situations, especially at high concurrency
+    use_eager = True
    trust_remote = True # Default True as per request
    
    while True:
@@ -315,6 +316,8 @@ def configure_and_launch_vllm(model_idx, head_ip):
    print(f" Launching VLLM Cluster on Head: {head_ip}")
    print(f" Model:     {name}")
    print(f" Config:    TP={current_tp} | Seqs={current_seqs} | Ctx={current_ctx}")
+    if use_eager:
+        print(" Note:      Eager Mode Enabled (Recommended for Cluster Stability)")
    print(f" Command:   {' '.join(cmd)}")
    print("="*60 + "\n")