feat: set LD_LIBRARY_PATH to include ROCm library directories.

chore: remove fix_block_size.py script and its execution from the Dockerfile.
fix: Remove vLLM block size validation checks by adding and running a new patching script in the Dockerfile.
2026-03-14 13:41:09 +00:00 · 2026-03-14 13:18:56 +00:00 · 2026-03-13 16:29:01 +00:00 · 2026-03-09 14:07:43 +00:00 · 2026-03-09 12:13:27 +00:00 · 2026-03-09 11:08:11 +00:00
@@ -1,2 +1,3 @@
 *.pyc
 __pycache__/
+settings.json
@@ -18,7 +18,7 @@ RUN chmod +x /tmp/install_rocm_sdk.sh && \
  /tmp/install_rocm_sdk.sh

 # 4. Python Venv Setup
-RUN /usr/bin/python3.13 -m venv /opt/venv
+RUN /usr/bin/python3.12 -m venv /opt/venv
 ENV VIRTUAL_ENV=/opt/venv
 ENV PATH=/opt/venv/bin:$PATH
 ENV PIP_NO_CACHE_DIR=1
@@ -34,6 +34,7 @@ WORKDIR /opt

 # Flash-Attention
 ENV FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
+ENV LD_LIBRARY_PATH="/opt/rocm/lib:/opt/rocm/lib64:$LD_LIBRARY_PATH"

 RUN git clone https://github.com/ROCm/flash-attention.git &&\ 
  cd flash-attention &&\
@@ -46,28 +47,8 @@ RUN git clone https://github.com/vllm-project/vllm.git /opt/vllm
 WORKDIR /opt/vllm

 # --- PATCHING ---
-RUN echo "import sys, re" > patch_strix.py && \
-  echo "from pathlib import Path" >> patch_strix.py && \
-  # Patch 1: __init__.py
-  echo "p = Path('vllm/platforms/__init__.py')" >> patch_strix.py && \
-  echo "txt = p.read_text()" >> patch_strix.py && \
-  echo "txt = txt.replace('import amdsmi', '# import amdsmi')" >> patch_strix.py && \
-  echo "txt = re.sub(r'is_rocm = .*', 'is_rocm = True', txt)" >> patch_strix.py && \
-  echo "txt = re.sub(r'if len\(amdsmi\.amdsmi_get_processor_handles\(\)\) > 0:', 'if True:', txt)" >> patch_strix.py && \
-  echo "txt = txt.replace('amdsmi.amdsmi_init()', 'pass')" >> patch_strix.py && \
-  echo "txt = txt.replace('amdsmi.amdsmi_shut_down()', 'pass')" >> patch_strix.py && \
-  echo "p.write_text(txt)" >> patch_strix.py && \
-  # Patch 2: rocm.py
-  echo "p = Path('vllm/platforms/rocm.py')" >> patch_strix.py && \
-  echo "txt = p.read_text()" >> patch_strix.py && \
-  echo "header = 'import sys\nfrom unittest.mock import MagicMock\nsys.modules[\"amdsmi\"] = MagicMock()\n'" >> patch_strix.py && \
-  echo "txt = header + txt" >> patch_strix.py && \
-  echo "txt = re.sub(r'device_type = .*', 'device_type = \"rocm\"', txt)" >> patch_strix.py && \  
-  echo "txt = re.sub(r'device_name = .*', 'device_name = \"gfx1151\"', txt)" >> patch_strix.py && \ 
-  echo "txt += '\n    def get_device_name(self, device_id: int = 0) -> str:\n        return \"AMD-gfx1151\"\n'" >> patch_strix.py && \
-  echo "p.write_text(txt)" >> patch_strix.py && \
-  echo "print('Successfully patched vLLM for Strix Halo')" >> patch_strix.py && \
-  python patch_strix.py && \
+COPY scripts/patch_strix.py /opt/vllm/patch_strix.py
+RUN python /opt/vllm/patch_strix.py && \
  sed -i 's/gfx1200;gfx1201/gfx1151/' CMakeLists.txt  

 # 7. Build vLLM (Wheel Method) with CLANG Host Compiler
@@ -127,10 +108,12 @@ COPY scripts/99-toolbox-banner.sh /etc/profile.d/99-toolbox-banner.sh
 COPY scripts/zz-venv-last.sh /etc/profile.d/zz-venv-last.sh
 COPY scripts/start_vllm.py /opt/start-vllm
 COPY scripts/start_vllm_cluster.py /opt/start-vllm-cluster
+COPY scripts/measure_bandwidth.sh /opt/measure_bandwidth.sh
 COPY scripts/cluster_manager.py /opt/cluster_manager.py
 COPY scripts/models.py /opt/models.py

 COPY benchmarks/max_context_results.json /opt/max_context_results.json
+COPY benchmarks/bench_utils.py /opt/bench_utils.py
 COPY benchmarks/run_vllm_bench.py /opt/run_vllm_bench.py
 COPY benchmarks/vllm_cluster_bench.py /opt/vllm_cluster_bench.py
 COPY benchmarks/find_max_context.py /opt/find_max_context.py
@@ -13,6 +13,16 @@ An **Fedora 43** Docker/Podman container that is **Toolbx-compatible** (usable a

 ---

+### 📦 Project Context
+
+This repository is part of the **[Strix Halo AI Toolboxes](https://strix-halo-toolboxes.com)** project. Check out the website for an overview of all toolboxes, tutorials, and host configuration guides.
+
+### ❤️ Support
+
+This is a hobby project maintained in my spare time. If you find these toolboxes and tutorials useful, you can **[buy me a coffee](https://buymeacoffee.com/dcapitella)** to support the work! ☕
+
+---
+
 ## Table of Contents

 * [Tested Models (Benchmarks)](#tested-models-benchmarks)
@@ -0,0 +1,14 @@
+import subprocess
+import tempfile
+
+def run_dialog(args):
+    """Runs dialog and returns stderr (selection line). Returns None if user cancelled."""
+    with tempfile.NamedTemporaryFile(mode="w+") as tf:
+        cmd = ["dialog"] + args
+        try:
+            # We don't trap stdout since dialog renders to TTY and writes choice to stderr
+            subprocess.run(cmd, stderr=tf, check=True)
+            tf.seek(0)
+            return tf.read().strip()
+        except subprocess.CalledProcessError:
+            return None # User cancelled/pressed ESC
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 524.2037815230142,
+    "num_requests": 200,
+    "total_num_tokens": 146805,
+    "requests_per_second": 0.3815310134141399,
+    "tokens_per_second": 280.05330212131406
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 485.412814248004,
+    "num_requests": 200,
+    "total_num_tokens": 146805,
+    "requests_per_second": 0.4120204373051785,
+    "tokens_per_second": 302.43330149293365
+}
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 421.75657659699937,
+    "elapsed_time": 424.04632396099623,
    "num_requests": 200,
    "total_num_tokens": 146805,
-    "requests_per_second": 0.4742071875054738,
-    "tokens_per_second": 348.0799308087054
+    "requests_per_second": 0.4716465836369236,
+    "tokens_per_second": 346.2003835540928
 }
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 868.8101008250001,
+    "elapsed_time": 918.187000697013,
    "num_requests": 200,
    "total_num_tokens": 146805,
-    "requests_per_second": 0.2301999019234296,
-    "tokens_per_second": 168.9724830093454
+    "requests_per_second": 0.21782055272855774,
+    "tokens_per_second": 159.8857312165796
 }
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 456.08530166203855,
+    "num_requests": 200,
+    "total_num_tokens": 146805,
+    "requests_per_second": 0.4385144604993234,
+    "tokens_per_second": 321.88057686801585
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 503.28860085096676,
+    "num_requests": 200,
+    "total_num_tokens": 146805,
+    "requests_per_second": 0.397386310084984,
+    "tokens_per_second": 291.6914862601304
+}
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 457.7749735690013,
+    "elapsed_time": 458.737264430034,
    "num_requests": 200,
    "total_num_tokens": 146805,
-    "requests_per_second": 0.4368958801760569,
-    "tokens_per_second": 320.69249844623016
+    "requests_per_second": 0.4359794058773347,
+    "tokens_per_second": 320.0197833991106
 }
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 644.1538858940003,
+    "elapsed_time": 686.8188757880125,
    "num_requests": 200,
    "total_num_tokens": 146805,
-    "requests_per_second": 0.3104848148551126,
-    "tokens_per_second": 227.90361622402403
+    "requests_per_second": 0.29119758796747197,
+    "tokens_per_second": 213.74630950782364
 }
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 534.8865945799625,
+    "num_requests": 200,
+    "total_num_tokens": 146805,
+    "requests_per_second": 0.3739110346503573,
+    "tokens_per_second": 274.46004720922855
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 571.4193902639672,
+    "num_requests": 200,
+    "total_num_tokens": 146805,
+    "requests_per_second": 0.35000562355367393,
+    "tokens_per_second": 256.91287782898553
+}
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 534.4193308840004,
+    "elapsed_time": 524.8208868440124,
    "num_requests": 200,
    "total_num_tokens": 146805,
-    "requests_per_second": 0.3742379596733028,
-    "tokens_per_second": 274.7000183491961
+    "requests_per_second": 0.38108239403864297,
+    "tokens_per_second": 279.7240042842149
 }
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 733.5017090729998,
+    "elapsed_time": 789.1420173590304,
    "num_requests": 200,
    "total_num_tokens": 146805,
-    "requests_per_second": 0.2726646680247824,
-    "tokens_per_second": 200.1426829468909
+    "requests_per_second": 0.2534398062712803,
+    "tokens_per_second": 186.03115379827653
 }
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 805.9022228560061,
+    "num_requests": 200,
+    "total_num_tokens": 146805,
+    "requests_per_second": 0.24816906360082697,
+    "tokens_per_second": 182.16229690959702
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 824.4905019259895,
+    "num_requests": 200,
+    "total_num_tokens": 146805,
+    "requests_per_second": 0.2425740497104635,
+    "tokens_per_second": 178.05541683872298
+}
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 879.0596038709991,
+    "elapsed_time": 748.1414223780157,
    "num_requests": 200,
    "total_num_tokens": 146805,
-    "requests_per_second": 0.22751585799106944,
-    "tokens_per_second": 167.00232766189475
+    "requests_per_second": 0.2673291359329993,
+    "tokens_per_second": 196.2262690032198
 }
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 1109.9732099440007,
+    "elapsed_time": 1168.3619703819859,
    "num_requests": 200,
    "total_num_tokens": 146805,
-    "requests_per_second": 0.18018452896722634,
-    "tokens_per_second": 132.2599488751683
+    "requests_per_second": 0.17117982703135376,
+    "tokens_per_second": 125.65027253668944
 }
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 510.63144373201067,
+    "num_requests": 200,
+    "total_num_tokens": 148857,
+    "requests_per_second": 0.391671923958063,
+    "tokens_per_second": 291.5155379231269
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 572.292031740013,
+    "num_requests": 200,
+    "total_num_tokens": 148857,
+    "requests_per_second": 0.3494719285046033,
+    "tokens_per_second": 260.10671430704866
+}
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 504.69023761399876,
+    "elapsed_time": 520.7929677469656,
    "num_requests": 200,
    "total_num_tokens": 148857,
-    "requests_per_second": 0.39628268013570256,
-    "tokens_per_second": 294.9472545848014
+    "requests_per_second": 0.3840297630462106,
+    "tokens_per_second": 285.8275921888489
 }
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 876.911706677,
+    "elapsed_time": 930.6109793490032,
    "num_requests": 200,
    "total_num_tokens": 148857,
-    "requests_per_second": 0.22807313265081958,
-    "tokens_per_second": 169.75141153501525
+    "requests_per_second": 0.2149125729635249,
+    "tokens_per_second": 159.95620436815713
 }
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 237.61095946098794,
+    "num_requests": 200,
+    "total_num_tokens": 145877,
+    "requests_per_second": 0.8417120172137385,
+    "tokens_per_second": 613.9321196754427
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 284.23000320699066,
+    "num_requests": 200,
+    "total_num_tokens": 145877,
+    "requests_per_second": 0.7036554823325597,
+    "tokens_per_second": 513.235753981134
+}
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 244.51837097500174,
+    "elapsed_time": 247.22850671299966,
    "num_requests": 200,
    "total_num_tokens": 145877,
-    "requests_per_second": 0.8179344529513773,
-    "tokens_per_second": 596.5891209659404
+    "requests_per_second": 0.8089681997399035,
+    "tokens_per_second": 590.0492703672895
 }
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 380.55349342600005,
+    "elapsed_time": 395.08209386101225,
    "num_requests": 200,
    "total_num_tokens": 145877,
-    "requests_per_second": 0.5255502930730307,
-    "tokens_per_second": 383.3285005130725
+    "requests_per_second": 0.5062239041143659,
+    "tokens_per_second": 369.23212230245684
 }
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 1361.426551499986,
+    "num_requests": 200,
+    "total_num_tokens": 146523,
+    "requests_per_second": 0.14690473002722398,
+    "tokens_per_second": 107.62460878889469
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 1474.255295659008,
+    "num_requests": 200,
+    "total_num_tokens": 146523,
+    "requests_per_second": 0.13566171380825723,
+    "tokens_per_second": 99.38780646163637
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 1482.2689266130328,
+    "num_requests": 200,
+    "total_num_tokens": 146523,
+    "requests_per_second": 0.13492828218223374,
+    "tokens_per_second": 98.85048345093716
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 1724.1368565150187,
+    "num_requests": 200,
+    "total_num_tokens": 147036,
+    "requests_per_second": 0.11600007229371459,
+    "tokens_per_second": 85.2809331488931
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 1338.8605944840237,
+    "num_requests": 200,
+    "total_num_tokens": 147036,
+    "requests_per_second": 0.1493807501871223,
+    "tokens_per_second": 109.82173992256857
+}
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 1307.2402118169994,
+    "elapsed_time": 1199.1163451180328,
    "num_requests": 200,
    "total_num_tokens": 147036,
-    "requests_per_second": 0.15299406963775225,
-    "tokens_per_second": 112.4781801162827
+    "requests_per_second": 0.16678948695367285,
+    "tokens_per_second": 122.62029501860121
 }
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 1886.751298176,
+    "elapsed_time": 1959.4152568069985,
    "num_requests": 200,
    "total_num_tokens": 147036,
-    "requests_per_second": 0.10600231211890418,
-    "tokens_per_second": 77.93077982357597
+    "requests_per_second": 0.10207126810164463,
+    "tokens_per_second": 75.0407548829671
 }
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 243.98866786801955,
+    "num_requests": 200,
+    "total_num_tokens": 147036,
+    "requests_per_second": 0.819710201082723,
+    "tokens_per_second": 602.6345456319963
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 282.0738571010297,
+    "num_requests": 200,
+    "total_num_tokens": 147036,
+    "requests_per_second": 0.7090341588386423,
+    "tokens_per_second": 521.2677328949931
+}
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 247.62527259899798,
+    "elapsed_time": 242.14750060701044,
    "num_requests": 200,
    "total_num_tokens": 147036,
-    "requests_per_second": 0.8076720033495051,
-    "tokens_per_second": 593.7843034224891
+    "requests_per_second": 0.825942863331829,
+    "tokens_per_second": 607.216674264294
 }
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 341.2666312900001,
+    "elapsed_time": 357.72086531698005,
    "num_requests": 200,
    "total_num_tokens": 147036,
-    "requests_per_second": 0.5860520240258851,
-    "tokens_per_second": 430.8537270233502
+    "requests_per_second": 0.5590951476167821,
+    "tokens_per_second": 411.03557062490586
 }
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 486.3392907420057,
+    "num_requests": 200,
+    "total_num_tokens": 146278,
+    "requests_per_second": 0.41123553824915293,
+    "tokens_per_second": 300.773560320048
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 455.7690629530116,
+    "num_requests": 200,
+    "total_num_tokens": 146278,
+    "requests_per_second": 0.4388187269758136,
+    "tokens_per_second": 320.9476287228403
+}
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 422.7612150579989,
+    "elapsed_time": 398.827027003048,
    "num_requests": 200,
    "total_num_tokens": 146278,
-    "requests_per_second": 0.47308029420949094,
-    "tokens_per_second": 346.0061963818796
+    "requests_per_second": 0.5014705284716613,
+    "tokens_per_second": 366.77052981888835
 }
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 594.5536415039987,
+    "elapsed_time": 610.5734472059994,
    "num_requests": 200,
    "total_num_tokens": 146278,
-    "requests_per_second": 0.33638680522429343,
-    "tokens_per_second": 246.02994547299596
+    "requests_per_second": 0.32756091984544267,
+    "tokens_per_second": 239.57478116575834
 }
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 497.111974740983,
+    "num_requests": 200,
+    "total_num_tokens": 146805,
+    "requests_per_second": 0.40232384284085837,
+    "tokens_per_second": 295.31575874126105
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 471.133652363962,
+    "num_requests": 200,
+    "total_num_tokens": 146805,
+    "requests_per_second": 0.4245079904534079,
+    "tokens_per_second": 311.59947769256274
+}
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 395.26841144900027,
+    "elapsed_time": 399.3928133630543,
    "num_requests": 200,
    "total_num_tokens": 146805,
-    "requests_per_second": 0.5059852854591319,
-    "tokens_per_second": 371.4058491591393
+    "requests_per_second": 0.5007601371589951,
+    "tokens_per_second": 367.5704596781314
 }
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 769.1666062429999,
+    "elapsed_time": 813.6141017450136,
    "num_requests": 200,
    "total_num_tokens": 146805,
-    "requests_per_second": 0.260021688898978,
-    "tokens_per_second": 190.86242019407229
+    "requests_per_second": 0.24581678165489804,
+    "tokens_per_second": 180.43566315423652
 }
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 456.45958357997006,
+    "num_requests": 200,
+    "total_num_tokens": 146805,
+    "requests_per_second": 0.4381548929949473,
+    "tokens_per_second": 321.6166453306162
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 490.5911466999678,
+    "num_requests": 200,
+    "total_num_tokens": 146805,
+    "requests_per_second": 0.40767144157681784,
+    "tokens_per_second": 299.24102990342374
+}
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 464.71097393700256,
+    "elapsed_time": 440.66104900900973,
    "num_requests": 200,
    "total_num_tokens": 146805,
-    "requests_per_second": 0.43037503139986644,
-    "tokens_per_second": 315.906032423287
+    "requests_per_second": 0.4538635771184551,
+    "tokens_per_second": 333.147212194374
 }
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 638.3282979609994,
+    "elapsed_time": 683.9224744850071,
    "num_requests": 200,
    "total_num_tokens": 146805,
-    "requests_per_second": 0.31331839844615444,
-    "tokens_per_second": 229.9835374194385
+    "requests_per_second": 0.29243080533447857,
+    "tokens_per_second": 214.65152188564062
 }
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 517.5916094129789,
+    "num_requests": 200,
+    "total_num_tokens": 146805,
+    "requests_per_second": 0.38640502736670695,
+    "tokens_per_second": 283.6309502128471
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 548.4156070559984,
+    "num_requests": 200,
+    "total_num_tokens": 146805,
+    "requests_per_second": 0.3646869225214776,
+    "tokens_per_second": 267.6893183038276
+}
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 502.6907218439992,
+    "elapsed_time": 497.59323585999664,
    "num_requests": 200,
    "total_num_tokens": 146805,
-    "requests_per_second": 0.3978589444944367,
-    "tokens_per_second": 292.0384117325289
+    "requests_per_second": 0.4019347241614679,
+    "tokens_per_second": 295.0301359026215
 }
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 721.7994779089986,
+    "elapsed_time": 780.1687226030044,
    "num_requests": 200,
    "total_num_tokens": 146805,
-    "requests_per_second": 0.2770852655357769,
-    "tokens_per_second": 203.38751203489863
+    "requests_per_second": 0.2563548040386794,
+    "tokens_per_second": 188.17083503449163
 }
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 802.5698999410379,
+    "num_requests": 200,
+    "total_num_tokens": 146805,
+    "requests_per_second": 0.24919947784572202,
+    "tokens_per_second": 182.9186467257061
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 839.8958681730437,
+    "num_requests": 200,
+    "total_num_tokens": 146805,
+    "requests_per_second": 0.23812475757863116,
+    "tokens_per_second": 174.78952518165474
+}
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 886.8526372269989,
+    "elapsed_time": 757.2171181479935,
    "num_requests": 200,
    "total_num_tokens": 146805,
-    "requests_per_second": 0.2255166096425645,
-    "tokens_per_second": 165.5348293928834
+    "requests_per_second": 0.2641250378612165,
+    "tokens_per_second": 193.87438091607942
 }
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 1084.3601952080007,
+    "elapsed_time": 1144.2253085140255,
    "num_requests": 200,
    "total_num_tokens": 146805,
-    "requests_per_second": 0.18444055848217136,
-    "tokens_per_second": 135.3839809398758
+    "requests_per_second": 0.1747907501362075,
+    "tokens_per_second": 128.30078036872973
 }
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 373.92354663898004,
+    "num_requests": 200,
+    "total_num_tokens": 148857,
+    "requests_per_second": 0.5348686965496139,
+    "tokens_per_second": 398.09474781142933
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 434.26602390100015,
+    "num_requests": 200,
+    "total_num_tokens": 148857,
+    "requests_per_second": 0.46054719686197254,
+    "tokens_per_second": 342.7783704164132
+}
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 369.2837602610016,
+    "elapsed_time": 374.03978066996206,
    "num_requests": 200,
    "total_num_tokens": 148857,
-    "requests_per_second": 0.5415889392445647,
-    "tokens_per_second": 403.09652364564084
+    "requests_per_second": 0.5347024844303181,
+    "tokens_per_second": 397.9710386242193
 }
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 509.0738683320001,
+    "elapsed_time": 555.4390292470343,
    "num_requests": 200,
    "total_num_tokens": 148857,
-    "requests_per_second": 0.39287029337276264,
-    "tokens_per_second": 292.4074663029466
+    "requests_per_second": 0.36007552488906747,
+    "tokens_per_second": 267.99881204205957
 }
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 213.75922767800512,
+    "num_requests": 200,
+    "total_num_tokens": 145877,
+    "requests_per_second": 0.9356321229849724,
+    "tokens_per_second": 682.4360360233941
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 264.80451649799943,
+    "num_requests": 200,
+    "total_num_tokens": 145877,
+    "requests_per_second": 0.7552741269105618,
+    "tokens_per_second": 550.8856190566602
+}
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 224.76228898300178,
+    "elapsed_time": 224.3753512299736,
    "num_requests": 200,
    "total_num_tokens": 145877,
-    "requests_per_second": 0.8898289873490544,
-    "tokens_per_second": 649.02791593759
+    "requests_per_second": 0.8913635071929533,
+    "tokens_per_second": 650.1471716939323
 }
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 322.171811016,
+    "elapsed_time": 336.45260514499387,
    "num_requests": 200,
    "total_num_tokens": 145877,
-    "requests_per_second": 0.620786776376495,
-    "tokens_per_second": 452.7925628873698
+    "requests_per_second": 0.5944373648520577,
+    "tokens_per_second": 433.5736973626181
 }
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 1484.8385301349917,
+    "num_requests": 200,
+    "total_num_tokens": 146523,
+    "requests_per_second": 0.1346947805710681,
+    "tokens_per_second": 98.67941666807306
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 1493.5249834020506,
+    "num_requests": 200,
+    "total_num_tokens": 146523,
+    "requests_per_second": 0.13391138562973798,
+    "tokens_per_second": 98.10548978313048
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 1707.9124416089617,
+    "num_requests": 200,
+    "total_num_tokens": 147036,
+    "requests_per_second": 0.11710202181769186,
+    "tokens_per_second": 86.0910643999307
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 1320.7432732739835,
+    "num_requests": 200,
+    "total_num_tokens": 147036,
+    "requests_per_second": 0.1514298834959962,
+    "tokens_per_second": 111.32822174858649
+}
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 1315.035868578001,
+    "elapsed_time": 1242.463667072996,
    "num_requests": 200,
    "total_num_tokens": 147036,
-    "requests_per_second": 0.15208710635115047,
-    "tokens_per_second": 111.8113988472388
+    "requests_per_second": 0.16097050183460196,
+    "tokens_per_second": 118.34229353876268
 }
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 1923.4690410719995,
+    "elapsed_time": 1966.935257990961,
    "num_requests": 200,
    "total_num_tokens": 147036,
-    "requests_per_second": 0.10397879858182421,
-    "tokens_per_second": 76.44313314138553
+    "requests_per_second": 0.10168102848706935,
+    "tokens_per_second": 74.75385852312364
 }
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 299.5004001749912,
+    "num_requests": 200,
+    "total_num_tokens": 147036,
+    "requests_per_second": 0.6677787404729495,
+    "tokens_per_second": 490.93757442090305
+}
@@ -0,0 +1,7 @@
+{
+    "elapsed_time": 282.87595599895576,
+    "num_requests": 200,
+    "total_num_tokens": 147036,
+    "requests_per_second": 0.707023682142636,
+    "tokens_per_second": 519.7896706376232
+}
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 246.0529060009976,
+    "elapsed_time": 244.54776988498634,
    "num_requests": 200,
    "total_num_tokens": 147036,
-    "requests_per_second": 0.8128333180474167,
-    "tokens_per_second": 597.5787987620997
+    "requests_per_second": 0.8178361229548825,
+    "tokens_per_second": 601.2567608739705
 }
@@ -1,7 +1,7 @@
 {
-    "elapsed_time": 333.59849170300004,
+    "elapsed_time": 362.9645123449736,
    "num_requests": 200,
    "total_num_tokens": 147036,
-    "requests_per_second": 0.5995230943012126,
-    "tokens_per_second": 440.75738846836555
+    "requests_per_second": 0.5510180560294371,
+    "tokens_per_second": 405.0974544317216
 }
@@ -15,18 +15,21 @@ except ImportError:
    print("Error: 'transformers' not found. Please install it or run in vLLM environment.")
    sys.exit(1)

+
 # Import path handling for scripts/models.py
 try:
    import sys, os
    sys.path.append(str(Path(__file__).parent.parent / "scripts"))
    import models
+    import cluster_manager # Import shared cluster logic
 except ImportError:
-    print("Error: Could not import scripts/models.py.")
+    print("Error: Could not import scripts/models.py or cluster_manager.py.")
    sys.exit(1)

 # Import Utils from run_vllm_bench (keep utils shared)
 try:
-    from run_vllm_bench import get_gpu_count, kill_vllm
+    from run_vllm_bench import kill_vllm
+    # We do NOT import get_gpu_count because we are overriding it for cluster awareness
 except ImportError:
    print("Error: Could not import run_vllm_bench.py.")
    sys.exit(1)
@@ -65,7 +68,30 @@ CONCURRENCY_STEPS = [1, 4, 8, 16]

 def log(msg):    print(f"[MAX-CTX] {msg}", flush=True)

+def get_gpu_count():
+    """
+    Returns total GPUs. 
+    If Ray Cluster is active, returns TOTAL cluster GPUs (e.g., 2).
+    Otherwise returns local AMD GPUs.
+    """
+    if cluster_manager.check_ray_status():
+        # Ideally we'd query Ray for total resources, but for this specific 2-node setup:
+        # If cluster is up, we assume 2 nodes x 1 GPU = 2 GPUs.
+        # Constructing a Ray client just to count is slow/complex here.
+        log("Ray Cluster Detected: Assuming 2 GPUs available.")
+        return 2
+        
+    # Local Fallback
+    try:
+        res = subprocess.run("rocm-smi --showid", shell=True, text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+        if res.returncode == 0:
+            return res.stdout.count("GPU")
+    except: pass
+    return 1
+
+
 def get_hf_context_limit(model_name, trust_remote=False):
+    # ... (Keep existing implementation)
    try:
        cfg = AutoConfig.from_pretrained(model_name, trust_remote_code=trust_remote)

@@ -95,6 +121,7 @@ def get_hf_context_limit(model_name, trust_remote=False):
 def get_vllm_server_cmd(model, tp_size, util, max_len, max_seqs):
    """
    Constructs the vLLM serve command.
+    Using Ray Backend if tp_size > 1 (Cluster Mode).
    """
    config = MODEL_TABLE[model]
    
@@ -105,16 +132,47 @@ def get_vllm_server_cmd(model, tp_size, util, max_len, max_seqs):
        "--tensor-parallel-size", str(tp_size),
        "--max-num-seqs", str(max_seqs),
        "--dtype", "auto",
-        # "--disable-log-stats" # Cleaner output, but user managed without it
+        # "--disable-log-stats" 
    ]
    
+    # Env Setup
+    env = os.environ.copy()
+    env["VLLM_DISABLE_COMPILE_CACHE"] = "1"
+    env.update(config.get("env", {}))
+
+    # CLUSTER / RAY LOGIC
+    # Only if we need more than 1 GPU do we engage the cluster machinery
+    if tp_size > 1:
+        log(f"TP={tp_size} > 1: Using Ray Distributed Backend")
+        cmd.extend(["--distributed-executor-backend", "ray"])
+        
+        # Inject Cluster Env Vars (similar to start_vllm_cluster.py)
+        # We need to know Head IP and RDMA Interface
+        rdma_iface = cluster_manager.get_net_iface()
+        head_ip = cluster_manager.get_local_ip(rdma_iface) # Assuming we run this ON HEAD
+        
+        # IMPORTANT: vLLM needs to bind to the Head IP for Ray workers to reach it? 
+        # Or at least we should be explicit.
+        cmd.extend(["--host", head_ip])
+        
+        # Update our own process env so verify_context knows where to look?
+        # No, verify_context runs in THIS process. We need to export it or pass it.
+        # Simplest is to set it in os.environ for OUR process too, but that might be messy.
+        # Better: We rely on standard PORT.
+        
+        env["RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES"] = "1"
+        env["VLLM_HOST_IP"] = head_ip
+        env["NCCL_SOCKET_IFNAME"] = rdma_iface
+        env["NCCL_IB_GID_INDEX"] = "1"
+        env["NCCL_IB_DISABLE"] = "0"
+        env["NCCL_NET_GDR_LEVEL"] = "0"
+    else:
+        # Default Localhost bind for single node safety
+        cmd.extend(["--host", "127.0.0.1"])
+        
    if config.get("trust_remote"): cmd.append("--trust-remote-code")
    if config.get("enforce_eager"): cmd.append("--enforce-eager")
    
-    # Add model specific env vars
-    env = os.environ.copy()
-    env.update(config.get("env", {}))
-    
    return cmd, env

 def is_port_free(port):
@@ -300,7 +358,14 @@ def verify_context(model, context_len):
    """
    Sends a request to the server with length ~context_len to verify stability.
    """
-    url = f"http://{HOST}:{PORT}/v1/completions"
+    # Use dynamic host if set (by cluster logic), else localhost
+    # But wait, the env var is set for the SERVER process, not necessarily us?
+    # Actually, we (the client script) need to know where to send requests.
+    # If we are on Head, localhost is fine for Head-based server. 
+    # But if we use Ray, vLLM head usually binds to HOST IP.
+    
+    target_host = os.getenv("VLLM_HOST_IP", "127.0.0.1")
+    url = f"http://{target_host}:{PORT}/v1/completions"
    
    # We use a simple "A " * N prompt.
    # Llama 3 tokenizer: "A" is usually 1 token.
@@ -529,9 +594,22 @@ def main():
            continue
            
        config = MODEL_TABLE[model]
-        valid_tps = [t for t in config["valid_tp"] if t <= gpu_count]
        
-        for tp in valid_tps:
+        # KEY CHANGES:
+        # We only want to test the MINIMUM required TP.
+        # If model supports 1 and 2, we ONLY test 1 (local is faster/easier).
+        # We only test 2 if model VALID_TP *starts* with 2 (or higher).
+        
+        valid_tps = config.get("valid_tp", [1])
+        min_tp = min(valid_tps)
+        
+        if min_tp > gpu_count:
+            log(f"Skipping {model}: Requires TP={min_tp} but only {gpu_count} GPUs available.")
+            continue
+            
+        tps_to_test = [min_tp]
+        
+        for tp in tps_to_test:
            # Track successful seqs for this TP to skip lower utils
            # effectively: {seqs_count: max_working_util}
            # Since we iterate high-util -> low-util, if we succeeded already for this 'seqs', we skip.
@@ -2,6 +2,12 @@
 import subprocess, time, json, sys, os, requests, argparse
 from pathlib import Path

+try:
+    import bench_utils
+except ImportError:
+    sys.path.append(str(Path(__file__).parent))
+    import bench_utils
+

 # =========================
 # ⚙️ GLOBAL SETTINGS
@@ -89,38 +95,43 @@ def get_dataset():



-def get_model_args(model, tp_size):
+def get_model_args(model, tp_size, overrides=None):
    config = MODEL_TABLE.get(model, {"max_num_seqs": "32"})
+    overrides = overrides or {}
    
    # Allow per-model GPU utilization override
-    util = config.get("gpu_util", GPU_UTIL)
+    util = overrides.get("gpu_util", config.get("gpu_util", GPU_UTIL))
+    max_seq_override = overrides.get("max_num_seqs", config.get("max_num_seqs", "32"))

    cmd = [
        "--model", model,
-        "--gpu-memory-utilization", util,
+        "--gpu-memory-utilization", str(util),
        "--dtype", "auto",
        "--tensor-parallel-size", str(tp_size),
-        "--max-num-seqs", config["max_num_seqs"]
+        "--max-num-seqs", str(max_seq_override)
    ]
    
    # Optional: if a model really needs a hard limit, we can still support "ctx" in config,
    # but by default we rely on auto.
-    if "ctx" in config:
-        cmd.extend(["--max-model-len", config["ctx"]])
+    if "ctx" in overrides or "ctx" in config:
+        cmd.extend(["--max-model-len", str(overrides.get("ctx", config.get("ctx")))])
        
    if config.get("trust_remote"): cmd.append("--trust-remote-code")
    if config.get("enforce_eager"): cmd.append("--enforce-eager")
    
    return cmd

-def run_throughput(model, tp_size, backend_name="Default", output_dir=RESULTS_DIR, extra_env=None):
+def run_throughput(model, tp_size, backend_name="Default", output_dir=RESULTS_DIR, extra_env=None, overrides=None):
    if tp_size not in MODEL_TABLE[model]["valid_tp"]: return
+    overrides = overrides or {}
    
    model_safe = model.replace("/", "_")
    output_dir_path = Path(output_dir)
    output_dir_path.mkdir(parents=True, exist_ok=True)
    
-    output_file = output_dir_path / f"{model_safe}_tp{tp_size}_throughput.json"
+    tag = overrides.get("tag", "").strip()
+    tag_suffix = f"_{tag}" if tag else ""
+    output_file = output_dir_path / f"{model_safe}_tp{tp_size}{tag_suffix}_throughput.json"
    
    if output_file.exists():
        log(f"SKIP {model} (TP={tp_size} | {backend_name})")
@@ -130,13 +141,13 @@ def run_throughput(model, tp_size, backend_name="Default", output_dir=RESULTS_DI
    dataset_args = ["--dataset-name", "sharegpt", "--dataset-path", dataset_path] if dataset_path else ["--input-len", "1024"]
    
    # Retrieve Model-Specific Batch Tokens
-    batch_tokens = MODEL_TABLE[model].get("max_tokens", DEFAULT_BATCH_TOKENS)
+    batch_tokens = str(overrides.get("max_tokens", MODEL_TABLE[model].get("max_tokens", DEFAULT_BATCH_TOKENS)))

    log(f"START {model} (TP={tp_size} | {backend_name}) [Batch: {batch_tokens}]...")
    kill_vllm()
    nuke_vllm_cache()

-    cmd = ["vllm", "bench", "throughput"] + get_model_args(model, tp_size)
+    cmd = ["vllm", "bench", "throughput"] + get_model_args(model, tp_size, overrides)
    cmd.extend([
        "--num-prompts", str(OFF_NUM_PROMPTS),
        "--max-num-batched-tokens", batch_tokens,
@@ -152,6 +163,7 @@ def run_throughput(model, tp_size, backend_name="Default", output_dir=RESULTS_DI

    # ENV Setup: Global + Model Specific
    env = os.environ.copy()
+    env["VLLM_DISABLE_COMPILE_CACHE"] = "1"
    
    # Inject model specific env vars (e.g. for AWQ)
    model_env = MODEL_TABLE[model].get("env", {})
@@ -168,35 +180,64 @@ def run_throughput(model, tp_size, backend_name="Default", output_dir=RESULTS_DI


 def print_summary(tps):
-    print(f"\n{'MODEL':<40} | {'TP':<2} | {'Triton':<8} | {'ROCm':<8}")
-    print("-" * 75)
+    print(f"\n{'MODEL':<40} | {'TP':<2} | {'Tag':<15} | {'Triton':<8} | {'ROCm':<8}")
+    print("-" * 92)
    
    for m in MODELS_TO_RUN:
        msafe = m.replace("/", "_")
+        name_cell = m.split('/')[-1]
+        
        for tp in tps:
            if tp not in MODEL_TABLE[m]["valid_tp"]: continue
            
-            # Default
-            try: 
-                p1 = RESULTS_DIR / f"{msafe}_tp{tp}_throughput.json"
-                d1 = json.loads(p1.read_text())
-                val1 = f"{d1.get('tokens_per_second', 0):.1f}"
-            except: val1 = "N/A"
+            prefix = f"{msafe}_tp{tp}"
            
-            # ROCm
-            try:
-                p2 = Path("benchmark_results_rocm") / f"{msafe}_tp{tp}_throughput.json"
-                d2 = json.loads(p2.read_text())
-                val2 = f"{d2.get('tokens_per_second', 0):.1f}"
-            except: val2 = "N/A"
+            tags = set()
+            for p in RESULTS_DIR.glob(f"{prefix}*_throughput.json"):
+                name_part = p.name[len(prefix):-len("_throughput.json")]
+                tag = name_part.lstrip("_")
+                tags.add(tag)
+                
+            for p in Path("benchmark_results_rocm").glob(f"{prefix}*_throughput.json"):
+                name_part = p.name[len(prefix):-len("_throughput.json")]
+                tag = name_part.lstrip("_")
+                tags.add(tag)
+                
+            if not tags:
+                tags.add("") # Default empty tag if no files found
+                
+            for tag in sorted(list(tags)):
+                tag_suffix = f"_{tag}" if tag else ""
+                
+                # Default
+                try: 
+                    p1 = RESULTS_DIR / f"{prefix}{tag_suffix}_throughput.json"
+                    if p1.exists():
+                        d1 = json.loads(p1.read_text())
+                        val1 = f"{d1.get('tokens_per_second', 0):.1f}"
+                    else:
+                        val1 = "N/A"
+                except: val1 = "N/A"
+                
+                # ROCm
+                try:
+                    p2 = Path("benchmark_results_rocm") / f"{prefix}{tag_suffix}_throughput.json"
+                    if p2.exists():
+                        d2 = json.loads(p2.read_text())
+                        val2 = f"{d2.get('tokens_per_second', 0):.1f}"
+                    else:
+                        val2 = "N/A"
+                except: val2 = "N/A"

-            name_cell = m.split('/')[-1]
-            print(f"{name_cell:<40} | {tp:<2} | {val1:<8} | {val2:<8}")
-    print("-" * 75)
+                display_tag = tag if tag else "(Default)"
+                print(f"{name_cell:<40} | {tp:<2} | {display_tag:<15} | {val1:<8} | {val2:<8}")
+                
+    print("-" * 92)

 if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--tp", type=int, nargs="+", default=[1])
+    parser.add_argument("--tui", action="store_true", help="Launch interactive configuration UI")
    args = parser.parse_args()
    
    gpu_count = get_gpu_count()
@@ -207,17 +248,86 @@ if __name__ == "__main__":
        log(f"Requested TP={args.tp} but only {gpu_count} GPU(s) detected. Nothing to run.")
        sys.exit(0)

+    selected_models = MODELS_TO_RUN
+    
+    if args.tui:
+        # TUI Model Selection
+        checklist_args = [
+            "--clear", "--backtitle", "AMD vLLM Benchmark Launcher",
+            "--title", "Model Selection",
+            "--checklist", "Select models to benchmark:", "20", "65", "10"
+        ]
+        
+        for m in MODELS_TO_RUN:
+            m_name = m.split("/")[-1]
+            # All selected "on" by default
+            checklist_args.extend([m, m_name, "on"])
+            
+        choice = bench_utils.run_dialog(checklist_args)
+        
+        if choice is None:
+            subprocess.run(["clear"])
+            print("Cancelled by user.")
+            sys.exit(0)
+            
+        # Parse space-separated quoted output from dialog checklist
+        import shlex
+        selected_models = [m for m in shlex.split(choice)]
+        
+        if not selected_models:
+            subprocess.run(["clear"])
+            print("No models selected. Exiting.")
+            sys.exit(0)
+
    kill_vllm()
    for tp in valid_tp_args:
-        for m in MODELS_TO_RUN:
+        for m in selected_models:
+            overrides = {}
+            if args.tui:
+                config = MODEL_TABLE.get(m, {})
+                default_seqs = config.get("max_num_seqs", "32")
+                default_tokens = config.get("max_tokens", DEFAULT_BATCH_TOKENS)
+                default_util = config.get("gpu_util", GPU_UTIL)
+                default_ctx = config.get("ctx", "auto")
+                
+                form_args = [
+                    "--clear", "--backtitle", f"AMD vLLM Benchmark Configuration (TP: {tp})",
+                    "--title", f"Tune Parameters: {m.split('/')[-1]}",
+                    "--form", "Edit the options below. Leave tag empty for no suffix.",
+                    "15", "70", "5",
+                    "Max Concurrent Seqs:", "1", "1",  str(default_seqs), "1", "25", "15", "0",
+                    "Max Batched Tokens:", "2", "1", str(default_tokens), "2", "25", "15", "0",
+                    "GPU Utilization (0-1):", "3", "1", str(default_util), "3", "25", "15", "0",
+                    "Max Context Length:", "4", "1", str(default_ctx), "4", "25", "15", "0",
+                    "Filename Tag (Optional):", "5", "1", "", "5", "25", "15", "0"
+                ]
+                
+                form_res = bench_utils.run_dialog(form_args)
+                if form_res is None:
+                    subprocess.run(["clear"])
+                    print(f"Skipping {m} (TP={tp}) due to user cancellation.")
+                    continue
+                    
+                lines = form_res.splitlines()
+                if len(lines) >= 5:
+                    overrides["max_num_seqs"] = lines[0].strip()
+                    overrides["max_tokens"] = lines[1].strip()
+                    overrides["gpu_util"] = lines[2].strip()
+                    
+                    ctx_val = lines[3].strip()
+                    if ctx_val and ctx_val.lower() != "auto":
+                        overrides["ctx"] = ctx_val
+                        
+                    overrides["tag"] = lines[4].strip()
+            
            # 1. Default (Triton)
-            run_throughput(m, tp, "Default", RESULTS_DIR)
+            run_throughput(m, tp, "Default", RESULTS_DIR, overrides=overrides)
            
            # 2. ROCm Attention 
            # We force this via CLI argument --attention-backend ROCM_ATTN below
            # No specific env vars needed if forcing backend.
            rocm_env = {}
            print(f"[DEBUG] Forcing ROCm Env: {rocm_env} + CLI: --attention-backend ROCM_ATTN")
-            run_throughput(m, tp, "ROCm-Attn", "benchmark_results_rocm", rocm_env)
+            run_throughput(m, tp, "ROCm-Attn", "benchmark_results_rocm", rocm_env, overrides=overrides)
            
    print_summary(valid_tp_args)
@@ -2,6 +2,12 @@
 import subprocess, time, json, sys, os, requests, argparse, re
 from pathlib import Path

+try:
+    import bench_utils
+except ImportError:
+    sys.path.append(str(Path(__file__).parent))
+    import bench_utils
+
 # Import models immediately to access globals
 try:
    import models
@@ -23,6 +29,8 @@ except ImportError:
 # User requested specifically to test with TP=2 on the cluster.
 CLUSTER_TP = 2
 GPU_UTIL = "0.90" 
+FORCE_ETH = False
+FORCE_DEBUG_NCCL = False

 # THROUGHPUT CONFIG (Imported from models.py)
 OFF_NUM_PROMPTS      = models.OFF_NUM_PROMPTS
@@ -66,6 +74,15 @@ def log(msg): print(f"\n[CLUSTER-BENCH] {msg}")
 def restart_cluster():
    log("Restarting Ray Cluster (Clean State)...")
    
+    # Push config to env so cluster_manager picks it up for daemon injection
+    os.environ["NCCL_IB_DISABLE"] = "1" if FORCE_ETH else "0"
+    if FORCE_DEBUG_NCCL:
+        os.environ["NCCL_DEBUG"] = "INFO"
+        os.environ["NCCL_DEBUG_SUBSYS"] = "INIT,NET"
+    else:
+        os.environ.pop("NCCL_DEBUG", None)
+        os.environ.pop("NCCL_DEBUG_SUBSYS", None)
+        
    # 1. Stop Cluster (Best Effort)
    cluster_manager.stop_cluster()
    
@@ -89,7 +106,8 @@ def restart_cluster():
    log("Cluster Ready.")

 def get_net_iface():
-    return cluster_manager.get_net_iface()
+    prefix = ".".join(HEAD_IP.split('.')[:3])
+    return cluster_manager.get_net_iface(prefix)

 def get_local_ip(iface):
    return cluster_manager.get_local_ip(iface)
@@ -122,6 +140,7 @@ def get_cluster_env():
    host_ip = get_local_ip(rdma_iface)
    
    env = os.environ.copy()
+    env["VLLM_DISABLE_COMPILE_CACHE"] = "1"
    
    # Critical Cluster Envs (Match start_vllm_cluster.py)
    env["RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES"] = "1"
@@ -130,31 +149,37 @@ def get_cluster_env():
    env["GLOO_SOCKET_IFNAME"] = rdma_iface
    # RCCL specific
    env["NCCL_IB_GID_INDEX"] = "1"
-    env["NCCL_IB_DISABLE"] = "0"
+    env["NCCL_IB_DISABLE"] = "1" if FORCE_ETH else "0"
    env["NCCL_NET_GDR_LEVEL"] = "0"
    
    # Stability for RDMA (Fix for high-throughput models like Gemma 3)
    env["NCCL_IB_TIMEOUT"] = "23"  # ~32 seconds (default is 18/~1s)
    env["NCCL_IB_RETRY_CNT"] = "7" # Default is 3, increase for lossy networks
    
+    if FORCE_DEBUG_NCCL:
+        env["NCCL_DEBUG"] = "INFO"
+        env["NCCL_DEBUG_SUBSYS"] = "INIT,NET"
+    
    return env

-def get_model_args(model):
+def get_model_args(model, overrides=None):
    config = MODEL_TABLE.get(model, {"max_num_seqs": "32"})
-    util = config.get("gpu_util", GPU_UTIL)
+    overrides = overrides or {}
+    util = overrides.get("gpu_util", config.get("gpu_util", GPU_UTIL))
+    max_seq_override = overrides.get("max_num_seqs", config.get("max_num_seqs", "32"))

    cmd = [
        "--model", model,
-        "--gpu-memory-utilization", util,
+        "--gpu-memory-utilization", str(util),
        "--dtype", "auto",
        "--tensor-parallel-size", str(CLUSTER_TP),
-        "--max-num-seqs", config["max_num_seqs"],
+        "--max-num-seqs", str(max_seq_override),
        "--distributed-executor-backend", "ray"
    ]
    
    # Optional ctx
-    if "ctx" in config:
-        cmd.extend(["--max-model-len", config["ctx"]])
+    if "ctx" in overrides or "ctx" in config:
+        cmd.extend(["--max-model-len", str(overrides.get("ctx", config.get("ctx")))])
        
    if config.get("trust_remote"): cmd.append("--trust-remote-code")
    
@@ -163,16 +188,20 @@ def get_model_args(model):
    
    return cmd

-def get_benchmark_output_file(model, output_dir):
+def get_benchmark_output_file(model, output_dir, tag=""):
    model_safe = model.replace("/", "_")
    output_dir_path = Path(output_dir)
-    return output_dir_path / f"{model_safe}_cluster_tp{CLUSTER_TP}_throughput.json"
+    eth_suffix = "_eth" if FORCE_ETH else ""
+    tag_suffix = f"_{tag}" if tag else ""
+    return output_dir_path / f"{model_safe}_cluster_tp{CLUSTER_TP}{eth_suffix}{tag_suffix}_throughput.json"

-def run_bench_set(model, backend_name, output_dir, extra_env=None):
+def run_bench_set(model, backend_name, output_dir, extra_env=None, overrides=None):
    output_dir_path = Path(output_dir)
    output_dir_path.mkdir(parents=True, exist_ok=True)
+    overrides = overrides or {}
    
-    output_file = get_benchmark_output_file(model, output_dir)
+    tag = overrides.get("tag", "").strip()
+    output_file = get_benchmark_output_file(model, output_dir, tag)
    
    if output_file.exists():
        log(f"SKIP {model} [{backend_name}] (Result exists)")
@@ -181,13 +210,13 @@ def run_bench_set(model, backend_name, output_dir, extra_env=None):
    dataset_path = get_dataset()
    dataset_args = ["--dataset-name", "sharegpt", "--dataset-path", dataset_path] if dataset_path else ["--input-len", "1024"]
    
-    batch_tokens = MODEL_TABLE[model].get("max_tokens", DEFAULT_BATCH_TOKENS)
+    batch_tokens = str(overrides.get("max_tokens", MODEL_TABLE.get(model, {}).get("max_tokens", DEFAULT_BATCH_TOKENS)))

    log(f"START {model} [TP={CLUSTER_TP} | {backend_name}]...")
    
    nuke_vllm_cache()

-    cmd = ["vllm", "bench", "throughput"] + get_model_args(model)
+    cmd = ["vllm", "bench", "throughput"] + get_model_args(model, overrides)
    cmd.extend([
        "--num-prompts", str(OFF_NUM_PROMPTS),
        "--max-num-batched-tokens", batch_tokens,
@@ -218,20 +247,24 @@ def run_bench_set(model, backend_name, output_dir, extra_env=None):
    except Exception as e:
        log(f"ERROR: System error: {e}")

-def run_cluster_throughput(model):
+def run_cluster_throughput(model, overrides=None):
+    overrides = overrides or {}
+    tag = overrides.get("tag", "").strip()
+    
    # 1. Default Run (Triton)
-    if get_benchmark_output_file(model, RESULTS_DIR).exists():
+    if get_benchmark_output_file(model, RESULTS_DIR, tag).exists():
        log(f"SKIP {model} [Default] (Result exists)")
    else:
        restart_cluster()
        run_bench_set(
            model, 
            "Default", 
-            RESULTS_DIR
+            RESULTS_DIR,
+            overrides=overrides
        )
    
    # 2. ROCm Attention Run
-    if get_benchmark_output_file(model, "benchmark_results_rocm").exists():
+    if get_benchmark_output_file(model, "benchmark_results_rocm", tag).exists():
        log(f"SKIP {model} [ROCm-Attn] (Result exists)")
    else:
        restart_cluster()
@@ -239,47 +272,186 @@ def run_cluster_throughput(model):
            model,
            "ROCm-Attn",
            "benchmark_results_rocm",
-            extra_env={}
+            extra_env={},
+            overrides=overrides
        )


 def print_summary():
-    print(f"\n{'MODEL (TP=2)':<50} | {'Triton':<8} | {'ROCm':<8}")
-    print("-" * 75)
+    eth_suffix = "_eth" if FORCE_ETH else ""
+    title_suffix = " (Ethernet ONLY)" if FORCE_ETH else ""
+    print(f"\n{f'MODEL (TP={CLUSTER_TP}){title_suffix}':<50} | {'Tag':<15} | {'Triton':<8} | {'ROCm':<8}")
+    print("-" * 92)
    
    for m in MODELS_TO_RUN:
        msafe = m.replace("/", "_")
-        
-        # Default
-        try: 
-            p1 = RESULTS_DIR / f"{msafe}_cluster_tp{CLUSTER_TP}_throughput.json"
-            d1 = json.loads(p1.read_text())
-            val1 = f"{d1.get('tokens_per_second', 0):.1f}"
-        except: val1 = "N/A"
-        
-        # ROCm
-        try:
-            p2 = Path("benchmark_results_rocm") / f"{msafe}_cluster_tp{CLUSTER_TP}_throughput.json"
-            d2 = json.loads(p2.read_text())
-            val2 = f"{d2.get('tokens_per_second', 0):.1f}"
-        except: val2 = "N/A"
-
        name_cell = m.split('/')[-1]
-        print(f"{name_cell:<50} | {val1:<8} | {val2:<8}")
-    print("-" * 75)
+        
+        # Find all tags used for this model by looking at the files in RESULTS_DIR
+        prefix = f"{msafe}_cluster_tp{CLUSTER_TP}{eth_suffix}"
+        
+        # Gather all unique tags from both directories
+        tags = set()
+        for p in RESULTS_DIR.glob(f"{prefix}*_throughput.json"):
+            # Extract tag: {prefix}_{tag}_throughput.json or {prefix}_throughput.json
+            name_part = p.name[len(prefix):-len("_throughput.json")]
+            tag = name_part.lstrip("_")
+            tags.add(tag)
+            
+        for p in Path("benchmark_results_rocm").glob(f"{prefix}*_throughput.json"):
+            name_part = p.name[len(prefix):-len("_throughput.json")]
+            tag = name_part.lstrip("_")
+            tags.add(tag)
+            
+        if not tags:
+            tags.add("") # Default empty tag if no files found
+            
+        # Sort so empty tag (Default) comes first
+        for tag in sorted(list(tags)):
+            tag_suffix = f"_{tag}" if tag else ""
+            
+            # Default (Triton)
+            try: 
+                p1 = RESULTS_DIR / f"{prefix}{tag_suffix}_throughput.json"
+                if p1.exists():
+                    d1 = json.loads(p1.read_text())
+                    val1 = f"{d1.get('tokens_per_second', 0):.1f}"
+                else:
+                    val1 = "N/A"
+            except: val1 = "N/A"
+            
+            # ROCm
+            try:
+                p2 = Path("benchmark_results_rocm") / f"{prefix}{tag_suffix}_throughput.json"
+                if p2.exists():
+                    d2 = json.loads(p2.read_text())
+                    val2 = f"{d2.get('tokens_per_second', 0):.1f}"
+                else:
+                    val2 = "N/A"
+            except: val2 = "N/A"
+
+            display_tag = tag if tag else "(Default)"
+            print(f"{name_cell:<50} | {display_tag:<15} | {val1:<8} | {val2:<8}")
+            
+    print("-" * 92)

 if __name__ == "__main__":
-    # if not check_ray_status():
-    #     log("ERROR: Ray Cluster not ready. Please start it with 'start-vllm-cluster' first.")
-    #     sys.exit(1)
-    # We now handle this by restarting the cluster ourselves.
-    pass
+    parser = argparse.ArgumentParser(description="VLLM Cluster Benchmark")
+    parser.add_argument("--eth-only", action="store_true", help="Run benchmark using only Ethernet (disable RDMA/RoCE)")
+    parser.add_argument("--debug-nccl", action="store_true", help="Enable NCCL Debug logging (INFO level for Transport tracking)")
+    parser.add_argument("--tui", action="store_true", help="Launch interactive configuration UI")
+    args = parser.parse_args()
+    
+    FORCE_ETH = args.eth_only
+    FORCE_DEBUG_NCCL = args.debug_nccl
+
+    selected_models = MODELS_TO_RUN
+    
+    if args.tui:
+        # 1. Cluster IPs Configuration
+        form_args = [
+            "--clear", "--backtitle", "AMD VLLM Cluster Configuration",
+            "--title", "Cluster Network Details",
+            "--form", "Verify Head and Worker IPs for this run:",
+            "10", "60", "2",
+            "Head Node IP:", "1", "1", HEAD_IP, "1", "20", "20", "0",
+            "Worker Node IP:", "2", "1", WORKER_IP, "2", "20", "20", "0"
+        ]
+        res = bench_utils.run_dialog(form_args)
+        if res is None:
+            subprocess.run(["clear"])
+            print("Cancelled by user.")
+            sys.exit(0)
+            
+        lines = res.splitlines()
+        if len(lines) >= 2:
+            HEAD_IP = lines[0].strip()
+            WORKER_IP = lines[1].strip()
+            os.environ["VLLM_HEAD_IP"] = HEAD_IP
+            os.environ["VLLM_WORKER_IP"] = WORKER_IP
+            
+        # 2. Network Options (ETH / Debug)
+        eth_status = "on" if FORCE_ETH else "off"
+        debug_status = "on" if FORCE_DEBUG_NCCL else "off"
+        check_args = [
+            "--title", "Network Overrides",
+            "--checklist", "Select custom backend flags:", "10", "60", "2",
+            "ETH_ONLY", "Force Ethernet (Disable RDMA/RoCE)", eth_status,
+            "DEBUG_NCCL", "Enable NCCL debug logs", debug_status
+        ]
+        flags_res = bench_utils.run_dialog(check_args)
+        if flags_res is not None:
+            FORCE_ETH = "ETH_ONLY" in flags_res
+            FORCE_DEBUG_NCCL = "DEBUG_NCCL" in flags_res
+
+        # 3. Model Selection
+        checklist_args = [
+            "--title", "Model Selection",
+            "--checklist", "Select models to benchmark:", "20", "65", "10"
+        ]
+        for m in MODELS_TO_RUN:
+            m_name = m.split("/")[-1]
+            checklist_args.extend([m, m_name, "on"])
+            
+        choice = bench_utils.run_dialog(checklist_args)
+        if choice is None:
+            subprocess.run(["clear"])
+            print("Cancelled by user.")
+            sys.exit(0)
+            
+        import shlex
+        selected_models = [m for m in shlex.split(choice)]
+        if not selected_models:
+            subprocess.run(["clear"])
+            print("No models selected. Exiting.")
+            sys.exit(0)

-        
    log("Ray Cluster Detected. Starting Benchmarks (Dual Backend)...")
+    if FORCE_ETH:
+        log("Note: Ethernet ONLY mode enabled. RDMA/RoCE disabled.")
+    if FORCE_DEBUG_NCCL:
+        log("Note: NCCL Debug mode enabled (Transport Logging).")
    log("Note: Eager Mode (--enforce-eager) is ENABLED for cluster stability.")
    
-    for m in MODELS_TO_RUN:
-        run_cluster_throughput(m)
+    for m in selected_models:
+        overrides = {}
+        if args.tui:
+            config = MODEL_TABLE.get(m, {})
+            default_seqs = config.get("max_num_seqs", "32")
+            default_tokens = config.get("max_tokens", DEFAULT_BATCH_TOKENS)
+            default_util = config.get("gpu_util", GPU_UTIL)
+            default_ctx = config.get("ctx", "auto")
+            
+            form_args = [
+                "--clear", "--backtitle", f"AMD VLLM Cluster Benchmark Configuration (TP: {CLUSTER_TP})",
+                "--title", f"Tune Parameters: {m.split('/')[-1]}",
+                "--form", "Edit cluster model options. Leave tag empty for no suffix.",
+                "15", "70", "5",
+                "Max Concurrent Seqs:", "1", "1",  str(default_seqs), "1", "25", "15", "0",
+                "Max Batched Tokens:", "2", "1", str(default_tokens), "2", "25", "15", "0",
+                "GPU Utilization (0-1):", "3", "1", str(default_util), "3", "25", "15", "0",
+                "Max Context Length:", "4", "1", str(default_ctx), "4", "25", "15", "0",
+                "Filename Tag (Optional):", "5", "1", "", "5", "25", "15", "0"
+            ]
+            
+            form_res = bench_utils.run_dialog(form_args)
+            if form_res is None:
+                subprocess.run(["clear"])
+                print(f"Skipping {m} due to user cancellation.")
+                continue
+                
+            lines = form_res.splitlines()
+            if len(lines) >= 5:
+                overrides["max_num_seqs"] = lines[0].strip()
+                overrides["max_tokens"] = lines[1].strip()
+                overrides["gpu_util"] = lines[2].strip()
+                
+                ctx_val = lines[3].strip()
+                if ctx_val and ctx_val.lower() != "auto":
+                    overrides["ctx"] = ctx_val
+                    
+                overrides["tag"] = lines[4].strip()
+
+        run_cluster_throughput(m, overrides=overrides)
        
    print_summary()
@@ -469,6 +469,14 @@
                    style="font-size: 0.9rem; font-weight: 500; display: flex; align-items: center; gap: 4px; cursor: pointer;">
                    <input type="checkbox" id="toggleTP2" checked> TP2
                </label>
+                <label
+                    style="font-size: 0.9rem; font-weight: 500; display: flex; align-items: center; gap: 4px; cursor: pointer;">
+                    <input type="checkbox" id="toggleTP2Eth"> TP2 (Eth)
+                </label>
+                <label
+                    style="font-size: 0.9rem; font-weight: 500; display: flex; align-items: center; gap: 4px; cursor: pointer;">
+                    <input type="checkbox" id="toggleTP2Usb"> TP2 (Thunderbolt)
+                </label>
            </div>

            <!-- Attention Group -->
@@ -544,6 +552,8 @@
            activeTab: "Throughput",
            showTP1: true,
            showTP2: true,
+            showTP2Eth: false,
+            showTP2Usb: false,
            showTriton: true,
            showRocm: false
        };
@@ -615,6 +625,8 @@
            // Toggles
            $('toggleTP1').addEventListener('change', e => { state.showTP1 = e.target.checked; render(); });
            $('toggleTP2').addEventListener('change', e => { state.showTP2 = e.target.checked; render(); });
+            $('toggleTP2Eth').addEventListener('change', e => { state.showTP2Eth = e.target.checked; render(); });
+            $('toggleTP2Usb').addEventListener('change', e => { state.showTP2Usb = e.target.checked; render(); });
            $('toggleTriton').addEventListener('change', e => { state.showTriton = e.target.checked; render(); });
            $('toggleRocm').addEventListener('change', e => { state.showRocm = e.target.checked; render(); });
        }
@@ -636,13 +648,23 @@
                        params: run.params_b || run.name_params_b,
                        results: {
                            1: { triton: null, rocm: null },
-                            2: { triton: null, rocm: null }
+                            2: { triton: null, rocm: null },
+                            "2_eth": { triton: null, rocm: null },
+                            "2_usb": { triton: null, rocm: null }
                        }
                    };
                }

                const m = testGroups[testName].models[modelName];
-                const tp = run.tp || 1;
+                let tp = run.tp || 1;
+                if (tp === 2) {
+                    if (run.network === "Ethernet") {
+                        if (run.tag === "usb") tp = "2_usb";
+                        else tp = "2_eth";
+                    } else if (run.tag === "usb") {
+                        tp = "2_usb";
+                    }
+                }

                if (!m.results[tp]) m.results[tp] = { triton: null, rocm: null };

@@ -749,8 +771,16 @@
                if (state.showRocm) cols.push({ id: "tp1_rocm", label: "TP1 ROCm" });
            }
            if (state.showTP2) {
-                if (state.showTriton) cols.push({ id: "tp2_triton", label: "TP2 Triton" });
-                if (state.showRocm) cols.push({ id: "tp2_rocm", label: "TP2 ROCm" });
+                if (state.showTriton) cols.push({ id: "tp2_triton", label: "TP2 RoCE Triton" });
+                if (state.showRocm) cols.push({ id: "tp2_rocm", label: "TP2 RoCE ROCm" });
+            }
+            if (state.showTP2Eth) {
+                if (state.showTriton) cols.push({ id: "tp2_eth_triton", label: "TP2 Eth Triton" });
+                if (state.showRocm) cols.push({ id: "tp2_eth_rocm", label: "TP2 Eth ROCm" });
+            }
+            if (state.showTP2Usb) {
+                if (state.showTriton) cols.push({ id: "tp2_usb_triton", label: "TP2 TB Triton" });
+                if (state.showRocm) cols.push({ id: "tp2_usb_rocm", label: "TP2 TB ROCm" });
            }

            // Thead
@@ -790,11 +820,7 @@

                // Data Cells
                cols.forEach(c => {
-                    let val = null;
-                    if (c.id === "tp1_triton") val = m.results[1]?.triton;
-                    if (c.id === "tp1_rocm") val = m.results[1]?.rocm;
-                    if (c.id === "tp2_triton") val = m.results[2]?.triton;
-                    if (c.id === "tp2_rocm") val = m.results[2]?.rocm;
+                    let val = getVal(m, c.id);

                    const bg = c.id.startsWith("tp2") ? 'style="background:#fbfdff;"' : "";
                    rowHtml += `<td class="col-data" ${bg}>${formatVal(val, unit)}</td>`;
@@ -823,6 +849,10 @@
            if (colId === "tp1_rocm") return m.results[1]?.rocm;
            if (colId === "tp2_triton") return m.results[2]?.triton;
            if (colId === "tp2_rocm") return m.results[2]?.rocm;
+            if (colId === "tp2_eth_triton") return m.results["2_eth"]?.triton;
+            if (colId === "tp2_eth_rocm") return m.results["2_eth"]?.rocm;
+            if (colId === "tp2_usb_triton") return m.results["2_usb"]?.triton;
+            if (colId === "tp2_usb_rocm") return m.results["2_usb"]?.rocm;
            return null;
        }

@@ -66,6 +66,30 @@ def parse_logs():
            if not tp_match: continue
            tp = int(tp_match.group(1))
            
+            # Network
+            network = "RoCE"
+            network_prefix = ""
+            if "_eth" in rest:
+                network = "Ethernet"
+                network_prefix = "_eth"
+                
+            # Tag Extraction
+            tag = ""
+            test_type_str = ""
+            if "throughput" in fname:
+                test_type_str = "_throughput.json"
+            elif "latency" in fname:
+                qps_match = re.search(r"(_qps[\d\.]+)_latency\.json$", rest)
+                if qps_match:
+                    test_type_str = qps_match.group(0)
+                else:
+                    test_type_str = "_latency.json"
+            
+            raw_prefix = f"{tp}{network_prefix}"
+            if rest.endswith(test_type_str):
+                tag_part = rest[len(raw_prefix):-len(test_type_str)]
+                tag = tag_part.lstrip("_")
+            
            # Model Name
            if "_" in model_part:
                model_display = model_part.replace("_", "/", 1)
@@ -87,6 +111,8 @@ def parse_logs():
                "params_b": params_b,
                "name_params_b": params_b,
                "backend": backend_name, # "Triton" or "ROCm"
+                "network": network,
+                "tag": tag,
                "error": False
            }

@@ -3,62 +3,140 @@
 # -------- dynamic config --------
 HOST_ROCE="192.168.100.2"
 HOST_ETH="192.168.1.127"
+HOST_TB="192.168.2.2"

-# Automatically detect local and remote RDMA device names
-RDMA_DEV_LOCAL=$(ibv_devices | awk 'NR==3 {print $1}')
-RDMA_DEV_REMOTE=$(ssh "$HOST_ROCE" "toolbox run -c vllm -- ibv_devices | awk 'NR==3 {print \$1}'")
+# Parse args
+RUN_ETH=true
+RUN_ROCE=true
+RUN_TB=true
+RUN_RDMA=true
+
+# If any flags are provided, turn off defaults and only run requested
+if [ "$#" -gt 0 ]; then
+    RUN_ETH=false
+    RUN_ROCE=false
+    RUN_TB=false
+    RUN_RDMA=false
+fi
+
+while getopts "ertih" opt; do
+    case ${opt} in
+        e ) RUN_ETH=true ;;
+        r ) RUN_ROCE=true ;;
+        t ) RUN_TB=true ;;
+        i ) RUN_RDMA=true ;;
+        h ) echo "Usage: $0 [-e (Ethernet LAN)] [-r (RoCE Ethernet/TCP)] [-t (Thunderbolt)] [-i (RDMA/Infiniband)]"
+            echo
+            echo "Options:"
+            echo "  -e    Run benchmarking for standard Ethernet (1G LAN)."
+            echo "  -r    Run benchmarking for RoCE NIC (via Ethernet/TCP)."
+            echo "  -t    Run benchmarking for Thunderbolt link."
+            echo "  -i    Run benchmarking for RDMA (RoCE v2)."
+            echo "  -h    Print this help message and exit."
+            echo
+            echo "If no arguments are provided, all benchmarks are executed."
+            exit 0
+            ;;
+        \? ) echo "Usage: cmd [-e (Ethernet LAN)] [-r (RoCE Ethernet/TCP)] [-t (Thunderbolt)] [-i (RDMA/Infiniband)] [-h (Help)]"
+             exit 1
+             ;;
+    esac
+done
+
+# Automatically detect local and remote RDMA device names if needed
+if [ "$RUN_RDMA" = true ]; then
+    RDMA_DEV_LOCAL=$(ibv_devices | awk 'NR==3 {print $1}')
+    RDMA_DEV_REMOTE=$(ssh "$HOST_ROCE" "toolbox run -c vllm -- ibv_devices | awk 'NR==3 {print \$1}'")
+fi

 WORKDIR="/tmp/rdma_bench"
 mkdir -p "$WORKDIR"

 # -------- helpers --------
 parse_ping_avg() {
-    grep rtt "$1" | awk -F'/' '{print $5}'
+    if [ -f "$1" ]; then
+        grep rtt "$1" | awk -F'/' '{print $5}'
+    else
+        echo "0"
+    fi
 }

 parse_iperf_gbps() {
-    grep receiver "$1" | tail -n1 | awk '
-    {
-        val=$(NF-2);
-        unit=$(NF-1);
-        if (unit=="Mbits/sec") printf "%.2f", val/1000;
-        else if (unit=="Gbits/sec") printf "%.2f", val;
-        else print "N/A";
-    }'
+    if [ -f "$1" ]; then
+        grep receiver "$1" | tail -n1 | awk '
+        {
+            val=$(NF-2);
+            unit=$(NF-1);
+            if (unit=="Mbits/sec") printf "%.2f", val/1000;
+            else if (unit=="Gbits/sec") printf "%.2f", val;
+            else print "0.00";
+        }'
+    else
+        echo "0.00"
+    fi
 }

 parse_rdma_lat_us() {
-    val=$(grep -E '^[[:space:]]*[0-9]+' "$1" | tail -n1 | awk '{print $6}')
-    echo "${val:-0}"
+    if [ -f "$1" ]; then
+        val=$(grep -E '^[[:space:]]*[0-9]+' "$1" | tail -n1 | awk '{print $6}')
+        echo "${val:-0}"
+    else
+        echo "0"
+    fi
 }

 parse_rdma_bw_mib() {
-    val=$(grep -E '^[[:space:]]*[0-9]+' "$1" | tail -n1 | awk '{print $4}')
-    echo "${val:-0}"
+    if [ -f "$1" ]; then
+        val=$(grep -E '^[[:space:]]*[0-9]+' "$1" | tail -n1 | awk '{print $4}')
+        echo "${val:-0}"
+    else
+        echo "0"
+    fi
 }

-# -------- normal ethernet --------
-ping -c 10 "$HOST_ETH" > "$WORKDIR/ping_eth.txt"
-ssh "$HOST_ROCE" "toolbox run -c vllm -- iperf3 -s -1" >/dev/null 2>&1 &
-sleep 1
-iperf3 -c "$HOST_ETH" -P 8 -t 10 > "$WORKDIR/iperf_eth.txt"
+# Clear old results
+rm -f "$WORKDIR"/*.txt

-# -------- roce ethernet (tcp) --------
-ping -c 10 "$HOST_ROCE" > "$WORKDIR/ping_roce.txt"
-ssh "$HOST_ROCE" "toolbox run -c vllm -- iperf3 -s -1" >/dev/null 2>&1 &
-sleep 1
-iperf3 -c "$HOST_ROCE" -P 8 -t 10 > "$WORKDIR/iperf_roce.txt"
+if [ "$RUN_ETH" = true ]; then
+    # -------- normal ethernet --------
+    echo "[*] Benchmarking Ethernet (1G LAN)..."
+    ping -c 10 "$HOST_ETH" > "$WORKDIR/ping_eth.txt"
+    ssh "$HOST_ROCE" "toolbox run -c vllm -- iperf3 -s -1" >/dev/null 2>&1 &
+    sleep 1
+    iperf3 -c "$HOST_ETH" -P 8 -t 10 > "$WORKDIR/iperf_eth.txt"
+fi

-# -------- rdma latency --------
-ssh "$HOST_ROCE" "toolbox run -c vllm -- ib_send_lat --rdma_cm -d $RDMA_DEV_REMOTE" > "$WORKDIR/rdma_lat_srv.txt" 2>&1 &
-sleep 2
-ib_send_lat --rdma_cm -d "$RDMA_DEV_LOCAL" "$HOST_ROCE" > "$WORKDIR/rdma_lat_cli.txt" 2>&1
+if [ "$RUN_ROCE" = true ]; then
+    # -------- roce ethernet (tcp) --------
+    echo "[*] Benchmarking RoCE NIC (Ethernet/TCP)..."
+    ping -c 10 "$HOST_ROCE" > "$WORKDIR/ping_roce.txt"
+    ssh "$HOST_ROCE" "toolbox run -c vllm -- iperf3 -s -1" >/dev/null 2>&1 &
+    sleep 1
+    iperf3 -c "$HOST_ROCE" -P 8 -t 10 > "$WORKDIR/iperf_roce.txt"
+fi

-# -------- rdma bandwidth (maximized) --------
-# We use -x 1 because show_gids confirmed RoCE v2 is at Index 1
-ssh "$HOST_ROCE" "toolbox run -c vllm -- ib_write_bw -a -x 1 -q 8 -m 4096" > "$WORKDIR/rdma_bw_srv.txt" 2>&1 &
-sleep 2
-ib_write_bw -a -x 1 -q 8 -m 4096 "$HOST_ROCE" > "$WORKDIR/rdma_bw_cli.txt" 2>&1
+if [ "$RUN_TB" = true ]; then
+    # -------- thunderbolt ethernet (tcp) --------
+    echo "[*] Benchmarking Thunderbolt..."
+    ping -c 10 "$HOST_TB" > "$WORKDIR/ping_tb.txt"
+    ssh "$HOST_TB" "toolbox run -c vllm -- iperf3 -s -1" >/dev/null 2>&1 &
+    sleep 1
+    iperf3 -c "$HOST_TB" -P 8 -t 10 > "$WORKDIR/iperf_tb.txt"
+fi
+
+if [ "$RUN_RDMA" = true ]; then
+    # -------- rdma latency --------
+    echo "[*] Benchmarking RDMA (RoCE v2)..."
+    ssh "$HOST_ROCE" "toolbox run -c vllm -- ib_send_lat --rdma_cm -d $RDMA_DEV_REMOTE" > "$WORKDIR/rdma_lat_srv.txt" 2>&1 &
+    sleep 2
+    ib_send_lat --rdma_cm -d "$RDMA_DEV_LOCAL" "$HOST_ROCE" > "$WORKDIR/rdma_lat_cli.txt" 2>&1
+
+    # -------- rdma bandwidth (maximized) --------
+    # We use -x 1 because show_gids confirmed RoCE v2 is at Index 1
+    ssh "$HOST_ROCE" "toolbox run -c vllm -- ib_write_bw -a -x 1 -q 8 -m 4096" > "$WORKDIR/rdma_bw_srv.txt" 2>&1 &
+    sleep 2
+    ib_write_bw -a -x 1 -q 8 -m 4096 "$HOST_ROCE" > "$WORKDIR/rdma_bw_cli.txt" 2>&1
+fi

 # -------- parse --------
 ETH_LAT_MS=$(parse_ping_avg "$WORKDIR/ping_eth.txt")
@@ -67,13 +145,17 @@ ETH_BW=$(parse_iperf_gbps "$WORKDIR/iperf_eth.txt")
 ROCE_LAT_MS=$(parse_ping_avg "$WORKDIR/ping_roce.txt")
 ROCE_BW=$(parse_iperf_gbps "$WORKDIR/iperf_roce.txt")

+TB_LAT_MS=$(parse_ping_avg "$WORKDIR/ping_tb.txt")
+TB_BW=$(parse_iperf_gbps "$WORKDIR/iperf_tb.txt")
+
 RDMA_LAT_US=$(parse_rdma_lat_us "$WORKDIR/rdma_lat_cli.txt")
 RDMA_BW_MIB=$(parse_rdma_bw_mib "$WORKDIR/rdma_bw_cli.txt")

 # Convert units for dual display
-ETH_LAT_US=$(python3 -c "print(f'{float(${ETH_LAT_MS:-0}) * 1000:.2f}')")
-ROCE_LAT_US=$(python3 -c "print(f'{float(${ROCE_LAT_MS:-0}) * 1000:.2f}')")
-RDMA_LAT_MS=$(python3 -c "print(f'{float(${RDMA_LAT_US:-0}) / 1000:.3f}')")
+ETH_LAT_US=$(python3 -c "print(f'{float(${ETH_LAT_MS:-0}) * 1000:.2f}')" 2>/dev/null || echo "0.00")
+ROCE_LAT_US=$(python3 -c "print(f'{float(${ROCE_LAT_MS:-0}) * 1000:.2f}')" 2>/dev/null || echo "0.00")
+TB_LAT_US=$(python3 -c "print(f'{float(${TB_LAT_MS:-0}) * 1000:.2f}')" 2>/dev/null || echo "0.00")
+RDMA_LAT_MS=$(python3 -c "print(f'{float(${RDMA_LAT_US:-0}) / 1000:.3f}')" 2>/dev/null || echo "0.00")

 RDMA_BW_GBPS=$(python3 - <<EOF
 import sys
@@ -88,9 +170,18 @@ EOF
 echo
 echo "=== Network Comparison ==="
 echo
-printf "%-20s %-15s %-15s %-12s\n" "Path" "Latency (ms)" "Latency (us)" "Bandwidth"
-echo "----------------------------------------------------------------"
-printf "%-20s %-15s %-15s %-12s\n" "Ethernet (1G LAN)" "${ETH_LAT_MS} ms" "${ETH_LAT_US} us" "${ETH_BW} Gbps"
-printf "%-20s %-15s %-15s %-12s\n" "Ethernet (RoCE NIC)" "${ROCE_LAT_MS} ms" "${ROCE_LAT_US} us" "${ROCE_BW} Gbps"
-printf "%-20s %-15s %-15s %-12s\n" "RDMA (RoCE)" "${RDMA_LAT_MS} ms" "${RDMA_LAT_US} us" "${RDMA_BW_GBPS} Gbps"
+printf "%-25s %-15s %-15s %-12s\n" "Path" "Latency (ms)" "Latency (us)" "Bandwidth"
+echo "-----------------------------------------------------------------------"
+if [ "$RUN_ETH" = true ]; then
+    printf "%-25s %-15s %-15s %-12s\n" "Ethernet (1G LAN)" "${ETH_LAT_MS:-0.00} ms" "${ETH_LAT_US:-0.00} us" "${ETH_BW:-0.00} Gbps"
+fi
+if [ "$RUN_ROCE" = true ]; then
+    printf "%-25s %-15s %-15s %-12s\n" "Ethernet (RoCE NIC)" "${ROCE_LAT_MS:-0.00} ms" "${ROCE_LAT_US:-0.00} us" "${ROCE_BW:-0.00} Gbps"
+fi
+if [ "$RUN_TB" = true ]; then
+    printf "%-25s %-15s %-15s %-12s\n" "Ethernet (Thunderbolt)" "${TB_LAT_MS:-0.00} ms" "${TB_LAT_US:-0.00} us" "${TB_BW:-0.00} Gbps"
+fi
+if [ "$RUN_RDMA" = true ]; then
+    printf "%-25s %-15s %-15s %-12s\n" "RDMA (RoCE)" "${RDMA_LAT_MS:-0.00} ms" "${RDMA_LAT_US:-0.00} us" "${RDMA_BW_GBPS:-0.00} Gbps"
+fi
 echo
@@ -45,6 +45,8 @@ This guide details how to configure a two-node **AMD Strix Halo** cluster linked

 ## 2. Concepts & Architecture

+![concepts](concepts.png)
+
 To fully utilize the Strix Halo cluster, it is helpful to understand the technologies involved:

 *   **vLLM**: A high-performance inference engine. To run models larger than a single GPU (or APU) can handle, it splits the model using **Tensor Parallelism (TP)**.
@@ -55,15 +57,20 @@ To fully utilize the Strix Halo cluster, it is helpful to understand the technol
    *   **With RDMA**: Latency is ~5µs.
    *   **Why it matters**: For interactive token generation, high latency kills performance. RoCE makes the two nodes feel like a single machine.

+
 ---

 ## 3. Hardware Prerequisites

+![cluster](cluster.png)
+
+
 *   **Nodes**: 2x [Framework Desktop Mainboards](https://frame.work/gb/en/products/framework-desktop-mainboard-amd-ryzen-ai-max-300-series?v=FRAFMK0006) with AMD Ryzen AI MAX+ "Strix Halo", 128GB of Unified Memory.
 *   **Network Cards**: [Intel Ethernet Controller E810-CQDA1](https://www.intel.com/content/www/us/en/products/sku/192558/intel-ethernet-network-adapter-e810cqda1/specifications.html) (or similar 100GbE QSFP28).
 *   **Connection**: Direct Attach Copper (DAC) cable (e.g., [QSFPTEK 100G QSFP28 DAC](https://www.amazon.co.uk/dp/B09F32F7VK)). No switch required for 2 nodes.
 *   **PCIe Note**: The Framework motherboard PCIe slot is physically **x4**, so a riser is required to plug in a 16x card (e.g., [CY PCI-E Express 4x to 16x Extender](https://www.amazon.co.uk/dp/B0837FZFJ6)). **Test Setup Note:** One of the boards in this setup has a modified PCIe slot (cut by Framework using an ultrasonic knife) to accept x16 cards directly. **This is not recommended for users.** Risers are the cheaper, safer, and easier solution. Performance is identical (~50Gbps bandwidth, ~5µs latency).

+
 ---

 ## 4. Host Configuration (Fedora)
@@ -326,3 +333,61 @@ If you see link issues, ensure your Intel E810 firmware is up to date using the

 *   **Reddit - Strix Halo Batching with Tensor Parallel**: [Thread by Hungry_Elk_3276](https://www.reddit.com/r/LocalLLaMA/comments/1p8nped/strix_halo_batching_with_tensor_parallel_and/)
    *   Special thanks to user **Hungry_Elk_3276** for their initial experiments with vLLM RDMA, which highlighted the missing `gfx1151` support in upstream RCCL.
+
+---
+
+## 9. Alternative: Thunderbolt Networking
+
+If you do not have dedicated 100GbE RDMA network cards, you can directly connect the two nodes using a high-quality **Thunderbolt 4 / USB4 cable**. This will create a `thunderbolt0` network interface.
+
+While it lacks the ultra-low microprocessor-level latency of RDMA, it provides significantly more bandwidth than standard 1GbE/5GbE Ethernet and is easier to configure.
+
+>**Note**: `thunderbolt-net` relies on standard OS kernel TCP/IP stacks.
+
+### 9.1 Thunderbolt Configuration
+
+**1. Establish Connection:**
+Connect the nodes directly using a certified Thunderbolt 4 or USB4 cable. Verify the link is active:
+```bash
+ip link show thunderbolt0
+```
+
+**2. Network Configuration (Head - Node 1):**
+Configure a persistent connection using `nmcli` with a static IP and Jumbo Frames (reduces CPU overhead).
+*Note: Jumbo Frames may be unsupported on some Thunderbolt host controllers.*
+```bash
+sudo nmcli connection add type ethernet ifname thunderbolt0 con-name thunderbolt0 ipv4.method manual ipv4.addresses 192.168.2.1/24 mtu 9000
+sudo nmcli connection up thunderbolt0
+```
+
+**3. Network Configuration (Worker - Node 2):**
+```bash
+sudo nmcli connection add type ethernet ifname thunderbolt0 con-name thunderbolt0 ipv4.method manual ipv4.addresses 192.168.2.2/24 mtu 9000
+sudo nmcli connection up thunderbolt0
+```
+
+**4. Firewall Rules:**
+To ensure Ray and NCCL can communicate freely over this link:
+```bash
+# Assign the interface to the trusted zone permanently
+sudo firewall-cmd --permanent --zone=trusted --add-interface=thunderbolt0
+sudo firewall-cmd --reload
+```
+
+### 9.2 Running vLLM over Thunderbolt
+
+Our cluster scripts dynamically detect the network interface based on the provided IPs. There is no need to manually export environment variables!
+
+1. Open the Toolbox: `toolbox enter vllm`
+2. Launch the cluster manager: `start-vllm-cluster`
+3. Select **Option 1 (Configure IPs)**.
+4. Set the **Head IP** explicitly to `192.168.2.1` and the **Worker IP** to `192.168.2.2`.
+5. Start the cluster normally (Option 2). The script will automatically discover and utilize `thunderbolt0` as the backend network for Ray orchestration and GPU synchronization.
+
+### 9.3 Validating the Link
+I have added Thunderbolt support to the `compare_eth_vs_rdma.sh` script. Run it from inside the toolbox to see the latency and bandwidth of your Thunderbolt link compared to your other network interfaces.
+
+You can use the `-t` flag to ONLY benchmark the Thunderbolt connection (or `-e`, `-r`, `-i` for the others):
+```bash
+/opt/compare_eth_vs_rdma.sh -t
+```
@@ -2,13 +2,17 @@ import subprocess
 import time
 import os

-def get_net_iface(ip_prefix="192.168.100"):
+def get_net_iface(ip_prefix=None):
    """
    Auto-detects the interface that serves the cluster network.
-    Assumes standard 192.168.100.x setup from start_vllm_cluster.py
+    Assumes standard 192.168.100.x setup from start_vllm_cluster.py, but parameterizable.
    """
+    if ip_prefix is None:
+        head_ip = os.getenv("VLLM_HEAD_IP", "192.168.100.1")
+        ip_prefix = ".".join(head_ip.split('.')[:3])
+        
    try:
-        # ip -o addr show | grep 192.168.100
+        # ip -o addr show | grep <ip_prefix>
        cmd = f"ip -o addr show | grep {ip_prefix}"
        res = subprocess.check_output(cmd, shell=True, text=True).strip()
        # Output format: 2: eth0    inet 192.168.100.1/24 ...
@@ -31,35 +35,77 @@ def get_subnet_from_ip(ip):
    parts = ip.split('.')
    return f"{parts[0]}.{parts[1]}.{parts[2]}.0/24"

-def stop_cluster(nodes=None):
+def stop_cluster(worker_ip=None):
    """
-    Stops Ray on the given nodes (list of IPs). 
-    If nodes is None, does nothing (caller should identify nodes first if needed, 
-    but typically for a clean start we might just rely on 'ray stop' on each setup).
-    Actually, to be safe, we can try to stop local ray.
+    Stops Ray locally and on the worker node if provided.
    """
+    print("Stopping Ray cluster locally...")
    subprocess.run(["ray", "stop", "--force"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+    
+    if worker_ip:
+        print(f"Stopping Ray cluster on worker ({worker_ip})...")
+        ssh_cmd = [
+            "ssh", "-o", "StrictHostKeyChecking=no", worker_ip, 
+            "toolbox", "run", "-c", "vllm", "--", "ray", "stop", "--force"
+        ]
+        try:
+            subprocess.run(ssh_cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+        except subprocess.CalledProcessError as e:
+            print(f"Warning: Failed to stop worker node completely: {e}")

 def setup_worker_node(worker_ip, head_ip): 
    subnet = get_subnet_from_ip(worker_ip)
    
-    # Script to run on worker
+    # Read overrides from current env
+    nccl_disable_val = os.getenv("NCCL_IB_DISABLE", "0")
+    nccl_debug_val = os.getenv("NCCL_DEBUG", "")
+    
    script = f"""
    source /etc/profile
-    # Silece the kill command
+    # Silence the kill command
    ray stop --force > /dev/null 2>&1 || true
+    
+    # Calculate Interface dynamically
+    RDMA_IFACE=$(ip -o addr show to {subnet} | awk '{{print $2}}' | head -n1)
+    
+    echo "\\n--- Ray Worker Environment ({worker_ip}) ---"
+    echo "export RAY_DISABLE_METRICS=1"
+    echo "export RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1"
+    echo "export RAY_memory_monitor_refresh_ms=0"
+    echo "export VLLM_HOST_IP={worker_ip}"
+    echo "export RDMA_IFACE=$RDMA_IFACE"
+    echo "export NCCL_SOCKET_IFNAME=$RDMA_IFACE"
+    echo "export GLOO_SOCKET_IFNAME=$RDMA_IFACE"
+    echo "export NCCL_IB_TIMEOUT=23"
+    echo "export NCCL_IB_RETRY_CNT=7"
+    echo "export NCCL_IB_DISABLE={nccl_disable_val}"
+    
    export RAY_DISABLE_METRICS=1
    export RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1
    export RAY_memory_monitor_refresh_ms=0
    export VLLM_HOST_IP={worker_ip}
-    export RDMA_IFACE=$(ip -o addr show to {subnet} | awk '{{print $2}}' | head -n1)
+    export RDMA_IFACE=$RDMA_IFACE
    export NCCL_SOCKET_IFNAME=$RDMA_IFACE
    export GLOO_SOCKET_IFNAME=$RDMA_IFACE
    # Stability for RDMA
    export NCCL_IB_TIMEOUT=23
    export NCCL_IB_RETRY_CNT=7
-    echo "Starting Ray Worker on {worker_ip} connecting to {head_ip}..."
-    ray start --address='{head_ip}:6379' --num-gpus=1 --num-cpus=8 --disable-usage-stats --include-dashboard=false
+    export NCCL_IB_DISABLE={nccl_disable_val}
+    """
+    if nccl_debug_val:
+        script += f"""
+    echo "export NCCL_DEBUG={nccl_debug_val}"
+    echo "export NCCL_DEBUG_SUBSYS=INIT,NET"
+    export NCCL_DEBUG={nccl_debug_val}
+    export NCCL_DEBUG_SUBSYS=INIT,NET
+    """
+    
+    script += f"""
+    echo "\\nStarting Ray Worker on {worker_ip} connecting to {head_ip}..."
+    if [ "{nccl_disable_val}" = "1" ]; then
+        echo "Note: Worker is configured with NCCL_IB_DISABLE=1 (Ethernet Forced)"
+    fi
+    ray start --address='{head_ip}:6379' --num-gpus=1 --num-cpus=8 --disable-usage-stats
    """
    
    print(f"Setting up Worker Node ({worker_ip})...")
@@ -83,20 +129,55 @@ def setup_head_node(head_ip):
    
    print(f"Setting up Head Node ({head_ip})...")
    
+    # Read overrides from current env
+    nccl_disable_val = os.getenv("NCCL_IB_DISABLE", "0")
+    nccl_debug_val = os.getenv("NCCL_DEBUG", "")
+    
    script = f"""
    # Silence the kill command
    ray stop --force > /dev/null 2>&1 || true
+    
+    # Calculate Interface dynamically
+    RDMA_IFACE=$(ip -o addr show to {subnet} | awk '{{print $2}}' | head -n1)
+    
+    echo "\\n--- Ray Head Environment ({head_ip}) ---"
+    echo "export RAY_DISABLE_METRICS=1"
+    echo "export RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1"
+    echo "export RAY_memory_monitor_refresh_ms=0"
+    echo "export VLLM_HOST_IP={head_ip}"
+    echo "export RDMA_IFACE=$RDMA_IFACE"
+    echo "export NCCL_SOCKET_IFNAME=$RDMA_IFACE"
+    echo "export GLOO_SOCKET_IFNAME=$RDMA_IFACE"
+    echo "export NCCL_IB_TIMEOUT=23"
+    echo "export NCCL_IB_RETRY_CNT=7"
+    echo "export NCCL_IB_DISABLE={nccl_disable_val}"
+
    export RAY_DISABLE_METRICS=1
    export RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1
    export RAY_memory_monitor_refresh_ms=0
    export VLLM_HOST_IP={head_ip}
-    export RDMA_IFACE=$(ip -o addr show to {subnet} | awk '{{print $2}}' | head -n1)
+    export RDMA_IFACE=$RDMA_IFACE
    export NCCL_SOCKET_IFNAME=$RDMA_IFACE
    export GLOO_SOCKET_IFNAME=$RDMA_IFACE
    # Stability for RDMA
    export NCCL_IB_TIMEOUT=23
    export NCCL_IB_RETRY_CNT=7
-    echo "Starting Ray Head on {head_ip}..."
+    export NCCL_IB_DISABLE={nccl_disable_val}
+    """
+    
+    if nccl_debug_val:
+        script += f"""
+    echo "export NCCL_DEBUG={nccl_debug_val}"
+    echo "export NCCL_DEBUG_SUBSYS=INIT,NET"
+    export NCCL_DEBUG={nccl_debug_val}
+    export NCCL_DEBUG_SUBSYS=INIT,NET
+    """
+    
+    script += f"""
+    echo "\\nStarting Ray Head on {head_ip}..."
+    if [ "{nccl_disable_val}" = "1" ]; then
+        echo "Note: Head is configured with NCCL_IB_DISABLE=1 (Ethernet Forced)"
+    fi
    ray start --head --port=6379 --node-ip-address={head_ip} --num-gpus=1 --num-cpus=8 --disable-usage-stats --include-dashboard=false
    """
    
@@ -4,9 +4,9 @@ set -e
 # 1. System Base & Build Tools
 # Added 'gperftools-libs' for tcmalloc (fixes double-free)
 dnf -y install --setopt=install_weak_deps=False --nodocs \
-  python3.13 python3.13-devel git rsync libatomic bash ca-certificates curl \
+  python3.12 python3.12-devel git rsync libatomic bash ca-certificates curl \
  gcc gcc-c++ binutils make ffmpeg-free \
  cmake ninja-build aria2c tar xz vim nano dialog \
  libdrm-devel zlib-devel openssl-devel pgrep \
-  numactl-devel gperftools-libs iproute libibverbs-utils patch perftest ping iperf3 \
+  numactl-devel gperftools-libs iproute libibverbs-utils patch perftest ping iperf3 perfquery \
  && dnf clean all && rm -rf /var/cache/dnf/*
@@ -51,7 +51,7 @@ printf '%s\n' \
  "export VLLM_TARGET_DEVICE=rocm" \
  "export HIP_FORCE_DEV_KERNARG=1" \
  "export RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1" \
-  "export LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so.4" \
+  "export LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so.4:/opt/rocm/lib/librocm_smi64.so.1.0" \
  > /etc/profile.d/rocm-sdk.sh

 chmod 0644 /etc/profile.d/rocm-sdk.sh
@@ -0,0 +1,18 @@
+#!/usr/bin/env bash
+
+while true; do
+  A_IN=$(rdma statistic | awk '/ip4InOctets/ {print $2}')
+  A_OUT=$(rdma statistic | awk '/ip4OutOctets/ {print $2}')
+  sleep 1
+  B_IN=$(rdma statistic | awk '/ip4InOctets/ {print $2}')
+  B_OUT=$(rdma statistic | awk '/ip4OutOctets/ {print $2}')
+
+  RX=$(( (B_IN - A_IN) * 8 ))
+  TX=$(( (B_OUT - A_OUT) * 8 ))
+
+  printf "%s RDMA RX: %7sbit/s  TX: %7sbit/s  SUM: %7sbit/s\n" \
+    "$(date +%T)" \
+    "$(numfmt --to=iec $RX)" \
+    "$(numfmt --to=iec $TX)" \
+    "$(numfmt --to=iec $((RX+TX)))"
+done
@@ -10,6 +10,7 @@ MODEL_TABLE = {
    
    "google/gemma-3-12b-it": {
        "trust_remote": False,
+        "enforce_eager": True,
        "valid_tp": [1, 2],
        "max_num_seqs": "64",
        "max_tokens": "32768" 
@@ -68,7 +69,7 @@ MODEL_TABLE = {
    # 5. Qwen 80B AWQ
    # Size: ~48GB. Fits on 2x32GB (64GB). Leftover for Cache: ~16GB.
    # Config: 20k ctx fits in that cache. Eager mode required for stability.
-     "dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16": {
+    "dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16": {
        "trust_remote": True,
        "valid_tp": [1], # Too big for single GPU
        "max_num_seqs": "64", # Large Model / Bandwidth Constrained
@@ -77,6 +78,15 @@ MODEL_TABLE = {
        "env": {"VLLM_USE_TRITON_AWQ": "1"} # Fixes "Unsupported Hardware" error
    },

+    "mratsim/MiniMax-M2.5-BF16-INT4-AWQ": {
+        "trust_remote": True,
+        "valid_tp": [2],
+        "max_num_seqs": "64",
+        "max_tokens": "16384",
+        "enforce_eager": False,
+        "env": {"VLLM_USE_TRITON_AWQ": "1"} # Fixes "Unsupported Hardware" error
+    },
+
 }

 MODELS_TO_RUN = [
@@ -89,6 +99,7 @@ MODELS_TO_RUN = [
    "btbtyler09/Qwen3-Coder-30B-A3B-Instruct-gptq-4bit",
    "btbtyler09/Qwen3-Coder-30B-A3B-Instruct-gptq-8bit",
    "dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16",
+    "mratsim/MiniMax-M2.5-BF16-INT4-AWQ",
 ]

 # Hardware / Global Defaults
@@ -0,0 +1,60 @@
+import sys
+import re
+import glob
+from pathlib import Path
+from unittest.mock import MagicMock
+
+def patch_vllm():
+    print("Applying Strix Halo patches to vLLM...")
+
+    # Patch 1: vllm/platforms/__init__.py
+    p_init = Path('vllm/platforms/__init__.py')
+    if p_init.exists():
+        txt = p_init.read_text()
+        txt = txt.replace('import amdsmi', '# import amdsmi')
+        txt = re.sub(r'is_rocm = .*', 'is_rocm = True', txt)
+        txt = re.sub(r'if len\(amdsmi\.amdsmi_get_processor_handles\(\)\) > 0:', 'if True:', txt)
+        txt = txt.replace('amdsmi.amdsmi_init()', 'pass')
+        txt = txt.replace('amdsmi.amdsmi_shut_down()', 'pass')
+        p_init.write_text(txt)
+        print(" -> Patched vllm/platforms/__init__.py")
+
+    # Patch 2: vllm/platforms/rocm.py
+    p_rocm = Path('vllm/platforms/rocm.py')
+    if p_rocm.exists():
+        txt = p_rocm.read_text()
+        header = 'import sys\nfrom unittest.mock import MagicMock\nsys.modules["amdsmi"] = MagicMock()\n'
+        txt = header + txt
+        txt = txt.replace('def _get_gcn_arch() -> str:', 'def _get_gcn_arch() -> str:\n    return "gfx1151"\n\ndef _old_get_gcn_arch() -> str:')
+        txt = re.sub(r'device_type = .*', 'device_type = "rocm"', txt)
+        txt = re.sub(r'device_name = .*', 'device_name = "gfx1151"', txt)
+        txt += '\n    def get_device_name(self, device_id: int = 0) -> str:\n        return "AMD-gfx1151"\n'
+        p_rocm.write_text(txt)
+        print(" -> Patched vllm/platforms/rocm.py")
+
+    # Patch 3: CUDA/HIP Macro injections for PyTorch Nightly Compatibility
+    macro_def = """
+#ifndef C10_HIP_CHECK
+#define C10_HIP_CHECK(error) do { if (error != hipSuccess) { abort(); } } while(0)
+#endif
+#ifndef C10_CUDA_CHECK
+#define C10_CUDA_CHECK(error) do { if (error != cudaSuccess) { abort(); } } while(0)
+#endif
+"""
+    # Apply to all .cu and .hip files in csrc
+    csrc_files = glob.glob('csrc/**/*.cu', recursive=True) + glob.glob('csrc/**/*.hip', recursive=True)
+    patched_csrc_count = 0
+    for f in csrc_files:
+        p_f = Path(f)
+        if p_f.exists():
+            txt = p_f.read_text()
+            # Only prepend if not already patched to avoid duplicate macros
+            if "C10_CUDA_CHECK" not in txt:
+                p_f.write_text(macro_def + '\n' + txt)
+                patched_csrc_count += 1
+    
+    print(f" -> Patched {patched_csrc_count} C/C++ source files with missing macros.")
+    print("Successfully patched vLLM for Strix Halo.")
+
+if __name__ == "__main__":
+    patch_vllm()
@@ -36,47 +36,6 @@ else:
 HOST = os.getenv("HOST", "0.0.0.0")
 PORT = os.getenv("PORT", "8000")

-def get_discovered_models():
-    """
-    Overrides the hardcoded MODELS_TO_RUN by looking at what we actually have results for.
-    This allows the UI to show all verified models, not just what's enabled for benchmarking.
-    """
-    if not RESULTS_FILE.exists():
-        return MODELS_TO_RUN
-        
-    try:
-        with open(RESULTS_FILE, "r") as f:
-            data = json.load(f)
-            
-        # 1. Find all models with at least one success
-        verified_models = set()
-        for r in data:
-            if r.get("status") == "success":
-                verified_models.add(r["model"])
-        
-        # 2. Filter: Must be in MODEL_TABLE (so we have config/valid_tp)
-        #    and must be in our verified list
-        final_list = []
-        for m in sorted(list(verified_models)):
-            if m in MODEL_TABLE:
-                final_list.append(m)
-                
-        if final_list:
-            return final_list
-            
-    except Exception as e:
-        print(f"Warning: Model discovery failed ({e}). Using default list.")
-        
-    return MODELS_TO_RUN
-
-# Refresh the list of models to run based on what we found
-MODELS_TO_RUN = get_discovered_models()
-
-def check_dependencies():
-    if not shutil.which("dialog"):
-        print("Error: 'dialog' is required. Please install it (apt-get install dialog).")
-        sys.exit(1)
-
 def detect_gpus():
    """Detects AMD GPUs via rocm-smi or /dev/dri."""
    try:
@@ -93,6 +52,63 @@ def detect_gpus():
    except:
        return 1

+def get_discovered_models():
+    """
+    Overrides the hardcoded MODELS_TO_RUN by looking at what we actually have results for.
+    This allows the UI to show all verified models, not just what's enabled for benchmarking.
+    """
+    try:
+        if RESULTS_FILE.exists():
+            with open(RESULTS_FILE, "r") as f:
+                data = json.load(f)
+                
+            # 1. Find all models with at least one success
+            verified_models = set()
+            for r in data:
+                if r.get("status") == "success":
+                    verified_models.add(r["model"])
+            
+            # 2. Filter: Must be in MODEL_TABLE (so we have config/valid_tp)
+            #    and must be in our verified list (if results exist)
+            final_list = []
+            gpu_count = detect_gpus()
+            
+            for m in sorted(list(verified_models)):
+                if m in MODEL_TABLE:
+                    # Check valid_tp
+                    valid_tps = MODEL_TABLE[m].get("valid_tp", [1])
+                    min_required = min(valid_tps)
+                    
+                    if min_required <= gpu_count:
+                        final_list.append(m)
+                    
+            if final_list:
+                return final_list
+            
+    except Exception as e:
+        print(f"Warning: Model discovery failed ({e}). Using default list.")
+        
+    # Fallback if no results file or error: return all models compatible with current hardware
+    gpu_count = detect_gpus()
+    compatible_models = []
+    
+    for m in MODELS_TO_RUN:
+        if m in MODEL_TABLE:
+            valid_tps = MODEL_TABLE[m].get("valid_tp", [1])
+            min_required = min(valid_tps)
+            if min_required <= gpu_count:
+                compatible_models.append(m)
+                
+    return compatible_models
+
+# Refresh the list of models to run based on what we found
+MODELS_TO_RUN = get_discovered_models()
+
+def check_dependencies():
+    if not shutil.which("dialog"):
+        print("Error: 'dialog' is required. Please install it (apt-get install dialog).")
+        sys.exit(1)
+
 def get_verified_config(model_id, tp_size, max_seqs):
    """
    Reads max_context_results.json to find the best verified configuration.
@@ -306,6 +322,7 @@ def configure_and_launch(model_idx, gpu_count):
    
    # Env Vars
    env = os.environ.copy()
+    env["VLLM_DISABLE_COMPILE_CACHE"] = "1"
    env.update(config.get("env", {}))
    
    if use_rocm_attn:
@@ -318,7 +335,15 @@ def configure_and_launch(model_idx, gpu_count):
    print(f" Backend:   {'ROCm' if use_rocm_attn else 'Triton'}")
    if clear_cache:
        print(f" Action:    Clearing vLLM Cache (~/.cache/vllm)")
-    print(f" Command:   {' '.join(cmd)}")
+        
+    # Variables that represent the custom environment overrides for models
+    custom_env = config.get("env", {})
+    if custom_env:
+        print("\n --- Environment Variables ---")
+        for k, v in custom_env.items():
+            print(f" export {k}={v}")
+            
+    print(f"\n Command:   {' '.join(cmd)}")
    print("="*60 + "\n")
    
    os.execvpe("vllm", cmd, env)
@@ -41,29 +41,8 @@ def get_discovered_models():
    """
    Overrides the hardcoded MODELS_TO_RUN by looking at what we actually have results for.
    """
-    if not RESULTS_FILE.exists():
-        return MODELS_TO_RUN
-        
-    try:
-        with open(RESULTS_FILE, "r") as f:
-            data = json.load(f)
-            
-        verified_models = set()
-        for r in data:
-            if r.get("status") == "success":
-                verified_models.add(r["model"])
-        
-        final_list = []
-        for m in sorted(list(verified_models)):
-            if m in MODEL_TABLE:
-                final_list.append(m)
-                
-        if final_list:
-            return final_list
-            
-    except Exception as e:
-        print(f"Warning: Model discovery failed ({e}). Using default list.")
-        
+    # Bypass verification check for Cluster Launcher
+    # We want to see ALL models, including those that require TP > 1 (which find_max_context might have skipped)
    return MODELS_TO_RUN

 # Refresh the list of models to run based on what we found
@@ -117,12 +96,13 @@ def check_ray_status():
 def wait_for_cluster():
    return cluster_manager.wait_for_cluster()

-def nuke_vllm_cache():
+def nuke_vllm_cache(head_ip):
    # Only nukes local cache on the head node for now, or use cluster nuke?
    # The original script just did local nuke.
    # cluster_manager has nuke_vllm_cache_on_node and nuke_vllm_cache_cluster
    # Let's use the local ip one effectively
-    rdma = cluster_manager.get_net_iface()
+    prefix = ".".join(head_ip.split('.')[:3])
+    rdma = cluster_manager.get_net_iface(prefix)
    local = cluster_manager.get_local_ip(rdma)
    cluster_manager.nuke_vllm_cache_on_node(local, is_local=True)

@@ -265,7 +245,7 @@ def configure_and_launch_vllm(model_idx, head_ip):
    subprocess.run(["clear"])
    
    if clear_cache:
-        nuke_vllm_cache()
+        nuke_vllm_cache(head_ip)
    
    # Environment Setup
    # We need to set these variables in the current process before exec or pass them in env
@@ -283,11 +263,11 @@ def configure_and_launch_vllm(model_idx, head_ip):
    print(f"Detected RDMA Interface: {rdma_iface}")
    
    env = os.environ.copy()
+    env["VLLM_DISABLE_COMPILE_CACHE"] = "1"
    env["RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES"] = "1"
    env["VLLM_HOST_IP"] = head_ip
    env["NCCL_SOCKET_IFNAME"] = rdma_iface
    env["NCCL_IB_GID_INDEX"] = "1"
-    env["NCCL_IB_DISABLE"] = "0"
    env["NCCL_NET_GDR_LEVEL"] = "0"
    
    # Also need this for Ray backend?
@@ -318,38 +298,75 @@ def configure_and_launch_vllm(model_idx, head_ip):
    print(f" Config:    TP={current_tp} | Seqs={current_seqs} | Ctx={current_ctx}")
    if use_eager:
        print(" Note:      Eager Mode Enabled (Recommended for Cluster Stability)")
-    print(f" Command:   {' '.join(cmd)}")
+        
+    print("\n --- Environment Variables ---")
+    vars_to_print = [
+        "RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES",
+        "VLLM_HOST_IP",
+        "NCCL_SOCKET_IFNAME",
+        "NCCL_IB_GID_INDEX",
+        "NCCL_NET_GDR_LEVEL"
+    ]
+    for k in vars_to_print:
+        if k in env:
+            print(f" export {k}={env[k]}")
+            
+    print(f"\n Command:   {' '.join(cmd)}")
    print("="*60 + "\n")
    
    # Exec
    os.execvpe("vllm", cmd, env)

+def setup_ips_dialog(current_head, current_worker):
+    # Using a form to edit both IPs
+    # label y x item y x flen ilen
+    form_args = [
+        "--title", "Cluster Configuration",
+        "--form", "Enter IP addresses for Head and Worker nodes:",
+        "10", "60", "2",
+        "Head Node IP:", "1", "1",  current_head, "1", "20", "20", "0",
+        "Worker Node IP:", "2", "1", current_worker, "2", "20", "20", "0"
+    ]
+    
+    result = run_dialog(form_args)
+    if not result:
+        return None
+        
+    lines = result.splitlines()
+    if len(lines) >= 2:
+        return lines[0].strip(), lines[1].strip()
+    return None
+
+
 def main():
    check_dependencies()
    
    # Default IPs
-    head_ip = "192.168.100.1"
-    worker_ip = "192.168.100.2"
+    head_ip = os.getenv("VLLM_HEAD_IP", "192.168.100.1")
+    worker_ip = os.getenv("VLLM_WORKER_IP", "192.168.100.2")
    
    while True:
        # Main Menu
        # 1. Configure IPs
        # 2. Start Cluster (Ray)
-        # 3. Start VLLM
-        # 4. Exit
+        # 3. Stop Ray Cluster
+        # 4. Ray Cluster Status
+        # 5. Launch VLLM Serve
+        # 6. Exit
        
        choice = run_dialog([
            "--clear", "--backtitle", "AMD VLLM RCCL Cluster Manager",
            "--title", "Main Menu",
-            "--menu", "Select Action:", "15", "60", "5",
+            "--menu", "Select Action:", "16", "60", "6",
            "1", f"Configure IPs (Head: {head_ip}, Worker: {worker_ip})",
            "2", "Start Ray Cluster",
-            "3", "Ray Cluster Status",
-            "4", "Launch VLLM Serve",
-            "5", "Exit"
+            "3", "Stop Ray Cluster",
+            "4", "Ray Cluster Status",
+            "5", "Launch VLLM Serve",
+            "6", "Exit"
        ])
        
-        if not choice or choice == "5":
+        if not choice or choice == "6":
            subprocess.run(["clear"])
            sys.exit(0)
            
@@ -359,25 +376,71 @@ def main():
                head_ip, worker_ip = res
            
        elif choice == "2":
-            subprocess.run(["clear"])
-            print("= Starting Ray Cluster Setup =")
-            # 1. Start Head
-            if setup_head_node(head_ip):
-                print("Head node started successfully. Waiting 5s before worker connection...")
-                time.sleep(5)
-                # 2. Start Worker
-                if setup_worker_node(worker_ip, head_ip):
-                    # 3. Wait for full cluster
-                    wait_for_cluster()
-            input("Press Enter to continue...")
+            force_ethernet = False
+            enable_nccl_debug = False
+            
+            while True:
+                eth_status = "YES" if force_ethernet else "NO"
+                debug_status = "YES" if enable_nccl_debug else "NO"
+                
+                c_choice = run_dialog([
+                    "--clear", "--backtitle", "AMD VLLM RCCL Cluster Manager",
+                    "--title", "Cluster Network Configuration",
+                    "--menu", "Set Network Parameters before starting Ray:", "15", "65", "3",
+                    "1", f"Force Ethernet (Disable RDMA/RoCE):  {eth_status}",
+                    "2", f"Enable NCCL Debug Logging:           {debug_status}",
+                    "3", "START CLUSTER"
+                ])
+                if not c_choice: break
+                
+                if c_choice == "1":
+                    force_ethernet = not force_ethernet
+                elif c_choice == "2":
+                    enable_nccl_debug = not enable_nccl_debug
+                elif c_choice == "3":
+                    os.environ["NCCL_IB_DISABLE"] = "1" if force_ethernet else "0"
+                    if enable_nccl_debug:
+                        os.environ["NCCL_DEBUG"] = "INFO"
+                        os.environ["NCCL_DEBUG_SUBSYS"] = "INIT,NET"
+                    else:
+                        os.environ.pop("NCCL_DEBUG", None)
+                        os.environ.pop("NCCL_DEBUG_SUBSYS", None)
+                    
+                    subprocess.run(["clear"])
+                    print("= Starting Ray Cluster Setup =")
+                    # 1. Start Head
+                    if setup_head_node(head_ip):
+                        print("Head node started successfully. Waiting 5s before worker connection...")
+                        time.sleep(5)
+                        # 2. Start Worker
+                        if setup_worker_node(worker_ip, head_ip):
+                            # 3. Wait for full cluster
+                            wait_for_cluster()
+                    input("Press Enter to continue...")
+                    break

-        elif choice == "3":
-            subprocess.run(["clear"])
            print("= Ray Cluster Status =")
            subprocess.run(["ray", "status"])
            input("\nPress Enter to continue...")
            
+        elif choice == "3":
+            subprocess.run(["clear"])
+            print("= Stopping Ray Cluster =")
+            cluster_manager.stop_cluster(worker_ip)
+            input("\nPress Enter to continue...")
+            
        elif choice == "4":
+            subprocess.run(["clear"])
+            print("= Ray Cluster Status =")
+            res = subprocess.run(["ray", "status"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
+            if res.returncode != 0:
+                print("\n[!] Cluster is Offline or Unreachable.")
+                print("Please start the cluster first via Option 2 (Start Ray Cluster).")
+            else:
+                print(res.stdout)
+            input("\nPress Enter to continue...")
+            
+        elif choice == "5":
            # Select Model
            menu_items = []
            for i, m_id in enumerate(MODELS_TO_RUN):
Autor	SHA1	Zpráva	Datum
Donato Capitella	039363b819	feat: set LD_LIBRARY_PATH to include ROCm library directories.	2026-03-14 13:41:09 +00:00
Donato Capitella	cf2fd6ec11	chore: remove `fix_block_size.py` script and its execution from the Dockerfile.	2026-03-14 13:18:56 +00:00
Donato Capitella	b78e8a9d82	fix: Remove vLLM block size validation checks by adding and running a new patching script in the Dockerfile.	2026-03-13 16:29:01 +00:00
Donato Capitella	16405e8943	config: Add VLLM_DISABLE_COMPILE_CACHE=1 to environment variables across VLLM scripts.	2026-03-09 14:07:43 +00:00
Donato Capitella	8de950d9ca	feat: Override `_get_gcn_arch` function to return "gfx1151" and rename the original implementation to `_old_get_gcn_arch`.	2026-03-09 12:13:27 +00:00
Donato Capitella	fb0aef0864	Downgrade Python to 3.12 and remove the `--no-deps` flag from a pip install command in the Dockerfile.	2026-03-09 11:08:11 +00:00
Donato Capitella	9997faaa1e	build: Add `--no-deps` flag to local wheel installation.	2026-03-08 16:31:16 +00:00
Donato Capitella	8a20ec27b2	fixing https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/issues/21	2026-02-26 12:36:03 +00:00
Donato Capitella	c27835d99f	feat: Introduce v1 API structure, enhance quantization support, and expand model compatibility with various updates and new tests.	2026-02-25 11:50:23 +00:00
Donato Capitella	b035bcb482	updated benchmarks including thunderbolt and configuratuion guides	2026-02-25 10:48:42 +00:00
Donato Capitella	6875f62ccf	improve benchmarks	2026-02-25 09:29:46 +00:00
Donato Capitella	a5a7b8fe04	fix: Ignore `settings.json` and default 'TP2 (Eth)' checkbox to unchecked in documentation.	2026-02-24 08:50:18 +00:00
Donato Capitella	1af159af81	removing llvm flags as they have no impact on performance	2026-02-24 08:27:57 +00:00
Donato Capitella	e726d406fa	updated benchmarks, fix start-vllm	2026-02-23 19:39:19 +00:00
Donato Capitella	e0fadf426b	force egaer mode to make gemma stable	2026-02-23 18:19:15 +00:00
Donato Capitella	f968cb1f30	most of the time spent by devs is to ensure there is no standard way of passing flags - I have no idea why	2026-02-23 12:08:57 +00:00
Donato Capitella	fedfa3c682	Trying fix for ROCm/llvm loop unrolling bug, to see if performance improves on custom complied kernels	2026-02-23 11:43:44 +00:00
Donato Capitella	13c5a929a3	feat: refactor vLLM Strix Halo patching into a dedicated script	2026-02-23 10:33:20 +00:00
Donato Capitella	5a7f0cc676	feat: Implement temporary patch for C10_CHECK macro import missing	2026-02-23 09:49:42 +00:00
Donato Capitella	b3fcb0091f	feat: Enhance `find_max_context.py` with Ray cluster support and fix `C10_HIP_CHECK` build error in Dockerfile.	2026-02-23 09:11:30 +00:00
Donato Capitella	91b6dbc270	feat: Display environment variables and allow to choose between RoCE/Ethernet and show RCCL debug information	2026-02-22 20:07:34 +00:00
Donato Capitella	4a5d6c7855	fix broken stuff	2026-02-19 20:29:28 +00:00
Donato Capitella	726cd5ae53	remove clang patch	2026-02-18 15:23:02 +00:00
Donato Capitella	49b85fc1fb	add MiniMax	2026-02-18 15:22:12 +00:00
Donato Capitella	290beffb05	feat: Enhance quantization support for MoE layers with new FP8/INT8 configs and model-specific optimizations across various devices.	2026-02-12 11:10:28 +00:00
Donato Capitella	6754095398	feat: Introduce `measure_bandwidth.sh` script, install `perfquery`, and add the script to the Docker image for RDMA bandwidth monitoring.	2026-02-07 10:40:53 +00:00
Donato Capitella	9cf7eaeab2	fix: Correct 'buy me a coffee' URL in README.	2026-02-06 06:56:26 +00:00
Donato Capitella	c3ecb9bbd5	feat: add project context and support sections to README.	2026-02-05 17:55:30 +00:00
Donato Capitella	afe985afca	added images to RDMA guide	2026-02-03 19:47:42 +00:00