perf: Increase max_num_seqs for bus batch scaling and OFF_NUM_PROMPTS for steady-state throughput measurement on Strix Halo.

2026-02-02 22:36:15 +00:00
@@ -556,13 +556,15 @@
                usecase: "Demonstrates the raw horsepower and architectural efficiency.",
                details: `
 **Test Configuration:**
-• <b>Dataset:</b> ShareGPT (Random Sample, 100 Prompts)
+• <b>Dataset:</b> ShareGPT (Random Sample, 200 Prompts)
 • <b>Output Length:</b> 512 Tokens (Fixed)
-• <b>Batch Budget:</b> 8192 - 32768 Tokens (Dynamic per model)
+• <b>Concurrency:</b> 64 Sequences (Saturates Memory Bandwidth)
 • <b>GPU Alloc:</b> 90% VRAM per GPU
 • <b>Pipeline:</b> <code>vllm bench throughput</code> (Offline)
 • <b>Cluster Config:</b> Ray Distributed (RoCE v2 RDMA, TP=2)

+<b>Rationale:</b> Throughput is maximized by increasing batch size (64) to utilize the massive memory bandwidth of Strix Halo, and running more prompts (200) to measure sustained steady-state performance.
+
 <b>Metric:</b> Tokens per Second (higher is better).`,
                unit: " tok/s"
            },