perf: Increase max_num_seqs for bus batch scaling and OFF_NUM_PROMPTS for steady-state throughput measurement on Strix Halo.
This commit is contained in:
+4
-2
@@ -556,13 +556,15 @@
|
||||
usecase: "Demonstrates the raw horsepower and architectural efficiency.",
|
||||
details: `
|
||||
**Test Configuration:**
|
||||
• <b>Dataset:</b> ShareGPT (Random Sample, 100 Prompts)
|
||||
• <b>Dataset:</b> ShareGPT (Random Sample, 200 Prompts)
|
||||
• <b>Output Length:</b> 512 Tokens (Fixed)
|
||||
• <b>Batch Budget:</b> 8192 - 32768 Tokens (Dynamic per model)
|
||||
• <b>Concurrency:</b> 64 Sequences (Saturates Memory Bandwidth)
|
||||
• <b>GPU Alloc:</b> 90% VRAM per GPU
|
||||
• <b>Pipeline:</b> <code>vllm bench throughput</code> (Offline)
|
||||
• <b>Cluster Config:</b> Ray Distributed (RoCE v2 RDMA, TP=2)
|
||||
|
||||
<b>Rationale:</b> Throughput is maximized by increasing batch size (64) to utilize the massive memory bandwidth of Strix Halo, and running more prompts (200) to measure sustained steady-state performance.
|
||||
|
||||
<b>Metric:</b> Tokens per Second (higher is better).`,
|
||||
unit: " tok/s"
|
||||
},
|
||||
|
||||
Referens i nytt ärende
Block a user