AMD Strix Halo (gfx1151) — vLLM Toolbox/Container
An Fedora 43 Docker/Podman container that is Toolbx-compatible (usable as a Fedora toolbox) for serving LLMs with vLLM on AMD Ryzen AI Max “Strix Halo” (gfx1151). Built on the TheRock nightly builds for ROCm.
Table of Contents
- Tested Models (Benchmarks)
- 1) Toolbx vs Docker/Podman
- 2) Quickstart — Fedora Toolbx
- 3) Quickstart — Ubuntu (Distrobox)
- 4) Testing the API
- 5) Use a Web UI for Chatting
Tested Models (Benchmarks)
View full benchmarks at: https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/
Table Key: Cell values represent Max Context Length (GPU Memory Utilization).
| Model | TP | 1 Req | 4 Reqs | 8 Reqs | 16 Reqs |
|---|---|---|---|---|---|
meta-llama/Meta-Llama-3.1-8B-Instruct |
1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
google/gemma-3-12b-it |
1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
openai/gpt-oss-20b |
1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
Qwen/Qwen3-14B-AWQ |
1 | 40k (0.90) | 40k (0.90) | 40k (0.90) | 40k (0.90) |
cpatonn/Qwen3-Coder-30B-A3B-Instruct-GPTQ-4bit |
1 | 256k (0.95) | 204k (0.90) | - | - |
dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16 |
1 | 256k (0.90) | - | - | - |
openai/gpt-oss-120b |
1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
1) Toolbx vs Docker/Podman
The kyuz0/vllm-therock-gfx1151:latest image can be used both as:
- Fedora Toolbx (recommended for development): Toolbx shares your HOME and user, so models/configs live on the host. Great for iterating quickly while keeping the host clean.
- Docker/Podman (recommended for deployment/perf): Use for running vLLM as a service (host networking, IPC tuning, etc.). Always mount a host directory for model weights so they stay outside the container.
2) Quickstart — Fedora Toolbx
Create a toolbox that exposes the GPU and relaxes seccomp to avoid ROCm syscall issues:
toolbox create vllm \
--image docker.io/kyuz0/vllm-therock-gfx1151:latest \
-- --device /dev/dri --device /dev/kfd \
--group-add video --group-add render --security-opt seccomp=unconfined
Enter it:
toolbox enter vllm
Model storage: Models are downloaded to ~/.cache/huggingface by default. This directory is shared with the host if you created the toolbox correctly, so downloads persist.
Serving a Model (Easiest Way)
The toolbox includes a TUI wizard called start-vllm which includes pre-configured models and handles the launch flags for you. This is the easiest way to get started.
start-vllm
Cache note: vLLM writes compiled kernels to
~/.cache/vllm/.
3) Quickstart — Ubuntu (Distrobox)
Ubuntu’s toolbox package still breaks GPU access, so use Distrobox instead:
distrobox create -n vllm \
--image docker.io/kyuz0/vllm-therock-gfx1151:latest \
--additional-flags "--device /dev/kfd --device /dev/dri --group-add video --group-add render --security-opt seccomp=unconfined"
distrobox enter vllm
Verification: Run
rocm-smito check GPU status.
Serving a Model (Easiest Way)
The toolbox includes a TUI wizard called start-vllm which includes pre-configured models and handles the launch flags for you. This is the easiest way to get started.
start-vllm
4) Testing the API
Once the server is up, hit the OpenAI‑compatible endpoint:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello! Test the performance."}]}'
You should receive a JSON response with a choices[0].message.content reply.
If you don't want to bother specifying the model name, you can run this which will query the currently deployed model:
MODEL=$(curl -s http://localhost:8000/v1/models | jq -r '.data[0].id') curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"$MODEL\",
\"messages\":[{\"role\":\"user\",\"content\":\"Hello! Test the performance.\"}]
}"
5) Use a Web UI for Chatting
If vLLM is on a remote server, expose port 8000 via SSH port forwarding:
ssh -L 0.0.0.0:8000:localhost:8000 <vllm-host>
Then, you can start HuggingFace ChatUI like this (on your host):
docker run -p 3000:3000 \
--add-host=host.docker.internal:host-gateway \
-e OPENAI_BASE_URL=http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY=dummy \
-v chat-ui-data:/data \
ghcr.io/huggingface/chat-ui-db