12 KiB
AMD Strix Halo — vLLM Toolbox/Container (gfx1151, PyTorch + AOTriton)
An Arch-based Docker/Podman container that is Toolbx-compatible (usable as a Fedora toolbox) for serving LLMs with vLLM on AMD Ryzen AI Max “Strix Halo” (gfx1151). Built on the PyTorch + AOTriton base to make ROCm on Strix Halo practical for day‑to‑day use.
Built on: https://github.com/kyuz0/amd-strix-halo-pytorch-gfx1151-aotriton Credits: lhl (build tools/scripts), ssweens (Arch‑based Dockerfiles), and the AMD Strix Halo Home Lab Discord for testing/support.
⚠️ Status & Expectations (Experimental)
This setup is highly experimental on ROCm/Strix Halo. Some models work; many fail due to missing custom kernels, unsupported quant types, or TorchInductor/AOTriton limitations on gfx1151. The matrix below lists combinations tested so far. Please contribute fixes or additional working recipes (see Contributing).
Tested Models (Experimental Matrix)
Legend: ✅ Works (with flags) · ❌ Fails · ⚠️ Notes include the exact error/symptom seen.
| Model (Hugging Face) | Params / Quant | Status | Required flags (if any) | Notes / Errors |
|---|---|---|---|---|
Qwen/Qwen2.5-7B-Instruct |
7B FP16 | ✅ Works | (recommended) --dtype float16 |
Good baseline; simple serve works. |
meta-llama/Llama-2-7b-chat-hf |
7B FP16 | ✅ Works | (recommended) --dtype float16 |
Stable. |
Qwen/Qwen3-30B-A3B-Instruct-2507 |
30B (A3B) FP16 | ✅ Works | (recommended) --dtype float16 |
Heavy; ensure unified memory tweaks. |
Qwen/Qwen3-14B-AWQ |
14B AWQ | ✅ Works (with flags) | --quantization awq --dtype float16 --enforce-eager |
On ROCm, eager avoids missing awq_dequantize during compile; vLLM auto‑sets VLLM_USE_TRITON_AWQ. |
openai/gpt-oss-20b |
20B MXFP4 | ❌ Fails | — | ModuleNotFoundError: triton_kernels.matmul_ogs (MXFP4 path not available in this image). |
zai-org/GLM-4.5-Air-FP8 |
FP8 | ❌ Fails | — | ValueError: type fp8e4nv not supported (only 'fp8e5'). |
cpatonn/GLM-4.5-Air-AWQ-4bit |
AWQ-4bit (MoE) | ❌ Fails | — | Missing custom op: torch.ops._C.gptq_marlin_repack (Marlin kernels). |
If you get a model to work, please PR a new row with: model name, exact flags, vLLM version,
torch&tritonversions, and a note on gfx1151 driver/kernel stack.
Table of Contents
- 1) Toolbx vs Docker/Podman
- 2) Quickstart — Fedora Toolbx (development)
- 3) Testing the API
- 4) Quickstart — Podman/Docker
- 5) Models, dtypes & storage
- 6) Performance notes (short)
- 7) Requirements (host)
- 8) Acknowledgements & Links
- Tested Models
- Contributing
1) Toolbx vs Docker/Podman
The kyuz0/vllm-therock-gfx1151-aotriton:latest image can be used both as:
- Fedora Toolbx (recommended for development): Toolbx shares your HOME and user, so models/configs live on the host. Great for iterating quickly while keeping the host clean.
- Docker/Podman (recommended for deployment/perf): Use for running vLLM as a service (host networking, IPC tuning, etc.). Always mount a host directory for model weights so they stay outside the container.
2) Quickstart — Fedora Toolbx (development)
Create a toolbox that exposes the GPU and relaxes seccomp to avoid ROCm syscall issues:
toolbox create vllm \
--image docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
-- --device /dev/dri --device /dev/kfd \
--group-add video --group-add render --security-opt seccomp=unconfined
Enter it:
toolbox enter vllm
Model storage (Toolbx): keep weights outside the toolbox under your HOME so they persist. Recommended path:
mkdir -p ~/vllm-models
Serve a model using the helper script start-vllm (it prints the exact vllm serve command and then runs it). Models download to ~/vllm-models by default; if a model isn't present, it will be fetched from Hugging Face automatically:
start-vllm
# pick a model from the menu; the script prints the serve command and launches it
Defaults:
0.0.0.0:8000and~/vllm-modelsfor weights. You can still runvllm servemanually if you prefer.
Toolbx shares HOME by design, so
~/vllm-modelsstays on the host and survives toolbox updates.Cache note (Toolbx): vLLM will also write compiled kernels to
~/.cache/vllm/torch_compile_cache/in your HOME. For example:du -sh ~/.cache/vllm/torch_compile_cache/ # e.g., 138M /home/you/.cache/vllm/torch_compile_cache/
3) Testing the API
Once the server is up, hit the OpenAI‑compatible endpoint:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello! Test the performance."}]}'
You should receive a JSON response with a choices[0].message.content reply.
If you don't want to bother specifying the model name, you can run this which will query the currently deployed model:
MODEL=$(curl -s http://localhost:8000/v1/models | jq -r '.data[0].id') curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"$MODEL\",
\"messages\":[{\"role\":\"user\",\"content\":\"Hello! Test the performance.\"}]
}"
4) Quickstart — Podman/Docker
Prefer this for persistent services. Always mount a host directory for weights so they live outside the container. If the model isn't present, vLLM will fetch it from Hugging Face into the mapped directory.
Qwen2.5 7B Instruct
podman run -d --name vllm-qwen2p5-7b \
--ipc=host \
--network host \
--device /dev/kfd \
--device /dev/dri \
--group-add video \
--group-add render \
-v ~/vllm-models:/models \
-v ~/.cache/vllm:/root/.cache/vllm \
docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
bash -lc 'source /torch-therock/.venv/bin/activate; \
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
vllm serve Qwen/Qwen2.5-7B-Instruct --dtype float16 \
--host 0.0.0.0 --port 8000 --download-dir /models'
Not using
--network host? Map a port instead:-p 8000:8000.
For other models, you can try:
Qwen3 30B A3B Instruct (2507)
podman run -d --name vllm-qwen3-30b-a3b \
--ipc=host \
--network host \
--device /dev/kfd \
--device /dev/dri \
--group-add video \
--group-add render \
-v ~/vllm-models:/models \
-v ~/.cache/vllm:/root/.cache/vllm \
docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
bash -lc 'source /torch-therock/.venv/bin/activate; \
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 --dtype float16 \
--host 0.0.0.0 --port 8000 --download-dir /models'
Qwen3 14B AWQ (requires extra flags on ROCm)
podman run -d --name vllm-qwen3-14b-awq \
--ipc=host \
--network host \
--device /dev/kfd \
--device /dev/dri \
--group-add video \
--group-add render \
-v ~/vllm-models:/models \
-v ~/.cache/vllm:/root/.cache/vllm \
docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
bash -lc 'source /torch-therock/.venv/bin/activate; \
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
vllm serve Qwen/Qwen3-14B-AWQ --quantization awq --dtype float16 --enforce-eager \
--host 0.0.0.0 --port 8000 --download-dir /models'
5) Models, dtypes & storage
-
Start with Qwen/Qwen2.5-7B-Instruct; larger models may work but are less forgiving on unified memory.
-
Use
--dtype float16unless you have a reason to change. -
Storage discipline:
- Toolbx:
--download-dir ~/vllm-models(lives in your HOME on the host). - Podman/Docker:
-v ~/vllm-models:/modelsand--download-dir /models.
- Toolbx:
6) Performance notes (short)
- The image is built on the PyTorch + AOTriton base; enabling
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1can improve startup/throughput on some models. - vLLM flags you might tune later:
--gpu-memory-utilization,--max-num-seqs,--max-model-len. Start simple; add knobs only if needed.
7) Requirements (host)
Hardware & drivers
- AMD Strix Halo APU (gfx1151).
- Working amdgpu stack with
/dev/kfd(ROCm compute) and/dev/dri(graphics). - Your user in the video and render groups.
Unified memory setup (HIGHLY recommended) Enable large GTT/unified memory so the iGPU can borrow system RAM for bigger models:
-
Kernel parameters (append to your GRUB cmdline):
amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432Parameter Purpose amd_iommu=offReduces latency amdgpu.gttsize=131072128 GiB GTT (unified memory) ttm.pages_limit=33554432Large pinned allocations -
BIOS: allocate minimal VRAM to the iGPU (e.g., 512 MB) and rely on unified memory.
-
Fedora example (GRUB): edit
/etc/default/grub→GRUB_CMDLINE_LINUX=...then:sudo grub2-mkconfig -o /boot/grub2/grub.cfg sudo reboot
Container runtime
- Podman or Docker installed (examples use Podman; replace with Docker if preferred).
8) Contributing
Spotted a fix, a working flag combo, or a model that should be on the list? PRs welcome! Please include:
- Model repo + exact version tag (if any)
- Full
vllm servecommand/flags that work - vLLM version,
torch&tritonversions (python -c "import torch, triton; print(torch.__version__, triton.__version__)") - Short log snippet of success/failure (especially the first error)
- Any relevant kernel/AOTriton env vars (e.g.,
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1)
9) Acknowledgements & Links
- Base images & docs: https://github.com/kyuz0/amd-strix-halo-pytorch-gfx1151-aotriton
- Upstreams: vLLM, ROCm/TheRock, AOTriton
- Community: AMD Strix Halo Home Lab Discord — https://discord.gg/pnPRyucNrG
- Big thanks to lhl and ssweens for doing the actual heavy lifting for this.