AMD Strix Halo — vLLM Toolbox/Container (gfx1151, PyTorch + AOTriton)
An Arch-based Docker/Podman container that is Toolbx-compatible (usable as a Fedora toolbox) for serving LLMs with vLLM on AMD Ryzen AI Max “Strix Halo” (gfx1151). Built on the PyTorch + AOTriton base to make ROCm on Strix Halo practical for day‑to‑day use.
Built on: https://github.com/kyuz0/amd-strix-halo-pytorch-gfx1151-aotriton Credits: lhl (build tools/scripts), ssweens (Arch‑based Dockerfiles), and the AMD Strix Halo Home Lab Discord for testing/support.
⚠️ Status & Expectations (Experimental)
This setup is highly experimental on ROCm/Strix Halo. Some models work; many fail due to missing custom kernels, unsupported quant types, or TorchInductor/AOTriton limitations on gfx1151. The matrix below lists combinations tested so far. Please contribute fixes or additional working recipes (see Contributing).
Tested Models (Experimental Matrix)
Legend: ✅ Works (with flags) · ❌ Fails · ⚠️ Notes include the exact error/symptom seen.
| Model (Hugging Face) | Params / Quant | Status | Required flags (if any) | Notes / Errors |
|---|---|---|---|---|
Qwen/Qwen2.5-7B-Instruct |
7B FP16 | ✅ Works | (recommended) --dtype float16 |
Good baseline; simple serve works. |
meta-llama/Llama-2-7b-chat-hf |
7B FP16 | ✅ Works | (recommended) --dtype float16 |
Stable. |
Qwen/Qwen3-30B-A3B-Instruct-2507 |
30B (A3B) FP16 | ✅ Works | (recommended) --dtype float16 |
Heavy; ensure unified memory tweaks. |
Qwen/Qwen3-14B-AWQ |
14B AWQ | ✅ Works (with flags) | --quantization awq --dtype float16 --enforce-eager |
On ROCm, eager avoids missing awq_dequantize during compile; vLLM auto‑sets VLLM_USE_TRITON_AWQ. |
openai/gpt-oss-20b |
20B MXFP4 | ❌ Fails | — | ModuleNotFoundError: triton_kernels.matmul_ogs (MXFP4 path not available in this image). |
zai-org/GLM-4.5-Air-FP8 |
FP8 | ❌ Fails | — | ValueError: type fp8e4nv not supported (only 'fp8e5'). |
cpatonn/GLM-4.5-Air-AWQ-4bit |
AWQ-4bit (MoE) | ❌ Fails | — | Missing custom op: torch.ops._C.gptq_marlin_repack (Marlin kernels). |
If you get a model to work, please PR a new row with: model name, exact flags, vLLM version,
torch&tritonversions, and a note on gfx1151 driver/kernel stack.
1) Toolbx vs Docker/Podman
The kyuz0/pytorch-therock-gfx1151-aotriton-builder image can be used both as:
- Fedora Toolbx (recommended for development): Toolbx shares your HOME and user, so models/configs live on the host. Great for iterating quickly while keeping the host clean.
- Docker/Podman (recommended for deployment/perf): Use for running vLLM as a service (host networking, IPC tuning, etc.). Always mount a host directory for model weights so they stay outside the container.
2) Quickstart — Fedora Toolbx (development)
Create a toolbox that exposes the GPU and relaxes seccomp to avoid ROCm syscall issues:
toolbox create vllm \
--image docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
-- --device /dev/dri --device /dev/kfd \
--group-add video --group-add render --security-opt seccomp=unconfined
Enter it:
toolbox enter vllm
Model storage (Toolbx): keep weights outside the toolbox under your HOME so they persist. Recommended path:
mkdir -p ~/vllm-models
Serve a model with vLLM (downloads to ~/vllm-models; if the model isn't present, it will be fetched from Hugging Face automatically):
vllm serve Qwen/Qwen2.5-7B-Instruct \
--host 0.0.0.0 --port 8000 \
--download-dir ~/vllm-models
Toolbx shares HOME by design, so
~/vllm-modelsstays on the host and survives toolbox updates.Cache note (Toolbx): vLLM will also write compiled kernels to
~/.cache/vllm/torch_compile_cache/in your HOME. For example:du -sh ~/.cache/vllm/torch_compile_cache/ # e.g., 138M /home/you/.cache/vllm/torch_compile_cache/
3) Testing the API
Once the server is up, hit the OpenAI‑compatible endpoint:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello! Test the performance."}]}'
You should receive a JSON response with a choices[0].message.content reply.
If you don't want to bother specifying the model name, you can run this which will query the currently deployed model:
MODEL=$(curl -s http://localhost:8000/v1/models | jq -r '.data[0].id')
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"$MODEL\",
\"messages\":[{\"role\":\"user\",\"content\":\"Hello! Test the performance.\"}]
}"
4) Quickstart — Podman/Docker
Prefer this for persistent services. Always mount a host directory for weights so they live outside the container. If the model isn't present, vLLM will fetch it from Hugging Face into the mapped directory.
podman run \
-d \
--name vllm \
--network host \
--device /dev/kfd \
--device /dev/dri \
--group-add video \
--group-add render \
-v ~/vllm-models:/models \
-v ~/.cache/vllm:/root/.cache/vllm \
docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
bash -lc 'source /torch-therock/.venv/bin/activate; \
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
vllm serve Qwen/Qwen2.5-7B-Instruct --dtype float16 \
--host 0.0.0.0 --port 8000 --download-dir /models'
Not using
--network host? Map a port instead:-p 8000:8000.
5) Models, dtypes & storage
-
Start with Qwen/Qwen2.5-7B-Instruct; larger models may work but are less forgiving on unified memory.
-
Use
--dtype float16unless you have a reason to change. -
Storage discipline:
- Toolbx:
--download-dir ~/vllm-models(lives in your HOME on the host). - Podman/Docker:
-v ~/vllm-models:/modelsand--download-dir /models.
- Toolbx:
6) Performance notes (short)
- The image is built on the PyTorch + AOTriton base; enabling
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1can improve startup/throughput on some models. - vLLM flags you might tune later:
--gpu-memory-utilization,--max-num-seqs,--max-model-len. Start simple; add knobs only if needed.
7) Requirements (host)
Hardware & drivers
- AMD Strix Halo APU (gfx1151).
- Working amdgpu stack with
/dev/kfd(ROCm compute) and/dev/dri(graphics). - Your user in the video and render groups.
Unified memory setup (HIGHLY recommended) Enable large GTT/unified memory so the iGPU can borrow system RAM for bigger models:
-
Kernel parameters (append to your GRUB cmdline):
amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432Parameter Purpose amd_iommu=offReduces latency amdgpu.gttsize=131072128 GiB GTT (unified memory) ttm.pages_limit=33554432Large pinned allocations -
BIOS: allocate minimal VRAM to the iGPU (e.g., 512 MB) and rely on unified memory.
-
Fedora example (GRUB): edit
/etc/default/grub→GRUB_CMDLINE_LINUX=...then:sudo grub2-mkconfig -o /boot/grub2/grub.cfg sudo reboot
Container runtime
- Podman or Docker installed (examples use Podman; replace with Docker if preferred).
8) Contributing
Spotted a fix, a working flag combo, or a model that should be on the list? PRs welcome! Please include:
- Model repo + exact version tag (if any)
- Full
vllm servecommand/flags that work - vLLM version,
torch&tritonversions (python -c "import torch, triton; print(torch.__version__, triton.__version__)") - Short log snippet of success/failure (especially the first error)
- Any relevant kernel/AOTriton env vars (e.g.,
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1)
9) Acknowledgements & Links
- Base images & docs: https://github.com/kyuz0/amd-strix-halo-pytorch-gfx1151-aotriton
- Upstreams: vLLM, ROCm/TheRock, AOTriton
- Community: AMD Strix Halo Home Lab Discord — https://discord.gg/pnPRyucNrG
- Big thanks to lhl and ssweens for doing the actual heavy lifting for this.