Donato Capitella 7e17fa8660 Added gemma models
2025-09-04 17:20:24 +01:00
2025-09-04 17:20:24 +01:00
2025-09-04 17:20:24 +01:00

AMD Strix Halo — vLLM Toolbox/Container (gfx1151, PyTorch + AOTriton)

An Arch-based Docker/Podman container that is Toolbx-compatible (usable as a Fedora toolbox) for serving LLMs with vLLM on AMD Ryzen AI Max “Strix Halo” (gfx1151). Built on the PyTorch + AOTriton base to make ROCm on Strix Halo practical for daytoday use.

Built on: https://github.com/kyuz0/amd-strix-halo-pytorch-gfx1151-aotriton Credits: lhl (build tools/scripts), ssweens (Archbased Dockerfiles), and the AMD Strix Halo Home Lab Discord for testing/support.


⚠️ Status & Expectations (Experimental)

This setup is highly experimental on ROCm/Strix Halo. Some models work; many fail due to missing custom kernels, unsupported quant types, or TorchInductor/AOTriton limitations on gfx1151. The matrix below lists combinations tested so far. Please contribute fixes or additional working recipes (see Contributing).


Tested Models (Experimental Matrix)

Legend: Works (with flags) · Fails · ⚠️ Notes include the exact error/symptom seen.

Model (Hugging Face) Params / Quant Status Required flags (if any) Notes / Errors
Qwen/Qwen2.5-7B-Instruct 7B FP16 Works (recommended) --dtype float16 Good baseline; simple serve works.
meta-llama/Llama-2-7b-chat-hf 7B FP16 Works (recommended) --dtype float16 Stable.
Qwen/Qwen3-30B-A3B-Instruct-2507 30B (A3B) FP16 Works (recommended) --dtype float16
Google/Gemma3-27B-Instruct 27B FP16 Works (recommended) --dtype float16 Slow
Google/Gemma3-12B-Instruct 12B FP16 Works (recommended) --dtype float16 Slow
Google/Gemma3-4B-Instruct 4B FP16 Works (recommended) --dtype float16 Slow
Qwen/Qwen3-14B-AWQ 14B AWQ Works (with flags) --quantization awq --dtype float16 --enforce-eager On ROCm, eager avoids missing awq_dequantize during compile; vLLM autosets VLLM_USE_TRITON_AWQ.
openai/gpt-oss-20b 20B MXFP4 Fails ModuleNotFoundError: triton_kernels.matmul_ogs (MXFP4 path not available in this image).
zai-org/GLM-4.5-Air-FP8 FP8 Fails ValueError: type fp8e4nv not supported (only 'fp8e5').
cpatonn/GLM-4.5-Air-AWQ-4bit AWQ-4bit (MoE) Fails Missing custom op: torch.ops._C.gptq_marlin_repack (Marlin kernels).

If you get a model to work, please PR a new row with: model name, exact flags, vLLM version, torch & triton versions, and a note on gfx1151 driver/kernel stack.


Table of Contents

1) Toolbx vs Docker/Podman

The kyuz0/vllm-therock-gfx1151-aotriton:latest image can be used both as: 

  • Fedora Toolbx (recommended for development): Toolbx shares your HOME and user, so models/configs live on the host. Great for iterating quickly while keeping the host clean. 
  • Docker/Podman (recommended for deployment/perf): Use for running vLLM as a service (host networking, IPC tuning, etc.). Always mount a host directory for model weights so they stay outside the container.

2) Quickstart — Fedora Toolbx (development)

Create a toolbox that exposes the GPU and relaxes seccomp to avoid ROCm syscall issues:

toolbox create vllm \
  --image docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
  -- --device /dev/dri --device /dev/kfd \
  --group-add video --group-add render --security-opt seccomp=unconfined

Enter it:

toolbox enter vllm

Model storage (Toolbx): keep weights outside the toolbox under your HOME so they persist. Recommended path:

mkdir -p ~/vllm-models

Serve a model using the helper script start-vllm (it prints the exact vllm serve command and then runs it). Models download to ~/vllm-models by default; if a model isn't present, it will be fetched from Hugging Face automatically:

start-vllm
# pick a model from the menu; the script prints the serve command and launches it

Defaults: 0.0.0.0:8000 and ~/vllm-models for weights. You can still run vllm serve manually if you prefer.

Toolbx shares HOME by design, so ~/vllm-models stays on the host and survives toolbox updates.

Cache note (Toolbx): vLLM will also write compiled kernels to ~/.cache/vllm/torch_compile_cache/ in your HOME. For example:

du -sh ~/.cache/vllm/torch_compile_cache/
# e.g., 138M  /home/you/.cache/vllm/torch_compile_cache/

3) Testing the API

Once the server is up, hit the OpenAIcompatible endpoint:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello! Test the performance."}]}'

You should receive a JSON response with a choices[0].message.content reply.

If you don't want to bother specifying the model name, you can run this which will query the currently deployed model:

MODEL=$(curl -s http://localhost:8000/v1/models | jq -r '.data[0].id') curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"$MODEL\",
    \"messages\":[{\"role\":\"user\",\"content\":\"Hello! Test the performance.\"}]
  }"

4) Quickstart — Podman/Docker

Prefer this for persistent services. Always mount a host directory for weights so they live outside the container. If the model isn't present, vLLM will fetch it from Hugging Face into the mapped directory.

Qwen2.5 7B Instruct

podman run -d --name vllm-qwen2p5-7b \
  --ipc=host \
  --network host \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add video \
  --group-add render \
  -v ~/vllm-models:/models \
  -v ~/.cache/vllm:/root/.cache/vllm \
  docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
  bash -lc 'source /torch-therock/.venv/bin/activate; \
    TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
    vllm serve Qwen/Qwen2.5-7B-Instruct --dtype float16 \
      --host 0.0.0.0 --port 8000 --download-dir /models'

Not using --network host? Map a port instead: -p 8000:8000.

For other models, you can try:

Qwen3 30B A3B Instruct (2507)

podman run -d --name vllm-qwen3-30b-a3b \
  --ipc=host \
  --network host \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add video \
  --group-add render \
  -v ~/vllm-models:/models \
  -v ~/.cache/vllm:/root/.cache/vllm \
  docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
  bash -lc 'source /torch-therock/.venv/bin/activate; \
    TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
    vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 --dtype float16 \
      --host 0.0.0.0 --port 8000 --download-dir /models'

Qwen3 14B AWQ (requires extra flags on ROCm)

podman run -d --name vllm-qwen3-14b-awq \
  --ipc=host \
  --network host \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add video \
  --group-add render \
  -v ~/vllm-models:/models \
  -v ~/.cache/vllm:/root/.cache/vllm \
  docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
  bash -lc 'source /torch-therock/.venv/bin/activate; \
    TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
    vllm serve Qwen/Qwen3-14B-AWQ --quantization awq --dtype float16 --enforce-eager \
      --host 0.0.0.0 --port 8000 --download-dir /models'

5) Models, dtypes & storage

  • Start with Qwen/Qwen2.5-7B-Instruct; larger models may work but are less forgiving on unified memory.

  • Use --dtype float16 unless you have a reason to change.

  • Storage discipline:

    • Toolbx: --download-dir ~/vllm-models (lives in your HOME on the host).
    • Podman/Docker: -v ~/vllm-models:/models and --download-dir /models.

6) Performance notes (short)

  • The image is built on the PyTorch + AOTriton base; enabling TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 can improve startup/throughput on some models.
  • vLLM flags you might tune later: --gpu-memory-utilization, --max-num-seqs, --max-model-len. Start simple; add knobs only if needed.

7) Requirements (host)

Hardware & drivers

  • AMD Strix Halo APU (gfx1151).
  • Working amdgpu stack with /dev/kfd (ROCm compute) and /dev/dri (graphics).
  • Your user in the video and render groups.

Unified memory setup (HIGHLY recommended) Enable large GTT/unified memory so the iGPU can borrow system RAM for bigger models:

  1. Kernel parameters (append to your GRUB cmdline):

    amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432
    
    Parameter Purpose
    amd_iommu=off Reduces latency
    amdgpu.gttsize=131072 128 GiB GTT (unified memory)
    ttm.pages_limit=33554432 Large pinned allocations
  2. BIOS: allocate minimal VRAM to the iGPU (e.g., 512 MB) and rely on unified memory.

  3. Fedora example (GRUB): edit /etc/default/grubGRUB_CMDLINE_LINUX=... then:

    sudo grub2-mkconfig -o /boot/grub2/grub.cfg
    sudo reboot
    

Container runtime

  • Podman or Docker installed (examples use Podman; replace with Docker if preferred).

8) Contributing

Spotted a fix, a working flag combo, or a model that should be on the list? PRs welcome! Please include:

  • Model repo + exact version tag (if any)
  • Full vllm serve command/flags that work
  • vLLM version, torch & triton versions (python -c "import torch, triton; print(torch.__version__, triton.__version__)")
  • Short log snippet of success/failure (especially the first error)
  • Any relevant kernel/AOTriton env vars (e.g., TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1)

S
توضیحات
No description provided
Readme 6.9 MiB
Languages
Python 77.2%
Shell 18.4%
Dockerfile 4.4%