270 rivejä
12 KiB
Markdown
270 rivejä
12 KiB
Markdown
# AMD Strix Halo — vLLM Toolbox/Container (gfx1151, PyTorch + AOTriton)
|
||
|
||
An **Arch-based** Docker/Podman container that is **Toolbx-compatible** (usable as a Fedora toolbox) for serving LLMs with **vLLM** on **AMD Ryzen AI Max “Strix Halo” (gfx1151)**. Built on the PyTorch + AOTriton base to make ROCm on Strix Halo practical for day‑to‑day use.
|
||
|
||
> **Built on:** [https://github.com/kyuz0/amd-strix-halo-pytorch-gfx1151-aotriton](https://github.com/kyuz0/amd-strix-halo-pytorch-gfx1151-aotriton)
|
||
> **Credits:** **lhl** (build tools/scripts), **ssweens** (Arch‑based Dockerfiles), and the **AMD Strix Halo Home Lab Discord** for testing/support.
|
||
|
||
---
|
||
|
||
## ⚠️ Status & Expectations (Experimental)
|
||
|
||
This setup is **highly experimental** on ROCm/Strix Halo. Some models work; **many fail** due to missing custom kernels, unsupported quant types, or TorchInductor/AOTriton limitations on gfx1151. The matrix below lists combinations tested so far. **Please contribute fixes** or additional working recipes (see *Contributing*).
|
||
|
||
---
|
||
|
||
## Tested Models (Experimental Matrix)
|
||
|
||
> **Legend:** ✅ Works (with flags) · ❌ Fails · ⚠️ Notes include the *exact* error/symptom seen.
|
||
|
||
| Model (Hugging Face) | Params / Quant | Status | Required flags (if any) | Notes / Errors |
|
||
| ---------------------------------- | -------------- | -------------------: | ---------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
|
||
| `Qwen/Qwen2.5-7B-Instruct` | 7B FP16 | ✅ Works | (recommended) `--dtype float16` | Good baseline; simple serve works. |
|
||
| `meta-llama/Llama-2-7b-chat-hf` | 7B FP16 | ✅ Works | (recommended) `--dtype float16` | Stable. |
|
||
| `Qwen/Qwen3-30B-A3B-Instruct-2507` | 30B (A3B) FP16 | ✅ Works | (recommended) `--dtype float16` | |
|
||
| `Google/Gemma3-27B-Instruct` | 27B FP16 | ✅ Works | (recommended) `--dtype float16` | Slow |
|
||
| `Google/Gemma3-12B-Instruct` | 12B FP16 | ✅ Works | (recommended) `--dtype float16` | |
|
||
| `Google/Gemma3-4B-Instruct` |4B FP16 | ✅ Works | (recommended) `--dtype float16` | |
|
||
| `Qwen/Qwen3-14B-AWQ` | 14B AWQ | ✅ Works (with flags) | `--quantization awq --dtype float16 --enforce-eager` | On ROCm, eager avoids missing `awq_dequantize` during compile; vLLM auto‑sets `VLLM_USE_TRITON_AWQ`. |
|
||
| `openai/gpt-oss-20b` | 20B MXFP4 | ❌ Fails | — | `ModuleNotFoundError: triton_kernels.matmul_ogs` (MXFP4 path not available in this image). |
|
||
| `zai-org/GLM-4.5-Air-FP8` | FP8 | ❌ Fails | — | `ValueError: type fp8e4nv not supported (only 'fp8e5')`. |
|
||
| `cpatonn/GLM-4.5-Air-AWQ-4bit` | AWQ-4bit (MoE) | ❌ Fails | — | Missing custom op: `torch.ops._C.gptq_marlin_repack` (Marlin kernels). |
|
||
|
||
> If you get a model to work, please PR a new row with: **model name**, **exact flags**, vLLM version, `torch` & `triton` versions, and a note on **gfx1151** driver/kernel stack.
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
* [1) Toolbx vs Docker/Podman](#1-toolbx-vs-dockerpodman)
|
||
* [2) Quickstart — Fedora Toolbx (development)](#2-quickstart--fedora-toolbx-development)
|
||
* [3) Testing the API](#3-testing-the-api)
|
||
* [4) Quickstart — Podman/Docker](#4-quickstart--podmandocker)
|
||
* [5) Models, dtypes & storage](#5-models-dtypes--storage)
|
||
* [6) Performance notes (short)](#6-performance-notes-short)
|
||
* [7) Requirements (host)](#7-requirements-host)
|
||
* [8) Acknowledgements & Links](#8-acknowledgements--links)
|
||
* [Tested Models](#tested-models)
|
||
* [Contributing](#contributing)
|
||
|
||
|
||
## 1) Toolbx vs Docker/Podman
|
||
|
||
The `kyuz0/vllm-therock-gfx1151-aotriton:latest` image can be used both as:
|
||
|
||
* **Fedora Toolbx (recommended for development):** Toolbx shares your **HOME** and user, so models/configs live on the host. Great for iterating quickly while keeping the host clean.
|
||
* **Docker/Podman (recommended for deployment/perf):** Use for running vLLM as a service (host networking, IPC tuning, etc.). Always **mount a host directory** for model weights so they stay outside the container.
|
||
|
||
---
|
||
|
||
## 2) Quickstart — Fedora Toolbx (development)
|
||
|
||
Create a toolbox that exposes the GPU and relaxes seccomp to avoid ROCm syscall issues:
|
||
|
||
```bash
|
||
toolbox create vllm \
|
||
--image docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
|
||
-- --device /dev/dri --device /dev/kfd \
|
||
--group-add video --group-add render --security-opt seccomp=unconfined
|
||
```
|
||
|
||
Enter it:
|
||
|
||
```bash
|
||
toolbox enter vllm
|
||
```
|
||
|
||
**Model storage (Toolbx):** keep weights **outside** the toolbox under your HOME so they persist. Recommended path:
|
||
|
||
```bash
|
||
mkdir -p ~/vllm-models
|
||
```
|
||
|
||
Serve a model using the helper script **`start-vllm`** (it prints the exact `vllm serve` command and then runs it). Models download to `~/vllm-models` by default; if a model isn't present, it will be fetched from Hugging Face automatically:
|
||
|
||
```bash
|
||
start-vllm
|
||
# pick a model from the menu; the script prints the serve command and launches it
|
||
```
|
||
|
||
> Defaults: `0.0.0.0:8000` and `~/vllm-models` for weights. You can still run `vllm serve` manually if you prefer.
|
||
|
||
> Toolbx shares HOME by design, so `~/vllm-models` stays on the host and survives toolbox updates.
|
||
>
|
||
> **Cache note (Toolbx):** vLLM will also write compiled kernels to `~/.cache/vllm/torch_compile_cache/` in your HOME. For example:
|
||
>
|
||
> ```bash
|
||
> du -sh ~/.cache/vllm/torch_compile_cache/
|
||
> # e.g., 138M /home/you/.cache/vllm/torch_compile_cache/
|
||
> ```
|
||
|
||
---
|
||
|
||
## 3) Testing the API
|
||
|
||
Once the server is up, hit the OpenAI‑compatible endpoint:
|
||
|
||
```bash
|
||
curl -X POST http://localhost:8000/v1/chat/completions \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello! Test the performance."}]}'
|
||
```
|
||
|
||
You should receive a JSON response with a `choices[0].message.content` reply.
|
||
|
||
If you don't want to bother specifying the model name, you can run this which will query the currently deployed model:
|
||
|
||
```bash
|
||
MODEL=$(curl -s http://localhost:8000/v1/models | jq -r '.data[0].id') curl -X POST http://localhost:8000/v1/chat/completions \
|
||
-H "Content-Type: application/json" \
|
||
-d "{
|
||
\"model\": \"$MODEL\",
|
||
\"messages\":[{\"role\":\"user\",\"content\":\"Hello! Test the performance.\"}]
|
||
}"
|
||
```
|
||
|
||
---
|
||
|
||
## 4) Quickstart — Podman/Docker
|
||
|
||
Prefer this for persistent services. **Always mount a host directory for weights** so they live outside the container. If the model isn't present, vLLM will fetch it from **Hugging Face** into the mapped directory.
|
||
|
||
**Qwen2.5 7B Instruct**
|
||
|
||
```bash
|
||
podman run -d --name vllm-qwen2p5-7b \
|
||
--ipc=host \
|
||
--network host \
|
||
--device /dev/kfd \
|
||
--device /dev/dri \
|
||
--group-add video \
|
||
--group-add render \
|
||
-v ~/vllm-models:/models \
|
||
-v ~/.cache/vllm:/root/.cache/vllm \
|
||
docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
|
||
bash -lc 'source /torch-therock/.venv/bin/activate; \
|
||
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
|
||
vllm serve Qwen/Qwen2.5-7B-Instruct --dtype float16 \
|
||
--host 0.0.0.0 --port 8000 --download-dir /models'
|
||
```
|
||
|
||
> Not using `--network host`? Map a port instead: `-p 8000:8000`.
|
||
|
||
For other models, you can try:
|
||
|
||
|
||
**Qwen3 30B A3B Instruct (2507)**
|
||
|
||
```bash
|
||
podman run -d --name vllm-qwen3-30b-a3b \
|
||
--ipc=host \
|
||
--network host \
|
||
--device /dev/kfd \
|
||
--device /dev/dri \
|
||
--group-add video \
|
||
--group-add render \
|
||
-v ~/vllm-models:/models \
|
||
-v ~/.cache/vllm:/root/.cache/vllm \
|
||
docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
|
||
bash -lc 'source /torch-therock/.venv/bin/activate; \
|
||
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
|
||
vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 --dtype float16 \
|
||
--host 0.0.0.0 --port 8000 --download-dir /models'
|
||
```
|
||
|
||
**Qwen3 14B AWQ** *(requires extra flags on ROCm)*
|
||
|
||
```bash
|
||
podman run -d --name vllm-qwen3-14b-awq \
|
||
--ipc=host \
|
||
--network host \
|
||
--device /dev/kfd \
|
||
--device /dev/dri \
|
||
--group-add video \
|
||
--group-add render \
|
||
-v ~/vllm-models:/models \
|
||
-v ~/.cache/vllm:/root/.cache/vllm \
|
||
docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
|
||
bash -lc 'source /torch-therock/.venv/bin/activate; \
|
||
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
|
||
vllm serve Qwen/Qwen3-14B-AWQ --quantization awq --dtype float16 --enforce-eager \
|
||
--host 0.0.0.0 --port 8000 --download-dir /models'
|
||
```
|
||
|
||
---
|
||
|
||
## 5) Models, dtypes & storage
|
||
|
||
* Start with **Qwen/Qwen2.5-7B-Instruct**; larger models may work but are less forgiving on unified memory.
|
||
* Use `--dtype float16` unless you have a reason to change.
|
||
* **Storage discipline:**
|
||
|
||
* **Toolbx:** `--download-dir ~/vllm-models` (lives in your HOME on the host).
|
||
* **Podman/Docker:** `-v ~/vllm-models:/models` and `--download-dir /models`.
|
||
|
||
---
|
||
|
||
## 6) Performance notes (short)
|
||
|
||
* The image is built on the PyTorch + **AOTriton** base; enabling `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` can improve startup/throughput on some models.
|
||
* vLLM flags you might tune later: `--gpu-memory-utilization`, `--max-num-seqs`, `--max-model-len`. Start simple; add knobs only if needed.
|
||
|
||
---
|
||
|
||
## 7) Requirements (host)
|
||
|
||
**Hardware & drivers**
|
||
|
||
* AMD Strix Halo APU (gfx1151).
|
||
* Working amdgpu stack with `/dev/kfd` (ROCm compute) and `/dev/dri` (graphics).
|
||
* Your user in the **video** and **render** groups.
|
||
|
||
**Unified memory setup (HIGHLY recommended)**
|
||
Enable large GTT/unified memory so the iGPU can borrow system RAM for bigger models:
|
||
|
||
1. **Kernel parameters** (append to your GRUB cmdline):
|
||
|
||
```
|
||
amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432
|
||
```
|
||
|
||
| Parameter | Purpose |
|
||
| -------------------------- | ---------------------------- |
|
||
| `amd_iommu=off` | Reduces latency |
|
||
| `amdgpu.gttsize=131072` | 128 GiB GTT (unified memory) |
|
||
| `ttm.pages_limit=33554432` | Large pinned allocations |
|
||
|
||
2. **BIOS**: allocate **minimal VRAM** to the iGPU (e.g., **512 MB**) and rely on unified memory.
|
||
|
||
3. **Fedora example** (GRUB): edit `/etc/default/grub` → `GRUB_CMDLINE_LINUX=...` then:
|
||
|
||
```bash
|
||
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
|
||
sudo reboot
|
||
```
|
||
|
||
**Container runtime**
|
||
|
||
* Podman or Docker installed (examples use Podman; replace with Docker if preferred).
|
||
|
||
---
|
||
|
||
## 8) Contributing
|
||
|
||
Spotted a fix, a working flag combo, or a model that should be on the list? **PRs welcome!** Please include:
|
||
|
||
* Model repo + exact version tag (if any)
|
||
* Full `vllm serve` command/flags that work
|
||
* vLLM version, `torch` & `triton` versions (`python -c "import torch, triton; print(torch.__version__, triton.__version__)"`)
|
||
* Short log snippet of success/failure (especially the **first** error)
|
||
* Any relevant kernel/AOTriton env vars (e.g., `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`)
|
||
|
||
---
|
||
|
||
## 9) Acknowledgements & Links
|
||
|
||
* Base images & docs: [https://github.com/kyuz0/amd-strix-halo-pytorch-gfx1151-aotriton](https://github.com/kyuz0/amd-strix-halo-pytorch-gfx1151-aotriton)
|
||
* Upstreams: [vLLM](https://github.com/vllm-project/vllm), [ROCm/TheRock](https://github.com/ROCm/TheRock), [AOTriton](https://github.com/ROCm/aotriton)
|
||
* Community: **AMD Strix Halo Home Lab Discord** — [https://discord.gg/pnPRyucNrG](https://discord.gg/pnPRyucNrG)
|
||
* Big thanks to **lhl** and **ssweens** for doing the actual heavy lifting for this.
|