Files

T

Donato Capitella 8ee405f07e Fixed Docker/Podman commands

2025-09-04 15:02:00 +01:00

12 KiB

خام سرزنش تاریخچه

AMD Strix Halo — vLLM Toolbox/Container (gfx1151, PyTorch + AOTriton)

An Arch-based Docker/Podman container that is Toolbx-compatible (usable as a Fedora toolbox) for serving LLMs with vLLM on AMD Ryzen AI Max “Strix Halo” (gfx1151). Built on the PyTorch + AOTriton base to make ROCm on Strix Halo practical for day‑to‑day use.

Built on: https://github.com/kyuz0/amd-strix-halo-pytorch-gfx1151-aotriton Credits: lhl (build tools/scripts), ssweens (Arch‑based Dockerfiles), and the AMD Strix Halo Home Lab Discord for testing/support.

⚠️ Status & Expectations (Experimental)

This setup is highly experimental on ROCm/Strix Halo. Some models work; many fail due to missing custom kernels, unsupported quant types, or TorchInductor/AOTriton limitations on gfx1151. The matrix below lists combinations tested so far. Please contribute fixes or additional working recipes (see Contributing).

Tested Models (Experimental Matrix)

Legend: ✅ Works (with flags) · ❌ Fails · ⚠️ Notes include the exact error/symptom seen.

Model (Hugging Face)	Params / Quant	Status	Required flags (if any)	Notes / Errors
`Qwen/Qwen2.5-7B-Instruct`	7B FP16	✅ Works	(recommended) `--dtype float16`	Good baseline; simple serve works.
`meta-llama/Llama-2-7b-chat-hf`	7B FP16	✅ Works	(recommended) `--dtype float16`	Stable.
`Qwen/Qwen3-30B-A3B-Instruct-2507`	30B (A3B) FP16	✅ Works	(recommended) `--dtype float16`	Heavy; ensure unified memory tweaks.
`Qwen/Qwen3-14B-AWQ`	14B AWQ	✅ Works (with flags)	`--quantization awq --dtype float16 --enforce-eager`	On ROCm, eager avoids missing `awq_dequantize` during compile; vLLM auto‑sets `VLLM_USE_TRITON_AWQ`.
`openai/gpt-oss-20b`	20B MXFP4	❌ Fails	—	`ModuleNotFoundError: triton_kernels.matmul_ogs` (MXFP4 path not available in this image).
`zai-org/GLM-4.5-Air-FP8`	FP8	❌ Fails	—	`ValueError: type fp8e4nv not supported (only 'fp8e5')`.
`cpatonn/GLM-4.5-Air-AWQ-4bit`	AWQ-4bit (MoE)	❌ Fails	—	Missing custom op: `torch.ops._C.gptq_marlin_repack` (Marlin kernels).

If you get a model to work, please PR a new row with: model name, exact flags, vLLM version, torch & triton versions, and a note on gfx1151 driver/kernel stack.

1) Toolbx vs Docker/Podman
2) Quickstart — Fedora Toolbx (development)
3) Testing the API
4) Quickstart — Podman/Docker
5) Models, dtypes & storage
6) Performance notes (short)
7) Requirements (host)
8) Acknowledgements & Links
Tested Models
Contributing

1) Toolbx vs Docker/Podman

The kyuz0/vllm-therock-gfx1151-aotriton:latest image can be used both as:

Fedora Toolbx (recommended for development): Toolbx shares your HOME and user, so models/configs live on the host. Great for iterating quickly while keeping the host clean.
Docker/Podman (recommended for deployment/perf): Use for running vLLM as a service (host networking, IPC tuning, etc.). Always mount a host directory for model weights so they stay outside the container.

2) Quickstart — Fedora Toolbx (development)

Create a toolbox that exposes the GPU and relaxes seccomp to avoid ROCm syscall issues:

toolbox create vllm \
  --image docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
  -- --device /dev/dri --device /dev/kfd \
  --group-add video --group-add render --security-opt seccomp=unconfined

Enter it:

toolbox enter vllm

Model storage (Toolbx): keep weights outside the toolbox under your HOME so they persist. Recommended path:

mkdir -p ~/vllm-models

Serve a model using the helper script start-vllm (it prints the exact vllm serve command and then runs it). Models download to ~/vllm-models by default; if a model isn't present, it will be fetched from Hugging Face automatically:

start-vllm
# pick a model from the menu; the script prints the serve command and launches it

Defaults: 0.0.0.0:8000 and ~/vllm-models for weights. You can still run vllm serve manually if you prefer.

Toolbx shares HOME by design, so ~/vllm-models stays on the host and survives toolbox updates.

Cache note (Toolbx): vLLM will also write compiled kernels to ~/.cache/vllm/torch_compile_cache/ in your HOME. For example:
du -sh ~/.cache/vllm/torch_compile_cache/
# e.g., 138M  /home/you/.cache/vllm/torch_compile_cache/

3) Testing the API

Once the server is up, hit the OpenAI‑compatible endpoint:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello! Test the performance."}]}'

You should receive a JSON response with a choices[0].message.content reply.

If you don't want to bother specifying the model name, you can run this which will query the currently deployed model:

MODEL=$(curl -s http://localhost:8000/v1/models | jq -r '.data[0].id') curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"$MODEL\",
    \"messages\":[{\"role\":\"user\",\"content\":\"Hello! Test the performance.\"}]
  }"

4) Quickstart — Podman/Docker

Prefer this for persistent services. Always mount a host directory for weights so they live outside the container. If the model isn't present, vLLM will fetch it from Hugging Face into the mapped directory.

Qwen2.5 7B Instruct

podman run -d --name vllm-qwen2p5-7b \
  --ipc=host \
  --network host \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add video \
  --group-add render \
  -v ~/vllm-models:/models \
  -v ~/.cache/vllm:/root/.cache/vllm \
  docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
  bash -lc 'source /torch-therock/.venv/bin/activate; \
    TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
    vllm serve Qwen/Qwen2.5-7B-Instruct --dtype float16 \
      --host 0.0.0.0 --port 8000 --download-dir /models'

Not using --network host? Map a port instead: -p 8000:8000.

For other models, you can try:

Qwen3 30B A3B Instruct (2507)

podman run -d --name vllm-qwen3-30b-a3b \
  --ipc=host \
  --network host \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add video \
  --group-add render \
  -v ~/vllm-models:/models \
  -v ~/.cache/vllm:/root/.cache/vllm \
  docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
  bash -lc 'source /torch-therock/.venv/bin/activate; \
    TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
    vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 --dtype float16 \
      --host 0.0.0.0 --port 8000 --download-dir /models'

Qwen3 14B AWQ (requires extra flags on ROCm)

podman run -d --name vllm-qwen3-14b-awq \
  --ipc=host \
  --network host \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add video \
  --group-add render \
  -v ~/vllm-models:/models \
  -v ~/.cache/vllm:/root/.cache/vllm \
  docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
  bash -lc 'source /torch-therock/.venv/bin/activate; \
    TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
    vllm serve Qwen/Qwen3-14B-AWQ --quantization awq --dtype float16 --enforce-eager \
      --host 0.0.0.0 --port 8000 --download-dir /models'

5) Models, dtypes & storage

Start with Qwen/Qwen2.5-7B-Instruct; larger models may work but are less forgiving on unified memory.
Use --dtype float16 unless you have a reason to change.
Storage discipline:
- Toolbx: --download-dir ~/vllm-models (lives in your HOME on the host).
- Podman/Docker: -v ~/vllm-models:/models and --download-dir /models.

6) Performance notes (short)

The image is built on the PyTorch + AOTriton base; enabling TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 can improve startup/throughput on some models.
vLLM flags you might tune later: --gpu-memory-utilization, --max-num-seqs, --max-model-len. Start simple; add knobs only if needed.

7) Requirements (host)

Hardware & drivers

AMD Strix Halo APU (gfx1151).
Working amdgpu stack with /dev/kfd (ROCm compute) and /dev/dri (graphics).
Your user in the video and render groups.

Unified memory setup (HIGHLY recommended) Enable large GTT/unified memory so the iGPU can borrow system RAM for bigger models:

Kernel parameters (append to your GRUB cmdline):
```
amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432
```
Parameter Purpose

amd_iommu=off Reduces latency

amdgpu.gttsize=131072 128 GiB GTT (unified memory)

ttm.pages_limit=33554432 Large pinned allocations
BIOS: allocate minimal VRAM to the iGPU (e.g., 512 MB) and rely on unified memory.
Fedora example (GRUB): edit /etc/default/grub → GRUB_CMDLINE_LINUX=... then:
```
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot
```

Parameter	Purpose
`amd_iommu=off`	Reduces latency
`amdgpu.gttsize=131072`	128 GiB GTT (unified memory)
`ttm.pages_limit=33554432`	Large pinned allocations

Container runtime

Podman or Docker installed (examples use Podman; replace with Docker if preferred).

8) Contributing

Spotted a fix, a working flag combo, or a model that should be on the list? PRs welcome! Please include:

Model repo + exact version tag (if any)
Full vllm serve command/flags that work
vLLM version, torch & triton versions (python -c "import torch, triton; print(torch.__version__, triton.__version__)")
Short log snippet of success/failure (especially the first error)
Any relevant kernel/AOTriton env vars (e.g., TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1)

9) Acknowledgements & Links

Base images & docs: https://github.com/kyuz0/amd-strix-halo-pytorch-gfx1151-aotriton
Upstreams: vLLM, ROCm/TheRock, AOTriton
Community: AMD Strix Halo Home Lab Discord — https://discord.gg/pnPRyucNrG
Big thanks to lhl and ssweens for doing the actual heavy lifting for this.

12 KiB خام سرزنش تاریخچه Unescape Escape