AI/amd-strix-halo-vllm-toolboxes

Форкнуть 0

T

Donato Capitella e9460b20ad updated with set of working models

2025-09-04 13:33:53 +01:00

scripts

updated with set of working models

2025-09-04 13:33:53 +01:00

Dockerfile.vllm-therock-gfx1151-aotriton

another patch for amdsmi

2025-09-04 07:34:55 +01:00

README.md

updated with set of working models

2025-09-04 13:33:53 +01:00

README.md

AMD Strix Halo — vLLM Toolbox/Container (gfx1151, PyTorch + AOTriton)

An Arch-based Docker/Podman container that is Toolbx-compatible (usable as a Fedora toolbox) for serving LLMs with vLLM on AMD Ryzen AI Max “Strix Halo” (gfx1151). Built on the PyTorch + AOTriton base to make ROCm on Strix Halo practical for day‑to‑day use.

Built on: https://github.com/kyuz0/amd-strix-halo-pytorch-gfx1151-aotriton Credits: lhl (build tools/scripts), ssweens (Arch‑based Dockerfiles), and the AMD Strix Halo Home Lab Discord for testing/support.

⚠️ Status & Expectations (Experimental)

This setup is highly experimental on ROCm/Strix Halo. Some models work; many fail due to missing custom kernels, unsupported quant types, or TorchInductor/AOTriton limitations on gfx1151. The matrix below lists combinations tested so far. Please contribute fixes or additional working recipes (see Contributing).

Tested Models (Experimental Matrix)

Legend: ✅ Works (with flags) · ❌ Fails · ⚠️ Notes include the exact error/symptom seen.

Model (Hugging Face)	Params / Quant	Status	Required flags (if any)	Notes / Errors
`Qwen/Qwen2.5-7B-Instruct`	7B FP16	✅ Works	(recommended) `--dtype float16`	Good baseline; simple serve works.
`meta-llama/Llama-2-7b-chat-hf`	7B FP16	✅ Works	(recommended) `--dtype float16`	Stable.
`Qwen/Qwen3-30B-A3B-Instruct-2507`	30B (A3B) FP16	✅ Works	(recommended) `--dtype float16`	Heavy; ensure unified memory tweaks.
`Qwen/Qwen3-14B-AWQ`	14B AWQ	✅ Works (with flags)	`--quantization awq --dtype float16 --enforce-eager`	On ROCm, eager avoids missing `awq_dequantize` during compile; vLLM auto‑sets `VLLM_USE_TRITON_AWQ`.
`openai/gpt-oss-20b`	20B MXFP4	❌ Fails	—	`ModuleNotFoundError: triton_kernels.matmul_ogs` (MXFP4 path not available in this image).
`zai-org/GLM-4.5-Air-FP8`	FP8	❌ Fails	—	`ValueError: type fp8e4nv not supported (only 'fp8e5')`.
`cpatonn/GLM-4.5-Air-AWQ-4bit`	AWQ-4bit (MoE)	❌ Fails	—	Missing custom op: `torch.ops._C.gptq_marlin_repack` (Marlin kernels).

If you get a model to work, please PR a new row with: model name, exact flags, vLLM version, torch & triton versions, and a note on gfx1151 driver/kernel stack.

1) Toolbx vs Docker/Podman

The kyuz0/pytorch-therock-gfx1151-aotriton-builder image can be used both as:

Fedora Toolbx (recommended for development): Toolbx shares your HOME and user, so models/configs live on the host. Great for iterating quickly while keeping the host clean.
Docker/Podman (recommended for deployment/perf): Use for running vLLM as a service (host networking, IPC tuning, etc.). Always mount a host directory for model weights so they stay outside the container.

2) Quickstart — Fedora Toolbx (development)

Create a toolbox that exposes the GPU and relaxes seccomp to avoid ROCm syscall issues:

toolbox create vllm \
  --image docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
  -- --device /dev/dri --device /dev/kfd \
  --group-add video --group-add render --security-opt seccomp=unconfined

Enter it:

toolbox enter vllm

Model storage (Toolbx): keep weights outside the toolbox under your HOME so they persist. Recommended path:

mkdir -p ~/vllm-models

Serve a model with vLLM (downloads to ~/vllm-models; if the model isn't present, it will be fetched from Hugging Face automatically):

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --host 0.0.0.0 --port 8000 \
  --download-dir ~/vllm-models

Toolbx shares HOME by design, so ~/vllm-models stays on the host and survives toolbox updates.

Cache note (Toolbx): vLLM will also write compiled kernels to ~/.cache/vllm/torch_compile_cache/ in your HOME. For example:
du -sh ~/.cache/vllm/torch_compile_cache/
# e.g., 138M  /home/you/.cache/vllm/torch_compile_cache/

3) Testing the API

Once the server is up, hit the OpenAI‑compatible endpoint:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello! Test the performance."}]}'

You should receive a JSON response with a choices[0].message.content reply.

If you don't want to bother specifying the model name, you can run this which will query the currently deployed model:

MODEL=$(curl -s http://localhost:8000/v1/models | jq -r '.data[0].id')

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"$MODEL\",
    \"messages\":[{\"role\":\"user\",\"content\":\"Hello! Test the performance.\"}]
  }"

4) Quickstart — Podman/Docker

Prefer this for persistent services. Always mount a host directory for weights so they live outside the container. If the model isn't present, vLLM will fetch it from Hugging Face into the mapped directory.

podman run \
  -d \
  --name vllm \
  --network host \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add video \
  --group-add render \
  -v ~/vllm-models:/models \
  -v ~/.cache/vllm:/root/.cache/vllm \
  docker.io/kyuz0/vllm-therock-gfx1151-aotriton:latest \
  bash -lc 'source /torch-therock/.venv/bin/activate; \
    TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
    vllm serve Qwen/Qwen2.5-7B-Instruct --dtype float16 \
      --host 0.0.0.0 --port 8000 --download-dir /models'

Not using --network host? Map a port instead: -p 8000:8000.

5) Models, dtypes & storage

Start with Qwen/Qwen2.5-7B-Instruct; larger models may work but are less forgiving on unified memory.
Use --dtype float16 unless you have a reason to change.
Storage discipline:
- Toolbx: --download-dir ~/vllm-models (lives in your HOME on the host).
- Podman/Docker: -v ~/vllm-models:/models and --download-dir /models.

6) Performance notes (short)

The image is built on the PyTorch + AOTriton base; enabling TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 can improve startup/throughput on some models.
vLLM flags you might tune later: --gpu-memory-utilization, --max-num-seqs, --max-model-len. Start simple; add knobs only if needed.

7) Requirements (host)

Hardware & drivers

AMD Strix Halo APU (gfx1151).
Working amdgpu stack with /dev/kfd (ROCm compute) and /dev/dri (graphics).
Your user in the video and render groups.

Unified memory setup (HIGHLY recommended) Enable large GTT/unified memory so the iGPU can borrow system RAM for bigger models:

Kernel parameters (append to your GRUB cmdline):
```
amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432
```
Parameter Purpose

amd_iommu=off Reduces latency

amdgpu.gttsize=131072 128 GiB GTT (unified memory)

ttm.pages_limit=33554432 Large pinned allocations
BIOS: allocate minimal VRAM to the iGPU (e.g., 512 MB) and rely on unified memory.
Fedora example (GRUB): edit /etc/default/grub → GRUB_CMDLINE_LINUX=... then:
```
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot
```

Parameter	Purpose
`amd_iommu=off`	Reduces latency
`amdgpu.gttsize=131072`	128 GiB GTT (unified memory)
`ttm.pages_limit=33554432`	Large pinned allocations

Container runtime

Podman or Docker installed (examples use Podman; replace with Docker if preferred).

8) Contributing

Spotted a fix, a working flag combo, or a model that should be on the list? PRs welcome! Please include:

Model repo + exact version tag (if any)
Full vllm serve command/flags that work
vLLM version, torch & triton versions (python -c "import torch, triton; print(torch.__version__, triton.__version__)")
Short log snippet of success/failure (especially the first error)
Any relevant kernel/AOTriton env vars (e.g., TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1)

9) Acknowledgements & Links

Base images & docs: https://github.com/kyuz0/amd-strix-halo-pytorch-gfx1151-aotriton
Upstreams: vLLM, ROCm/TheRock, AOTriton
Community: AMD Strix Halo Home Lab Discord — https://discord.gg/pnPRyucNrG
Big thanks to lhl and ssweens for doing the actual heavy lifting for this.

README.md Убрать экранирование Экранировать

AMD Strix Halo — vLLM Toolbox/Container (gfx1151, PyTorch + AOTriton)

⚠️ Status & Expectations (Experimental)

Tested Models (Experimental Matrix)

1) Toolbx vs Docker/Podman

2) Quickstart — Fedora Toolbx (development)

3) Testing the API

4) Quickstart — Podman/Docker

5) Models, dtypes & storage

6) Performance notes (short)

7) Requirements (host)

8) Contributing

9) Acknowledgements & Links

README.md