AMD Strix Halo (gfx1151) — vLLM Toolbox/Container

An Fedora 43 Docker/Podman container that is Toolbx-compatible (usable as a Fedora toolbox) for serving LLMs with vLLM on AMD Ryzen AI Max “Strix Halo” (gfx1151). Built on the TheRock nightly builds for ROCm.


Table of Contents

Tested Models (Benchmarks)

View full benchmarks at: https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/

Table Key: Cell values represent Max Context Length (GPU Memory Utilization).

Model TP 1 Req 4 Reqs 8 Reqs 16 Reqs
meta-llama/Meta-Llama-3.1-8B-Instruct 1 128k (0.95) 128k (0.95) 128k (0.95) 128k (0.95)
google/gemma-3-12b-it 1 128k (0.95) 128k (0.95) 128k (0.95) 128k (0.95)
openai/gpt-oss-20b 1 128k (0.95) 128k (0.95) 128k (0.95) 128k (0.95)
Qwen/Qwen3-14B-AWQ 1 40k (0.90) 40k (0.90) 40k (0.90) 40k (0.90)
cpatonn/Qwen3-Coder-30B-A3B-Instruct-GPTQ-4bit 1 256k (0.95) 204k (0.90) - -
dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16 1 256k (0.90) - - -
openai/gpt-oss-120b 1 128k (0.95) 128k (0.95) 128k (0.95) 128k (0.95)

1) Toolbx vs Docker/Podman

The kyuz0/vllm-therock-gfx1151:latest image can be used both as: 

  • Fedora Toolbx (recommended for development): Toolbx shares your HOME and user, so models/configs live on the host. Great for iterating quickly while keeping the host clean.
  • Docker/Podman (recommended for deployment/perf): Use for running vLLM as a service (host networking, IPC tuning, etc.). Always mount a host directory for model weights so they stay outside the container.

2) Quickstart — Fedora Toolbx

Create a toolbox that exposes the GPU and relaxes seccomp to avoid ROCm syscall issues:

toolbox create vllm \
  --image docker.io/kyuz0/vllm-therock-gfx1151:latest \
  -- --device /dev/dri --device /dev/kfd \
  --group-add video --group-add render --security-opt seccomp=unconfined

Enter it:

toolbox enter vllm

Model storage: Models are downloaded to ~/.cache/huggingface by default. This directory is shared with the host if you created the toolbox correctly, so downloads persist.

Serving a Model (Easiest Way)

The toolbox includes a TUI wizard called start-vllm which includes pre-configured models and handles the launch flags for you. This is the easiest way to get started.

start-vllm

Cache note: vLLM writes compiled kernels to ~/.cache/vllm/.


3) Quickstart — Ubuntu (Distrobox)

Ubuntus toolbox package still breaks GPU access, so use Distrobox instead:

distrobox create -n vllm \
  --image docker.io/kyuz0/vllm-therock-gfx1151:latest \
  --additional-flags "--device /dev/kfd --device /dev/dri --group-add video --group-add render --security-opt seccomp=unconfined"

distrobox enter vllm

Verification: Run rocm-smi to check GPU status.

Serving a Model (Easiest Way)

The toolbox includes a TUI wizard called start-vllm which includes pre-configured models and handles the launch flags for you. This is the easiest way to get started.

start-vllm

4) Testing the API

Once the server is up, hit the OpenAIcompatible endpoint:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello! Test the performance."}]}'

You should receive a JSON response with a choices[0].message.content reply.

If you don't want to bother specifying the model name, you can run this which will query the currently deployed model:

MODEL=$(curl -s http://localhost:8000/v1/models | jq -r '.data[0].id') curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"$MODEL\",
    \"messages\":[{\"role\":\"user\",\"content\":\"Hello! Test the performance.\"}]
  }"

5) Use a Web UI for Chatting

If vLLM is on a remote server, expose port 8000 via SSH port forwarding:

ssh -L 0.0.0.0:8000:localhost:8000 <vllm-host>

Then, you can start HuggingFace ChatUI like this (on your host):

docker run -p 3000:3000 \
  --add-host=host.docker.internal:host-gateway \
  -e OPENAI_BASE_URL=http://host.docker.internal:8000/v1 \
  -e OPENAI_API_KEY=dummy \
  -v chat-ui-data:/data \
  ghcr.io/huggingface/chat-ui-db

6) Host Configuration

This should work on any Strix Halo. For a complete list of available hardware, see: Strix Halo Hardware Database

6.1 Test Configuration

Component Specification
Test Machine Framework Desktop
CPU Ryzen AI MAX+ 395 "Strix Halo"
System Memory 128 GB RAM
GPU Memory 512 MB allocated in BIOS
Host OS Fedora 42, Linux 6.18.0-0.rc6.vanilla.fc42.x86_64

6.2 Kernel Parameters (tested on Fedora 42)

Add these boot parameters to enable unified memory while reserving a minimum of 4 GiB for the OS (max 124 GiB for iGPU):

amd_iommu=off amdgpu.gttsize=126976 ttm.pages_limit=32505856

Parameter Purpose
amd_iommu=off Disables IOMMU for lower latency
amdgpu.gttsize=126976 Caps GPU unified memory to 124 GiB; 126976 MiB ÷ 1024 = 124 GiB
ttm.pages_limit=32505856 Caps pinned memory to 124 GiB; 32505856 × 4 KiB = 126976 MiB = 124 GiB

Source: https://www.reddit.com/r/LocalLLaMA/comments/1m9wcdc/comment/n5gf53d/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Apply the changes:

# Edit /etc/default/grub to add parameters to GRUB_CMDLINE_LINUX
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot
S
Описание
No description provided
Readme 6,9 MiB
Languages
Python 77.2%
Shell 18.4%
Dockerfile 4.4%