An Fedora 43 Docker/Podman container that is Toolbx-compatible (usable as a Fedora toolbox) for serving LLMs with vLLM on AMD Ryzen AI Max “Strix Halo” (gfx1151). Built on the TheRock nightly builds for ROCm.

Tested Models (Benchmarks)
1) Toolbx vs Docker/Podman
2) Quickstart — Fedora Toolbx
3) Quickstart — Ubuntu (Distrobox)
4) Testing the API
5) Use a Web UI for Chatting
6) Host Configuration

Tested Models (Benchmarks)

View full benchmarks at: https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/

Table Key: Cell values represent Max Context Length (GPU Memory Utilization).

Model	TP	1 Req	4 Reqs	8 Reqs	16 Reqs
`meta-llama/Meta-Llama-3.1-8B-Instruct`	1	128k (0.95)	128k (0.95)	128k (0.95)	128k (0.95)
`google/gemma-3-12b-it`	1	128k (0.95)	128k (0.95)	128k (0.95)	128k (0.95)
`openai/gpt-oss-20b`	1	128k (0.95)	128k (0.95)	128k (0.95)	128k (0.95)
`Qwen/Qwen3-14B-AWQ`	1	40k (0.90)	40k (0.90)	40k (0.90)	40k (0.90)
`cpatonn/Qwen3-Coder-30B-A3B-Instruct-GPTQ-4bit`	1	256k (0.95)	204k (0.90)	-	-
`dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16`	1	256k (0.90)	-	-	-
`openai/gpt-oss-120b`	1	128k (0.95)	128k (0.95)	128k (0.95)	128k (0.95)

1) Toolbx vs Docker/Podman

The kyuz0/vllm-therock-gfx1151:latest image can be used both as:

Fedora Toolbx (recommended for development): Toolbx shares your HOME and user, so models/configs live on the host. Great for iterating quickly while keeping the host clean.
Docker/Podman (recommended for deployment/perf): Use for running vLLM as a service (host networking, IPC tuning, etc.). Always mount a host directory for model weights so they stay outside the container.

2) Quickstart — Fedora Toolbx

Create a toolbox that exposes the GPU and relaxes seccomp to avoid ROCm syscall issues:

toolbox create vllm \
  --image docker.io/kyuz0/vllm-therock-gfx1151:latest \
  -- --device /dev/dri --device /dev/kfd \
  --group-add video --group-add render --security-opt seccomp=unconfined

Enter it:

toolbox enter vllm

Model storage: Models are downloaded to ~/.cache/huggingface by default. This directory is shared with the host if you created the toolbox correctly, so downloads persist.

Serving a Model (Easiest Way)

The toolbox includes a TUI wizard called start-vllm which includes pre-configured models and handles the launch flags for you. This is the easiest way to get started.

start-vllm

Cache note: vLLM writes compiled kernels to ~/.cache/vllm/.

3) Quickstart — Ubuntu (Distrobox)

Ubuntu’s toolbox package still breaks GPU access, so use Distrobox instead:

distrobox create -n vllm \
  --image docker.io/kyuz0/vllm-therock-gfx1151:latest \
  --additional-flags "--device /dev/kfd --device /dev/dri --group-add video --group-add render --security-opt seccomp=unconfined"

distrobox enter vllm

Verification: Run rocm-smi to check GPU status.

Serving a Model (Easiest Way)

The toolbox includes a TUI wizard called start-vllm which includes pre-configured models and handles the launch flags for you. This is the easiest way to get started.

start-vllm

4) Testing the API

Once the server is up, hit the OpenAI‑compatible endpoint:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello! Test the performance."}]}'

You should receive a JSON response with a choices[0].message.content reply.

If you don't want to bother specifying the model name, you can run this which will query the currently deployed model:

MODEL=$(curl -s http://localhost:8000/v1/models | jq -r '.data[0].id') curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"$MODEL\",
    \"messages\":[{\"role\":\"user\",\"content\":\"Hello! Test the performance.\"}]
  }"

5) Use a Web UI for Chatting

If vLLM is on a remote server, expose port 8000 via SSH port forwarding:

ssh -L 0.0.0.0:8000:localhost:8000 <vllm-host>

Then, you can start HuggingFace ChatUI like this (on your host):

docker run -p 3000:3000 \
  --add-host=host.docker.internal:host-gateway \
  -e OPENAI_BASE_URL=http://host.docker.internal:8000/v1 \
  -e OPENAI_API_KEY=dummy \
  -v chat-ui-data:/data \
  ghcr.io/huggingface/chat-ui-db

6) Host Configuration

This should work on any Strix Halo. For a complete list of available hardware, see: Strix Halo Hardware Database

6.1 Test Configuration

Component	Specification
Test Machine	Framework Desktop
CPU	Ryzen AI MAX+ 395 "Strix Halo"
System Memory	128 GB RAM
GPU Memory	512 MB allocated in BIOS
Host OS	Fedora 42, Linux 6.18.0-0.rc6.vanilla.fc42.x86_64

6.2 Kernel Parameters (tested on Fedora 42)

Add these boot parameters to enable unified memory while reserving a minimum of 4 GiB for the OS (max 124 GiB for iGPU):

amd_iommu=off amdgpu.gttsize=126976 ttm.pages_limit=32505856

Parameter	Purpose
`amd_iommu=off`	Disables IOMMU for lower latency
`amdgpu.gttsize=126976`	Caps GPU unified memory to 124 GiB; 126976 MiB ÷ 1024 = 124 GiB
`ttm.pages_limit=32505856`	Caps pinned memory to 124 GiB; 32505856 × 4 KiB = 126976 MiB = 124 GiB

Source: https://www.reddit.com/r/LocalLLaMA/comments/1m9wcdc/comment/n5gf53d/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Apply the changes:

# Edit /etc/default/grub to add parameters to GRUB_CMDLINE_LINUX
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot

README.md Убрать экранирование Экранировать

AMD Strix Halo (gfx1151) — vLLM Toolbox/Container

Table of Contents

Tested Models (Benchmarks)

1) Toolbx vs Docker/Podman

2) Quickstart — Fedora Toolbx

Serving a Model (Easiest Way)

3) Quickstart — Ubuntu (Distrobox)

Serving a Model (Easiest Way)

4) Testing the API

5) Use a Web UI for Chatting

6) Host Configuration

6.1 Test Configuration

6.2 Kernel Parameters (tested on Fedora 42)

README.md