2025-12-20 11:37:06 +00:00
# AMD Strix Halo (gfx1151) — vLLM Toolbox/Container
2025-09-03 20:42:44 +01:00
2025-12-21 09:55:31 +00:00
An **Fedora 43 ** Docker/Podman container that is **Toolbx-compatible ** (usable as a Fedora toolbox) for serving LLMs with **vLLM ** on **AMD Ryzen AI Max “Strix Halo” (gfx1151) ** . Built on the **TheRock nightly builds ** for ROCm.
2025-09-03 20:42:44 +01:00
2025-09-04 13:33:53 +01:00
---
2025-12-20 11:37:06 +00:00
## Table of Contents
2025-09-04 13:33:53 +01:00
2025-12-20 11:37:06 +00:00
* [Tested Models (Benchmarks) ](#tested-models-benchmarks )
* [1) Toolbx vs Docker/Podman ](#1-toolbx-vs-dockerpodman )
2025-12-20 12:32:46 +00:00
* [2) Quickstart — Fedora Toolbx ](#2-quickstart--fedora-toolbx )
2025-12-20 11:37:06 +00:00
* [3) Quickstart — Ubuntu (Distrobox) ](#3-quickstart--ubuntu-distrobox )
* [4) Testing the API ](#4-testing-the-api )
* [5) Use a Web UI for Chatting ](#5-use-a-web-ui-for-chatting )
2025-12-22 16:40:25 +00:00
* [6) Host Configuration ](#6-host-configuration )
2025-09-04 13:33:53 +01:00
2025-12-20 11:37:06 +00:00
## Tested Models (Benchmarks)
2025-09-04 13:33:53 +01:00
2025-12-20 11:37:06 +00:00
View full benchmarks at: [https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/ ](https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/ )
2025-09-04 13:33:53 +01:00
2025-12-20 11:37:06 +00:00
**Table Key: ** Cell values represent `Max Context Length (GPU Memory Utilization)` .
| Model | TP | 1 Req | 4 Reqs | 8 Reqs | 16 Reqs |
| :--- | :--- | :--- | :--- | :--- | :--- |
| * * `meta-llama/Meta-Llama-3.1-8B-Instruct` ** | 1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
| * * `google/gemma-3-12b-it` ** | 1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
| * * `openai/gpt-oss-20b` ** | 1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
| * * `Qwen/Qwen3-14B-AWQ` ** | 1 | 40k (0.90) | 40k (0.90) | 40k (0.90) | 40k (0.90) |
| * * `cpatonn/Qwen3-Coder-30B-A3B-Instruct-GPTQ-4bit` ** | 1 | 256k (0.95) | 204k (0.90) | - | - |
| * * `dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16` ** | 1 | 256k (0.90) | - | - | - |
| * * `openai/gpt-oss-120b` ** | 1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
2025-09-04 13:58:51 +01:00
2025-12-20 11:37:06 +00:00
---
2025-09-04 13:58:51 +01:00
2025-09-03 20:42:44 +01:00
## 1) Toolbx vs Docker/Podman
2025-12-20 12:32:46 +00:00
The `kyuz0/vllm-therock-gfx1151:latest` image can be used both as:
2025-09-03 20:42:44 +01:00
2025-12-20 12:32:46 +00:00
* **Fedora Toolbx (recommended for development):** Toolbx shares your **HOME ** and user, so models/configs live on the host. Great for iterating quickly while keeping the host clean.
2025-09-03 20:42:44 +01:00
* **Docker/Podman (recommended for deployment/perf):** Use for running vLLM as a service (host networking, IPC tuning, etc.). Always **mount a host directory ** for model weights so they stay outside the container.
---
2025-12-20 12:32:46 +00:00
## 2) Quickstart — Fedora Toolbx
2025-09-03 20:42:44 +01:00
Create a toolbox that exposes the GPU and relaxes seccomp to avoid ROCm syscall issues:
``` bash
toolbox create vllm \
2025-12-20 11:37:06 +00:00
--image docker.io/kyuz0/vllm-therock-gfx1151:latest \
2025-09-03 20:42:44 +01:00
-- --device /dev/dri --device /dev/kfd \
--group-add video --group-add render --security-opt seccomp = unconfined
```
Enter it:
``` bash
toolbox enter vllm
```
2025-12-20 11:37:06 +00:00
**Model storage: ** Models are downloaded to `~/.cache/huggingface` by default. This directory is shared with the host if you created the toolbox correctly, so downloads persist.
### Serving a Model (Easiest Way)
The toolbox includes a TUI wizard called * * `start-vllm` ** which includes pre-configured models and handles the launch flags for you. This is the easiest way to get started.
2025-09-03 20:42:44 +01:00
``` bash
2025-12-20 11:37:06 +00:00
start-vllm
2025-09-03 20:42:44 +01:00
```
2025-12-20 11:37:06 +00:00
> **Cache note:** vLLM writes compiled kernels to `~/.cache/vllm/`.
---
## 3) Quickstart — Ubuntu (Distrobox)
Ubuntu’ s toolbox package still breaks GPU access, so use Distrobox instead:
2025-09-03 20:42:44 +01:00
``` bash
2025-12-20 11:37:06 +00:00
distrobox create -n vllm \
--image docker.io/kyuz0/vllm-therock-gfx1151:latest \
--additional-flags "--device /dev/kfd --device /dev/dri --group-add video --group-add render --security-opt seccomp=unconfined"
distrobox enter vllm
2025-09-03 20:42:44 +01:00
```
2025-12-20 11:37:06 +00:00
> **Verification:** Run `rocm-smi` to check GPU status.
2025-09-04 13:58:51 +01:00
2025-12-20 11:37:06 +00:00
### Serving a Model (Easiest Way)
The toolbox includes a TUI wizard called * * `start-vllm` ** which includes pre-configured models and handles the launch flags for you. This is the easiest way to get started.
``` bash
start-vllm
```
2025-09-03 20:42:44 +01:00
---
2025-12-20 11:37:06 +00:00
## 4) Testing the API
2025-09-03 20:42:44 +01:00
2025-09-04 13:33:53 +01:00
Once the server is up, hit the OpenAI‑ compatible endpoint:
2025-09-03 20:42:44 +01:00
``` bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello! Test the performance."}]}'
```
You should receive a JSON response with a `choices[0].message.content` reply.
2025-09-04 13:33:53 +01:00
If you don't want to bother specifying the model name, you can run this which will query the currently deployed model:
``` bash
2025-09-04 15:02:00 +01:00
MODEL = $( curl -s http://localhost:8000/v1/models | jq -r '.data[0].id' ) curl -X POST http://localhost:8000/v1/chat/completions \
2025-09-04 13:33:53 +01:00
-H "Content-Type: application/json" \
-d " {
\"model\": \" $MODEL \",
\"messages\":[{\"role\":\"user\",\"content\":\"Hello! Test the performance.\"}]
} "
```
2025-09-03 20:42:44 +01:00
---
2025-12-20 11:37:06 +00:00
## 5) Use a Web UI for Chatting
2025-09-04 15:02:00 +01:00
2025-12-20 11:37:06 +00:00
If vLLM is on a remote server, expose port 8000 via SSH port forwarding:
2025-09-04 15:02:00 +01:00
``` bash
2025-12-20 11:37:06 +00:00
ssh -L 0.0.0.0:8000:localhost:8000 <vllm-host>
2025-09-04 15:02:00 +01:00
```
2025-12-20 11:37:06 +00:00
Then, you can start HuggingFace ChatUI like this (on your host):
2025-09-04 15:02:00 +01:00
``` bash
2025-12-20 11:37:06 +00:00
docker run -p 3000:3000 \
--add-host= host.docker.internal:host-gateway \
-e OPENAI_BASE_URL = http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY = dummy \
-v chat-ui-data:/data \
ghcr.io/huggingface/chat-ui-db
2025-09-04 15:02:00 +01:00
```
2025-12-22 16:40:25 +00:00
## 6) Host Configuration
This should work on any Strix Halo. For a complete list of available hardware, see: [Strix Halo Hardware Database ](https://strixhalo-homelab.d7.wtf/Hardware )
### 6.1 Test Configuration
| Component | Specification |
| :---------------- | :---------------------------------------------------------- |
| **Test Machine ** | Framework Desktop |
| **CPU ** | Ryzen AI MAX+ 395 "Strix Halo" |
| **System Memory ** | 128 GB RAM |
| **GPU Memory ** | 512 MB allocated in BIOS |
2025-12-22 16:40:44 +00:00
| **Host OS ** | Fedora 42, Linux 6.18.0-0.rc6.vanilla.fc42.x86_64 |
2025-12-22 16:40:25 +00:00
### 6.2 Kernel Parameters (tested on Fedora 42)
Add these boot parameters to enable unified memory while reserving a minimum of 4 GiB for the OS (max 124 GiB for iGPU):
amd_iommu=off amdgpu.gttsize=126976 ttm.pages_limit=32505856
| Parameter | Purpose |
|-----------------------------|--------------------------------------------------------------------------------------------|
| `amd_iommu=off` | Disables IOMMU for lower latency |
| `amdgpu.gttsize=126976` | Caps GPU unified memory to 124 GiB; 126976 MiB ÷ 1024 = 124 GiB |
| `ttm.pages_limit=32505856` | Caps pinned memory to 124 GiB; 32505856 × 4 KiB = 126976 MiB = 124 GiB |
Source: https://www.reddit.com/r/LocalLLaMA/comments/1m9wcdc/comment/n5gf53d/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
**Apply the changes: **
```
# Edit /etc/default/grub to add parameters to GRUB_CMDLINE_LINUX
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot
```