README.md

# AMD Strix Halo (gfx1151) — vLLM Toolbox/Container

An **Fedora 43** Docker/Podman container that is **Toolbx-compatible** (usable as a Fedora toolbox) for serving LLMs with **vLLM** on **AMD Ryzen AI Max “Strix Halo” (gfx1151)**. Built on the **TheRock nightly builds** for ROCm.


---

## Table of Contents

* [Tested Models (Benchmarks)](#tested-models-benchmarks)
* [1) Toolbx vs Docker/Podman](#1-toolbx-vs-dockerpodman)
* [2) Quickstart — Fedora Toolbx](#2-quickstart--fedora-toolbx)
* [3) Quickstart — Ubuntu (Distrobox)](#3-quickstart--ubuntu-distrobox)
* [4) Testing the API](#4-testing-the-api)
* [5) Use a Web UI for Chatting](#5-use-a-web-ui-for-chatting)
* [6) Host Configuration](#6-host-configuration)


## Tested Models (Benchmarks)

View full benchmarks at: [https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/](https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/)

**Table Key:** Cell values represent `Max Context Length (GPU Memory Utilization)`.

| Model | TP | 1 Req | 4 Reqs | 8 Reqs | 16 Reqs |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **`meta-llama/Meta-Llama-3.1-8B-Instruct`** | 1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
| **`google/gemma-3-12b-it`** | 1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
| **`openai/gpt-oss-20b`** | 1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
| **`Qwen/Qwen3-14B-AWQ`** | 1 | 40k (0.90) | 40k (0.90) | 40k (0.90) | 40k (0.90) |
| **`cpatonn/Qwen3-Coder-30B-A3B-Instruct-GPTQ-4bit`** | 1 | 256k (0.95) | 204k (0.90) | - | - |
| **`dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16`** | 1 | 256k (0.90) | - | - | - |
| **`openai/gpt-oss-120b`** | 1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |


---

## 1) Toolbx vs Docker/Podman

The `kyuz0/vllm-therock-gfx1151:latest` image can be used both as: 

* **Fedora Toolbx (recommended for development):** Toolbx shares your **HOME** and user, so models/configs live on the host. Great for iterating quickly while keeping the host clean. 
* **Docker/Podman (recommended for deployment/perf):** Use for running vLLM as a service (host networking, IPC tuning, etc.). Always **mount a host directory** for model weights so they stay outside the container.

---

## 2) Quickstart — Fedora Toolbx

**Recommended:** Use the included `refresh_toolbox.sh` script. It pulls the latest image and creates the toolbox with the correct parameters:

```bash
./refresh_toolbox.sh
```

> **InfiniBand / RDMA Support:** The script automatically detects if a fast InfiniBand link is active (checks `/dev/infiniband`). If found, it correctly sets up the container to expose these devices, enabling high-performance clustering.

**Manual Creation:**

To manually create a toolbox that exposes the GPU and relaxes seccomp:

```bash
toolbox create vllm \
  --image docker.io/kyuz0/vllm-therock-gfx1151:latest \
  -- --device /dev/dri --device /dev/kfd \
  --group-add video --group-add render --security-opt seccomp=unconfined
```

Enter it:

```bash
toolbox enter vllm
```

**Model storage:** Models are downloaded to `~/.cache/huggingface` by default. This directory is shared with the host if you created the toolbox correctly, so downloads persist.

### Serving a Model (Easiest Way)

The toolbox includes a TUI wizard called **`start-vllm`** which includes pre-configured models and handles the launch flags for you. This is the easiest way to get started.

```bash
start-vllm
```

> **Cache note:** vLLM writes compiled kernels to `~/.cache/vllm/`.

---

## 3) Quickstart — Ubuntu (Distrobox)

Ubuntu’s toolbox package still breaks GPU access, so use Distrobox instead:

```bash
distrobox create -n vllm \
  --image docker.io/kyuz0/vllm-therock-gfx1151:latest \
  --additional-flags "--device /dev/kfd --device /dev/dri --group-add video --group-add render --security-opt seccomp=unconfined"

distrobox enter vllm
```

> **Verification:** Run `rocm-smi` to check GPU status.

### Serving a Model (Easiest Way)

The toolbox includes a TUI wizard called **`start-vllm`** which includes pre-configured models and handles the launch flags for you. This is the easiest way to get started.

```bash
start-vllm
```

---

## 4) Testing the API

Once the server is up, hit the OpenAI‑compatible endpoint:

```bash
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello! Test the performance."}]}'
```

You should receive a JSON response with a `choices[0].message.content` reply.

If you don't want to bother specifying the model name, you can run this which will query the currently deployed model:

```bash
MODEL=$(curl -s http://localhost:8000/v1/models | jq -r '.data[0].id') curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"$MODEL\",
    \"messages\":[{\"role\":\"user\",\"content\":\"Hello! Test the performance.\"}]
  }"
```

---

## 5) Use a Web UI for Chatting

If vLLM is on a remote server, expose port 8000 via SSH port forwarding:

```bash
ssh -L 0.0.0.0:8000:localhost:8000 <vllm-host>
```

Then, you can start HuggingFace ChatUI like this (on your host):

```bash
docker run -p 3000:3000 \
  --add-host=host.docker.internal:host-gateway \
  -e OPENAI_BASE_URL=http://host.docker.internal:8000/v1 \
  -e OPENAI_API_KEY=dummy \
  -v chat-ui-data:/data \
  ghcr.io/huggingface/chat-ui-db
```

## 6) Host Configuration

This should work on any Strix Halo. For a complete list of available hardware, see: [Strix Halo Hardware Database](https://strixhalo-homelab.d7.wtf/Hardware)

### 6.1 Test Configuration

| Component         | Specification                                               |
| :---------------- | :---------------------------------------------------------- |
| **Test Machine**  | Framework Desktop                                           |
| **CPU**           | Ryzen AI MAX+ 395 "Strix Halo"                              |
| **System Memory** | 128 GB RAM                                                  |
| **GPU Memory**    | 512 MB allocated in BIOS                                    |
| **Host OS**       | Fedora 42, Linux 6.18.0-0.rc6.vanilla.fc42.x86_64           |

### 6.2 Kernel Parameters (tested on Fedora 42)

Add these boot parameters to enable unified memory while reserving a minimum of 4 GiB for the OS (max 124 GiB for iGPU):

amd_iommu=off amdgpu.gttsize=126976 ttm.pages_limit=32505856

| Parameter                   | Purpose                                                                                    |
|-----------------------------|--------------------------------------------------------------------------------------------|
| `amd_iommu=off`             | Disables IOMMU for lower latency                                                           |
| `amdgpu.gttsize=126976`     | Caps GPU unified memory to 124 GiB; 126976 MiB ÷ 1024 = 124 GiB                            |
| `ttm.pages_limit=32505856`  | Caps pinned memory to 124 GiB; 32505856 × 4 KiB = 126976 MiB = 124 GiB                     |

Source: https://www.reddit.com/r/LocalLLaMA/comments/1m9wcdc/comment/n5gf53d/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button


**Apply the changes:**

```
# Edit /etc/default/grub to add parameters to GRUB_CMDLINE_LINUX
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot
```
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								# AMD Strix Halo (gfx1151) — vLLM Toolbox/Container
-											first commit
										
										
											2025-09-03 20:42:44 +01:00
-											docs: Update README to specify Fedora 43
										
										
											2025-12-21 09:55:31 +00:00
+								An **Fedora 43** Docker/Podman container that is **Toolbx-compatible** (usable as a Fedora toolbox) for serving LLMs with **vLLM** on **AMD Ryzen AI Max “Strix Halo” (gfx1151)**. Built on the **TheRock nightly builds** for ROCm.
-											first commit
										
										
											2025-09-03 20:42:44 +01:00
-											updated with set of working models
										
										
											2025-09-04 13:33:53 +01:00
 								---
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								## Table of Contents
-											updated with set of working models
										
										
											2025-09-04 13:33:53 +01:00
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								* [Tested Models (Benchmarks)](#tested-models-benchmarks)
 								* [1) Toolbx vs Docker/Podman](#1-toolbx-vs-dockerpodman)
-											fixes
										
										
											2025-12-20 12:32:46 +00:00
+								* [2) Quickstart — Fedora Toolbx](#2-quickstart--fedora-toolbx)
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								* [3) Quickstart — Ubuntu (Distrobox)](#3-quickstart--ubuntu-distrobox)
 								* [4) Testing the API](#4-testing-the-api)
 								* [5) Use a Web UI for Chatting](#5-use-a-web-ui-for-chatting)
-											updated table with host configuration
										
										
											2025-12-22 16:40:25 +00:00
+								* [6) Host Configuration](#6-host-configuration)
-											updated with set of working models
										
										
											2025-09-04 13:33:53 +01:00
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								## Tested Models (Benchmarks)
-											updated with set of working models
										
										
											2025-09-04 13:33:53 +01:00
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								View full benchmarks at: [https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/](https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/)
-											updated with set of working models
										
										
											2025-09-04 13:33:53 +01:00
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								**Table Key:** Cell values represent `Max Context Length (GPU Memory Utilization)`.
 								| Model | TP | 1 Req | 4 Reqs | 8 Reqs | 16 Reqs |
 								| :--- | :--- | :--- | :--- | :--- | :--- |
 								| **`meta-llama/Meta-Llama-3.1-8B-Instruct`** | 1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
 								| **`google/gemma-3-12b-it`** | 1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
 								| **`openai/gpt-oss-20b`** | 1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
 								| **`Qwen/Qwen3-14B-AWQ`** | 1 | 40k (0.90) | 40k (0.90) | 40k (0.90) | 40k (0.90) |
 								| **`cpatonn/Qwen3-Coder-30B-A3B-Instruct-GPTQ-4bit`** | 1 | 256k (0.95) | 204k (0.90) | - | - |
 								| **`dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16`** | 1 | 256k (0.90) | - | - | - |
 								| **`openai/gpt-oss-120b`** | 1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
-											Fixed missing parameters in start-vllm
										
										
											2025-09-04 13:58:51 +01:00
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								---
-											Fixed missing parameters in start-vllm
										
										
											2025-09-04 13:58:51 +01:00
-											first commit
										
										
											2025-09-03 20:42:44 +01:00
+								## 1) Toolbx vs Docker/Podman
-											fixes
										
										
											2025-12-20 12:32:46 +00:00
+								The `kyuz0/vllm-therock-gfx1151:latest` image can be used both as:
-											first commit
										
										
											2025-09-03 20:42:44 +01:00
-											fixes
										
										
											2025-12-20 12:32:46 +00:00
+								* **Fedora Toolbx (recommended for development):** Toolbx shares your **HOME** and user, so models/configs live on the host. Great for iterating quickly while keeping the host clean.
-											first commit
										
										
											2025-09-03 20:42:44 +01:00
+								* **Docker/Podman (recommended for deployment/perf):** Use for running vLLM as a service (host networking, IPC tuning, etc.). Always **mount a host directory** for model weights so they stay outside the container.
 								---
-											fixes
										
										
											2025-12-20 12:32:46 +00:00
+								## 2) Quickstart — Fedora Toolbx
-											first commit
										
										
											2025-09-03 20:42:44 +01:00
-											docs: update quickstart to recommend refresh_toolbox.sh for toolbox creation and detail its InfiniBand/RDMA detection capabilities.
										
										
											2026-02-01 21:55:46 +00:00
+								**Recommended:** Use the included `refresh_toolbox.sh` script. It pulls the latest image and creates the toolbox with the correct parameters:
 								```bash
 								./refresh_toolbox.sh
 								```
 								> **InfiniBand / RDMA Support:** The script automatically detects if a fast InfiniBand link is active (checks `/dev/infiniband`). If found, it correctly sets up the container to expose these devices, enabling high-performance clustering.
 								**Manual Creation:**
 								To manually create a toolbox that exposes the GPU and relaxes seccomp:
-											first commit
										
										
											2025-09-03 20:42:44 +01:00
 								```bash
 								toolbox create vllm \
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								  --image docker.io/kyuz0/vllm-therock-gfx1151:latest \
-											first commit
										
										
											2025-09-03 20:42:44 +01:00
+								  -- --device /dev/dri --device /dev/kfd \
 								  --group-add video --group-add render --security-opt seccomp=unconfined
 								```
 								Enter it:
 								```bash
 								toolbox enter vllm
 								```
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								**Model storage:** Models are downloaded to `~/.cache/huggingface` by default. This directory is shared with the host if you created the toolbox correctly, so downloads persist.
 								### Serving a Model (Easiest Way)
 								The toolbox includes a TUI wizard called **`start-vllm`** which includes pre-configured models and handles the launch flags for you. This is the easiest way to get started.
-											first commit
										
										
											2025-09-03 20:42:44 +01:00
 								```bash
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								start-vllm
-											first commit
										
										
											2025-09-03 20:42:44 +01:00
+								```
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								> **Cache note:** vLLM writes compiled kernels to `~/.cache/vllm/`.
 								---
 								## 3) Quickstart — Ubuntu (Distrobox)
 								Ubuntu’s toolbox package still breaks GPU access, so use Distrobox instead:
-											first commit
										
										
											2025-09-03 20:42:44 +01:00
 								```bash
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								distrobox create -n vllm \
 								  --image docker.io/kyuz0/vllm-therock-gfx1151:latest \
 								  --additional-flags "--device /dev/kfd --device /dev/dri --group-add video --group-add render --security-opt seccomp=unconfined"
 								distrobox enter vllm
-											first commit
										
										
											2025-09-03 20:42:44 +01:00
+								```
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								> **Verification:** Run `rocm-smi` to check GPU status.
-											Fixed missing parameters in start-vllm
										
										
											2025-09-04 13:58:51 +01:00
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								### Serving a Model (Easiest Way)
 								The toolbox includes a TUI wizard called **`start-vllm`** which includes pre-configured models and handles the launch flags for you. This is the easiest way to get started.
 								```bash
 								start-vllm
 								```
-											first commit
										
										
											2025-09-03 20:42:44 +01:00
 								---
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								## 4) Testing the API
-											first commit
										
										
											2025-09-03 20:42:44 +01:00
-											updated with set of working models
										
										
											2025-09-04 13:33:53 +01:00
+								Once the server is up, hit the OpenAI‑compatible endpoint:
-											first commit
										
										
											2025-09-03 20:42:44 +01:00
 								```bash
 								curl -X POST http://localhost:8000/v1/chat/completions \
 								  -H "Content-Type: application/json" \
 								  -d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello! Test the performance."}]}'
 								```
 								You should receive a JSON response with a `choices[0].message.content` reply.
-											updated with set of working models
										
										
											2025-09-04 13:33:53 +01:00
+								If you don't want to bother specifying the model name, you can run this which will query the currently deployed model:
 								```bash
-											Fixed Docker/Podman commands
										
										
											2025-09-04 15:02:00 +01:00
+								MODEL=$(curl -s http://localhost:8000/v1/models | jq -r '.data[0].id') curl -X POST http://localhost:8000/v1/chat/completions \
-											updated with set of working models
										
										
											2025-09-04 13:33:53 +01:00
+								  -H "Content-Type: application/json" \
 								  -d "{
 								    \"model\": \"$MODEL\",
 								    \"messages\":[{\"role\":\"user\",\"content\":\"Hello! Test the performance.\"}]
 								  }"
 								```
-											first commit
										
										
											2025-09-03 20:42:44 +01:00
+								---
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								## 5) Use a Web UI for Chatting
-											Fixed Docker/Podman commands
										
										
											2025-09-04 15:02:00 +01:00
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								If vLLM is on a remote server, expose port 8000 via SSH port forwarding:
-											Fixed Docker/Podman commands
										
										
											2025-09-04 15:02:00 +01:00
 								```bash
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								ssh -L 0.0.0.0:8000:localhost:8000 <vllm-host>
-											Fixed Docker/Podman commands
										
										
											2025-09-04 15:02:00 +01:00
+								```
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								Then, you can start HuggingFace ChatUI like this (on your host):
-											Fixed Docker/Podman commands
										
										
											2025-09-04 15:02:00 +01:00
 								```bash
-											updates
										
										
											2025-12-20 11:37:06 +00:00
+								docker run -p 3000:3000 \
 								  --add-host=host.docker.internal:host-gateway \
 								  -e OPENAI_BASE_URL=http://host.docker.internal:8000/v1 \
 								  -e OPENAI_API_KEY=dummy \
 								  -v chat-ui-data:/data \
 								  ghcr.io/huggingface/chat-ui-db
-											Fixed Docker/Podman commands
										
										
											2025-09-04 15:02:00 +01:00
+								```
-											updated table with host configuration
										
										
											2025-12-22 16:40:25 +00:00
 								## 6) Host Configuration
 								This should work on any Strix Halo. For a complete list of available hardware, see: [Strix Halo Hardware Database](https://strixhalo-homelab.d7.wtf/Hardware)
 								### 6.1 Test Configuration
 								| Component         | Specification                                               |
 								| :---------------- | :---------------------------------------------------------- |
 								| **Test Machine**  | Framework Desktop                                           |
 								| **CPU**           | Ryzen AI MAX+ 395 "Strix Halo"                              |
 								| **System Memory** | 128 GB RAM                                                  |
 								| **GPU Memory**    | 512 MB allocated in BIOS                                    |
-											fix
										
										
											2025-12-22 16:40:44 +00:00
+								| **Host OS**       | Fedora 42, Linux 6.18.0-0.rc6.vanilla.fc42.x86_64           |
-											updated table with host configuration
										
										
											2025-12-22 16:40:25 +00:00
 								### 6.2 Kernel Parameters (tested on Fedora 42)
 								Add these boot parameters to enable unified memory while reserving a minimum of 4 GiB for the OS (max 124 GiB for iGPU):
 								amd_iommu=off amdgpu.gttsize=126976 ttm.pages_limit=32505856
 								| Parameter                   | Purpose                                                                                    |
 								|-----------------------------|--------------------------------------------------------------------------------------------|
 								| `amd_iommu=off`             | Disables IOMMU for lower latency                                                           |
 								| `amdgpu.gttsize=126976`     | Caps GPU unified memory to 124 GiB; 126976 MiB ÷ 1024 = 124 GiB                            |
 								| `ttm.pages_limit=32505856`  | Caps pinned memory to 124 GiB; 32505856 × 4 KiB = 126976 MiB = 124 GiB                     |
 								Source: https://www.reddit.com/r/LocalLLaMA/comments/1m9wcdc/comment/n5gf53d/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
 								**Apply the changes:**
 								```
 								# Edit /etc/default/grub to add parameters to GRUB_CMDLINE_LINUX
 								sudo grub2-mkconfig -o /boot/grub2/grub.cfg
 								sudo reboot
 								```