# AMD Strix Halo (gfx1151) — vLLM Toolbox/Container An **Fedora 43** Docker/Podman container that is **Toolbx-compatible** (usable as a Fedora toolbox) for serving LLMs with **vLLM** on **AMD Ryzen AI Max “Strix Halo” (gfx1151)**. Built on the **TheRock nightly builds** for ROCm. --- ## Table of Contents * [Tested Models (Benchmarks)](#tested-models-benchmarks) * [1) Toolbx vs Docker/Podman](#1-toolbx-vs-dockerpodman) * [2) Quickstart — Fedora Toolbx](#2-quickstart--fedora-toolbx) * [3) Quickstart — Ubuntu (Distrobox)](#3-quickstart--ubuntu-distrobox) * [4) Testing the API](#4-testing-the-api) * [5) Use a Web UI for Chatting](#5-use-a-web-ui-for-chatting) * [6) Host Configuration](#6-host-configuration) ## Tested Models (Benchmarks) View full benchmarks at: [https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/](https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/) **Table Key:** Cell values represent `Max Context Length (GPU Memory Utilization)`. | Model | TP | 1 Req | 4 Reqs | 8 Reqs | 16 Reqs | | :--- | :--- | :--- | :--- | :--- | :--- | | **`meta-llama/Meta-Llama-3.1-8B-Instruct`** | 1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) | | **`google/gemma-3-12b-it`** | 1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) | | **`openai/gpt-oss-20b`** | 1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) | | **`Qwen/Qwen3-14B-AWQ`** | 1 | 40k (0.90) | 40k (0.90) | 40k (0.90) | 40k (0.90) | | **`cpatonn/Qwen3-Coder-30B-A3B-Instruct-GPTQ-4bit`** | 1 | 256k (0.95) | 204k (0.90) | - | - | | **`dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16`** | 1 | 256k (0.90) | - | - | - | | **`openai/gpt-oss-120b`** | 1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) | --- ## 1) Toolbx vs Docker/Podman The `kyuz0/vllm-therock-gfx1151:latest` image can be used both as:  * **Fedora Toolbx (recommended for development):** Toolbx shares your **HOME** and user, so models/configs live on the host. Great for iterating quickly while keeping the host clean. * **Docker/Podman (recommended for deployment/perf):** Use for running vLLM as a service (host networking, IPC tuning, etc.). Always **mount a host directory** for model weights so they stay outside the container. --- ## 2) Quickstart — Fedora Toolbx **Recommended:** Use the included `refresh_toolbox.sh` script. It pulls the latest image and creates the toolbox with the correct parameters: ```bash ./refresh_toolbox.sh ``` > **InfiniBand / RDMA Support:** The script automatically detects if a fast InfiniBand link is active (checks `/dev/infiniband`). If found, it correctly sets up the container to expose these devices, enabling high-performance clustering. **Manual Creation:** To manually create a toolbox that exposes the GPU and relaxes seccomp: ```bash toolbox create vllm \ --image docker.io/kyuz0/vllm-therock-gfx1151:latest \ -- --device /dev/dri --device /dev/kfd \ --group-add video --group-add render --security-opt seccomp=unconfined ``` Enter it: ```bash toolbox enter vllm ``` **Model storage:** Models are downloaded to `~/.cache/huggingface` by default. This directory is shared with the host if you created the toolbox correctly, so downloads persist. ### Serving a Model (Easiest Way) The toolbox includes a TUI wizard called **`start-vllm`** which includes pre-configured models and handles the launch flags for you. This is the easiest way to get started. ```bash start-vllm ``` > **Cache note:** vLLM writes compiled kernels to `~/.cache/vllm/`. --- ## 3) Quickstart — Ubuntu (Distrobox) Ubuntu’s toolbox package still breaks GPU access, so use Distrobox instead: ```bash distrobox create -n vllm \ --image docker.io/kyuz0/vllm-therock-gfx1151:latest \ --additional-flags "--device /dev/kfd --device /dev/dri --group-add video --group-add render --security-opt seccomp=unconfined" distrobox enter vllm ``` > **Verification:** Run `rocm-smi` to check GPU status. ### Serving a Model (Easiest Way) The toolbox includes a TUI wizard called **`start-vllm`** which includes pre-configured models and handles the launch flags for you. This is the easiest way to get started. ```bash start-vllm ``` --- ## 4) Testing the API Once the server is up, hit the OpenAI‑compatible endpoint: ```bash curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello! Test the performance."}]}' ``` You should receive a JSON response with a `choices[0].message.content` reply. If you don't want to bother specifying the model name, you can run this which will query the currently deployed model: ```bash MODEL=$(curl -s http://localhost:8000/v1/models | jq -r '.data[0].id') curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d "{ \"model\": \"$MODEL\", \"messages\":[{\"role\":\"user\",\"content\":\"Hello! Test the performance.\"}] }" ``` --- ## 5) Use a Web UI for Chatting If vLLM is on a remote server, expose port 8000 via SSH port forwarding: ```bash ssh -L 0.0.0.0:8000:localhost:8000 ``` Then, you can start HuggingFace ChatUI like this (on your host): ```bash docker run -p 3000:3000 \ --add-host=host.docker.internal:host-gateway \ -e OPENAI_BASE_URL=http://host.docker.internal:8000/v1 \ -e OPENAI_API_KEY=dummy \ -v chat-ui-data:/data \ ghcr.io/huggingface/chat-ui-db ``` ## 6) Host Configuration This should work on any Strix Halo. For a complete list of available hardware, see: [Strix Halo Hardware Database](https://strixhalo-homelab.d7.wtf/Hardware) ### 6.1 Test Configuration | Component | Specification | | :---------------- | :---------------------------------------------------------- | | **Test Machine** | Framework Desktop | | **CPU** | Ryzen AI MAX+ 395 "Strix Halo" | | **System Memory** | 128 GB RAM | | **GPU Memory** | 512 MB allocated in BIOS | | **Host OS** | Fedora 42, Linux 6.18.0-0.rc6.vanilla.fc42.x86_64 | ### 6.2 Kernel Parameters (tested on Fedora 42) Add these boot parameters to enable unified memory while reserving a minimum of 4 GiB for the OS (max 124 GiB for iGPU): amd_iommu=off amdgpu.gttsize=126976 ttm.pages_limit=32505856 | Parameter | Purpose | |-----------------------------|--------------------------------------------------------------------------------------------| | `amd_iommu=off` | Disables IOMMU for lower latency | | `amdgpu.gttsize=126976` | Caps GPU unified memory to 124 GiB; 126976 MiB ÷ 1024 = 124 GiB | | `ttm.pages_limit=32505856` | Caps pinned memory to 124 GiB; 32505856 × 4 KiB = 126976 MiB = 124 GiB | Source: https://www.reddit.com/r/LocalLLaMA/comments/1m9wcdc/comment/n5gf53d/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button **Apply the changes:** ``` # Edit /etc/default/grub to add parameters to GRUB_CMDLINE_LINUX sudo grub2-mkconfig -o /boot/grub2/grub.cfg sudo reboot ```