A self-hosted voice assistant for my Home Assistant Voice (Preview Edition)
hardware. Speech-to-text and text-to-speech run as CPU-only Wyoming services
in a Docker Compose stack; the conversation LLM runs on the GPU via llama.cpp.
The whole pipeline — wake word → transcription → intent → speech — executes on
my own unRAID box, so nothing a microphone hears ever reaches a cloud.
This page is part of my public homelab write-up. It shares the why, the
shape, and the actual (sanitized) config; the only thing redacted is the
node’s LAN address, shown throughout as <unraid-ip>.
I wanted “Okay Nabu, turn off the office lights” to work without OpenAI,
Google, or Amazon in the loop. Home Assistant’s Assist pipeline makes that
possible if you can supply three things locally: speech-to-text (STT),
text-to-speech (TTS), and a conversation agent (an LLM that turns the
transcribed sentence into Home Assistant service calls).
This home-ai stack supplies the two speech services. The LLM is a separate
llama.cpp container already running on the same box. Together they form a
100%-local voice loop.
The whole design is shaped by one constraint: the GPU is an RTX 2070 SUPER
with only 8GB of VRAM, and the LLM already wants ~4–5GB of it.
openwakeword service to run server-side.Home Assistant orchestrates the pipeline; the three local services each do one
job. The home-ai Compose stack is the two CPU boxes (Whisper + Piper); the GPU
box (llama.cpp) is managed separately.
The conversation agent is served by the official llama.cpp server image with
NVIDIA support: ghcr.io/ggml-org/llama.cpp:server-cuda.
This is a hard prerequisite — the helper scripts below assume llama.cpp is
already running and reachable on port 8000.
It runs as its own container, roughly like this:
services:
llama-cpp:
image: ghcr.io/ggml-org/llama.cpp:server-cuda # CUDA build for the NVIDIA GPU
container_name: llama_cpp
command: >-
-m /models/model.gguf
--host 0.0.0.0 --port 8000
-ngl 99 # offload all layers to the GPU
-c 4096 # context size
ports:
- "8000:8000"
volumes:
- /mnt/user/appdata/llama_cpp/models:/models
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
On unRAID, the equivalent of the deploy.devices block in the container
template is the NVIDIA runtime: set Extra Parameters --runtime=nvidia and
add a variable NVIDIA_VISIBLE_DEVICES=all (the unRAID Nvidia Driver plugin
provides the GPU). The -ngl 99 flag is what puts the whole model on the GPU;
since the speech services are on CPU, all 8GB is available for it.
The model file is loaded as /models/model.gguf — a stable name the helper
script keeps pointed at whichever GGUF is current (see below).
The home-ai stack itself is small — two Wyoming services, both CPU-only:
name: home-ai
# Local voice pipeline services for Home Assistant Voice devices.
#
# This stack provides ONLY the Wyoming speech services (STT + TTS).
# The conversation LLM is handled separately by llama.cpp, which runs at
# http://<unraid-ip>:8000 and is added to Home Assistant as the conversation
# agent (see "Wiring it into Home Assistant").
#
# Both services run on CPU so they do not compete with llama.cpp for the
# RTX 2070 SUPER's 8GB of VRAM.
services:
# Speech-to-text (faster-whisper) — Wyoming protocol on tcp://<host>:10300
whisper:
image: rhasspy/wyoming-whisper:latest
container_name: home-ai-whisper
command: >-
--model ${WHISPER_MODEL:-base-int8}
--language ${WHISPER_LANGUAGE:-en}
--beam-size ${WHISPER_BEAM_SIZE:-1}
ports:
- "10300:10300"
volumes:
- ./data/whisper:/data
restart: unless-stopped
# Text-to-speech (Piper) — Wyoming protocol on tcp://<host>:10200
piper:
image: rhasspy/wyoming-piper:latest
container_name: home-ai-piper
command: >-
--voice ${PIPER_VOICE:-en_US-lessac-medium}
ports:
- "10200:10200"
volumes:
- ./data/piper:/data
restart: unless-stopped
All the tunable knobs live in an .env file (copied from env.example):
# Copy to .env and adjust as needed: cp env.example .env
# --- Speech-to-text (Whisper, CPU) ---------------------------------------
# faster-whisper model. Trade-off on CPU (latency vs. accuracy):
# tiny-int8 - fastest, lowest accuracy
# base-int8 - good balance for short voice commands (default)
# small-int8 - more accurate, noticeably slower on CPU
WHISPER_MODEL=base-int8
WHISPER_LANGUAGE=en
# beam-size 1 = fastest/greedy. Raise to 3-5 for slightly better accuracy.
WHISPER_BEAM_SIZE=1
# --- Text-to-speech (Piper, CPU) -----------------------------------------
# Browse voices: https://rhasspy.github.io/piper-samples/
# Common English options: en_US-lessac-medium, en_US-amy-medium, en_GB-alba-medium
PIPER_VOICE=en_US-lessac-medium
The template is named
env.example(no leading dot) on purpose — a hidden
.env.exampleis easy to lose in a file browser or skip when copying a
directory around on unRAID. The runtime file still has to be.envfor
Compose to read it.
Models persist under ./data/, so only the first start pays the download cost.
Three small scripts make day-to-day operation painless. They’re written for
unRAID’s User Scripts plugin, which can’t source a shared env file — so each
script keeps its settings in an editable block at the top instead.
pull-model.sh — fetch and activate a modelDownloads a GGUF from Hugging Face and saves it as model.gguf (the stable name
llama.cpp loads), so swapping models never means editing llama.cpp’s config. It
downloads under the model’s real filename (so an interrupted download resumes
correctly), verifies it’s a real GGUF rather than an HTML error page, then
renames it — and drops a model.gguf.source file recording exactly what’s
loaded.
#!/usr/bin/env bash
#
# pull-model.sh — download a llama.cpp GGUF model from HuggingFace and save it
# as model.gguf (the name llama.cpp loads).
#
# Usage:
# ./pull-model.sh # fetch the default (Qwen2.5-7B-Instruct Q4_K_M)
# ./pull-model.sh <hf_repo> <hf_file> # fetch a specific model
#
# Runs on the unRAID node. Downloads resume if interrupted, then the file is
# renamed to model.gguf.
set -euo pipefail
# ---- config (edit for your environment) -----------------------------------
# llama.cpp model storage on the unRAID node.
MODEL_DIR="/mnt/user/appdata/llama_cpp/models"
MODEL_FILE="model.gguf"
# Default model to fetch (HF repo + file). Override via the two CLI args.
# Primary pick for HA voice on 8GB: Qwen2.5-7B-Instruct Q4_K_M (~4.7GB).
# Low-latency alternative: bartowski/Qwen2.5-3B-Instruct-GGUF
# Qwen2.5-3B-Instruct-Q4_K_M.gguf
HF_REPO="bartowski/Qwen2.5-7B-Instruct-GGUF"
HF_FILE="Qwen2.5-7B-Instruct-Q4_K_M.gguf"
# Optional: container name of your llama.cpp instance. If set, it's restarted
# after a successful download so the new model loads. Leave "" to skip.
LLAMA_CONTAINER=""
# ---------------------------------------------------------------------------
HF_REPO="${1:-$HF_REPO}"
HF_FILE="${2:-$HF_FILE}"
URL="https://huggingface.co/${HF_REPO}/resolve/main/${HF_FILE}?download=true"
# Download under the descriptive name (so resume is keyed to the model), then
# rename to model.gguf once verified.
DEST="${MODEL_DIR}/${HF_FILE}"
FINAL="${MODEL_DIR}/${MODEL_FILE}"
echo ">> Model: ${HF_REPO}/${HF_FILE}"
echo ">> Dest: ${FINAL}"
mkdir -p "$MODEL_DIR"
# Resume-capable download. Prefer curl, fall back to wget.
if command -v curl >/dev/null 2>&1; then
curl -L --fail --retry 3 --continue-at - -o "$DEST" "$URL"
elif command -v wget >/dev/null 2>&1; then
wget --continue -O "$DEST" "$URL"
else
echo "!! Need curl or wget to download." >&2
exit 1
fi
# Sanity check: a real GGUF is hundreds of MB+, not an HTML error page.
size_bytes="$(stat -c %s "$DEST" 2>/dev/null || stat -f %z "$DEST")"
if [ "${size_bytes:-0}" -lt 104857600 ]; then
echo "!! Downloaded file is only ${size_bytes} bytes — likely an error page, not a model." >&2
echo " Check that ${HF_REPO}/${HF_FILE} exists on HuggingFace." >&2
exit 1
fi
# Rename to model.gguf (the name llama.cpp loads) so its config never changes.
mv -f "$DEST" "$FINAL"
echo ">> Saved as ${FINAL} (from ${HF_FILE})"
# Record provenance next to the model so the active model is identifiable
# at a glance (the descriptive filename is gone after the rename).
cat > "${FINAL}.source" <<EOF
repo: ${HF_REPO}
file: ${HF_FILE}
url: https://huggingface.co/${HF_REPO}/resolve/main/${HF_FILE}
downloaded: $(date -u '+%Y-%m-%dT%H:%M:%SZ')
EOF
echo ">> Wrote ${FINAL}.source"
if [ -n "${LLAMA_CONTAINER}" ]; then
echo ">> Restarting llama.cpp container '${LLAMA_CONTAINER}' to load the new model..."
docker restart "${LLAMA_CONTAINER}"
else
echo ">> Done. Restart your llama.cpp instance to load the new model"
echo " (set LLAMA_CONTAINER at the top of this script to automate this)."
fi
reload.sh — (re)deploy the Compose stackPulls the latest images and recreates the containers, or --restart for a quick
bounce. Seeds .env from env.example on first run so the defaults resolve.
#!/usr/bin/env bash
#
# reload.sh — (re)deploy the home-ai compose stack.
#
# Usage:
# ./reload.sh # pull latest images + recreate containers
# ./reload.sh --restart # quick restart only, no image pull
#
# Runs on the unRAID node.
set -euo pipefail
# ---- config (edit for your environment) -----------------------------------
# Directory holding the home-ai docker-compose.yml. Defaults to the copy in
# this repo; point it at wherever the stack actually lives on the node (e.g.
# unRAID's Compose Manager project directory).
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
COMPOSE_DIR="$(cd "$SCRIPT_DIR/../.." && pwd)/docker-compose/home-ai"
# ---------------------------------------------------------------------------
if [ ! -f "$COMPOSE_DIR/docker-compose.yml" ]; then
echo "!! No docker-compose.yml found in $COMPOSE_DIR" >&2
exit 1
fi
cd "$COMPOSE_DIR"
# Ensure a .env exists so model/voice defaults resolve.
if [ ! -f .env ] && [ -f env.example ]; then
echo ">> No .env found; seeding from env.example"
cp env.example .env
fi
if [ "${1:-}" = "--restart" ]; then
echo ">> Restarting home-ai containers..."
docker compose restart
else
echo ">> Pulling latest images..."
docker compose pull
echo ">> Recreating containers..."
docker compose up -d --remove-orphans
fi
echo ">> Current status:"
docker compose ps
health-check.sh — confirm the whole pipeline is upRaw-TCP checks the two Wyoming ports and queries llama.cpp’s OpenAI-compatible
API, printing the model it currently has loaded. Exits non-zero on any failure,
so it’s cron- / User-Scripts-friendly for monitoring. Set HOST to your unRAID
node’s address.
#!/usr/bin/env bash
#
# health-check.sh — verify the home-ai voice pipeline is reachable.
#
# Checks the Wyoming STT/TTS ports (raw TCP) and the llama.cpp OpenAI API,
# reporting the model llama.cpp currently has loaded. Exits non-zero if any
# check fails (handy for cron / unRAID User Scripts monitoring).
set -uo pipefail
# ---- config (edit for your environment) -----------------------------------
HOST="<unraid-ip>"
WHISPER_PORT=10300
PIPER_PORT=10200
LLAMA_PORT=8000
# ---------------------------------------------------------------------------
rc=0
check_tcp() {
local name="$1" host="$2" port="$3"
if timeout 3 bash -c ">/dev/tcp/${host}/${port}" 2>/dev/null; then
printf 'OK %-22s %s:%s\n' "$name" "$host" "$port"
else
printf 'FAIL %-22s %s:%s\n' "$name" "$host" "$port"
rc=1
fi
}
check_tcp "whisper (STT)" "$HOST" "$WHISPER_PORT"
check_tcp "piper (TTS)" "$HOST" "$PIPER_PORT"
# llama.cpp: query the OpenAI-compatible model list and surface the model id.
llama_url="http://${HOST}:${LLAMA_PORT}/v1/models"
if resp="$(curl -fsS --max-time 5 "$llama_url" 2>/dev/null)"; then
model="$(printf '%s' "$resp" | grep -o '"id":"[^"]*"' | head -1 | cut -d'"' -f4)"
printf 'OK %-22s %s:%s [model: %s]\n' "llama.cpp (LLM)" "$HOST" "$LLAMA_PORT" "${model:-unknown}"
else
printf 'FAIL %-22s %s:%s\n' "llama.cpp (LLM)" "$HOST" "$LLAMA_PORT"
rc=1
fi
exit "$rc"
Do these in order; each has a checkpoint so you know it’s good before moving on.
Run health-check.sh on the box. You want three OK lines (Whisper 10300,
Piper 10200, llama.cpp 8000 with a model id). If a Wyoming service fails, it’s
usually still downloading its model — check docker compose logs whisper.
Settings → Devices & Services → + Add Integration → “Wyoming Protocol”. Add
it twice:
<unraid-ip>, Port 10300 → Whisper (STT)<unraid-ip>, Port 10200 → Piper (TTS)Checkpoint: a faster-whisper and a piper entry appear. “Failed to
connect” means HA can’t reach the port — check the network/firewall.
This is the part that isn’t in HA by default. The conversation agent is provided
by the home-llm custom integration
(“Local LLM Conversation”) by acon96, installed through HACS:
https://github.com/acon96/home-llm (category: Integration).<unraid-ip>, port 8000. AnyCheckpoint: the integration adds without an auth error and a “Local LLM”
conversation agent is available.
Why home-llm and not the OpenAI Conversation integration? home-llm is
purpose-built for local models and smart-home control — it handles the
tool/function-calling and prompt scaffolding that make a local model reliably
emit Home Assistant service calls. The generic OpenAI-compatible backend lets
it talk to the standalone llama.cppserver-cudacontainer directly.
Settings → Voice assistants → Add assistant:
| Field | Set to |
|---|---|
| Conversation agent | the Local LLM agent from step 2 |
| Speech-to-text | faster-whisper |
| Text-to-speech | piper (pick your voice) |
| Wake word | leave blank — Voice PE does it |
Settings → Devices & Services → ESPHome → your Voice PE device → set its
Assistant to the pipeline from step 3.
base-int8 on CPU — drop to tiny-int8 in .envreload.sh --restart.docker compose ps shows it running and HA is on the samecat /mnt/user/appdata/llama_cpp/models/model.gguf.source. Use anHome Assistant Assist leans hard on tool/function calling to control
entities, so an instruct model tuned for that beats a reasoning/math fine-tune.
Switching is one command — pull-model.sh downloads it, saves it as
model.gguf, and records the provenance:
./pull-model.sh # Qwen2.5-7B-Instruct Q4_K_M (default)
# or, for the 3B:
./pull-model.sh bartowski/Qwen2.5-3B-Instruct-GGUF Qwen2.5-3B-Instruct-Q4_K_M.gguf
A voice assistant that turns lights on, answers questions, and runs my home
without a single word leaving the house — STT, the LLM, and TTS all on hardware
I own. The CPU/GPU split keeps an 8GB card comfortably serving a 7B model and two
speech services at once, and the helper scripts make swapping models or
recovering the stack a one-liner.