how to run locally llama.cpp for home assistant on rpi5
Large Language Models for Home Assistant.
To run llama.cpp locally for Home Assistant, you must host a llama.cpp server that provides an API, that Home Assistant can communicate with via an API.
Home Assistant does not have a "llama.cpp" brand integration by default.
connect Home Assistant to it using a compatible integration. such as https://github.com/skye-harris/hass_local_openai_llm.
Table of Contents
Device: Raspberry Pi 5 (8GB)
OS: Debian 12
Runtime: Docker
Inference Engine: llama-server
Model: Gemma 4 E2B (GGUF, quantized)
Home assistant Integration: Local OpenAI LLM
In the ghcr.io/ggml-org/llama.cpp repository, the images are split by purpose:
| Tag | Primary Contents | Best Use Case |
|---|---|---|
:light | llama-cli, llama-completion | Testing/CLI: Best for running models in the terminal or one-off completions without overhead. |
:server | llama-server | Production/API: Ideal for your Home Assistant setup. It provides the OpenAI-compatible endpoint. |
:full | CLI, Server, and Python conversion/quantization tools. | Development: Use this if you need to convert .safetensors to .gguf or quantize a model yourself. |
:light: Contains only llama-cli and llama-completion.It does not contain the API server.
:server: Contains only llama-server.It contain the API server not contain llama-cli.A lightweight, OpenAI API compatible, HTTP server for serving LLMs.
:full: Contains everything.
You should use the :server tag (or better yet, the :server-arm64 tag since you are on a Raspberry Pi 5).
Run llama-server
The llama-server executable acts as an OpenAI-compatible API that Home Assistant can use.
docker run -it --rm \
--name llama \
-v /datadocker/llama-cpp/models:/models \
-p 8091:8080 \
ghcr.io/ggml-org/llama.cpp:server \
-m /models/google_gemma-4-E2B-it-Q4_0.gguf \
--host 0.0.0.0 \
--port 8080 \
--threads 4 \
--jinja
Parameter Description
--entrypoint /app/llama-cli: processes your input (or waits for one), and then exits. It does not listen for network requests on a port.
--entrypoint /app/llama-server: You were using llama-cli, which is for one-off prompts in the terminal. The llama-server is required to handle API calls like curl.
--host 0.0.0.0: Inside a Docker container, the server must listen on 0.0.0.0 to accept connections from your Raspberry Pi's IP or localhost.
--port 8080: This tells the software inside the container to listen on port 8080 (which you mapped to 8091 on your host).
--jinja: support for OpenAI-style function calling.Tool calling must be enabled in inference engine.Detail
output
...
srv init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...
srv update_slots: all slots are idle
Now llama.cpp = local LLM → HTTP API server
Test
Once the server logs show "HTTP server listening", run your curl command. Make sure to include a JSON body, otherwise the server might reject the request:
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello Gemma!"}]
}' Connect to Home Assistant
Integration - Add Integration
Custom Integration - Local OpenAI LLM Integration
https://github.com/skye-harris/hass_local_openai_llm
Wyoming-LLM (The Bridge): A Home Assistant Integration that sits between HA and llama.cpp.
Custom Integration - Configure Integration
Added server URL to the initial server configuration
http://192.168.2.125:8091
Voice assistant - Create conversation agent
Add assistant
Comments
Comments are closed