how to run locally llama.cpp for home assistant on rpi5

Published Apr 21, 2026

Control your Home Assistant smart home with a completely local Large Language Model.

To run docker llama.cpp locally on a Raspberry Pi 5 for Home Assistant, you need to set up a standalone docker llama.cpp server that provides an API on your Pi's base OS，that Home Assistant can communicate with via an API.

Home Assistant does not have a "llama.cpp" brand integration by default.

Connect Home Assistant to it using a compatible integration. such as https://github.com/skye-harris/hass_local_openai_llm.

Device: Raspberry Pi 5 (8GB)

OS: Debian 12

Runtime: Docker

Inference Engine: llama-server

Model: Gemma 4 E2B (GGUF, quantized)

Home assistant Integration: Local OpenAI LLM

In the ghcr.io/ggml-org/llama.cpp repository, the images are split by purpose:

Tag	Primary Contents	Best Use Case
`:light`	`llama-cli`, `llama-completion`	Testing/CLI: Best for running models in the terminal or one-off completions without overhead.
`:server`	`llama-server`	Production/API: Ideal for your Home Assistant setup. It provides the OpenAI-compatible endpoint.
`:full`	CLI, Server, and Python conversion/quantization tools.	Development: Use this if you need to convert `.safetensors` to `.gguf` or quantize a model yourself.

:light: Contains only llama-cli and llama-completion.It does not contain the API server.

:server: Contains only llama-server.It contain the API server not contain llama-cli.A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

:full: Contains everything.

You should use the :server tag (or better yet, the :server-arm64 tag since you are on a Raspberry Pi 5).

Download LLM Models

Docker Run llama-server

The llama-server executable acts as an OpenAI-compatible API that Home Assistant can use.

docker run -it --rm \
  --name llama \
  -v /datadocker/llama-cpp/models:/models \
  -p 8091:8080 \
  ghcr.io/ggml-org/llama.cpp:server \
  -m /models/google_gemma-4-E2B-it-Q4_0.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --threads 4 \
  --jinja

Parameter Description

--entrypoint /app/llama-cli: processes your input (or waits for one), and then exits. It does not listen for network requests on a port.

--entrypoint /app/llama-server: You were using llama-cli, which is for one-off prompts in the terminal. The llama-server is required to handle API calls like curl.

--host 0.0.0.0: Inside a Docker container, the server must listen on 0.0.0.0 to accept connections from your Raspberry Pi's IP or localhost.

--port 8080: This tells the software inside the container to listen on port 8080 (which you mapped to 8091 on your host).

--jinja: support for OpenAI-style function calling.Tool calling must be enabled in inference engine.Detail

output

...
srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...
srv  update_slots: all slots are idle

Now llama.cpp = local LLM → HTTP API server

Test

Once the server logs show "HTTP server listening", run your curl command. Make sure to include a JSON body, otherwise the server might reject the request:

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello Gemma!"}]
  }'

Connect llama.cpp to Home Assistant

Integration - Add Integration

Custom Integration - Local OpenAI LLM Integration

https://github.com/skye-harris/hass_local_openai_llm

Wyoming-LLM (The Bridge): A Home Assistant Integration that sits between HA and llama.cpp.

Custom Integration - Configure Integration

Added server URL to the initial server configuration

http://192.168.2.125:8091

Voice assistant - Create conversation agent

Add assistant

To start using the model to turn on lights, adjust thermostats, or ask queries, tie it to your Assist pipeline.