Large Language Models for Home Assistant.

To run llama.cpp locally for Home Assistant, you must host a llama.cpp server that provides an API, that Home Assistant can communicate with via an API. 

Home Assistant does not have a "llama.cpp" brand integration by default.

connect Home Assistant to it using a compatible integration. such as https://github.com/skye-harris/hass_local_openai_llm.

 

Device: Raspberry Pi 5 (8GB)

OS: Debian 12

Runtime: Docker

Inference Engine: llama-server

Model: Gemma 4 E2B (GGUF, quantized)

Home assistant Integration: Local OpenAI LLM

 

 

 

In the ghcr.io/ggml-org/llama.cpp repository, the images are split by purpose:

TagPrimary ContentsBest Use Case
:lightllama-cli, llama-completionTesting/CLI: Best for running models in the terminal or one-off completions without overhead.
:serverllama-serverProduction/API: Ideal for your Home Assistant setup. It provides the OpenAI-compatible endpoint.
:fullCLI, Server, and Python conversion/quantization tools.Development: Use this if you need to convert .safetensors to .gguf or quantize a model yourself.

:light: Contains only llama-cli and llama-completion.It does not contain the API server.

:server: Contains only llama-server.It contain the API server not contain llama-cli.A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

:full: Contains everything.

You should use the :server tag (or better yet, the :server-arm64 tag since you are on a Raspberry Pi 5).

Run llama-server

The llama-server executable acts as an OpenAI-compatible API that Home Assistant can use. 

docker run -it --rm \
  --name llama \
  -v /datadocker/llama-cpp/models:/models \
  -p 8091:8080 \
  ghcr.io/ggml-org/llama.cpp:server \
  -m /models/google_gemma-4-E2B-it-Q4_0.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --threads 4 \
  --jinja

Parameter Description

 --entrypoint /app/llama-cli: processes your input (or waits for one), and then exits. It does not listen for network requests on a port.

--entrypoint /app/llama-server: You were using llama-cli, which is for one-off prompts in the terminal. The llama-server is required to handle API calls like curl.

 

--host 0.0.0.0: Inside a Docker container, the server must listen on 0.0.0.0 to accept connections from your Raspberry Pi's IP or localhost.

 

--port 8080: This tells the software inside the container to listen on port 8080 (which you mapped to 8091 on your host).

--jinja: support for OpenAI-style function calling.Tool calling must be enabled in  inference engine.Detail

 

output

 

...
srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...
srv  update_slots: all slots are idle

Now llama.cpp = local LLM → HTTP API server

 

Test

Once the server logs show "HTTP server listening", run your curl command. Make sure to include a JSON body, otherwise the server might reject the request:

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello Gemma!"}]
  }'

 

Connect to Home Assistant

Integration - Add Integration

Custom Integration - Local OpenAI LLM Integration

https://github.com/skye-harris/hass_local_openai_llm

Wyoming-LLM (The Bridge): A Home Assistant Integration that sits between HA and llama.cpp.

 

Custom Integration - Configure Integration

Added server URL to the initial server configuration

http://192.168.2.125:8091

 

Voice assistant - Create  conversation agent

Add assistant

 

 

Comments


Comments are closed