how to run locally llama.cpp for home assistant on rpi5
To run llama.cpp locally for Home Assistant, you must host a llama.cpp server that provides an API.
Home Assistant does not have a "llama.cpp" brand integration by default.
connect Home Assistant to it using a compatible integration. such as https://github.com/skye-harris/hass_local_openai_llm.
Table of Contents
In the ghcr.io/ggml-org/llama.cpp repository, the images are split by purpose:
| Tag | Primary Contents | Best Use Case |
|---|---|---|
:light | llama-cli, llama-completion | Testing/CLI: Best for running models in the terminal or one-off completions without overhead. |
:server | llama-server | Production/API: Ideal for your Home Assistant setup. It provides the OpenAI-compatible endpoint. |
:full | CLI, Server, and Python conversion/quantization tools. | Development: Use this if you need to convert .safetensors to .gguf or quantize a model yourself. |
:light: Contains only llama-cli and llama-completion.It does not contain the API server.
:server: Contains only llama-server.It contain the API server not contain llama-cli.
:full: Contains everything.
You should use the :server tag (or better yet, the :server-arm64 tag since you are on a Raspberry Pi 5).
docker run -it --rm \
--name llama \
-v /datadocker/llama-cpp/models:/models \
-p 8091:8080 \
ghcr.io/ggml-org/llama.cpp:server \
-m /models/google_gemma-4-E2B-it-Q4_0.gguf \
--host 0.0.0.0 \
--port 8080 \
--threads 4 \
--jinja
Note
--entrypoint /app/llama-cli: processes your input (or waits for one), and then exits. It does not listen for network requests on a port.
--entrypoint /app/llama-server: You were using llama-cli, which is for one-off prompts in the terminal. The llama-server is required to handle API calls like curl.
--host 0.0.0.0: Inside a Docker container, the server must listen on 0.0.0.0 to accept connections from your Raspberry Pi's IP or localhost.
--port 8080: This tells the software inside the container to listen on port 8080 (which you mapped to 8091 on your host).
Test
Once the server logs show "HTTP server listening", run your curl command again. Make sure to include a JSON body, otherwise the server might reject the request:
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello Gemma!"}]
}'
Integration
Local OpenAI LLM
https://github.com/skye-harris/hass_local_openai_llm
configure Integration
http://192.168.2.125:8091
Add conversation agent
Voice assistant
Add assistant
Comments
Be the first to post a comment