To run llama.cpp locally for Home Assistant, you must host a llama.cpp server that provides an API.

Home Assistant does not have a "llama.cpp" brand integration by default.

connect Home Assistant to it using a compatible integration. such as https://github.com/skye-harris/hass_local_openai_llm.

Table of Contents

 

In the ghcr.io/ggml-org/llama.cpp repository, the images are split by purpose:

TagPrimary ContentsBest Use Case
:lightllama-cli, llama-completionTesting/CLI: Best for running models in the terminal or one-off completions without overhead.
:serverllama-serverProduction/API: Ideal for your Home Assistant setup. It provides the OpenAI-compatible endpoint.
:fullCLI, Server, and Python conversion/quantization tools.Development: Use this if you need to convert .safetensors to .gguf or quantize a model yourself.

:light: Contains only llama-cli and llama-completion.It does not contain the API server.

:server: Contains only llama-server.It contain the API server not contain llama-cli.

:full: Contains everything.

You should use the :server tag (or better yet, the :server-arm64 tag since you are on a Raspberry Pi 5).

 

docker run -it --rm \
  --name llama \
  -v /datadocker/llama-cpp/models:/models \
  -p 8091:8080 \
  ghcr.io/ggml-org/llama.cpp:server \
  -m /models/google_gemma-4-E2B-it-Q4_0.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --threads 4 \
  --jinja

Note

 --entrypoint /app/llama-cli: processes your input (or waits for one), and then exits. It does not listen for network requests on a port.

--entrypoint /app/llama-server: You were using llama-cli, which is for one-off prompts in the terminal. The llama-server is required to handle API calls like curl.

 

--host 0.0.0.0: Inside a Docker container, the server must listen on 0.0.0.0 to accept connections from your Raspberry Pi's IP or localhost.

 

--port 8080: This tells the software inside the container to listen on port 8080 (which you mapped to 8091 on your host).

 

Test

Once the server logs show "HTTP server listening", run your curl command again. Make sure to include a JSON body, otherwise the server might reject the request:

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello Gemma!"}]
  }'

 

Integration

Local OpenAI LLM

https://github.com/skye-harris/hass_local_openai_llm

 

configure Integration

http://192.168.2.125:8091

Add conversation agent

 

Voice assistant

Add assistant

 

 

Comments

Be the first to post a comment

Post a comment