How to Run AI Models Locally with llama.cpp on rpi5
Running Gemma 4 within a Dockerized llama.cpp environment for Home Assistant.
Use the server-arm64 image which is optimized for the Pi's CPU.
Table of Contents
Prerequisites
Hardware: Raspberry Pi 5 (8GB RAM highly recommended).
OS: Raspberry Pi OS (64-bit) or Ubuntu (64-bit).
Storage: At least 5GB free space (preferably on an SSD/NVMe for speed).
Device: Raspberry Pi 5 (8GB)
OS: Debian 12
Runtime: Docker
Engine: llama.cpp
Model: Gemma 4 E2B (GGUF, quantized)
Install Docker (Debian 12)
Pull llama.cpp (light)
# This is the correct lightweight image for Pi (ARM)
docker pull ghcr.io/ggml-org/llama.cpp:light Pick a model - Download Gemma (GGUF, quantized)
# llama.cpp only works with GGUF
# Create model directory
mkdir -p /datadocker/llama-cpp
cd /datadocker/llama-cpp/models
https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/tree/main
https://huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF/tree/main
google_gemma-4-E2B-it-Q4_0.gguf
https://huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF/resolve/main/google_gemma-4-E2B-it-Q4_0.gguf?download=true Note
There is no official “Gemma 4 E2B GGUF direct URL” from Google.
GGUF files are community-converted and hosted on Hugging Face.
Run llama.cpp server (Docker)
docker run -v /path/to/models:/models --entrypoint /app/llama-cli ghcr.io/ggml-org/llama.cpp:light -m /models/7B/ggml-model-q4_0.gguf run ok
docker run -it --rm \
-v /datadocker/llama-cpp/models:/models \
--entrypoint /app/llama-cli \
-p 8091:8080 \
ghcr.io/ggml-org/llama.cpp:light \
-m /models/gemma-4-e2b-it-Q4_K_M.gguf \
output
load_backend: loaded CPU backend from /app/libggml-cpu-armv8.2_2.so
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b8833-45cac7ca7
model : google_gemma-4-E2B-it-Q4_0.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read <file> add a text file
/glob <pattern> add text files using globbing pattern
>
test
> how
[Start thinking]
Thinking Process:
1. **Analyze the Request:** The user provided a single word: "how". This is an extremely open-ended prompt. It implies a question, a request for a process, or a search for a method.
2. **Determine the Context:** Since there is no context, I must provide a response that addresses the ambiguity while offering ways to get a useful answer.
3. **Identify the Goal:** The goal is to prompt the user to specify what they want to know "how" to do.
4. **Brainstorm Potential Interpretations of "how":**
* *How does X work?* (Mechanism, process)
* *How do I do Y?* (Instructions, tutorial)
* *How to achieve Z?* (Strategy, steps)
* *How are we connected?* (Relationship, system)
5. **Formulate the Response Strategy:**
* Acknowledge the brevity.
* State that more information is needed.
* Provide examples of the types of questions I can answer.
* Invite the user to ask the specific question.
6. **Draft the Response (Self-Correction/Refinement):** Start with a friendly, open invitation. Ensure the tone is helpful and encouraging. (The resulting response should be a clear call to action.)
[End thinking]
Please tell me what you would like to know **how** to do! 😊
I can help you with instructions, explanations, processes, recipes, coding, concepts, and much more.
**For example, you could ask:**
* "How do I bake a cake?"
* "How does photosynthesis work?"
* "How do I change the font in Microsoft Word?"
* "How do I start learning Spanish?"
**Just tell me your question!**
[ Prompt: 8.6 t/s | Generation: 5.6 t/s ]
>
Useful links
llama.cpp on GitHub with Docker image
https://github.com/ggml-org/llama.cpp/blob/master/docs/docker.md
Models
https://huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF/tree/main
https://huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF/blob/main/google_gemma-4-E2B-it-Q4_0.gguf
https://huggingface.co/ggml-org/gemma-4-E2B-it-GGUF
Comments
Be the first to post a comment