How to Run AI Models Locally with llama.cpp on rpi5
Most people access generative AI tools like ChatGPT or Gemini through a web interface or API — but what if you could run them locally?
In this article, you’ll learn how to set up your own local generative AI using existing models such as llama.cpp.
The final result will look like the GIF shown below (note, it’s hosted localhost)
Table of Contents
Prerequisites
Hardware: Raspberry Pi 5 (8GB RAM highly recommended).
OS: Raspberry Pi OS (64-bit) or Ubuntu (64-bit).
Storage: At least 5GB free space (preferably on an SSD/NVMe for speed).
Device: Raspberry Pi 5 (8GB)
OS: Debian 12
Runtime: Docker
Engine: llama.cpp
Model: Gemma 4 E2B (GGUF, quantized)
Install Docker (Debian 12)
Here are the commands I used to Got Gemma 4 E2B running on a Raspberry Pi 5 8GB:
step 1. Pull llama.cpp (light)
First of all, we need an LLM Serving Engine, such as llama.cpp.
# This is the correct lightweight image for Pi (ARM)
docker pull ghcr.io/ggml-org/llama.cpp:light step 2. Pick a model - Download Gemma (GGUF, quantized)
# llama.cpp only works with GGUF
# Create model directory
mkdir -p /datadocker/llama-cpp
cd /datadocker/llama-cpp/models
https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/tree/main
https://huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF/tree/main
google_gemma-4-E2B-it-Q4_0.gguf
https://huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF/resolve/main/google_gemma-4-E2B-it-Q4_0.gguf?download=true Note
There is no official “Gemma 4 E2B GGUF direct URL” from Google.
GGUF files are community-converted and hosted on Hugging Face.
step 3. Docker run and load model
Run llama.cpp server (Docker)
docker run -v /path/to/models:/models --entrypoint /app/llama-cli ghcr.io/ggml-org/llama.cpp:light -m /models/7B/ggml-model-q4_0.gguf run ok
docker run -it --rm \
-v /datadocker/llama-cpp/models:/models \
--entrypoint /app/llama-cli \
-p 8091:8080 \
ghcr.io/ggml-org/llama.cpp:light \
-m /models/gemma-4-e2b-it-Q4_K_M.gguf \ Note
1.pick the model you downloaded earlier.
output
load_backend: loaded CPU backend from /app/libggml-cpu-armv8.2_2.so
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b8833-45cac7ca7
model : google_gemma-4-E2B-it-Q4_0.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read <file> add a text file
/glob <pattern> add text files using globbing pattern
>
step 4. test
> how
[Start thinking]
Thinking Process:
1. **Analyze the Request:** The user provided a single word: "how". This is an extremely open-ended prompt. It implies a question, a request for a process, or a search for a method.
2. **Determine the Context:** Since there is no context, I must provide a response that addresses the ambiguity while offering ways to get a useful answer.
3. **Identify the Goal:** The goal is to prompt the user to specify what they want to know "how" to do.
4. **Brainstorm Potential Interpretations of "how":**
* *How does X work?* (Mechanism, process)
* *How do I do Y?* (Instructions, tutorial)
* *How to achieve Z?* (Strategy, steps)
* *How are we connected?* (Relationship, system)
5. **Formulate the Response Strategy:**
* Acknowledge the brevity.
* State that more information is needed.
* Provide examples of the types of questions I can answer.
* Invite the user to ask the specific question.
6. **Draft the Response (Self-Correction/Refinement):** Start with a friendly, open invitation. Ensure the tone is helpful and encouraging. (The resulting response should be a clear call to action.)
[End thinking]
Please tell me what you would like to know **how** to do! 😊
I can help you with instructions, explanations, processes, recipes, coding, concepts, and much more.
**For example, you could ask:**
* "How do I bake a cake?"
* "How does photosynthesis work?"
* "How do I change the font in Microsoft Word?"
* "How do I start learning Spanish?"
**Just tell me your question!**
[ Prompt: 8.6 t/s | Generation: 5.6 t/s ]
>
access generative AI tools like llamacp through a web interface or API
Useful links
llama.cpp on GitHub with Docker image
https://github.com/ggml-org/llama.cpp/blob/master/docs/docker.md
Models download via url
https://huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF/tree/main
https://huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF/blob/main/google_gemma-4-E2B-it-Q4_0.gguf
https://huggingface.co/ggml-org/gemma-4-E2B-it-GGUF
Comments
Comments are closed