Most people access generative AI tools like ChatGPT or Gemini through a web interface or API — but what if you could run them locally?

In this article, you’ll learn how to set up your own local generative AI using existing models such as llama.cpp.

The final result will look like the GIF shown below (note, it’s hosted localhost)

 

 

Prerequisites

Hardware: Raspberry Pi 5 (8GB RAM highly recommended).

OS: Raspberry Pi OS (64-bit) or Ubuntu (64-bit).

Storage: At least 5GB free space (preferably on an SSD/NVMe for speed).

 

 

Device: Raspberry Pi 5 (8GB)

OS: Debian 12

Runtime: Docker

Engine: llama.cpp

Model: Gemma 4 E2B (GGUF, quantized)

 

Install Docker (Debian 12)

 

Here are the commands I used to Got Gemma 4 E2B running on a Raspberry Pi 5 8GB:

step 1. Pull llama.cpp (light)

First of all, we need an LLM Serving Engine, such as llama.cpp.

# This is the correct lightweight image for Pi (ARM)
docker pull ghcr.io/ggml-org/llama.cpp:light

step 2. Pick a model - Download Gemma (GGUF, quantized)

# llama.cpp only works with GGUF

# Create model directory
mkdir -p /datadocker/llama-cpp
cd /datadocker/llama-cpp/models


https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/tree/main

https://huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF/tree/main

google_gemma-4-E2B-it-Q4_0.gguf
https://huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF/resolve/main/google_gemma-4-E2B-it-Q4_0.gguf?download=true

Note

There is no official “Gemma 4 E2B GGUF direct URL” from Google.

GGUF files are community-converted and hosted on Hugging Face.

 

step 3.  Docker run and load model

Run llama.cpp server (Docker)

docker run -v /path/to/models:/models --entrypoint /app/llama-cli ghcr.io/ggml-org/llama.cpp:light -m /models/7B/ggml-model-q4_0.gguf

run ok

 

docker run -it --rm \
  -v /datadocker/llama-cpp/models:/models \
  --entrypoint /app/llama-cli \
  -p 8091:8080 \
  ghcr.io/ggml-org/llama.cpp:light \
  -m /models/gemma-4-e2b-it-Q4_K_M.gguf \

Note

1.pick the model you downloaded earlier.

 

 

output

load_backend: loaded CPU backend from /app/libggml-cpu-armv8.2_2.so

Loading model...

▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8833-45cac7ca7
model      : google_gemma-4-E2B-it-Q4_0.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 

 

step 4. test 

> how

[Start thinking]
Thinking Process:

1.  **Analyze the Request:** The user provided a single word: "how". This is an extremely open-ended prompt. It implies a question, a request for a process, or a search for a method.
2.  **Determine the Context:** Since there is no context, I must provide a response that addresses the ambiguity while offering ways to get a useful answer.
3.  **Identify the Goal:** The goal is to prompt the user to specify what they want to know "how" to do.
4.  **Brainstorm Potential Interpretations of "how":**
    *   *How does X work?* (Mechanism, process)
    *   *How do I do Y?* (Instructions, tutorial)
    *   *How to achieve Z?* (Strategy, steps)
    *   *How are we connected?* (Relationship, system)
5.  **Formulate the Response Strategy:**
    *   Acknowledge the brevity.
    *   State that more information is needed.
    *   Provide examples of the types of questions I can answer.
    *   Invite the user to ask the specific question.
6.  **Draft the Response (Self-Correction/Refinement):** Start with a friendly, open invitation. Ensure the tone is helpful and encouraging. (The resulting response should be a clear call to action.)
[End thinking]

Please tell me what you would like to know **how** to do! 😊

I can help you with instructions, explanations, processes, recipes, coding, concepts, and much more.

**For example, you could ask:**

* "How do I bake a cake?"
* "How does photosynthesis work?"
* "How do I change the font in Microsoft Word?"
* "How do I start learning Spanish?"

**Just tell me your question!**

[ Prompt: 8.6 t/s | Generation: 5.6 t/s ]

> 

 

access generative AI tools like llamacp through a web interface or API 

 

 

Useful links

llama.cpp on GitHub with Docker image

https://github.com/ggml-org/llama.cpp/blob/master/docs/docker.md

 

Models download via  url

https://huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF/tree/main

https://huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF/blob/main/google_gemma-4-E2B-it-Q4_0.gguf

https://huggingface.co/ggml-org/gemma-4-E2B-it-GGUF

Comments


Comments are closed