gemma3.c is a from‑scratch CPU inference engine for the Gemma 3 4B IT model.
It proves that modern LLMs can run without Python, PyTorch, or GPUs.
- ⚙️ 100% Pure C (C11) – zero external dependencies
- 🧠 Full Gemma 3 architecture – GQA, hybrid attention, SwiGLU
- 🗺️ Memory‑mapped weights – BF16 SafeTensors via
mmap - 🔤 Native SentencePiece tokenizer – 262K vocab
- 🌊 Streaming output – token‑by‑token callbacks
- 💬 Interactive chat mode
- 📦 CLI + Library API
- 🐧 Linux/macOS native, 🪟 Windows via WSL (recommended) or MinGW
⚠️ POSIX‑first: native on Linux/macOS. On Windows use WSL or MinGW (nommap).
export HF_TOKEN=your_token_here
python download_model.pymake# Single prompt
./gemma3 -m ./gemma-3-4b-it -p "Explain quantum computing simply."
# Interactive chat
./gemma3 -m ./gemma-3-4b-it -iThe included Python script:
- Handles HuggingFace auth
- Downloads all shards
- Resumes broken downloads
- Verifies integrity
python download_model.py --token YOUR_HF_TOKENManual alternatives: huggingface-cli or git lfs.
make # Optimized
make debug # Debug symbols
make fast # -march=native -ffast-math
make clean-m <path> Model directory
-p <text> Prompt
-i Interactive mode
-s <text> System prompt
-n <n> Max tokens
-t <f> Temperature
-k <n> Top‑k
--top-p <f> Top‑p
-c <n> Context size
--seed <n> RNG seed
-v Verbose
gemma3_ctx *ctx = gemma3_load_dir("./gemma-3-4b-it");
gemma3_gen_params params = gemma3_default_params();
char *out = gemma3_generate(ctx, "Hello!", ¶ms, NULL, NULL);
printf("%s\n", out);
free(out);
gemma3_free(ctx);| Param | Value |
|---|---|
| Vocab | 262,208 |
| Layers | 34 |
| Hidden | 2,560 |
| Heads | 8 (4 KV, GQA) |
| Context | 128K |
| Pattern | 5 local : 1 global |
- Weights: ~8 GB on disk (BF16)
- Runtime RAM: ~3 GB total
Reduce usage:
./gemma3 -m ./gemma-3-4b-it -c 512 -p "Hello"- Prefill: ~2–5 tok/s
- Generation: ~1–3 tok/s
Use:
make fast- CPU only
- Text only
- No quantization (yet)
MIT License. Model weights under Google’s Gemma license.
If you ever wanted to see Gemma 3 breathe in pure C, this is it.