gemma3.c is a from‑scratch CPU inference engine for the Gemma 3 4B IT model.
- ⚙️ 100% Pure C (C11) – zero external dependencies
- 🧠 Full Gemma 3 architecture – GQA, hybrid attention, SwiGLU
- 🗺️ Memory‑mapped weights – BF16 SafeTensors via
mmap - 🔤 Native SentencePiece tokenizer – 262K vocab
- 🌊 Streaming output – token‑by‑token callbacks
- 💬 Interactive chat mode
- 📦 CLI + Library API
- 🐧 Linux/macOS native, 🪟 Windows via WSL (recommended) or MinGW
- 🔗 OpenBLAS support (optional) – BLAS-accelerated matrix operations
- 🧵 Multi-threaded inference – Thread pool for parallel computation
⚠️ POSIX‑first: native on Linux/macOS. On Windows use WSL or MinGW (nommap).
export HF_TOKEN=your_token_here
pip install huggingface_hub
python download_model.pymake# Single prompt
./gemma3 -m ./gemma-3-4b-it -p "Explain quantum computing simply."
OpenBLAS builds:
make blasandmake blas-threadsrequire OpenBLAS:
- Linux:
sudo apt install libopenblas-dev- macOS:
brew install openblas
The included Python script:
- Handles HuggingFace auth
- Downloads all shards
- Resumes broken downloads
- Verifies integrity
python download_model.py --token YOUR_HF_TOKENManual alternatives: huggingface-cli or git lfs.
make # Release build (default)
make debug # Debug symbols
make fast # Native optimizations (-march=native -ffast-math)
make threads # Thread pool parallelization
make blas # OpenBLAS acceleration (requires libopenblas)
make blas-threads # OpenBLAS + threads (best performance)
make clean # Remove build artifacts
make help # Show all targets-m <path> Model directory
-p <text> Prompt
-i Interactive mode
-s <text> System prompt
-n <n> Max tokens
-t <f> Temperature
-k <n> Top‑k
--top-p <f> Top‑p
-c <n> Context size
--seed <n> RNG seed
-v Verbose
gemma3_ctx *ctx = gemma3_load_dir("./gemma-3-4b-it");
gemma3_gen_params params = gemma3_default_params();
char *out = gemma3_generate(ctx, "Hello!", ¶ms, NULL, NULL);
printf("%s\n", out);
free(out);
gemma3_free(ctx);| Param | Value |
|---|---|
| Vocab | 262,208 |
| Layers | 34 |
| Hidden | 2,560 |
| Heads | 8 (4 KV, GQA) |
| Context | 128K |
| Pattern | 5 local : 1 global |
- Weights: ~8 GB on disk (BF16)
- Runtime RAM: ~3 GB total
Reduce usage:
./gemma3 -m ./gemma-3-4b-it -c 512 -p "Hello"- Prefill: ~2–5 tok/s
- Generation: ~1–3 tok/s
For better performance:
make fast # Single-threaded with native optimizations
make threads # Multi-core parallelization
make blas-threads # Best performance (requires OpenBLAS)- CPU only
- Text only
- No quantization (yet)
MIT License. Model weights under Google’s Gemma license.
If you ever wanted to see Gemma 3 breathe in pure C, this is it.