Thanks to visit codestin.com
Credit goes to github.com

Skip to content

robitec97/gemma3.c

Repository files navigation

gemma3.c

gemma3.c is a from‑scratch CPU inference engine for the Gemma 3 4B IT model. It proves that modern LLMs can run without Python, PyTorch, or GPUs.

✨ Highlights

  • ⚙️ 100% Pure C (C11) – zero external dependencies
  • 🧠 Full Gemma 3 architecture – GQA, hybrid attention, SwiGLU
  • 🗺️ Memory‑mapped weights – BF16 SafeTensors via mmap
  • 🔤 Native SentencePiece tokenizer – 262K vocab
  • 🌊 Streaming output – token‑by‑token callbacks
  • 💬 Interactive chat mode
  • 📦 CLI + Library API
  • 🐧 Linux/macOS native, 🪟 Windows via WSL (recommended) or MinGW

🚀 Quick Start

⚠️ POSIX‑first: native on Linux/macOS. On Windows use WSL or MinGW (no mmap).

1️⃣ Download model (recommended)

export HF_TOKEN=your_token_here
python download_model.py

2️⃣ Build

make

3️⃣ Run

# Single prompt
./gemma3 -m ./gemma-3-4b-it -p "Explain quantum computing simply."

# Interactive chat
./gemma3 -m ./gemma-3-4b-it -i

📥 Model Download

The included Python script:

  • Handles HuggingFace auth
  • Downloads all shards
  • Resumes broken downloads
  • Verifies integrity
python download_model.py --token YOUR_HF_TOKEN

Manual alternatives: huggingface-cli or git lfs.


🛠️ Build Targets

make        # Optimized
make debug  # Debug symbols
make fast   # -march=native -ffast-math
make clean

🧪 CLI Options

-m <path>    Model directory
-p <text>    Prompt
-i           Interactive mode
-s <text>    System prompt
-n <n>       Max tokens
-t <f>       Temperature
-k <n>       Top‑k
--top-p <f>  Top‑p
-c <n>       Context size
--seed <n>   RNG seed
-v           Verbose

📚 Library Example

gemma3_ctx *ctx = gemma3_load_dir("./gemma-3-4b-it");

gemma3_gen_params params = gemma3_default_params();
char *out = gemma3_generate(ctx, "Hello!", &params, NULL, NULL);
printf("%s\n", out);
free(out);

gemma3_free(ctx);

🧠 Model Specs

Param Value
Vocab 262,208
Layers 34
Hidden 2,560
Heads 8 (4 KV, GQA)
Context 128K
Pattern 5 local : 1 global

💾 Memory

  • Weights: ~8 GB on disk (BF16)
  • Runtime RAM: ~3 GB total

Reduce usage:

./gemma3 -m ./gemma-3-4b-it -c 512 -p "Hello"

⚡ Performance (CPU)

  • Prefill: ~2–5 tok/s
  • Generation: ~1–3 tok/s

Use:

make fast

⚠️ Limitations

  • CPU only
  • Text only
  • No quantization (yet)

🪪 License

MIT License. Model weights under Google’s Gemma license.


If you ever wanted to see Gemma 3 breathe in pure C, this is it.

About

Gemma 3 pure inference in C

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published