gemma4

Gemma 4 on ExecuTorch

Multimodal inference for Gemma 4 on ExecuTorch. Supports audio transcription, translation, image understanding, and text generation on mobile devices.

Variants: E2B (2B params) and E4B (4B params).

Architecture

Single PTE with up to 4 methods:

speech_transform — Waveform to log-mel spectrogram (no learned weights)
audio_encoder — USM Conformer via HF's Gemma4AudioModel
vision_encoder — ViT with 2D RoPE via HF's Gemma4VisionModel (8-bit, int8 position embeddings)
text_decoder — Autoregressive decoder with YOCO, PLE, partial RoPE

Use --no-audio or --no-vision at export time to exclude unused encoders.

	E2B	E4B
Hidden size	1536	2560
Layers	35	42
KV heads	1 (MQA)	2

Export

# E2B default (4-bit text, 8-bit vision, all modalities):
buck2 run fbcode//executorch/examples/models/gemma4:export_gemma4 -- \
    --checkpoint_path /tmp/gemma4-e2b-it

# E2B 4-bit with tied embedding (smaller, for on-device deployment):
buck2 run fbcode//executorch/examples/models/gemma4:export_gemma4 -- \
    --checkpoint_path /tmp/gemma4-e2b-it --tied_embedding

# E4B (4-bit):
buck2 run fbcode//executorch/examples/models/gemma4:export_gemma4 -- \
    --checkpoint_path /tmp/gemma4-e4b-it --variant e4b

# Audio-only (no vision encoder, saves ~129 MB):
buck2 run fbcode//executorch/examples/models/gemma4:export_gemma4 -- \
    --checkpoint_path /tmp/gemma4-e2b-it --no-vision

# Vision-only (no audio encoder, saves ~100 MB):
buck2 run fbcode//executorch/examples/models/gemma4:export_gemma4 -- \
    --checkpoint_path /tmp/gemma4-e2b-it --no-audio

Model Variants

Default export includes all modalities (audio + vision + text). Default context length: 1024 tokens (--max_seq_len).

Pre-exported Models

E2B:

File	Size	Config	Description
`gemma4.pte`	4.1 GB	4-bit, audio-only	Default — fastest
`gemma4_vision.pte`	4.3 GB	4-bit, all modalities	Audio + vision + text
`gemma4_tied_emb4.pte`	2.5 GB	4-bit tied + emb4, audio-only	Smallest

E4B:

File	Size	Config	Description
`gemma4.pte`	6.1 GB	4-bit, audio-only	Default — fastest
`gemma4_vision.pte`	6.2 GB	4-bit, all modalities	Audio + vision + text
`gemma4_tied_emb4.pte`	4.0 GB	4-bit tied + emb4, audio-only	Smallest

Export Flags

Variant	Size	Flag
E2B 4-bit (default)	4.3 GB	(none)
E2B 4-bit audio-only	4.1 GB	`--no-vision`
E2B 4-bit emb4 tied	2.5 GB	`--quantize 8da4w+emb4 --tied_embedding --no-vision`
E4B 4-bit	6.2 GB	`--variant e4b`
E4B 4-bit audio-only	6.1 GB	`--variant e4b --no-vision`
E4B 4-bit emb4 tied	4.0 GB	`--variant e4b --quantize 8da4w+emb4 --tied_embedding --no-vision`

Vision encoder adds ~129 MB (8-bit linears + int8 position embedding table).

Untied models (gemma4.pte, gemma4_vision.pte) work with both Python and C++ runners.
emb4 tied uses packed INT4 embeddings and shared embed_tokens/lm_head weights. Requires C++ runner with TorchAO shared embedding kernels.

Build (CMake, host)

cmake --preset gemma4-cpu -S examples/models/gemma4
cmake --build --preset gemma4-cpu -j$(nproc)

Run

# Audio transcription (C++ runner):
./cmake-out/examples/models/gemma4/gemma4_e2e_runner \
    --model_path gemma4.pte \
    --tokenizer_path tokenizer.model \
    --audio_path test_audio.wav

# Image understanding (C++ runner):
./cmake-out/examples/models/gemma4/gemma4_e2e_runner \
    --model_path gemma4.pte \
    --tokenizer_path tokenizer.model \
    --image_path photo.jpg \
    --prompt "Describe this image:"

# Text-only:
./cmake-out/examples/models/gemma4/gemma4_e2e_runner \
    --model_path gemma4.pte \
    --tokenizer_path tokenizer.model \
    --prompt "What is 2+2?"

# Python runner (audio):
buck2 run fbcode//executorch/examples/models/gemma4:run_gemma4 -- \
    --model_path /tmp/gemma4.pte \
    --tokenizer_path /tmp/tokenizer.model \
    --audio_path /tmp/test_audio.wav

# Python runner (image):
buck2 run fbcode//executorch/examples/models/gemma4:run_gemma4 -- \
    --model_path /tmp/gemma4.pte \
    --tokenizer_path /tmp/tokenizer.model \
    --image_path /tmp/photo.jpg \
    --prompt "Describe this image:"

Recommended Prompts

The runners default --prompt to a short generic string. For best output quality on ASR / translation tasks, pass the canonical Google Gemma 4 prompt explicitly.

Speech transcription (ASR)

Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.

Replace {LANGUAGE} with the source language (e.g., English, Chinese, Spanish).

Example:

./gemma4_e2e_runner \
    --model_path gemma4.pte --tokenizer_path tokenizer.model \
    --audio_path test_audio.wav \
    --prompt "$(cat <<'EOF'
Transcribe the following speech segment in English into English text.

Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.
EOF
)"

Speech translation

Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.
When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.

Input Requirements

Audio: WAV, 16kHz, 16-bit PCM, mono, max 30 seconds.

Image: JPEG or PNG. Resized to fit --max_vision_tokens soft tokens (default 140). Aspect ratio preserved, dimensions rounded to multiples of 48 pixels. Lower tokens = faster but less detail (25 ~= 240x240, 70 ~= 384x384, 140 ~= 528x528, 280 ~= 768x768).

Samsung S25 Performance

Audio (23s)

Model	Size	Load	Prefill	Gen	TTFT	RTF	Mem load	Mem peak
E2B gemma4.pte	4.1 GB	705ms	166 tok/s	6 tok/s	4.50s	0.71	1885 MB	2251 MB
E2B gemma4_vision.pte	4.3 GB	648ms	163 tok/s	6 tok/s	4.56s	0.72	1890 MB	2257 MB
E2B gemma4_tied_emb4.pte	2.5 GB	645ms	164 tok/s	6 tok/s	4.52s	0.71	1683 MB	2241 MB
E4B gemma4.pte	6.1 GB	1.30s	91 tok/s	4 tok/s	7.50s	1.07	3231 MB	3601 MB
E4B gemma4_vision.pte	6.2 GB	1.28s	92 tok/s	4 tok/s	7.47s	1.00	3231 MB	3602 MB
E4B gemma4_tied_emb4.pte	4.0 GB	1.17s	85 tok/s	4 tok/s	8.00s	1.07	2899 MB	3590 MB

Vision (dog.jpg, "Describe this image in two sentences.", 140 tokens ~528x528)

Model	Size	Load	Encode	Prefill	Gen	TTFT	Total	Mem load	Mem peak
E2B gemma4_vision.pte	4.3 GB	798ms	2.73s	134 tok/s	6 tok/s	3.83s	10.14s	1884 MB	2600 MB
E4B gemma4_vision.pte	6.2 GB	1.36s	2.44s	85 tok/s	4 tok/s	4.17s	14.62s	3232 MB	3950 MB

Text ("Write a short paragraph about the history of artificial intelligence")

Model	Size	Load	Prefill	Gen	TTFT	Total	Mem load	Mem peak
E2B gemma4.pte	4.1 GB	625ms	57 tok/s	6 tok/s	332ms	26.94s	1890 MB	1950 MB
E4B gemma4.pte	6.1 GB	1.51s	38 tok/s	3 tok/s	506ms	44.66s	3231 MB	3287 MB

Name		Name	Last commit message	Last commit date
parent directory ..
config		config
runner		runner
tests		tests
text_decoder		text_decoder
BUCK		BUCK
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
README.md		README.md
__init__.py		__init__.py
e2e_runner.cpp		e2e_runner.cpp
export_gemma4.py		export_gemma4.py
image_utils.h		image_utils.h
image_utils.py		image_utils.py
quant_utils.py		quant_utils.py
run_gemma4.py		run_gemma4.py
speech_transform.py		speech_transform.py
targets.bzl		targets.bzl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

Gemma 4 on ExecuTorch

Architecture

Export

Model Variants

Pre-exported Models

Export Flags

Build (CMake, host)

Run

Recommended Prompts

Speech transcription (ASR)

Speech translation

Input Requirements

Samsung S25 Performance

Audio (23s)

Vision (dog.jpg, "Describe this image in two sentences.", 140 tokens ~528x528)

Text ("Write a short paragraph about the history of artificial intelligence")

Uh oh!

FilesExpand file tree

gemma4

Directory actions

More options

Directory actions

More options

Latest commit

History

gemma4

Folders and files

parent directory

README.md

Gemma 4 on ExecuTorch

Architecture

Export

Model Variants

Pre-exported Models

Export Flags

Build (CMake, host)

Run

Recommended Prompts

Speech transcription (ASR)

Speech translation

Input Requirements

Samsung S25 Performance

Audio (23s)

Vision (dog.jpg, "Describe this image in two sentences.", 140 tokens ~528x528)

Text ("Write a short paragraph about the history of artificial intelligence")