Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Pipelines] Right-size Gemma 4 activation-memory reservation by model dimensions#6698

Draft
msaelices wants to merge 5 commits into
modular:mainfrom
msaelices:gemma4-activation-memory-headroom
Draft

[Pipelines] Right-size Gemma 4 activation-memory reservation by model dimensions#6698
msaelices wants to merge 5 commits into
modular:mainfrom
msaelices:gemma4-activation-memory-headroom

Conversation

@msaelices

@msaelices msaelices commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Gemma4MemoryPlanner.estimate_activation_memory reserved a flat per-device budget (15 GB bf16 KV / 30 GB fp8), independent of model size. On a low-concurrency single-GPU serve this dwarfs the KV cache and keeps an NVFP4 model that physically fits on a 24 GB card from loading.

This scales the reservation by the widest per-token buffer (hidden vs MLP intermediate) and a per-step token budget, capped at the old flat value so it never reserves more than before. Falls back to the flat value when dims are unavailable. Effect (31B, bf16 KV): 15 GiB → ~2.6 GiB.

Resolves #6696. Part of #6667; builds on #6668.

Honest scope

  • The cap only rules out extra over-reservation. The scaled value is an uncalibrated heuristic, not a proven activation-peak bound, so in the scaled < true_peak < flat regime it can under-reserve. _ACTIVATION_SAFETY_FACTOR / _PREFILL_TOKENS_PER_STEP need calibration against measured peak GPU memory (MODELS-1544) before relying on it — draft until then.
  • The estimate is model-size driven, not batch driven: the per-step token count is max(max_batch, _PREFILL_TOKENS_PER_STEP=8192), so max_batch below 8192 does not change it (documented by a test).

Tests

Unit tests added (test_memory_planner.py): below-flat for small batch, capped at flat (and the cap tracks KV dtype), fallback-to-flat on missing dims, device-count scaling, graph-capture headroom, and the inert-batch-term behaviour.

Assisted-by: AI

…dims

BEGIN_PUBLIC
[Pipelines] Scale Gemma 4 activation-memory reservation by batch and dims

`Gemma4MemoryPlanner.estimate_activation_memory` reserved a flat per-device
activation budget (15 GB with a bf16 KV cache, 30 GB with fp8), independent
of batch size or model dimensions. On a low-concurrency single-GPU serve
this dwarfs the KV cache and, with NVFP4 weights now runnable pre-Blackwell,
keeps a model that physically fits on a 24 GB card from loading.

Scale the reservation with the widest per-token buffer (hidden vs MLP
intermediate), the tokens processed per forward step, and a safety multiple,
mirroring the principled per-step estimate in `Qwen3_5MemoryPlanner`. The
result is capped at the previous flat value, so this can only lower the
reservation, never raise the OOM risk relative to before; it falls back to
the flat value when model dimensions are unavailable.

The safety multiple and per-step token count still need calibration against
measured peak GPU memory (tracked in MODELS-1544) before the cap is relaxed.

Relates to modular#6667.
END_PUBLIC

Assisted-by: AI
Signed-off-by: Manuel Saelices <[email protected]>
The docstring claimed the change "can only lower the estimate, never raise
the OOM risk relative to before" and that it "mirrors Qwen3_5MemoryPlanner".
The cap only rules out extra over-reservation; the scaled value is an
uncalibrated heuristic, not a proven activation-peak bound, so it can still
under-reserve vs the true peak. Reword to state that honestly and drop the
inaccurate Qwen comparison.

Assisted-by: AI
Signed-off-by: Manuel Saelices <[email protected]>
Cover the behaviours issue modular#6696 asked for: the scaled estimate drops below
the old flat value for small batch, is capped at the flat value (and the
flat cap still scales with the KV cache dtype), falls back to flat when
model dims are missing, scales by device count, and adds the graph-capture
headroom. Also documents that the batch term is inert below the prefill
floor (the estimate is model-size driven, not batch driven).

Assisted-by: AI
Signed-off-by: Manuel Saelices <[email protected]>
@msaelices msaelices changed the title [Pipelines] Scale Gemma 4 activation-memory reservation by batch and dims [Pipelines] Right-size Gemma 4 activation-memory reservation by model dimensions Jun 19, 2026
Condense the estimate_activation_memory docstring; keep the one-line
summary and the uncalibrated-heuristic caveat.

Assisted-by: AI
Signed-off-by: Manuel Saelices <[email protected]>
Construct the planner via __new__ to skip __init__'s model-config
validation (estimate_activation_memory reads only its arguments, never
self), and drop the unused transformers dep that pydeps rejected.

Assisted-by: AI
Signed-off-by: Manuel Saelices <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Gemma 4: right-size NVFP4 activation-memory reservation for 24 GB pre-Blackwell GPUs

1 participant