Codestin Search App

msaelices · 2026-06-19T11:21:50Z

Summary

Gemma4MemoryPlanner.estimate_activation_memory reserved a flat per-device budget (15 GB bf16 KV / 30 GB fp8), independent of model size. On a low-concurrency single-GPU serve this dwarfs the KV cache and keeps an NVFP4 model that physically fits on a 24 GB card from loading.

This scales the reservation by the widest per-token buffer (hidden vs MLP intermediate) and a per-step token budget, capped at the old flat value so it never reserves more than before. Falls back to the flat value when dims are unavailable. Effect (31B, bf16 KV): 15 GiB → ~2.6 GiB.

Resolves #6696. Part of #6667; builds on #6668.

Honest scope

The cap only rules out extra over-reservation. The scaled value is an uncalibrated heuristic, not a proven activation-peak bound, so in the scaled < true_peak < flat regime it can under-reserve. _ACTIVATION_SAFETY_FACTOR / _PREFILL_TOKENS_PER_STEP need calibration against measured peak GPU memory (MODELS-1544) before relying on it — draft until then.
The estimate is model-size driven, not batch driven: the per-step token count is max(max_batch, _PREFILL_TOKENS_PER_STEP=8192), so max_batch below 8192 does not change it (documented by a test).

Tests

Unit tests added (test_memory_planner.py): below-flat for small batch, capped at flat (and the cap tracks KV dtype), fallback-to-flat on missing dims, device-count scaling, graph-capture headroom, and the inert-batch-term behaviour.

Assisted-by: AI

…dims BEGIN_PUBLIC [Pipelines] Scale Gemma 4 activation-memory reservation by batch and dims `Gemma4MemoryPlanner.estimate_activation_memory` reserved a flat per-device activation budget (15 GB with a bf16 KV cache, 30 GB with fp8), independent of batch size or model dimensions. On a low-concurrency single-GPU serve this dwarfs the KV cache and, with NVFP4 weights now runnable pre-Blackwell, keeps a model that physically fits on a 24 GB card from loading. Scale the reservation with the widest per-token buffer (hidden vs MLP intermediate), the tokens processed per forward step, and a safety multiple, mirroring the principled per-step estimate in `Qwen3_5MemoryPlanner`. The result is capped at the previous flat value, so this can only lower the reservation, never raise the OOM risk relative to before; it falls back to the flat value when model dimensions are unavailable. The safety multiple and per-step token count still need calibration against measured peak GPU memory (tracked in MODELS-1544) before the cap is relaxed. Relates to modular#6667. END_PUBLIC Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>

The docstring claimed the change "can only lower the estimate, never raise the OOM risk relative to before" and that it "mirrors Qwen3_5MemoryPlanner". The cap only rules out extra over-reservation; the scaled value is an uncalibrated heuristic, not a proven activation-peak bound, so it can still under-reserve vs the true peak. Reword to state that honestly and drop the inaccurate Qwen comparison. Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>

Cover the behaviours issue modular#6696 asked for: the scaled estimate drops below the old flat value for small batch, is capped at the flat value (and the flat cap still scales with the KV cache dtype), falls back to flat when model dims are missing, scales by device count, and adds the graph-capture headroom. Also documents that the batch term is inert below the prefill floor (the estimate is model-size driven, not batch driven). Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>

Condense the estimate_activation_memory docstring; keep the one-line summary and the uncalibrated-heuristic caveat. Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>

Construct the planner via __new__ to skip __init__'s model-config validation (estimate_activation_memory reads only its arguments, never self), and drop the unused transformers dep that pydeps rejected. Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>

github-actions Bot added the waiting-on-review label Jun 19, 2026

msaelices added 2 commits June 19, 2026 16:44

msaelices changed the title ~~[Pipelines] Scale Gemma 4 activation-memory reservation by batch and dims~~ [Pipelines] Right-size Gemma 4 activation-memory reservation by model dimensions Jun 19, 2026

msaelices added 2 commits June 19, 2026 17:16

[Pipelines] Trim Gemma 4 activation-memory docstring

3bc6a5a

Condense the estimate_activation_memory docstring; keep the one-line summary and the uncalibrated-heuristic caveat. Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pipelines] Right-size Gemma 4 activation-memory reservation by model dimensions#6698

[Pipelines] Right-size Gemma 4 activation-memory reservation by model dimensions#6698
msaelices wants to merge 5 commits into
modular:mainfrom
msaelices:gemma4-activation-memory-headroom

msaelices commented Jun 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

msaelices commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Honest scope

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

msaelices commented Jun 19, 2026 •

edited

Loading