[Pipelines] Right-size Gemma 4 activation-memory reservation by model dimensions#6698
Draft
msaelices wants to merge 5 commits into
Draft
[Pipelines] Right-size Gemma 4 activation-memory reservation by model dimensions#6698msaelices wants to merge 5 commits into
msaelices wants to merge 5 commits into
Conversation
…dims BEGIN_PUBLIC [Pipelines] Scale Gemma 4 activation-memory reservation by batch and dims `Gemma4MemoryPlanner.estimate_activation_memory` reserved a flat per-device activation budget (15 GB with a bf16 KV cache, 30 GB with fp8), independent of batch size or model dimensions. On a low-concurrency single-GPU serve this dwarfs the KV cache and, with NVFP4 weights now runnable pre-Blackwell, keeps a model that physically fits on a 24 GB card from loading. Scale the reservation with the widest per-token buffer (hidden vs MLP intermediate), the tokens processed per forward step, and a safety multiple, mirroring the principled per-step estimate in `Qwen3_5MemoryPlanner`. The result is capped at the previous flat value, so this can only lower the reservation, never raise the OOM risk relative to before; it falls back to the flat value when model dimensions are unavailable. The safety multiple and per-step token count still need calibration against measured peak GPU memory (tracked in MODELS-1544) before the cap is relaxed. Relates to modular#6667. END_PUBLIC Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>
The docstring claimed the change "can only lower the estimate, never raise the OOM risk relative to before" and that it "mirrors Qwen3_5MemoryPlanner". The cap only rules out extra over-reservation; the scaled value is an uncalibrated heuristic, not a proven activation-peak bound, so it can still under-reserve vs the true peak. Reword to state that honestly and drop the inaccurate Qwen comparison. Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>
Cover the behaviours issue modular#6696 asked for: the scaled estimate drops below the old flat value for small batch, is capped at the flat value (and the flat cap still scales with the KV cache dtype), falls back to flat when model dims are missing, scales by device count, and adds the graph-capture headroom. Also documents that the batch term is inert below the prefill floor (the estimate is model-size driven, not batch driven). Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>
Condense the estimate_activation_memory docstring; keep the one-line summary and the uncalibrated-heuristic caveat. Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>
Construct the planner via __new__ to skip __init__'s model-config validation (estimate_activation_memory reads only its arguments, never self), and drop the unused transformers dep that pydeps rejected. Assisted-by: AI Signed-off-by: Manuel Saelices <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Gemma4MemoryPlanner.estimate_activation_memoryreserved a flat per-device budget (15 GB bf16 KV / 30 GB fp8), independent of model size. On a low-concurrency single-GPU serve this dwarfs the KV cache and keeps an NVFP4 model that physically fits on a 24 GB card from loading.This scales the reservation by the widest per-token buffer (hidden vs MLP intermediate) and a per-step token budget, capped at the old flat value so it never reserves more than before. Falls back to the flat value when dims are unavailable. Effect (31B, bf16 KV): 15 GiB → ~2.6 GiB.
Resolves #6696. Part of #6667; builds on #6668.
Honest scope
scaled < true_peak < flatregime it can under-reserve._ACTIVATION_SAFETY_FACTOR/_PREFILL_TOKENS_PER_STEPneed calibration against measured peak GPU memory (MODELS-1544) before relying on it — draft until then.max(max_batch, _PREFILL_TOKENS_PER_STEP=8192), somax_batchbelow 8192 does not change it (documented by a test).Tests
Unit tests added (
test_memory_planner.py): below-flat for small batch, capped at flat (and the cap tracks KV dtype), fallback-to-flat on missing dims, device-count scaling, graph-capture headroom, and the inert-batch-term behaviour.Assisted-by: AI