[Feature Request] Gemma 4: right-size NVFP4 activation-memory reservation so it fits on 24 GB pre-Blackwell GPUs
Sub-task of #6667 (NVFP4 dequant fallback for pre-Blackwell NVIDIA GPUs). Builds on #6668 (the fused dequant-GEMV kernel).
Problem
With #6668 the NVFP4 matmul fallback runs on Ampere/Ada (sm_8x/sm_89), so Gemma 4 NVFP4 weights (~16-19 GB) physically fit on a 24 GB card. But Gemma4MemoryPlanner.estimate_activation_memory (max/python/max/pipelines/architectures/gemma4/memory_planner.py) reserves a flat activation budget, independent of batch size or model dims:
base = (30 // kv_cache.cache_dtype.size_in_bytes) * 1024**3 # 15 GB (bf16 KV) / 30 GB (fp8 KV)
if device_graph_capture:
base += 2 GB
return base * num_devices
On a single 24 GB GPU serving NVFP4 at low concurrency, ~15 GB of activation reservation on top of ~16-19 GB of weights leaves no room for KV cache and trips the static_memory_size > free_memory guard, so the model the kernel can run still won't load. The estimate is already flagged as too high (TODO(MODELS-1544)), and the value is the same whether max_batch_size is 1 or 512.
Goal
Make the Gemma 4 activation reservation scale with the actual workload (batch size and model dims) instead of a flat constant, following the principled per-request × max_batch_size pattern already used by Qwen3_5MemoryPlanner.estimate_activation_memory.
Constraints / safety
- Must never under-reserve and cause runtime OOM during prefill or vision processing. The flat value exists as a conservative headroom for those peaks.
- First step can be strictly non-regressing: cap the new estimate at the current flat value (
min(scaled_estimate, flat)), so it can only lower the reservation for small-batch / single-GPU serving and never reserve more than today.
- Keep the
device_graph_capture headroom and the per-device scaling.
- The vision-cache reservation path (
estimate_vision_cache_entry_bytes) is separate and unchanged.
Suggested shape
- Derive a per-token activation estimate from
text_config.hidden_size / intermediate_size and the model (activation) dtype, times the tokens in flight in one forward step (max_batch_size for decode; chunked prefill for the prefill peak), with a safety multiple for simultaneously-live temporaries.
- Return
min(principled, flat_current) initially; calibrate the safety multiple against measured peak GPU memory (an A10G / L40S NVFP4 run) before removing the cap.
Verification
- Unit: assert the reservation drops for small
max_batch_size and stays <= the current flat value for large batch.
- E2E: a Gemma 4 NVFP4 checkpoint (e.g.
berkerdooo/gemma-4-12B-it-NVFP4, RedHatAI/gemma-4-26B-A4B-it-NVFP4) loads and serves on a 24 GB card without OOM, output unchanged.
Context
Found while serving Gemma 4 NVFP4 on g5/g6 instances. Related: #6667, #6668, #6670, #6666.
[Feature Request] Gemma 4: right-size NVFP4 activation-memory reservation so it fits on 24 GB pre-Blackwell GPUs
Sub-task of #6667 (NVFP4 dequant fallback for pre-Blackwell NVIDIA GPUs). Builds on #6668 (the fused dequant-GEMV kernel).
Problem
With #6668 the NVFP4 matmul fallback runs on Ampere/Ada (sm_8x/sm_89), so Gemma 4 NVFP4 weights (~16-19 GB) physically fit on a 24 GB card. But
Gemma4MemoryPlanner.estimate_activation_memory(max/python/max/pipelines/architectures/gemma4/memory_planner.py) reserves a flat activation budget, independent of batch size or model dims:On a single 24 GB GPU serving NVFP4 at low concurrency, ~15 GB of activation reservation on top of ~16-19 GB of weights leaves no room for KV cache and trips the
static_memory_size > free_memoryguard, so the model the kernel can run still won't load. The estimate is already flagged as too high (TODO(MODELS-1544)), and the value is the same whethermax_batch_sizeis 1 or 512.Goal
Make the Gemma 4 activation reservation scale with the actual workload (batch size and model dims) instead of a flat constant, following the principled per-request ×
max_batch_sizepattern already used byQwen3_5MemoryPlanner.estimate_activation_memory.Constraints / safety
min(scaled_estimate, flat)), so it can only lower the reservation for small-batch / single-GPU serving and never reserve more than today.device_graph_captureheadroom and the per-device scaling.estimate_vision_cache_entry_bytes) is separate and unchanged.Suggested shape
text_config.hidden_size/intermediate_sizeand the model (activation) dtype, times the tokens in flight in one forward step (max_batch_sizefor decode; chunked prefill for the prefill peak), with a safety multiple for simultaneously-live temporaries.min(principled, flat_current)initially; calibrate the safety multiple against measured peak GPU memory (an A10G / L40S NVFP4 run) before removing the cap.Verification
max_batch_sizeand stays<=the current flat value for large batch.berkerdooo/gemma-4-12B-it-NVFP4,RedHatAI/gemma-4-26B-A4B-it-NVFP4) loads and serves on a 24 GB card without OOM, output unchanged.Context
Found while serving Gemma 4 NVFP4 on g5/g6 instances. Related: #6667, #6668, #6670, #6666.