Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Feature Request] Gemma 4: right-size NVFP4 activation-memory reservation for 24 GB pre-Blackwell GPUs #6696

@msaelices

Description

@msaelices

[Feature Request] Gemma 4: right-size NVFP4 activation-memory reservation so it fits on 24 GB pre-Blackwell GPUs

Sub-task of #6667 (NVFP4 dequant fallback for pre-Blackwell NVIDIA GPUs). Builds on #6668 (the fused dequant-GEMV kernel).

Problem

With #6668 the NVFP4 matmul fallback runs on Ampere/Ada (sm_8x/sm_89), so Gemma 4 NVFP4 weights (~16-19 GB) physically fit on a 24 GB card. But Gemma4MemoryPlanner.estimate_activation_memory (max/python/max/pipelines/architectures/gemma4/memory_planner.py) reserves a flat activation budget, independent of batch size or model dims:

base = (30 // kv_cache.cache_dtype.size_in_bytes) * 1024**3   # 15 GB (bf16 KV) / 30 GB (fp8 KV)
if device_graph_capture:
    base += 2 GB
return base * num_devices

On a single 24 GB GPU serving NVFP4 at low concurrency, ~15 GB of activation reservation on top of ~16-19 GB of weights leaves no room for KV cache and trips the static_memory_size > free_memory guard, so the model the kernel can run still won't load. The estimate is already flagged as too high (TODO(MODELS-1544)), and the value is the same whether max_batch_size is 1 or 512.

Goal

Make the Gemma 4 activation reservation scale with the actual workload (batch size and model dims) instead of a flat constant, following the principled per-request × max_batch_size pattern already used by Qwen3_5MemoryPlanner.estimate_activation_memory.

Constraints / safety

  • Must never under-reserve and cause runtime OOM during prefill or vision processing. The flat value exists as a conservative headroom for those peaks.
  • First step can be strictly non-regressing: cap the new estimate at the current flat value (min(scaled_estimate, flat)), so it can only lower the reservation for small-batch / single-GPU serving and never reserve more than today.
  • Keep the device_graph_capture headroom and the per-device scaling.
  • The vision-cache reservation path (estimate_vision_cache_entry_bytes) is separate and unchanged.

Suggested shape

  • Derive a per-token activation estimate from text_config.hidden_size / intermediate_size and the model (activation) dtype, times the tokens in flight in one forward step (max_batch_size for decode; chunked prefill for the prefill peak), with a safety multiple for simultaneously-live temporaries.
  • Return min(principled, flat_current) initially; calibrate the safety multiple against measured peak GPU memory (an A10G / L40S NVFP4 run) before removing the cap.

Verification

  • Unit: assert the reservation drops for small max_batch_size and stays <= the current flat value for large batch.
  • E2E: a Gemma 4 NVFP4 checkpoint (e.g. berkerdooo/gemma-4-12B-it-NVFP4, RedHatAI/gemma-4-26B-A4B-it-NVFP4) loads and serves on a 24 GB card without OOM, output unchanged.

Context

Found while serving Gemma 4 NVFP4 on g5/g6 instances. Related: #6667, #6668, #6670, #6666.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions