[Feature Request] Gemma 4: right-size NVFP4 activation-memory reservation for 24 GB pre-Blackwell GPUs

# [Feature Request] Gemma 4: right-size NVFP4 activation-memory reservation so it fits on 24 GB pre-Blackwell GPUs

_Sub-task of #6667 (NVFP4 dequant fallback for pre-Blackwell NVIDIA GPUs). Builds on #6668 (the fused dequant-GEMV kernel)._

## Problem

With #6668 the NVFP4 matmul fallback runs on Ampere/Ada (sm_8x/sm_89), so Gemma 4 NVFP4 weights (~16-19 GB) physically fit on a 24 GB card. But `Gemma4MemoryPlanner.estimate_activation_memory` (`max/python/max/pipelines/architectures/gemma4/memory_planner.py`) reserves a **flat** activation budget, independent of batch size or model dims:

```python
base = (30 // kv_cache.cache_dtype.size_in_bytes) * 1024**3   # 15 GB (bf16 KV) / 30 GB (fp8 KV)
if device_graph_capture:
    base += 2 GB
return base * num_devices
```

On a single 24 GB GPU serving NVFP4 at low concurrency, ~15 GB of activation reservation on top of ~16-19 GB of weights leaves no room for KV cache and trips the `static_memory_size > free_memory` guard, so the model the kernel can run still won't load. The estimate is already flagged as too high (`TODO(MODELS-1544)`), and the value is the same whether `max_batch_size` is 1 or 512.

## Goal

Make the Gemma 4 activation reservation scale with the actual workload (batch size and model dims) instead of a flat constant, following the principled per-request × `max_batch_size` pattern already used by `Qwen3_5MemoryPlanner.estimate_activation_memory`.

## Constraints / safety

- Must never *under*-reserve and cause runtime OOM during prefill or vision processing. The flat value exists as a conservative headroom for those peaks.
- First step can be strictly non-regressing: cap the new estimate at the current flat value (`min(scaled_estimate, flat)`), so it can only *lower* the reservation for small-batch / single-GPU serving and never reserve more than today.
- Keep the `device_graph_capture` headroom and the per-device scaling.
- The vision-cache reservation path (`estimate_vision_cache_entry_bytes`) is separate and unchanged.

## Suggested shape

- Derive a per-token activation estimate from `text_config.hidden_size` / `intermediate_size` and the model (activation) dtype, times the tokens in flight in one forward step (`max_batch_size` for decode; chunked prefill for the prefill peak), with a safety multiple for simultaneously-live temporaries.
- Return `min(principled, flat_current)` initially; calibrate the safety multiple against measured peak GPU memory (an A10G / L40S NVFP4 run) before removing the cap.

## Verification

- Unit: assert the reservation drops for small `max_batch_size` and stays `<=` the current flat value for large batch.
- E2E: a Gemma 4 NVFP4 checkpoint (e.g. `berkerdooo/gemma-4-12B-it-NVFP4`, `RedHatAI/gemma-4-26B-A4B-it-NVFP4`) loads and serves on a 24 GB card without OOM, output unchanged.

## Context

Found while serving Gemma 4 NVFP4 on g5/g6 instances. Related: #6667, #6668, #6670, #6666.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Gemma 4: right-size NVFP4 activation-memory reservation for 24 GB pre-Blackwell GPUs #6696

[Feature Request] Gemma 4: right-size NVFP4 activation-memory reservation so it fits on 24 GB pre-Blackwell GPUs

Problem

Goal

Constraints / safety

Suggested shape

Verification

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature Request] Gemma 4: right-size NVFP4 activation-memory reservation for 24 GB pre-Blackwell GPUs #6696

Description

[Feature Request] Gemma 4: right-size NVFP4 activation-memory reservation so it fits on 24 GB pre-Blackwell GPUs

Problem

Goal

Constraints / safety

Suggested shape

Verification

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions