Add vocabulary tiling to reduce redundant memory #2242

NuojCheng · 2025-08-26T17:34:46Z

Description

This PR introduces vocabulary tiling, a new feature for MaxText designed to significantly reduce peak memory consumption during training.

This optimization is particularly beneficial for two key scenarios:

Training large-vocabulary models on limited devices: It makes it possible to train models like Gemma 2B and Gemma-2 2B, which have large vocabulary sizes, on hardware with limited memory.
Training with long sequence lengths: It makes training on long context lengths more feasible by avoiding the need to store the entire, large logits tensor in memory for the final loss computation.

The core idea of vocabulary tiling is to avoid explicitly materializing the full final logits tensor. Instead, the logits activation is chunked (or "tiled") along the batch-sequence dimension.

As illustrated in the diagram below, the forward and backward passes are repeated num_vocab_tiling times. In each iteration, a small slice of the logits is computed, used to calculate the loss, and the Vector-Jacobian product is immediately backpropagated. This iterative process avoids holding the complete, memory-intensive logits tensor.

Figure 1: The vocabulary tiling process. The forward and backward passes are repeated for each tile, preventing the full logits tensor from being stored in memory.

For a more in-depth technical explanation, please see the design document.

Doc: go/maxtext-vocab-tiling
FIXES: b/429255841

Tests

Correctness Tests

Test losses and embedding table gradient differences in MaxText/tests/vocab_tiling_test.py with 1% relative error tolerance in following cases:

Default non-tied embedding (logits_via_embedding=False).
Tied embedding and default sharding (FSDP).
Data parallelism.
Tensor parallelism.
Context parallelism.

Performance Tests

See doc.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

MaxText/train.py

MaxText/configs/base.yml

MaxText/train.py

github-actions · 2025-09-25T19:50:36Z

🤖 Hi @gobbleturk, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

src/MaxText/configs/base.yml

src/MaxText/layers/models.py

src/MaxText/maxtext_utils.py

src/MaxText/layers/decoders.py

src/MaxText/layers/models.py

richjames0 · 2025-09-25T22:06:14Z

src/MaxText/layers/models.py

@@ -297,6 +311,19 @@ def no_op(self, *args, **kwargs):
  def init_cache(self, cache_size: int, batch_size: int, dtype=jnp.float32):
    return True

+  def logits_from_hidden_states(self, hidden_states, deterministic):


i'm confused - is this function defined twice? here and above

I will only keep one of them, Sorry for the confusion.

richjames0 · 2025-09-25T22:06:31Z

src/MaxText/layers/models.py

@@ -410,6 +441,20 @@ class ZeroOneTransformer(nn.Module):
  def setup(self):
    self.model = transformer_as_linen(self.config, self.mesh, self.quant, self.model_mode)

+  def logits_from_hidden_states(self, hidden_states, deterministic):


three times?

I will remove them. This is because there are three types of Transformers defined in model.py but actually only one of them get used.

src/MaxText/maxtext_utils.py

gobbleturk

Thanks Nuojin!

src/MaxText/configs/base.yml

github-actions · 2025-09-26T16:43:22Z

🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

📋 Review Summary

This pull request introduces vocabulary tiling, a memory-saving optimization that computes the cross-entropy loss in chunks to reduce peak memory usage. The implementation is well-structured, with the core logic encapsulated in maxtext_utils.py and comprehensive tests to ensure correctness across various sharding configurations.

🔍 General Feedback

The addition of thorough unit tests for different parallelism strategies is excellent and ensures the reliability of this new feature.
The code is clean and the changes are well-integrated into the existing structure.
The TODOs for future optimizations are noted and will be important for maximizing the benefits of this feature.

src/MaxText/maxtext_utils.py

src/MaxText/pyconfig.py

tests/vocab_tiling_test.py

src/MaxText/maxtext_utils.py

richjames0

PTAL at Gemini's nits but otherwise LGTM!

NuojCheng force-pushed the chengnuojin-seq-tiling branch 6 times, most recently from b9ad751 to c1277e3 Compare August 29, 2025 22:37

gobbleturk reviewed Aug 29, 2025

View reviewed changes

MaxText/train.py Outdated Show resolved Hide resolved

gobbleturk reviewed Aug 29, 2025

View reviewed changes

MaxText/configs/base.yml Outdated Show resolved Hide resolved

gobbleturk reviewed Aug 29, 2025

View reviewed changes

MaxText/train.py Outdated Show resolved Hide resolved

NuojCheng changed the title ~~[Draft] Add sequence tiling to reduce redundant memory~~ [Draft] Add vocabulary tiling to reduce redundant memory Sep 4, 2025

NuojCheng force-pushed the chengnuojin-seq-tiling branch 9 times, most recently from f231602 to 6eb981c Compare September 8, 2025 03:52

NuojCheng marked this pull request as ready for review September 8, 2025 03:54

NuojCheng requested review from RissyRan, SurbhiJainUSC, bvandermoon, gagika, hengtaoguo, khatwanimohit, richjames0, shralex, vipannalla and yangyuwei as code owners September 8, 2025 03:54

gobbleturk added the gemini-review label Sep 25, 2025

NuojCheng force-pushed the chengnuojin-seq-tiling branch from bf90996 to 9cfd8ef Compare September 25, 2025 20:18