Haian Jin1,2, Rundi Wu1, Tianyuan Zhang3, Ruiqi Gao1, Jonathan T. Barron1, Noah Snavely1,2, Aleksander Holynski1
1Google DeepMind 2Cornell University 3MIT
@inproceedings{jin2026zipmap,
title = {{ZipMap}: Linear-Time Stateful 3D Reconstruction via Test-Time Training},
author = {Jin, Haian and Wu, Rundi and Zhang, Tianyuan and Gao, Ruiqi and Barron, Jonathan T. and Snavely, Noah and Holynski, Aleksander},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2026}
}- 2026-04-15: Released ZipMap with state query support and quantitative evaluation code.
- 2026-04-14: Released the streaming version of ZipMap along with its training and demo scripts.
- 2026-04-12: Released code, checkpoints, and the interactive Gradio demo for ZipMap main model.
- 2026-02: ZipMap accepted to CVPR 2026.
This is a reimplementation of the code for the paper "ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training".
We have verified that the re-implemented version matches the performance of the original. For any questions or issues, please contact Haian Jin at [email protected].
conda create -n zipmap python=3.11
conda activate zipmap
pip install -e .Download the ZipMap checkpoints hosted on Hugging Face:
| Model | Description |
|---|---|
| ZipMap | Main model; no reference view specification (stage 3 checkpoint) |
| ZipMap Variants | |
| ZipMap w/ reference view | With reference view specification (stage 2 checkpoint) |
| ZipMap Streaming | Supports online/streaming inference (fine-tuned from ZipMap) |
| ZipMap w/ state query | Supports state query (fine-tuned from ZipMap w/ reference view) |
Launch the demo locally for the main ZipMap model:
python demo_gradio_zipmap.py --ckpt_path /path/to/zipmap_main_checkpoint.ptIf using the checkpoint with reference view specification, you disable the affine invariant by setting --affine_invariant false when launching the demo.
We also provide scripts to run the demo for the streaming version of ZipMap.
Expand to see:
The torch.compile takes significant time for streaming version of ZipMap. For inference usage, please disable torch.compile by setting TORCH_COMPILE_DISABLE=1 before launching the demo:
TORCH_COMPILE_DISABLE=1 python demo_gradio_zipmap_streaming.py --ckpt_path /path/to/your/online_checkpoint.ptSee branch evaluation for code and instructions on how to run the quantitative evaluation and runtime measurements. Our evaluation code is heavily built on top of this repository provided by the authors of Pi3. We sincerely thank the authors for their open-source contributions.
We train our model with FSDP (Fully Sharded Data Parallel) using GPUs with 80GB memory. If you have access to such hardware, you can run the training with the provided configs. If not, you can modify the configs to fit your hardware (e.g., by reducing batch size and using more aggressive FSDP sharding strategies).
Setup WandB for Logging
Before training, fill in your WandB API key in training/config/wandb_key.yaml (get yours at wandb.ai/authorize):
wandb: YOUR_WANDB_API_KEY_HEREDownload the Pretrained VGGT Checkpoint
Download the VGGT checkpoint:
mkdir -p checkpoints
wget -O checkpoints/vggt_checkpoint.pt https://huggingface.co/facebook/VGGT-1B/resolve/main/model.ptBefore running the full training, we recommend doing a quick debug run to ensure everything is set up correctly. This run only uses two datasets.
First download the two datasets for the debug run
1. Prepare VKITTI dataset
- a. Use
training/data/datasets/preprocess/download_vkitti.shto download the VKITTI dataset. (This script is from VGGT's repository) - b. Use our script
training/data/datasets/preprocess/save_vkitti_metadata.pyto save the metadata to improve IO efficiency.
2. Prepare MVS-Synth dataset
- a. First download the MVS-Synth dataset. Choose the 720p version.
- b. Then, preprocess the MVS-Synth dataset by running the script
training/data/datasets/preprocess/preprocess_mvs_synth.pyto convert data format. (This script is from CUT3R's repository) - c. Run our script
training/data/datasets/preprocess/save_mvs_synth_metadata.pyto save the metadata to improve IO efficiency.
After that, you can use the default_debug config for this purpose:
torchrun --nproc_per_node=8 training/launch.py --config default_debug base_data_dir=/path/to/your/dataThe first few iterations may be slow due to torch.compile for TTT modules. After that, the per-iteration time should be a few seconds.
The full training uses 29 datasets, including 23 static datasets and 6 dynamic datasets. Please prepare the datasets following the instructions in training/data/README.md. After that, you can run the full training with the following commands:
# Stage 1: Train on 23 static datasets.
torchrun --nproc_per_node=8 training/launch.py --config default_stage1_hi_res_static base_data_dir=/path/to/your/data
# Stage 2: Train on all 29 datasets (23 static + 6 dynamic datasets).
torchrun --nproc_per_node=8 training/launch.py --config default_stage2_hi_res_dynamic base_data_dir=/path/to/your/data checkpoint.resume_checkpoint_path=/path/to/your/stage1_checkpoint.ckpt
# Stage 3: Remove the reference view specification and keep tuning on all 29 datasets.
torchrun --nproc_per_node=8 training/launch.py --config default_stage3_hi_res_dynamic_aff_inv base_data_dir=/path/to/your/data checkpoint.resume_checkpoint_path=/path/to/your/stage2_checkpoint.ckptWe train a ZipMap-State-Query model by finetuning the ZipMap checkpoint with reference view specification (stage 2 checkpoint)
Expand to see run commands:
torchrun --nproc_per_node=8 training/launch.py --config default_finetune_state_query base_data_dir=/path/to/your/data checkpoint.resume_checkpoint_path=/path/to/your/stage2_checkpoint.ckptWe train a ZipMap-Streaming model by finetuning the offline ZipMap checkpoint (stage 3). We replace the transformer-based camera head with a lightweight MLP-based camera head, keep the rest of the model unchanged, and fine-tune on all 29 datasets.
Expand to see run commands:
# Stage 1: Train on 12-view context length for 60k iterations.
torchrun --nproc_per_node=8 training/launch.py --config default_finetune_online_run1 base_data_dir=/path/to/your/data checkpoint.resume_checkpoint_path=/path/to/your/offline_stage3_checkpoint.ckpt
# Stage 2: Continue training on 24-view context length for 30k iterations.
torchrun --nproc_per_node=8 training/launch.py --config default_finetune_online_run2 base_data_dir=/path/to/your/data checkpoint.resume_checkpoint_path=/path/to/your/previous_online_run_checkpoint.ckpt
# We have seen obvious improvements after extending the training context length from 12 to 24. Due to time constraints, we only train the model on 24-view context length. If you have more time and resources, I recommend continuing to train the model on a longer context length (e.g., 48 or 64 views) for further performance improvement.
Note: Due to time constraints, we did not fully explore the streaming setting and only ran a last-shot experiment. With more careful tuning and longer training, the streaming model's performance can likely be further improved.
Potential directions to explore:
- Replace the current
TTT with MuonwithTTT with momentum, which is much more efficient for streaming-version training. - Fine-tune from the stage 2 checkpoint (with reference view specification) instead of the stage 3 checkpoint, which may improve training stability and potentially improve final performance.
- Instead of directly replacing the camera head with a lightweight MLP-based head, re-use the transformer-based camera head and only replace its bidirectional self-attention with causal attention.
- Train on a longer context length (e.g., 48 or 64 views).
Our code is built on top of the following repositories:
We sincerely thank the authors of these repositories for their open-source contributions, which have greatly helped this project.