LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

ICLR 2025 (Oral)

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, Zexiang Xu

0. Clarification

This is the official repository for the paper "LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias".

The code here is a re-implementation and differs from the original version developed at Adobe. However, the provided checkpoints are from the original Adobe implementation and were trained inside Adobe.

We have verified that the re-implemented version matches the performance of the original. For any questions or issues, please contact Haian Jin at [email protected].

1. Preparation

Environment

conda create -n LVSM python=3.11
conda activate LVSM
pip install -r requirements.txt

As we used xformers memory_efficient_attention, the GPU device compute capability needs > 8.0. Otherwise, it would pop up an error. Check your GPU compute capability in CUDA GPUs Page.

Data

Download the RealEstate10K dataset from this link, which is provided by pixelSplat, and unzip the zip file and put the data in YOUR_RAW_DATAPATH. Run the following command to preprocess the data into our format.

python process_data.py --base_path YOUR_RAW_DATAPATH --output_dir YOUR_PROCESSED_DATAPATH --mode ['train' or 'test']

Checkpoints

The scene-level evaluation is conducted on the RealEstate10K dataset prepocessed by pixelSplat. The model checkpoints are host on HuggingFace.

Model	PSNR	SSIM	LPIPS
LVSM Decoder-Only Scene-Level res256×256 (full)	29.67	0.906	0.098
LVSM Encoder-Decoder Scene-Level res256×256 (full)	28.60	0.893	0.114
LVSM Decoder-Only Scene-Level res512×512	N/A	N/A	N/A
LVSM Encoder-Decoder Scene-Level res512×512	N/A	N/A	N/A

As we discussed in the limitation sections of this paper:

Our model’s performance degrades when provided with images with aspect ratios and resolutions different from those seen during training.

Therefore, if you plan to use the model for inference at resolutions or aspect ratios different from those used to train our provided checkpoints (256×256 or 512×512), we recommend fine-tuning the model for the specific resolution and aspect ratio.

2. Training

Before training, you need to follow the instructions here to generate the Wandb key file for logging and save it in the configs folder as api_keys.yaml. You can use the configs/api_keys_example.yaml as a template.

The original training command:

torchrun --nproc_per_node 8 --nnodes 8 \
    --rdzv_id 18635 --rdzv_backend c10d --rdzv_endpoint localhost:29502 \
    train.py --config configs/LVSM_scene_decoder_only.yaml

The training will be distributed across 8 GPUs and 8 nodes with a total batch size of 512. LVSM_scene_decoder_only.yaml is the config file for the scene-level Decoder-Only LVSM model. You can also use LVSM_scene_encoder_decoder.yaml for the training of the scene-level Encoder-Decoder LVSM model.

If you have limited resources, you can use the following command to train a smaller model with a smaller batch size:

torchrun --nproc_per_node 8 --nnodes 1 \
    --rdzv_id 18635 --rdzv_backend c10d --rdzv_endpoint localhost:29502 \
    train.py --config configs/LVSM_scene_decoder_only.yaml \
    model.transformer.n_layer = 12 \
    training.batch_size_per_gpu = 16

Here, we decrease the total batch size from 512 to 128, and the transformer layers from 24 to 12. You can also increase the patch-size from 8 to 16 for faster training with lower performance. We have also discussed the efficient settings (single/two GPU training) in the paper.

3. Inference

torchrun --nproc_per_node 8 --nnodes 1 \
--rdzv_id 18635 --rdzv_backend c10d --rdzv_endpoint localhost:29506 \
inference.py --config "configs/LVSM_scene_decoder_only.yaml" \
training.dataset_path = "./preprocessed_data/test/full_list.txt" \
training.batch_size_per_gpu = 4 \
training.target_has_input =  false \
training.num_views = 5 \
training.square_crop = true \
training.num_input_views = 2 \
training.num_target_views = 3 \
inference.if_inference = true \
inference.compute_metrics = true \
inference.render_video = true \
inference_out_dir = ./experiments/evaluation/test

We use ./data/evaluation_index_re10k.json to specify the input and target view indice. This json file is originally from pixelSplat.

After the inference, the code will generate a html file in the inference_out_dir folder. You can open the html file to view the results.

4. Citation

If you find this work useful in your research, please consider citing:

@inproceedings{
jin2025lvsm,
title={LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias},
author={Haian Jin and Hanwen Jiang and Hao Tan and Kai Zhang and Sai Bi and Tianyuan Zhang and Fujun Luan and Noah Snavely and Zexiang Xu},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=QQBPWtvtcn}
}

5. Acknowledgement

We thank Kalyan Sunkavalli for helpful discussions and support. This work was done when Haian Jin, Hanwen Jiang, and Tianyuan Zhang were research interns at Adobe Research. This work was also partly funded by the National Science Foundation (IIS-2211259, IIS-2212084).

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
configs		configs
data		data
model		model
utils		utils
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
generate_html.py		generate_html.py
inference.py		inference.py
process_data.py		process_data.py
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

ICLR 2025 (Oral)

0. Clarification

1. Preparation

Environment

Data

Checkpoints

2. Training

3. Inference

4. Citation

5. Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Uh oh!

License

Uh oh!

Haian-Jin/LVSM

Folders and files

Latest commit

History

Repository files navigation

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

ICLR 2025 (Oral)

0. Clarification

1. Preparation

Environment

Data

Checkpoints

2. Training

3. Inference

4. Citation

5. Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages