Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

CVPR 2025 (Oral Presentation)

Linyi Jin^1,2, Richard Tucker¹, Zhengqi Li¹, David Fouhey³, Noah Snavely^1, Aleksander Hołyński^1,4

¹Google DeepMind, ²University of Michigan, ³New York University, ⁴UC Berkeley
(*: equal contribution)

teaser_compressed.mp4

This repository contains the data processing pipeline to convert a stereoscopic video into a dynamic point cloud, which involves stereo disparity, and 2D tracks, fusing these quantities into a consistent 3D coordinate frame, and performing several filtering operations to ensure temporal consistent, high-quality reconstructions.

This is not an officially supported Google product.

🔥 News

[Oct 2025] DynaDUSt3R has been reimplemented and released in PyTorch! Check out Kevin Mathew's unofficial implementation at dynadust3r. Thanks Kevin! 🙏

Getting Started

Step 0/6 Environment

# Clone the Repository
git clone --recurse-submodules [email protected]:Stereo4d/stereo4d-code.git
cd stereo4d-code
git submodule update --init --recursive
cd SEA-RAFT
git apply ../sea-raft-changes.patch
cd .. 
mamba env create --file=environment.yml

Step 1/6 Download Stereo4D dataset

We have released Stereo4D dataset annotations (3.6 TB) on Google Storage Bucket. https://console.cloud.google.com/storage/browser/stereo4d/. The annotations are under CC license.

For each video clip, we release:

{
  'name': clip unique id <video_id>_<first_frame_time_stamp>,
  'video_id': the link to the video https://www.youtube.com/watch?v=<video_id>,
  'timestamps': a list of frame time stamp from the original video
  'camera2world': a list of camera poses corresponding to the rectified frames.
  'track_lengths', 'track_indices', 'track_coordinates': 3D tracks, will be loaded by utils/load_dataset_npz()
  'rectified2rig': rotation matrix used to rectify frames.
  'fov_bounds': camera intrinsics of the VR180 frame, which will be used to get perspective frames..
}

Please follow gcloud installation guidance to download the npz files, or

# Install gcloud sdk
curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-linux-x86_64.tar.gz
tar -xf google-cloud-cli-linux-x86_64.tar.gz
./google-cloud-sdk/install.sh
./google-cloud-sdk/bin/gcloud init

# To download one example
mkdir -p stereo4d_dataset/npz
gcloud storage cp gs://stereo4d/train/CMwZrkhQ0ck_130030030.npz stereo4d_dataset/npz

# To download full dataset
gsutil -m cp -R gs://stereo4d .

Demo

Download demo data, bash demo_run.bash, or

TIMESTAMP=66957
VIDEOID=9876543210b
VID="${VIDEOID}_${TIMESTAMP}"

echo "=== Downloading Dataset ==="
gsutil -m cp -R gs://stereo4d/demo .
mv demo stereo4d_dataset
mkdir -p stereo4d_dataset/npz stereo4d_dataset/raw
mv stereo4d_dataset/${VIDEOID}.mp4 stereo4d_dataset/raw
mv stereo4d_dataset/${VID}.npz stereo4d_dataset/npz

Step 2/6 Rectify raw videos and convert to perspective projections

Some VR180 videos may not be perfectly rectified. Therefore, we performed rig calibration during bundle adjustment to find two rotation matrices, one per left and right view, for stereo rectification.

The script runs the following steps:

Extract frames from the specified timestamps and save them as {videoid}-raw_equirect.mp4.
Rectify the equirectangular video using the rig calibration result in rectified2rig and save it as rectified_equirect.mp4.
Crop the equirectangular projection to a 60° FoV perspective projection, saving the results as:

• {videoid}-left_rectified.mp4 (left eye)

• {videoid}-right_rectified.mp4 (right eye)

JAX_PLATFORMS=cpu python rectify.py \
--vid=9876543210b_66957

Example output:

Rectified stereo video in equirectangular format.

9876543210b_66957-rectified_equirect-compressed.mp4

512x512 60° FoV perspective video.

9876543210b_66957-left_rectified.mp4

🎉 The released .npz files already contain 3D tracks, you can skip the remaining steps and directly use example to visualize them.

Notebook for visualization

If you want to reproduce the 3D tracks, continue with the following steps.

Step 3/6 Disparity from stereo matching

The following script loads the rectified perspective videos, calculates the disparity, and saves the results to flows_stereo.pkl. We used an internal version of RAFT when developing, here we use SEA-RAFT for demo. You can also try other SOTA stereo method such as FoundationStereo. We can integrate more advanced stereo methods as they become available.

python inference_raft.py \
--vid=9876543210b_66957

Step 4/6 Dense point tracking

We extract long-range 2D point trajectories using BootsTAP. The following script runs it on perspective videos and saves results to tapir_2d.pkl and visualizations to tapir_2d.mp4.

For every 10th frame, we uniformly initialize 128 x 128 query points on frames of resolution 512 x 512. We then prune redundant tracks that overlap on the same pixel.

python tracking.py \
--vid=9876543210b_66957

Example output:

Dense 2D tracks.

tapir_2d-compressed.mp4

Step 5/6 Filter Drifting tracks

Since 2D tracks can drift on textureless regions, we discard moving 3D tracks that correspond to certain semantic categories (e.g., walls, building, road, earth, sidewalk), detected by DeepLabv3 on ADE20K classes. We can integrate more advanced tracking methods as they become available.

python segmentation.py \
--vid=9876543210b_66957

Example output:

Dense 3D tracks projected onto video frames, without drifting tracks.

9876543210b_66957-tapir_3dtrack_filtered.mp4

We then fuse these quantities into 4D reconstructions, by lifting the 2D tracks into 3D with their depth.

9876543210b_66957-dyna_3dtrack_concated-original.mp4

Since stereo depth estimation is performed per-frame, the initial disparity estimates (and therefore, the 3D track positions) are likely to exhibit high-frequency temporal jitter.

Step 6/6 Track optimization

To ensure static points remain stationary while moving tracks maintain realistic, smooth motion, avoiding abrupt depth changes frame by frame, we design an optimization process (paper Eqn. 5) to get high quality 3D tracks.

python track_optimization.py \
--vid=9876543210b_66957

Example output:

Raw video depth from stereo matching.

9876543210b_66957-raw_depth.mp4

Project the 3D tracks back to get depthmaps.

9876543210b_66957-optimized_depth.mp4

Final 3D tracks (Color trails are only shown for moving points, but all points have been reconstructed in 3D).

9876543210b_66957-dyna_3dtrack_concated-optimized.mp4

🎉 That's it!

Citation

If you find this code useful, please consider citing:

@inproceedings{jin2025stereo4d,
  title={{Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos}}, 
  author={Jin, Linyi and Tucker, Richard and Li, Zhengqi and Fouhey, David and Snavely, Noah and Holynski, Aleksander},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025},
}

Acknowledgment

Thanks to Jon Barron, Ruiqi Gao, Kyle Genova, Philipp Henzler, Andrew Liu, Erika Lu, Ben Poole, Qianqian Wang, Rundi Wu, Richard Szeliski, and Stan Szymanowicz for their helpful proofreading, comments, and discussions. Thanks to Carl Doersch, Skanda Koppula, and Ignacio Rocco for their assistance with TAPVid-3D and BootsTAP. Thanks to Carlos Hernandez, Dominik Kaeser, Janne Kontkanen, Ricardo Martin-Brualla, and Changchang Wu for their help with VR180 cameras and videos.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
SEA-RAFT @ f41d0a7		SEA-RAFT @ f41d0a7
tapnet @ c7e3ff3		tapnet @ c7e3ff3
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
demo_run.bash		demo_run.bash
environment.yml		environment.yml
inference_raft.py		inference_raft.py
rectify.py		rectify.py
sea-raft-changes.patch		sea-raft-changes.patch
segmentation.py		segmentation.py
track_optimization.py		track_optimization.py
track_visualization.ipynb		track_visualization.ipynb
tracking.py		tracking.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

Linyi Jin^1,2, Richard Tucker¹, Zhengqi Li¹, David Fouhey³, Noah Snavely^1, Aleksander Hołyński^1,4

¹Google DeepMind, ²University of Michigan, ³New York University, ⁴UC Berkeley
(*: equal contribution)

🔥 News

Table of Contents

Getting Started

Step 0/6 Environment

Step 1/6 Download Stereo4D dataset

Step 2/6 Rectify raw videos and convert to perspective projections

Step 3/6 Disparity from stereo matching

Step 4/6 Dense point tracking

Step 5/6 Filter Drifting tracks

Step 6/6 Track optimization

Citation

Acknowledgment

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

Stereo4d/stereo4d-code

Folders and files

Latest commit

History

Repository files navigation

Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

Linyi Jin1,2, Richard Tucker1, Zhengqi Li1, David Fouhey3, Noah Snavely1*, Aleksander Hołyński1,4* 1Google DeepMind, 2University of Michigan, 3New York University, 4UC Berkeley (*: equal contribution)

🔥 News

Table of Contents

Getting Started

Step 0/6 Environment

Step 1/6 Download Stereo4D dataset

Step 2/6 Rectify raw videos and convert to perspective projections

Step 3/6 Disparity from stereo matching

Step 4/6 Dense point tracking

Step 5/6 Filter Drifting tracks

Step 6/6 Track optimization

Citation

Acknowledgment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Linyi Jin^1,2, Richard Tucker¹, Zhengqi Li¹, David Fouhey³, Noah Snavely^1, Aleksander Hołyński^1,4

¹Google DeepMind, ²University of Michigan, ³New York University, ⁴UC Berkeley
(*: equal contribution)

Packages