CVPR 2025 (Oral Presentation)
Linyi Jin1,2, Richard Tucker1, Zhengqi Li1, David Fouhey3, Noah Snavely1*, Aleksander Hołyński1,4*
1Google DeepMind, 2University of Michigan, 3New York University, 4UC Berkeley
(*: equal contribution)
teaser_compressed.mp4
This repository contains the data processing pipeline to convert a stereoscopic video into a dynamic point cloud, which involves stereo disparity, and 2D tracks, fusing these quantities into a consistent 3D coordinate frame, and performing several filtering operations to ensure temporal consistent, high-quality reconstructions.
This is not an officially supported Google product.
- [Oct 2025] DynaDUSt3R has been reimplemented and released in PyTorch! Check out Kevin Mathew's unofficial implementation at dynadust3r. Thanks Kevin! 🙏
# Clone the Repository
git clone --recurse-submodules [email protected]:Stereo4d/stereo4d-code.git
cd stereo4d-code
git submodule update --init --recursive
cd SEA-RAFT
git apply ../sea-raft-changes.patch
cd ..
mamba env create --file=environment.ymlWe have released Stereo4D dataset annotations (3.6 TB) on Google Storage Bucket. https://console.cloud.google.com/storage/browser/stereo4d/. The annotations are under CC license.
For each video clip, we release:
{
'name': clip unique id <video_id>_<first_frame_time_stamp>,
'video_id': the link to the video https://www.youtube.com/watch?v=<video_id>,
'timestamps': a list of frame time stamp from the original video
'camera2world': a list of camera poses corresponding to the rectified frames.
'track_lengths', 'track_indices', 'track_coordinates': 3D tracks, will be loaded by utils/load_dataset_npz()
'rectified2rig': rotation matrix used to rectify frames.
'fov_bounds': camera intrinsics of the VR180 frame, which will be used to get perspective frames..
}
Please follow gcloud installation guidance to download the npz files, or
# Install gcloud sdk
curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-linux-x86_64.tar.gz
tar -xf google-cloud-cli-linux-x86_64.tar.gz
./google-cloud-sdk/install.sh
./google-cloud-sdk/bin/gcloud init# To download one example
mkdir -p stereo4d_dataset/npz
gcloud storage cp gs://stereo4d/train/CMwZrkhQ0ck_130030030.npz stereo4d_dataset/npz# To download full dataset
gsutil -m cp -R gs://stereo4d .Demo
Download demo data, bash demo_run.bash, or
TIMESTAMP=66957
VIDEOID=9876543210b
VID="${VIDEOID}_${TIMESTAMP}"
echo "=== Downloading Dataset ==="
gsutil -m cp -R gs://stereo4d/demo .
mv demo stereo4d_dataset
mkdir -p stereo4d_dataset/npz stereo4d_dataset/raw
mv stereo4d_dataset/${VIDEOID}.mp4 stereo4d_dataset/raw
mv stereo4d_dataset/${VID}.npz stereo4d_dataset/npzSome VR180 videos may not be perfectly rectified. Therefore, we performed rig calibration during bundle adjustment to find two rotation matrices, one per left and right view, for stereo rectification.
The script runs the following steps:
-
Extract frames from the specified
timestampsand save them as{videoid}-raw_equirect.mp4. -
Rectify the equirectangular video using the rig calibration result in
rectified2rigand save it asrectified_equirect.mp4. -
Crop the equirectangular projection to a 60° FoV perspective projection, saving the results as:
•
{videoid}-left_rectified.mp4(left eye)•
{videoid}-right_rectified.mp4(right eye)
JAX_PLATFORMS=cpu python rectify.py \
--vid=9876543210b_66957Example output:
Rectified stereo video in equirectangular format.
9876543210b_66957-rectified_equirect-compressed.mp4
512x512 60° FoV perspective video.
9876543210b_66957-left_rectified.mp4
🎉 The released .npz files already contain 3D tracks, you can skip the remaining steps and directly use example to visualize them.
If you want to reproduce the 3D tracks, continue with the following steps.
The following script loads the rectified perspective videos, calculates the disparity, and saves the results to flows_stereo.pkl.
We used an internal version of RAFT when developing, here we use SEA-RAFT for demo.
You can also try other SOTA stereo method such as FoundationStereo.
We can integrate more advanced stereo methods as they become available.
python inference_raft.py \
--vid=9876543210b_66957We extract long-range 2D point trajectories using BootsTAP.
The following script runs it on perspective videos and saves results to tapir_2d.pkl and visualizations to tapir_2d.mp4.
For every 10th frame, we uniformly initialize 128 x 128 query points on frames of resolution 512 x 512. We then prune redundant tracks that overlap on the same pixel.
python tracking.py \
--vid=9876543210b_66957Example output:
Dense 2D tracks.
tapir_2d-compressed.mp4
Since 2D tracks can drift on textureless regions, we discard moving 3D tracks that correspond to certain semantic categories (e.g., walls, building, road, earth, sidewalk), detected by DeepLabv3 on ADE20K classes.
We can integrate more advanced tracking methods as they become available.
python segmentation.py \
--vid=9876543210b_66957Example output:
Dense 3D tracks projected onto video frames, without drifting tracks.
9876543210b_66957-tapir_3dtrack_filtered.mp4
We then fuse these quantities into 4D reconstructions, by lifting the 2D tracks into 3D with their depth.
9876543210b_66957-dyna_3dtrack_concated-original.mp4
Since stereo depth estimation is performed per-frame, the initial disparity estimates (and therefore, the 3D track positions) are likely to exhibit high-frequency temporal jitter.
To ensure static points remain stationary while moving tracks maintain realistic, smooth motion, avoiding abrupt depth changes frame by frame, we design an optimization process (paper Eqn. 5) to get high quality 3D tracks.
python track_optimization.py \
--vid=9876543210b_66957Example output:
Raw video depth from stereo matching.
9876543210b_66957-raw_depth.mp4
Project the 3D tracks back to get depthmaps.
9876543210b_66957-optimized_depth.mp4
Final 3D tracks (Color trails are only shown for moving points, but all points have been reconstructed in 3D).
9876543210b_66957-dyna_3dtrack_concated-optimized.mp4
🎉 That's it!
If you find this code useful, please consider citing:
@inproceedings{jin2025stereo4d,
title={{Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos}},
author={Jin, Linyi and Tucker, Richard and Li, Zhengqi and Fouhey, David and Snavely, Noah and Holynski, Aleksander},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025},
}
Thanks to Jon Barron, Ruiqi Gao, Kyle Genova, Philipp Henzler, Andrew Liu, Erika Lu, Ben Poole, Qianqian Wang, Rundi Wu, Richard Szeliski, and Stan Szymanowicz for their helpful proofreading, comments, and discussions. Thanks to Carl Doersch, Skanda Koppula, and Ignacio Rocco for their assistance with TAPVid-3D and BootsTAP. Thanks to Carlos Hernandez, Dominik Kaeser, Janne Kontkanen, Ricardo Martin-Brualla, and Changchang Wu for their help with VR180 cameras and videos.