Yifan Wang1*
Jianjun Zhou123*
Haoyi Zhu1
Wenzheng Chang1
Yang Zhou1
Zizun Li1
Junyi Chen1
Jiangmiao Pang1
Chunhua Shen2
Tong He13†
1Shanghai AI Lab 2ZJU 3SII
* Equal Contribution † Corresponding Author
π³ reconstructs visual geometry without a fixed reference view, achieving robust, state-of-the-art performance.
- [September 3, 2025] ⭐️ Training code is updated! See
trainingbranch for details. - [July 29, 2025] 📈 Evaluation code is released! See
evaluationbranch for details. - [July 16, 2025] 🚀 Hugging Face Demo and inference code are released!
We introduce
In contrast,
A key emergent property of our simple, bias-free design is the learning of a dense and structured latent representation of the camera pose manifold. Without complex priors or training schemes,
First, clone the repository and install the required packages.
git clone https://github.com/yyfz/Pi3.git
cd Pi3
pip install -r requirements.txtTry our example inference script. You can run it on a directory of images or a video file.
If the automatic download from Hugging Face is slow, you can download the model checkpoint manually from here and specify its local path using the --ckpt argument.
# Run with default example video
python example.py
# Run on your own data (image folder or .mp4 file)
python example.py --data_path <path/to/your/images_dir_or_video.mp4>Optional Arguments:
--data_path: Path to the input image directory or a video file. (Default:examples/skating.mp4)--save_path: Path to save the output.plypoint cloud. (Default:examples/result.ply)--interval: Frame sampling interval. (Default:1for images,10for video)--ckpt: Path to a custom model checkpoint file.--device: Device to run inference on. (Default:cuda)
You can also launch a local Gradio demo for an interactive experience.
# Install demo-specific requirements
pip install -r requirements_demo.txt
# Launch the demo
python demo_gradio.pyThe model takes a tensor of images and outputs a dictionary containing the reconstructed geometry.
-
Input: A
torch.Tensorof shape$B \times N \times 3 \times H \times W$ with pixel values in the range[0, 1]. -
Output: A
dictwith the following keys:-
points: Global point cloud unprojected bylocal pointsandcamerae_poses(torch.Tensor,$B \times N \times H \times W \times 3$ ). -
local_points: Per-view local point maps (torch.Tensor,$B \times N \times H \times W \times 3$ ). -
conf: Confidence scores for local points (values in[0, 1]aftertorch.sigmoid(), higher is better) (torch.Tensor,$B \times N \times H \times W \times 1$ ). -
camera_poses: Camera-to-world transformation matrices (4x4in OpenCV format) (torch.Tensor,$B \times N \times 4 \times 4$ ).
-
Here is a minimal example of how to run the model on a batch of images.
import torch
from pi3.models.pi3 import Pi3
from pi3.utils.basic import load_images_as_tensor # Assuming you have a helper function
# --- Setup ---
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = Pi3.from_pretrained("yyfz233/Pi3").to(device).eval()
# or download checkpoints from `https://huggingface.co/yyfz233/Pi3/resolve/main/model.safetensors`
# --- Load Data ---
# Load a sequence of N images into a tensor
# imgs shape: (N, 3, H, W).
# imgs value: [0, 1]
imgs = load_images_as_tensor('path/to/your/data', interval=10).to(device)
# --- Inference ---
print("Running model inference...")
# Use mixed precision for better performance on compatible GPUs
dtype = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16
with torch.no_grad():
with torch.amp.autocast('cuda', dtype=dtype):
# Add a batch dimension -> (1, N, 3, H, W)
results = model(imgs[None])
print("Reconstruction complete!")
# Access outputs: results['points'], results['camera_poses'] and results['local_points'].Our work builds upon several fantastic open-source projects. We'd like to express our gratitude to the authors of:
If you find our work useful, please consider citing:
@misc{wang2025pi3,
title={$\pi^3$: Scalable Permutation-Equivariant Visual Geometry Learning},
author={Yifan Wang and Jianjun Zhou and Haoyi Zhu and Wenzheng Chang and Yang Zhou and Zizun Li and Junyi Chen and Jiangmiao Pang and Chunhua Shen and Tong He},
year={2025},
eprint={2507.13347},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.13347},
}For academic use, this project is licensed under the 2-clause BSD License. See the LICENSE file for details. For commercial use, please contact the authors.