DepthLM

Official implementation of "DepthLM: Metric Depth from Vision Language Models".

We show for the first time that VLMs can achieve comparable accuracy with pure vision models on metric depth estimation, with standard text-based SFT and no architecture chagne, i.e., no dense prediction head or regression/regularization loss is needed. Such simplicity allows DepthLM to train a unified VLM to handle various complex 3D understanding tasks such as speed or time estimation, and metric scale camera pose estimation, which require different architecture or hand-crafted pipelines in pure vision models.

Citation

If you find our code useful for your research, please consider citing:

@article{cai2025depthlm,
    title={DepthLM: Metric Depth from Vision Language Models},
    author={Cai, Zhipeng and Yeh, Ching-Feng and Hu, Xu and Liu, Zhuang and Meyer, Gregory and Lei, Xinjie and Zhao, Changsheng and Li, Shang-Wen and Chandra, Vikas and Shi, Yangyang},
    journal={arXiv preprint arXiv:2509.25413},
    year={2025},
}

Contact

Zhipeng Cai, Meta Inc, homepage: https://zhipengcai.github.io/, email: czptc2h at gmail dot com.

Prerequisites

run conda create -n DepthLM python=3.12
run pip install -r requirements.txt (the code is tested with transformers 4.51.1 version)

Model	Link
DepthLM (Pixtral 12B)	Download 🤗

Data Preparation

For each training/eval dataset, we curate them into
- A folder containing the images
- A jsonl file containing the corresponding camera intrinsics and 3D labels
We provide example data from the iBims1 dataset at examples/ibims1 for quick code run without the need of data preparation. Other images/datasets can use the same code after finishing the data preparation steps.
Due to legal reasons, we cannot directly release the curated data. However, we provide the data curation code to enable reproduction.
Checkout each block in prepare_data.sh for the detailed data preparation steps on each dataset.

Eval

run bash eval.sh <path_to_your_model>

Training

Download the base model you want to train from here. Our code currently supports Qwen2.5-VL and Pixtral, please see our paper for the corresponding hyper-parameters.
run bash train.sh <path_to_your_model> <output_path>

Results

Comparison with VLMs

Comparison with pure vision models

Point cloud visualization

License

DepthLM is FAIR CC-BY-NC licensed, as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples/ibims1		examples/ibims1
media		media
utils		utils
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
eval.sh		eval.sh
prepare_data.sh		prepare_data.sh
requirements.txt		requirements.txt
train.py		train.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DepthLM

Citation

Contact

Prerequisites

Data Preparation

Eval

Training

Results

Comparison with VLMs

Comparison with pure vision models

Point cloud visualization

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

facebookresearch/DepthLM_Official

Folders and files

Latest commit

History

Repository files navigation

DepthLM

Citation

Contact

Prerequisites

Data Preparation

Eval

Training

Results

Comparison with VLMs

Comparison with pure vision models

Point cloud visualization

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages