Stars
[MTI-LLM@NeurIPS 2025] Official implementation of "PyVision: Agentic Vision with Dynamic Tooling."
Official repo of "Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens"
Official code for paper: N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models
EO: Open-source Unified Embodied Foundation Model Series
NeurIPS 2025 Spotlight; ICLR2024 Spotlight; CVPR 2024; EMNLP 2024
Official Implementation of Paper Transfer between Modalities with MetaQueries
The code for paper 'Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors'
Official implementation of DepthLM
Official implementation of "C3G: Learning Compact 3D Representations with 2K Gaussians"
NEO Series: Native Vision-Language Models from First Principles
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images
Cambrian-S: Towards Spatial Supersensing in Video
HunyuanVideo-1.5: A leading lightweight video generation model
[NeurIPS 2025 DB Track] 3EED: Ground Everything Everywhere in 3D
PyTorch implementation of JiT https://arxiv.org/abs/2511.13720
Wan: Open and Advanced Large-Scale Video Generative Models
[Awesome-Spatial-VLMs] This repository is the official, community-maintained resource for the survey paper: Spatial Intelligence in Vision-Language Models: A Comprehensive Survey;
Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos
A high-throughput and memory-efficient inference and serving engine for LLMs
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
[ICCV 2025] SuperDec: 3D Scene Decomposition with Superquadric Primitives.