Stars
starVLA: A Lego-like Codebase for Vision-Language-Action Model Developing
Cosmos-Reason1 models understand the physical common sense and generate appropriate embodied decisions in natural language through long chain-of-thought reasoning processes.
[NeurIPS 2025 spotlight] Official implementation for "FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving"
Fully Open Framework for Democratized Multimodal Training
[NeurIPS'24] This repository is the implementation of "SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models"
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Accessible large language models via k-bit quantization for PyTorch.
Wan: Open and Advanced Large-Scale Video Generative Models
Code of π^3: Permutation-Equivariant Visual Geometry Learning
[ICCV 2025] A Simple yet Effective Pathway to Empowering LLaVA to Understand and Interact with 3D World
Paper list in the survey: A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
[NeurIPS 2025] SpatialLM: Training Large Language Models for Structured Indoor Modeling
A curated list of awesome papers for reconstructing 4D spatial intelligence from video. (arXiv 2507.21045)
Official implementation of Continuous 3D Perception Model with Persistent State
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
[ICCV 2025] Official code of "ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation"
Code for Streaming 4D Visual Geometry Transformer
[CVPR 2025 Best Paper Award] VGGT: Visual Geometry Grounded Transformer
Re-implementation of pi0 vision-language-action (VLA) model from Physical Intelligence
The code for paper 'Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors'
[CVPR 2025] The code for paper ''Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding''.
🤗 LeRobot: Making AI for Robotics more accessible with end-to-end learning
[CVPR 2025, Spotlight] SimLingo (CarLLava): Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment