Starred repositories
MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
📖[IEEE Sensors Journal (JSEN) ] SuperVINS: A Real-Time Visual-Inertial SLAM Framework for Challenging Imaging Conditions (integrated deep learning features)
Matrix is an advanced simulation platform that integrates MuJoCo, Unreal Engine 5, and CARLA to provide high-fidelity, interactive environments for robotics research.
[ICCV'25] Unified Open-World Segmentation with Multi-Modal Prompts
StreamingVLM: Real-Time Understanding for Infinite Video Streams
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
3DGS-to-PC: Convert a 3D Gaussian splatting scene into a dense point cloud or basic mesh with advanced customisation options and high-accuracy rendered point colours
InteriorGS: 3D Gaussian Splatting Dataset of Semantically Labeled Indoor Scenes
[NeurIPS'25] Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Fully Open Framework for Democratized Multimodal Training
VGGT-X: When VGGT Meets Dense Novel View Synthesis
[NeurIPS 2025] PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer
A simple state update rule to enhance length generalization for CUT3R
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
The SAIL-VL2 series model developed by the BytedanceDouyinContent Group
[CoRL 2025] Repository relating to "TrackVLA: Embodied Visual Tracking in the Wild"
This is a pipeline to construct HD Semantic Map and HD Vector Map by IPNL.
Official code for "No time to train! Training-Free Reference-Based Instance Segmentation"
A Large-Scale Indoor-Outdoor Robot Dataset for Multi-Sensor Fusion Navigation and Mapping
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
(Preprint) ORV: 4D Occupancy-centric Robot Video Generation.
OpenFace 3.0 – open-source toolkit for facial landmark detection, action unit detection, eye-gaze estimation, and emotion recognition.