Stars
This is a project on visual spatial reasoning tasks-SIBench
Fully Open Framework for Democratized Multimodal Training
Official Code for "Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search"
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
verl: Volcano Engine Reinforcement Learning for LLMs
Code and dataset link for "DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World"
The implementation of Extreme Viewpoint 4D Video Generation
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
Open source repo for Locate 3D Model, 3D-JEPA and Locate 3D Dataset
[CVPR 2024 & NeurIPS 2024] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
The Next Step Forward in Multimodal LLM Alignment
Official Repo For "Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos"
Official repo and evaluation implementation of VSI-Bench
A generative world for general-purpose robotics & embodied AI learning.
[ICCV 2025] Official Implementation for "Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition"
[NeurIPS 2024] Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
Robust Speech Recognition via Large-Scale Weak Supervision
[ICLR 2025] The First Multimodal Seach Engine Pipeline and Benchmark for LMMs
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
[TPAMI 2025] ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
Controllable video and image Generation, SVD, Animate Anyone, ControlNet, ControlNeXt, LoRA
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.