Stars
Fully Open Framework for Democratized Multimodal Training
Cook up amazing multimodal AI applications effortlessly with MiniCPM-o
Official repository for VisionZip (CVPR 2025)
[ECCV 2024 Oral] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Survey: https://arxiv.org/pdf/2507.20198
Long-RL: Scaling RL to Long Sequences (NeurIPS 2025)
Open-source offline translation library written in Python
Universal memory layer for AI Agents; Announcing OpenMemory MCP - local and secure memory management.
This repository contains the official implementation of the research paper, "FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization" ICCV 2023
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
A Simple Framework of Small-scale LMMs for Video Understanding
✨First Open-Source R1-like Video-LLM [2025/02/18]
Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]
The development and future prospects of large multimodal reasoning models.
This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025
📖 A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).
Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models.
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
[ICLR'24 spotlight] An open platform for training, serving, and evaluating large language model for tool learning.
An Autonomous LLM Agent for Complex Task Solving
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
✨✨Latest Advances on Multimodal Large Language Models
[CVPR 2025 (Oral)] Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key
Official repository of 'Visual-RFT: Visual Reinforcement Fine-Tuning' & 'Visual-ARFT: Visual Agentic Reinforcement Fine-Tuning'’
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks