Stars
Statistical Learning course in USTC. 中科大统计学习(刘东)课程复习资料。
Collection of papers about video-audio understanding
Official implementation of RLFR: Extending Reinforcement Learning for LLMs with Flow Environment
A Comprehensive Dataset for Advanced Image Generation and Editing}
Rex-Thinker: Grounded Object Refering via Chain-of-Thought Reasoning
Latest open-source "Thinking with images" (O3/O4-mini) papers, covering training-free, SFT-based, and RL-enhanced methods for "fine-grained visual understanding".
Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.
New generation of CLIP with fine grained discrimination capability, ICML2025
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
[ICCV2025]Code Release of Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
This is a repo to track the latest autoregressive visual generation papers.
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning
GPT-ImgEval: Evaluating GPT-4o’s state-of-the-art image generation capabilities
Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"
A collection of awesome text-to-image generation studies.
(ෆ`꒳´ෆ) A Survey on Text-to-Image Generation/Synthesis.
✨✨ [ICLR 2026] MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
Align Anything: Training All-modality Model with Feedback
📖 This is a repository for organizing papers, codes, and other resources related to unified multimodal models.
[ICCV 2025] VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
A collection of multimodal reasoning papers, codes, datasets, benchmarks and resources.
Paper List of Inference/Test Time Scaling/Computing