-
East China Normal University
- Shanghai
-
18:14
(UTC +08:00)
Highlights
- Pro
Starred repositories
LoR-VP: Low-Rank Visual Prompting for Efficient Vision Model Adaptation (ICLR 2025)
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Train transformer language models with reinforcement learning.
Official repo of "Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens"
Official codes of "Monet: Reasoning in Latent Visual Space Beyond Image and Language"
Official codebase for the paper "Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space"
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (arXiv 2025)
An official implementation of "CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning"
[ICLR'25] Official code for the paper 'MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs'
Pixel-Level Reasoning Model trained with RL [NeuIPS25]
🔥stable, simple, state-of-the-art VQVAE toolkit & cookbook
[NeurIPS 2024 Best Paper Award][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". A…
We present **FOCI**, a benchmark for Fine-grained Object ClassIfication for large vision language models (LVLMs).
[ICLR'25 Oral] Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Official implementation of "VIRAL: Visual Representation Alignment for MLLMs".
[ICLR 2025] MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation
Fast and flexible image augmentation library. Paper about the library: https://www.mdpi.com/2078-2489/11/2/125
An open-source AI agent that lives in your terminal.
UP-TO-DATE LLM Adaptive thinking paper. 🔥🔥🔥
(ICCV 2025)This repository is the official implementation of AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models