-
CUHK MMLab
- Hong Kong
- https://zrrskywalker.github.io/
Stars
The official implementation of The paper "Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation"
Offical Repository for Paper: DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation
The first Interleaved framework for textual reasoning within the visual generation process
The official repository of BEAR: Benchmarking and Enhancing Multimodal Language Models with Atomic Embodied Capabilities
AHN: Artificial Hippocampus Networks for Efficient Long-Context Modeling
ULMEvalKit: One-Stop Eval ToolKit for Image Generation
This is the official repository for the paper "FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark"
Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos
[NeurIPS 2025] MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
Multi-SpatialMLLM Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
[ICCV 2025] Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
[NeurIPS 2025] T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
Official repository for "TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving"
Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling
The repo of paper `RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation`
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
Wan: Open and Advanced Large-Scale Video Generative Models
MME-CoT: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency
[CVPR 2025] The First Investigation of CoT Reasoning (RL, TTS, Reflection) in Image Generation
(ICCV-2025 Official Code)) Improving Generalist Model with Domain-Specific Experts
[CVPR 2025]Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation
Official WACV 2025 code for Point-GN: A non-parametric, training-free method for 3D point cloud classification using Gaussian Positional Encoding (GPE). No training, no parameters, state-of-the-art…
Training-free Regional Prompting for Diffusion Transformers 🔥
[ICLR2025] A versatile image-to-image visual assistant, designed for image generation, manipulation, and translation based on free-from user instructions.
[ICLR 2025] The First Multimodal Seach Engine Pipeline and Benchmark for LMMs