-
The Chinese University of Hong Kong
- Hong Kong
- https://zongzhuofan.github.io/
Stars
Official Repo for Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
ULMEvalKit: One-Stop Eval ToolKit for Image Generation
TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis
Official codebase for "Self Forcing: Bridging Training and Inference in Autoregressive Video Diffusion" (NeurIPS 2025 Spotlight)
Text-audio foundation model from Boson AI
Awesome lists about framework figures in papers
The official SpeakerVid-5M data curation code.
(CVPR 2025) From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
[NeurIPS 2025] T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
Scalable and memory-optimized training of diffusion models
Solve Visual Understanding with Reinforced VLMs
Fully open reproduction of DeepSeek-R1
[CVPR 2025] EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation
StreamDiffusion: A Pipeline-Level Solution for Real-Time Interactive Generation
[ICML 2025] EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM
Efficient Triton Kernels for LLM Training
[Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
[ICLR 2025] The First Multimodal Seach Engine Pipeline and Benchmark for LMMs
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
[NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context
[NeurIPS 2024] 💫CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
[ICCV 2023] GeoMIM: towards better 3d knowledge transfer via masked image modeling for multi-view 3d understanding
The code repository for "RCNet: Reverse Feature Pyramid and Cross-scale Shift Network for Object Detection" (ACM MM'21)
[ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of the Open World"
[ICCV 2023] Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction