Stars
MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
Automatic Video Generation from Scientific Papers
Home page for Microsoft Phi-Ground tech-report
[ICCV 2025] MMReason, MLLMs, step by step, reasoning benchmark, AGI
EmoCapCLIP: Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions
The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.
Building a comprehensive and handy list of papers for GUI agents
💻 A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.
Agentic ADK is an Agent application development framework launched by Alibaba International AI Business, based on Google-ADK and Ali-LangEngine.
An unified model that seamlessly integrates multimodal understanding, text-to-image generation, and image editing within a single powerful framework.
Lets make video diffusion practical!
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
The real state 10k dataset from https://google.github.io/realestate10k
[CVPR 2025 (Oral)] Open implementation of "RandAR"
[CVPR 2025 Oral]Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
[NeurIPS 2024 Best Paper Award][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". A…
📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.
Official implementation of "Single Image Iterative Subject-driven Generation and Editing".
[ICCV 2025] FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing
Stable Virtual Camera: Generative View Synthesis with Diffusion Models
This repository contains the official implementation of the research paper, "FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization" ICCV 2023
The ultimate training toolkit for finetuning diffusion models
HunyuanVideo: A Systematic Framework For Large Video Generation Model
[ICLR 2025 Oral] TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio-Motion Embedding and Diffusion Interpolation
[ECCV 2024 Oral] PetFace: A Large-Scale Dataset and Benchmark for Animal Identification https://arxiv.org/abs/2407.13555
Inference and training library for high-quality TTS models.
This is a simple ComfyUI custom TTS node based on Parler_tts.