-
HKUST
- Hong Kong | Vietnam
- https://tkpham3105.github.io/
Stars
Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction
(ICCV 2025) "Principal Components" Enable A New Language of Images
Official code for the CVPR 2025 paper "Navigation World Models".
Download audioset data super fastly with youtube-dl, ffmpeg and python multiprocessing
This is a repository to collect training-free algorithms for visual generation and manipulation
Visualization of DiT self attention features
Official Pytorch Implementation for “DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video” (ECCV 2024)
[ICCV 2025] Official implementation of the paper: REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers
Wan: Open and Advanced Large-Scale Video Generative Models
OmniGen: Unified Image Generation. https://arxiv.org/pdf/2409.11340
[CVPR 2024] Code release for "InstanceDiffusion: Instance-level Control for Image Generation"
Fine-tune Stable Audio Open with DiT ControlNet.
Improved Implementation for Training GLIGEN: Open-Set Grounded Text-to-Image Generation
Source code for "Synchformer: Efficient Synchronization from Sparse Cues" (ICASSP 2024)
Code for the paper: GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Get up and running with OpenAI GLM-4.7, DeepSeek, gpt-oss, Qwen, Gemma and other models.
[ICLR 2024] LLM-grounded Video Diffusion Models (LVD): official implementation for the LVD paper
[ECCV 2024 Oral] Audio-Synchronized Visual Animation
Lumina-T2X is a unified framework for Text to Any Modality Generation
[ACM MM 2024] Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization
[NIPS2023] Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
[CSUR] A Survey on Video Diffusion Models
[TMLR 2025] Latte: Latent Diffusion Transformer for Video Generation.
[CVPR 2024] Official repository for "MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model"
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation