LLIA - Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models
Haojie Yu* · Zhaonian Wang* · Yihan Pan* · Meng Cheng · Hao Yang · Chao Wang · Tao Xie · Xiaoming Xu✉ · Xiaoming Wei · Xunliang Cai
*Equal Contribution ✉Corresponding Authors
TL; DR: LLIA is a real-time audio-driven portrait video generation with diffusion models, enabling low-latency interactive avatars.
001.mp4 |
001.mp4 |
002.mp4 |
003.mp4 |
We propose LLIA , a novel audio-driven portrait video generation framework based on the diffusion model. Our approach achieves low-latency, fluid, and authentic two-way communication. On an NVIDIA RTX 4090D, our model achieves a maximum of 78 FPS at a resolution of 384 × 384 and 45 FPS at a resolution of 512 × 512, with an initial video generation latency of 140 ms and 215 ms, respectively
- June 9, 2025: 👋 We release the Technique-Report of LLIA
- June 9, 2025: 👋 We release the project page of LLIA
- Release the technical report
- Inference
- Checkpoints