-
Zhejiang University
- https://liuhuadai.github.io
Stars
Fun-Audio-Chat is a Large Audio Language Model built for natural, low-latency voice interactions.
The repository provides code for running inference with the Meta Segment Anything Audio Model (SAM-Audio), links for downloading the trained model checkpoints, and example notebooks that show how t…
open soundstream-ish VAE codecs for downstream neural audio synthesis
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Official Implementation of "MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation"
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
Educational implementation of the Discrete Flow Matching paper
Krea Realtime 14B. An open-source realtime AI video model.
Official implementation of "Continuous Autoregressive Language Models"
[NeurIPS 2025] Encoder-Decoder Diffusion Language Models for Efficient Training and Inference
Official implementation of "UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing"
A comprehensive list of papers for the definition of World Models and using World Models for General Video Generation, Embodied AI, and Autonomous Driving, including papers, codes, and related webs…
PyTorch code and models for VJEPA2 self-supervised learning from video.
Official PyTorch Implementation of "Diffusion Transformers with Representation Autoencoders"
Official implementation of "DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training".
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
The official repo for SpaceVista: All-Scale Visual Spatial Reasoning from mm to km.
Unified automatic quality assessment for speech, music, and sound.
Scalable and memory-optimized training of diffusion models
Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
VGGSounder, a multi-label audio-visual classification dataset with modality annotations.
Reference PyTorch implementation and models for DINOv3
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.