Stars
SteerMoE: Efficient Audio-Language Models with Preserved Reasoning Capabilities
[NAACL 2025 Findings] Continuous Speech Tokenizer in Text To Speech
[AAAI 2026] DIFFA: Large Language Diffusion Models Can Listen and Understand
A powerful 3B-parameter, LLM-based Reinforcement Learning audio edit model excels at editing emotion, speaking style, and paralinguistics, and features robust zero-shot text-to-speech
The official GitHub page for the survey paper "Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey". And this paper is under review.
OmniGen2: Exploration to Advanced Multimodal Generation.
Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
Pytorch Implementation (unofficial) of the paper "Mean Flows for One-step Generative Modeling" by Geng et al.
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
A TTS model capable of generating ultra-realistic dialogue in one pass.
An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
Reverse Engineering of Supervised Semantic Speech Tokenizer (S3Tokenizer) proposed in CosyVoice
Open-source framework for conversational voice AI agents
Dippy Synthetic Speech Subnet
Generative models for conditional audio generation
Best practice TTS based on BERT and VITS with some Natural Speech Features Of Microsoft; Support ONNX streaming out!
A generative speech model for daily dialogue.
SALMONN family: A suite of advanced multi-modal LLMs
A Framework for Speech, Language, Audio, Music Processing with Large Language Model
✨✨Latest Advances on Multimodal Large Language Models
Research and Production Oriented Speaker Verification, Recognition and Diarization Toolkit
Zero-Shot Speech Editing and Text-to-Speech in the Wild