Stars
🎯 Read research papers faster with AI. Resophy is an HTML-based AI paper reader with: 🤖 AI Translation & Analysis — instantly understand structure, contributions, and results 🚀 Daily arXiv Recommen…
Scalable toolkit for efficient model reinforcement
Official implementation of URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding (AAAI 2026 Oral).
Tongyi Deep Research, the Leading Open-source Deep Research Agent
This repository collects papers for "A Survey on Knowledge Distillation of Large Language Models". We break down KD into Knowledge Elicitation and Distillation Algorithms, and explore the Skill & V…
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
Ongoing research training transformer models at scale
verl: Volcano Engine Reinforcement Learning for LLMs
MiMo-Audio: Audio Language Models are Few-Shot Learners
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
OSUM & OSUM-EChat, open speech understanding model and empathetic spoken chatbot based on it, open-sourced by ASLP@NPU.
Official PyTorch code for Deep Audio-Signal Holistic Embeddings
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Open-source framework for conversational voice AI agents
Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
Speech-to-text, text-to-speech, speaker diarization, speech enhancement, source separation, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Andr…
HelloWorldU / vllm
Forked from vllm-project/vllmA high-throughput and memory-efficient inference and serving engine for LLMs
Python library for audio and music analysis
This repository includes the code to reproduce our paper "Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation".
A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
Research and Production Oriented Speaker Verification, Recognition and Diarization Toolkit
High-quality and streaming Speech-to-Speech interactive agent in a single file. 只用一个文件实现的流式全双工语音交互原型智能体!
Efficient audio understanding with general audio captions
A Python library for audio data augmentation. Useful for making audio ML models work well in the real world, not just in the lab.
Production First and Production Ready End-to-End Speech Recognition Toolkit