Stars
DDN: A novel generative model with simple principles and unique properties. (ICLR 2025)
Speech Human Evaluation Estimation Toolkit (SHEET)
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
Silero VAD: pre-trained enterprise-grade Voice Activity Detector
Xwdit / espeak-ng
Forked from espeak-ng/espeak-ngeSpeak NG is an open source speech synthesizer that supports more than hundred languages and accents.
✨ Split text by languages (e.g. 你喜欢看アニメ吗 -> 你喜欢看 | アニメ | 吗) for NLP tasks (e.g. parse, TTS). Powered by fasttext and budoux
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second.
[SIGGRAPH Asia 2024, Journal Track] ToonCrafter: Generative Cartoon Interpolation
FlashMLA: Efficient Multi-head Latent Attention Kernels
XiaoGua RTC Firmware is a firmware for ESP32 chip, which is designed by zideai.com.
A generative world for general-purpose robotics & embodied AI learning.
[ICCV 2025] SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
Acoustic Echo Canceller for Mobile Module Port From WebRTC
zero-shot voice conversion & singing voice conversion, with real-time support
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
STM32 extension for working with STM32 and CubeMX in VSCode
Styled banners for your Readme made with html/css in SVG !!
AI powered speech denoising and enhancement
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation
A high-throughput and memory-efficient inference and serving engine for LLMs