-
Fudan University&Sun Yat-Sen University
- Shanghai, China
- [email protected]
- @cheng_qinyuan
- https://xiami2019.github.io/
Lists (13)
Sort Name ascending (A-Z)
Stars
Post-training with Tinker
FlashCosyVoice: A lightweight vLLM implementation built from scratch for CosyVoice.
SoulX-Podcast is an inference codebase by the Soul AI team for generating high-fidelity podcasts from text.
A simple, unified multimodal models training engine. Lean, flexible, and built for hacking at scale.
Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with good capability of general video understanding.
Sparser Block-Sparse Attention via Token Permutation
Official code for"DiaMoE-TTS: A Unified IPA-based Dialect TTS Framework with Mixture-of-Experts and Parameter-Efficient Zero-Shot Adaptation"
kyutai-labs / nanoGPTaudio
Forked from karpathy/nanoGPTCode for the blog "Neural audio codecs: how to get audio into LLMs"
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
Finetune Sesame AI's conversational speech model on new languages and voices. Blog post: https://blog.speechmatics.com/sesame-finetune
Codes for the paper "BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping" by Zhiheng Xi et al.
Trainging, inference, and testing of the SAC speech codec model.
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
The TTSDS benchmark evaluates synthetic speech quality by considering prosody, speaker identity, and intelligibility, comparing these factors with real speech and noise datasets.
[TPAMI2024] Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
NEO Series: Native Vision-Language Models from First Principles
Declarative Intent Driven Platform Orchestrator for Internal Developer Platform (IDP).
This repository contains a series of works on diffusion-based speech tokenizers, including the official implementation of the paper: "TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Lan…
Thinking with Videos from Open-Source Priors. We reproduce chain-of-frames visual reasoning by fine-tuning open-source video models. Give it a star 🌟 if you find it useful.
LongCat Audio Tokenizer and Detokenizer
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale (CVPR 2025)