-
The Hong Kong Polytechnic University
- Hong Kong SAR, China
- [email protected]
Lists (3)
Sort Name ascending (A-Z)
Stars
[ACM MM Award] AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset
official implementation of paper ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification
轻量、灵活、易上手的Python剪映草稿生成及导出工具,构建全自动化视频剪辑/混剪流水线。本项目的CapCut版本正于 https://github.com/GuanYixuan/pyCapCut 内开发
Di♪♪Rhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion
🚀AI拟声: 5秒内克隆您的声音并生成任意语音内容 Clone a voice in 5 seconds to generate arbitrary speech in real-time
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
MiMo: Unlocking the Reasoning Potential of Language Model – From Pretraining to Posttraining
HighRateMOS is the first non-intrusive MOS prediction model that explicitly models sampling rates, achieving first place in five out of eight metrics in AudioMOS Challenge 2025 Track3.
This repo contains my attempt to create a Speaker Recognition and Verification system using SideKit-1.3.1
An implementation of local windowed attention for language modeling
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows".
Experiments around a simple idea for inducing multiple hierarchical predictive model within a GPT
A PyTorch implementation of the Transformer model in "Attention is All You Need".
[NeurIPS 2025] Benchmark data and code for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
[ACM CCS'24] SafeEar: Content Privacy-Preserving Audio Deepfake Detection
AudioTrust: Benchmarking the Multi-faceted Trustworthiness of Audio Large Language Models
[ACL 2024] Official PyTorch code for extracting features and training downstream models with emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation
A novel human-interaction method for real-time speech extraction on headphones.
EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine
Simple project webpage template. Originally used in Colorful Image Colorization. ECCV, 2016.
The most cited deep learning papers
SoftVC VITS Singing Voice Conversion
Audio-JEPA is an adaptation of the Joint-Embedding Predictive Architecture (JEPA) for self-supervised audio representation learning. Built upon the I-JEPA paradigm, it uses a Vision Transformer (Vi…
zero-shot voice conversion & singing voice conversion, with real-time support