Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with good capability of general video understanding.

Python 497 28 Updated Aug 14, 2025

xinghaow99 / pbs-attn

Sparser Block-Sparse Attention via Token Permutation

Python 22 Updated Oct 27, 2025

GiantAILab / DiaMoE-TTS

Official code for"DiaMoE-TTS: A Unified IPA-based Dialect TTS Framework with Mixture-of-Experts and Parameter-Efficient Zero-Shot Adaptation"

Python 157 12 Updated Oct 20, 2025

meituan-longcat / LongCat-Video

Python 708 40 Updated Oct 29, 2025

kyutai-labs / nanoGPTaudio

Forked from karpathy/nanoGPT

Code for the blog "Neural audio codecs: how to get audio into LLMs"

Python 106 2 Updated Oct 20, 2025

NVlabs / OmniVinci

OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.

Python 342 35 Updated Oct 27, 2025

knottwill / sesame-finetune

Finetune Sesame AI's conversational speech model on new languages and voices. Blog post: https://blog.speechmatics.com/sesame-finetune

Python 89 9 Updated Sep 27, 2025

WooooDyy / BAPO

Codes for the paper "BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping" by Zhiheng Xi et al.

Python 71 2 Updated Oct 25, 2025

Soul-AILab / SAC

Trainging, inference, and testing of the SAC speech codec model.

Python 76 4 Updated Oct 24, 2025

brandoncarone / MUSE_music_benchmark

Music Benchmark

Python 2 Updated Oct 23, 2025

facebookresearch / chameleon

Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.

Python 2,061 117 Updated Jul 29, 2024

HumanMLLM / HumanOmniV2

Python 136 7 Updated Jul 31, 2025

antgroup / HumanSense

Python 19 3 Updated Sep 26, 2025

ttsds / ttsds

The TTSDS benchmark evaluates synthetic speech quality by considering prosody, speaker identity, and intelligibility, comparing these factors with real speech and noise datasets.

Python 66 5 Updated Sep 29, 2025

TXH-mercury / VALOR

[TPAMI2024] Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Python 304 18 Updated Dec 25, 2024

EvolvingLMMs-Lab / NEO

NEO Series: Native Vision-Language Models from First Principles

Python 205 11 Updated Oct 21, 2025

meta-pytorch / torchforge

PyTorch-native post-training at scale

Python 451 45 Updated Oct 29, 2025

mshumer / interactive-sora

Python 162 15 Updated Oct 28, 2025

pytorch / tutorials

PyTorch tutorials.

Python 8,860 4,284 Updated Oct 26, 2025

anthropics / skills

Public repository for Skills

Python 14,324 1,130 Updated Oct 18, 2025

KusionStack / kusion

Declarative Intent Driven Platform Orchestrator for Internal Developer Platform (IDP).

Go 1,181 94 Updated Aug 28, 2025

AmphionTeam / TaDiCodec

This repository contains a series of works on diffusion-based speech tokenizers, including the official implementation of the paper: "TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Lan…

Python 49 1 Updated Sep 21, 2025

thuml / MiniVeo3-Reasoner

Thinking with Videos from Open-Source Priors. We reproduce chain-of-frames visual reasoning by fine-tuning open-source video models. Give it a star 🌟 if you find it useful.

Python 165 3 Updated Oct 12, 2025

opendatalab / laion5b-downloader

Python 116 9 Updated May 16, 2023

meituan-longcat / LongCat-Audio-Codec

LongCat Audio Tokenizer and Detokenizer

Python 184 11 Updated Oct 20, 2025

showlab / livecc

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale (CVPR 2025)

Python 295 37 Updated Oct 14, 2025

Qinyuan Cheng xiami2019

Organizations

Lists (13)

Agent

Data Source

Dataset Distillation

Dialog

Finance

ICL

Interview

LLMs

MultiModal

My work

SentenceEmbeddings

Speech

Tutorial

Stars