TASU: Text-only Alignment for Speech Understanding

TASU (Text-only Alignment for Speech Understanding) is a newly proposed training paradigm for Speech Large Language Models (Speech LLMs), with a primary focus on semantic speech understanding. This repository contains the core implementation of the TASU algorithm.

📰 News

2026-01-18: 🎉 Congrats to TASU (Text-only Alignment for Speech Understanding) on its acceptance to ICASSP 2026! See you in Barcelona!

⚙️ Environment Setup

TASU is mainly developed and tested on Huawei Ascend (910B) NPU clusters (npu branch), but can also be adapted to NVIDIA GPU environments. You can use the dockerfile like:

# Build image from Dockerfile in the current directory
docker build -t tasu:latest .

🚀 Getting Started

Once your environment is ready (either Ascend NPU or NVIDIA GPU):

Clone this repository:

git clone https://github.com/PigeonDan1/ps-slm.git
cd ps-slm
cd Multitask

Prepare the dataset in format.

Each sample is one valid JSON line (JSON Lines). Field names and constraints:

Field	Type	Required	Description
key	string	✔	Globally unique ID, no `/` or spaces
task	string	✔	Task code (ASR, EN2ZH, etc.)
target	string	✔	Text that the model must produce (label / decoding target)
path	string	✔	Audio location, 2 protocols supported, see below
GT	string	✔(✘)	Audio GT for text-simulation CTC posterior (If you want to train with audio, 'GT' would not be used)

Audio format support:

Protocol	Example Path	Reading Hint
plain wav	`/xxx/common_voice_en_19641841.wav`	direct `soundfile.read`
ark offset	`/xxx/data_wav.1.ark:246511401`	binary `seek(offset)`

Data examples:

{"key": "common_voice_en_19315788", "task": "ASR", "target": "Raita also had feelings for her.", "GT": "raita also had feelings for her.", "path": "/aistor/sjtu/hpc_stor01/home/yangyi/data/common_voice/audio/dev/common_voice_en_19315788.wav"}
{"key": "common_voice_en_19685643", "task": "EN2ZH", "target": "第四个是 Benson and Hedges Championship。", "GT": "the fourth was the benson and hedges championship.", "path": "/aistor/sjtu/hpc_stor01/home/yangyi/data/covost2_en2zh_mls/audio/dev/common_voice_en_19685643.wav"}

Tasks supported: ASR, EN2ZH, EN2DE, QA, SLU_scenario (SLURP).
(For more tasks, add corresponding prompts in /conf/multiprompt.jsonl.)

Download pre-trained models and our checkpoints(huggingFace):

SenseVoiceSmall:https://huggingface.co/FunAudioLLM/SenseVoiceSmall

Qwen2.5-1.5B:https://huggingface.co/Qwen/Qwen2.5-1.5B

ckpts:https://huggingface.co/yyy1421129/ps-slm https://www.modelscope.cn/models/yyy1421129/ps-slm
- text_only: Checkpoint trained with only text
- half_audio_finetuned: SFT using 900h audio based on text_only/pytorch_model.bin
- Method to use these ckpts: Download and Fill in the ckpt_path variable in scripts scripts/decode_sensevoice.sh with the path to the downloaded model checkpoint.
One-Click Script:

After changing some essential local path and downloading models(more detailed instruction in Multitask/readme), you can run with:

Core training script: /scripts/finetune_deepspeed_sensevoice.sh

Inference script: /scripts/decode_sensevoice.sh

📖 Citation

If you find TASU or this codebase useful in your research, please consider citing:

@article{peng2025tasu,
  title   = {TASU: Text-Only Alignment for Speech Understanding},
  author  = {Peng, Jing and Yang, Yi and Li, Xu and Xi, Yu and Tang, Quanwei and Fang, Yangui and Li, Junjie and Yu, Kai},
  journal = {arXiv preprint arXiv:2511.03310},
  year    = {2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Multitask		Multitask
.gitignore		.gitignore
Dockerfile		Dockerfile
overview.png		overview.png
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TASU: Text-only Alignment for Speech Understanding

📰 News

⚙️ Environment Setup

🚀 Getting Started

📖 Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

PigeonDan1/ps-slm

Folders and files

Latest commit

History

Repository files navigation

TASU: Text-only Alignment for Speech Understanding

📰 News

⚙️ Environment Setup

🚀 Getting Started

📖 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages