TASU (Text-only Alignment for Speech Understanding) is a newly proposed training paradigm for Speech Large Language Models (Speech LLMs), with a primary focus on semantic speech understanding.
This repository contains the core implementation of the TASU algorithm.
TASU is mainly developed and tested on Huawei Ascend (910B) NPU clusters (npu branch), but can also be adapted to NVIDIA GPU environments. You can use the dockerfile like:
# Build image from Dockerfile in the current directory
docker build -t tasu:latest .Once your environment is ready (either Ascend NPU or NVIDIA GPU):
-
Clone this repository:
git clone https://github.com/PigeonDan1/ps-slm.git cd ps-slm cd Multitask
-
Prepare the dataset in format.
Each sample is one valid JSON line (JSON Lines). Field names and constraints:
Field Type Required Description key string ✔ Globally unique ID, no /or spacestask string ✔ Task code (ASR, EN2ZH, etc.) target string ✔ Text that the model must produce (label / decoding target) path string ✔ Audio location, 2 protocols supported, see below GT string ✔(✘) Audio GT for text-simulation CTC posterior (If you want to train with audio, 'GT' would not be used) Audio format support:
Protocol Example Path Reading Hint plain wav /xxx/common_voice_en_19641841.wavdirect soundfile.readark offset /xxx/data_wav.1.ark:246511401binary seek(offset)Data examples:
{"key": "common_voice_en_19315788", "task": "ASR", "target": "Raita also had feelings for her.", "GT": "raita also had feelings for her.", "path": "/aistor/sjtu/hpc_stor01/home/yangyi/data/common_voice/audio/dev/common_voice_en_19315788.wav"} {"key": "common_voice_en_19685643", "task": "EN2ZH", "target": "第四个是 Benson and Hedges Championship。", "GT": "the fourth was the benson and hedges championship.", "path": "/aistor/sjtu/hpc_stor01/home/yangyi/data/covost2_en2zh_mls/audio/dev/common_voice_en_19685643.wav"}Tasks supported: ASR, EN2ZH, EN2DE, QA, SLU_scenario (SLURP).
(For more tasks, add corresponding prompts in/conf/multiprompt.jsonl.) -
Download pre-trained models and our checkpoints(huggingFace):
SenseVoiceSmall:https://huggingface.co/FunAudioLLM/SenseVoiceSmall
Qwen2.5-1.5B:https://huggingface.co/Qwen/Qwen2.5-1.5B
ckpts:https://huggingface.co/yyy1421129/ps-slm https://www.modelscope.cn/models/yyy1421129/ps-slm
- text_only: Checkpoint trained with only text
- half_audio_finetuned: SFT using 900h audio based on text_only/pytorch_model.bin
- Method to use these ckpts: Download and Fill in the ckpt_path variable in scripts scripts/decode_sensevoice.sh with the path to the downloaded model checkpoint.
-
One-Click Script:
After changing some essential local path and downloading models(more detailed instruction in Multitask/readme), you can run with:
Core training script:
/scripts/finetune_deepspeed_sensevoice.shInference script:
/scripts/decode_sensevoice.sh
If you find TASU or this codebase useful in your research, please consider citing:
@article{peng2025tasu,
title = {TASU: Text-Only Alignment for Speech Understanding},
author = {Peng, Jing and Yang, Yi and Li, Xu and Xi, Yu and Tang, Quanwei and Fang, Yangui and Li, Junjie and Yu, Kai},
journal = {arXiv preprint arXiv:2511.03310},
year = {2025},
}