Thanks to visit codestin.com
Credit goes to github.com

Skip to content

TASU: A New Style of Alignment of Speech LLM with only Text Training Data, zero-shot on ASR and Other SU tasks

Notifications You must be signed in to change notification settings

PigeonDan1/ps-slm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TASU: Text-only Alignment for Speech Understanding

TASU (Text-only Alignment for Speech Understanding) is a newly proposed training paradigm for Speech Large Language Models (Speech LLMs), with a primary focus on semantic speech understanding.
This repository contains the core implementation of the TASU algorithm.


TASU Overview

⚙️ Environment Setup

TASU is mainly developed and tested on Huawei Ascend (910B) NPU clusters (npu branch), but can also be adapted to NVIDIA GPU environments. You can use the dockerfile like:

# Build image from Dockerfile in the current directory
docker build -t tasu:latest .

🚀 Getting Started

Once your environment is ready (either Ascend NPU or NVIDIA GPU):

  1. Clone this repository:

    git clone https://github.com/PigeonDan1/ps-slm.git
    cd ps-slm
    cd Multitask
  2. Prepare the dataset in format.

    Each sample is one valid JSON line (JSON Lines). Field names and constraints:

    Field Type Required Description
    key string Globally unique ID, no / or spaces
    task string Task code (ASR, EN2ZH, etc.)
    target string Text that the model must produce (label / decoding target)
    path string Audio location, 2 protocols supported, see below
    GT string ✔(✘) Audio GT for text-simulation CTC posterior (If you want to train with audio, 'GT' would not be used)

    Audio format support:

    Protocol Example Path Reading Hint
    plain wav /xxx/common_voice_en_19641841.wav direct soundfile.read
    ark offset /xxx/data_wav.1.ark:246511401 binary seek(offset)

    Data examples:

    {"key": "common_voice_en_19315788", "task": "ASR", "target": "Raita also had feelings for her.", "GT": "raita also had feelings for her.", "path": "/aistor/sjtu/hpc_stor01/home/yangyi/data/common_voice/audio/dev/common_voice_en_19315788.wav"}
    {"key": "common_voice_en_19685643", "task": "EN2ZH", "target": "第四个是 Benson and Hedges Championship。", "GT": "the fourth was the benson and hedges championship.", "path": "/aistor/sjtu/hpc_stor01/home/yangyi/data/covost2_en2zh_mls/audio/dev/common_voice_en_19685643.wav"}

    Tasks supported: ASR, EN2ZH, EN2DE, QA, SLU_scenario (SLURP).
    (For more tasks, add corresponding prompts in /conf/multiprompt.jsonl.)

  3. Download pre-trained models and our checkpoints(huggingFace):

    SenseVoiceSmall:https://huggingface.co/FunAudioLLM/SenseVoiceSmall

    Qwen2.5-1.5B:https://huggingface.co/Qwen/Qwen2.5-1.5B

    ckpts:https://huggingface.co/yyy1421129/ps-slm https://www.modelscope.cn/models/yyy1421129/ps-slm

    • text_only: Checkpoint trained with only text
    • half_audio_finetuned: SFT using 900h audio based on text_only/pytorch_model.bin
    • Method to use these ckpts: Download and Fill in the ckpt_path variable in scripts scripts/decode_sensevoice.sh with the path to the downloaded model checkpoint.
  4. One-Click Script:

    After changing some essential local path and downloading models(more detailed instruction in Multitask/readme), you can run with:

    Core training script: /scripts/finetune_deepspeed_sensevoice.sh

    Inference script: /scripts/decode_sensevoice.sh


📖 Citation

If you find TASU or this codebase useful in your research, please consider citing:

@article{peng2025tasu,
  title   = {TASU: Text-Only Alignment for Speech Understanding},
  author  = {Peng, Jing and Yang, Yi and Li, Xu and Xi, Yu and Tang, Quanwei and Fang, Yangui and Li, Junjie and Yu, Kai},
  journal = {arXiv preprint arXiv:2511.03310},
  year    = {2025},
}

About

TASU: A New Style of Alignment of Speech LLM with only Text Training Data, zero-shot on ASR and Other SU tasks

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •