Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ta012/PAL-AudioLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

PAL: Probing Audio Encoders via LLMs

This repository contains the code and other resources for the paper PAL: Probing Audio Encoders via LLMs - A Study of Information Transfer from Audio Encoders to LLMs.

Project Page | Link to arXiv Paper


Abstract

Integration of audio perception into large language models (LLMs) is an emerging research area for enabling machine listening applications, yet efficient transfer of rich audio semantics from audio encoders to LLMs remains underexplored. The most widely used integration paradigm projects the audio encoder output tokens into the LLM input space (e.g., via an MLP or a Q-Former), then \emph{prepends or inserts} them to the text tokens. We refer to this generic scheme as \emph{Prepend to the LLM’s input token space (PLITS)} integration. We propose an efficient alternative, \underline{L}ightweight \underline{A}udio \underline{L}LM Integration \textbf{(LAL)}. LAL introduces audio representations solely via the attention mechanism within different layers of the LLM, bypassing its feedforward module. LAL encodes rich audio semantics at an appropriate level of abstraction for integration into different blocks of LLMs. Our design significantly reduces computational overhead compared to existing integration approaches. Observing with Whisper that the speech encoder benefits from PLITS integration, we propose an audio encoder aware approach for efficiently \underline{P}robing \underline{A}udio encoders via \underline{L}LM (\textbf{PAL}), which employs PLITS integration for Whisper and LAL for general audio encoders. Under an identical training curriculum, \textbf{LAL} consistently maintains performance or outperforms existing integration approaches across multiple base LLMs and tasks. For general audio tasks, LAL improvement is up to 30% over a strong PLITS baseline while reducing memory usage by up to 64.1% and increasing throughput by up to 247.5%. Furthermore, for general audio-music-speech LLM, \textbf{PAL}, performs on par with a fully PLITS integration-based system but with substantially improved computational and memory efficiency. Project page:\url{https://ta012.github.io/PAL/}

Status

The code and models will be made available here soon.

Citation

If you find our work useful, please consider citing our paper:

@misc{alex2025palprobingaudioencoders,
      title={PAL: Probing Audio Encoders via LLMs -- A Study of Information Transfer from Audio Encoders to LLMs}, 
      author={Tony Alex and Wish Suharitdamrong and Sara Atito and Armin Mustafa and Philip J. B. Jackson and Imran Razzak and Muhammad Awais},
      year={2025},
      eprint={2506.10423},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2506.10423}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published