SpeechLLM LibriSpeech recipe #2885

Adel-Moumen · 2025-04-11T11:09:21Z

What does this PR do?

This PR adds support of SpeechLLM for ASR with LibriSpeech. Feats extractions, Training, Greedy search, and inference scripts are provided.

Before submitting

Did you read the contributor guideline?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified
Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
Review the self-review checklist to ensure the code is ready for review

…n/speechbrain into speechllm_librispeech

…values

Adel-Moumen · 2025-12-01T13:54:27Z

speechbrain/integrations/huggingface/llama.py

+        # Capture config-only overrides to avoid passing them to from_pretrained
+        self._config_overrides = {}
+        if "output_hidden_states" in kwargs:
+            self._config_overrides["output_hidden_states"] = kwargs.pop(
+                "output_hidden_states"
+            )


TBH, I don't remember in what scenario you want to set output_hidden_states=True

Adel-Moumen · 2025-12-02T12:48:14Z

recipes/LibriSpeech/ASR/transformer/train_speechllm.py

+        tokens_bos = torch.LongTensor(
+            [start_of_audio_index]
+            + [end_of_audio_index]
+            + prompt_ids
+            + [bos_index]
+            + tokens_list


one thing: it's really hard to know if an LM requires bos / eos (e.g. https://huggingface.co/blog/qgallouedec/gotchas-in-tokenizer-behavior) ; so, I think, ideally, we need to have a proper prompt function that will create a prompt depending on the input tokens (e.g. is eos is None, then don't append etc).

pplantinga

Overall, looks like a good addition -- it will be nice to have a starting framework for doing SpeechLLMs in SpeechBrain!

From my first read, I guess I'm wondering if we want to try to support more than one task, even a simple second task such as keyword spotting, just to show how it can be done, as it seems like the main benefit of SpeechLLMs over traditional ASR is the fact that they can support multiple tasks.

pplantinga · 2025-12-11T16:22:57Z

recipes/LibriSpeech/ASR/transformer/hparams/speechllm_ssl_feats.yaml

Do we need a second yaml file for this? I'm legitimately curious -- its fine if the answer is yes!

pplantinga · 2025-12-11T16:28:05Z

speechbrain/integrations/huggingface/llama.py

+        additional_special_tokens: List[str] = None,
+        pad_to_multiple_of: int = 8,


These new parameters are not documented.

pplantinga · 2025-12-11T17:37:27Z

speechbrain/nnet/activations.py

What's the benefit of adding this over torch.nn.GELU? It doesn't seem to behave any differently

pplantinga · 2025-12-11T17:42:17Z

recipes/LibriSpeech/ASR/transformer/hparams/speechllm_e2e.yaml

+bos_index: !PLACEHOLDER # 0
+eos_index: !PLACEHOLDER # 0
+pad_token: !PLACEHOLDER # 128256
+prompt: "Transcribe speech to text."


This recipe doesn't support tasks other than transcription? Couldn't we at least do keyword spotting? "Is the word {word} present in the audio?" For SpeechLLMs my understanding is that we ultimately want multi-task machines, so it would be nice if we at least had a basic concept of how multi-task would be handled.

TParcollet · 2026-01-05T12:23:39Z

recipes/LibriSpeech/ASR/transformer/README.md

+
+| Release | Model | hyperparams file | Dev Clean WER | Test Clean WER | Test Other WER | HuggingFace link | Model link | GPUs |
+|:-------------:|:-------------:|:-------------:|:---------------------------:| :-----:| :-----:| :-----:| :--------:|
+| 29-11-25 | WavLM Large + SmolLM2 1.7B + LoRA | speechllm_ssl_feats.yaml | N/A | 3.17 | 6.83 | [HuggingFace](https://huggingface.co/speechbrain/asr-speechllm-librispeech) | - | 1xH100 40GB |


These are not really good numbers given the architecture imho, but ok.

I had much better results (2.X%), but I need to train the models a bit longer.

TParcollet · 2026-01-05T12:28:04Z

recipes/LibriSpeech/ASR/transformer/extract_ssl_feats.py

+#!/usr/bin/env python3
+"""Script to extract SSL features from the audio waveforms.
+
+The script uses the `speechbrain.integrations.hdf5.cached_item` module to cache the features.


Adel-Moumen and others added 24 commits April 10, 2025 11:44

Attempt to fix self.device for AMP

2a2605b

pre-commit

6b895ac

fix 'cuda' dtype

4b43535

precommit

5014409

MM attention mask + downsampler

9139d55

Attempt to fix self.device for AMP

6ff6407

pre-commit

9d094f6

fix 'cuda' dtype

a21d878

precommit

712e781

MM attention mask + downsampler

073bd14

Merge branch 'speechllm_librispeech' of https://github.com/speechbrai…

c111887

…n/speechbrain into speechllm_librispeech

loss

8f94b17

Merge remote-tracking branch 'origin/develop' into speechllm_librispeech

265f609

extend PaddedBatch to support padding of specific keys with specific …

4beec66

…values

update config naming

a2afaa6

padding tokens are mapped to -100 and skipped from loss

7ef607a

fix tests

e58f98d

prompt + generate fn

4b78227

update yaml

a024f32

update recipe -> WER looks good.

288651d

small updates

46e0685

beamsearch + greedysearch with speechLLMs

c51fc1a

greedy/beam

2168bd1

fix little bug

b62f1c9

TParcollet added this to the v1.1.0 milestone Oct 9, 2025

TParcollet assigned TParcollet and Adel-Moumen Oct 9, 2025

Adel-Moumen added 3 commits October 21, 2025 09:12

update codebase

d5de7e1

Merge remote-tracking branch 'origin/develop' into speechllm_librispeech

8c70ac7

merge FT + offline

2ec9a6a

Adel-Moumen added 5 commits December 1, 2025 05:37

update recipe test

f08b2cb

clean up

3b8f929

tests

ef02021

docstring fwd

632e285

docstring

b6ae60b

Adel-Moumen commented Dec 1, 2025

View reviewed changes

Adel-Moumen commented Dec 2, 2025

View reviewed changes

Merge branch 'develop' into speechllm_librispeech

b952324

Adel-Moumen requested review from mravanelli and pplantinga December 3, 2025 11:40

Adel-Moumen unassigned TParcollet Dec 10, 2025

pplantinga reviewed Dec 11, 2025

View reviewed changes

Merge branch 'develop' into speechllm_librispeech

810d172

TParcollet reviewed Jan 5, 2026

View reviewed changes

Adel-Moumen added 15 commits January 25, 2026 18:05

add ReLU as a proper act

fe43095

make hdf5 a lazy module

90bff2f

fix attn mask + use_feats

f9b56f1

use_feats=True

b6ebdd6

README

5f1b6c9

update path

2e0b1b6

fix doc gen sphinx

f8a8f2e

pre-commit fixes

2d3c0f0

fix empty init

113e221

fix doc gen!!

31345ae

linters

7a1a69f

remove unused proj

a7c3228

fix CI deprecations

103fadd

more fixes

674eb5b

missing dep?

5b4e582

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpeechLLM LibriSpeech recipe #2885

SpeechLLM LibriSpeech recipe #2885

Adel-Moumen commented Apr 11, 2025 •

edited

Loading

Uh oh!

Adel-Moumen Dec 1, 2025

Uh oh!

Adel-Moumen Dec 2, 2025

Uh oh!

pplantinga left a comment

Uh oh!

pplantinga Dec 11, 2025

Uh oh!

pplantinga Dec 11, 2025

Uh oh!

pplantinga Dec 11, 2025

Uh oh!

pplantinga Dec 11, 2025

Uh oh!

TParcollet Jan 5, 2026

Uh oh!

Adel-Moumen Jan 5, 2026

Uh oh!

TParcollet Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		additional_special_tokens: List[str] = None,
		pad_to_multiple_of: int = 8,

SpeechLLM LibriSpeech recipe #2885

Are you sure you want to change the base?

SpeechLLM LibriSpeech recipe #2885

Conversation

Adel-Moumen commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pplantinga left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Adel-Moumen commented Apr 11, 2025 •

edited

Loading