-
Notifications
You must be signed in to change notification settings - Fork 1.6k
SpeechLLM LibriSpeech recipe #2885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…n/speechbrain into speechllm_librispeech
| # Capture config-only overrides to avoid passing them to from_pretrained | ||
| self._config_overrides = {} | ||
| if "output_hidden_states" in kwargs: | ||
| self._config_overrides["output_hidden_states"] = kwargs.pop( | ||
| "output_hidden_states" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I don't remember in what scenario you want to set output_hidden_states=True
| tokens_bos = torch.LongTensor( | ||
| [start_of_audio_index] | ||
| + [end_of_audio_index] | ||
| + prompt_ids | ||
| + [bos_index] | ||
| + tokens_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thing: it's really hard to know if an LM requires bos / eos (e.g. https://huggingface.co/blog/qgallouedec/gotchas-in-tokenizer-behavior) ; so, I think, ideally, we need to have a proper prompt function that will create a prompt depending on the input tokens (e.g. is eos is None, then don't append etc).
pplantinga
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks like a good addition -- it will be nice to have a starting framework for doing SpeechLLMs in SpeechBrain!
From my first read, I guess I'm wondering if we want to try to support more than one task, even a simple second task such as keyword spotting, just to show how it can be done, as it seems like the main benefit of SpeechLLMs over traditional ASR is the fact that they can support multiple tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a second yaml file for this? I'm legitimately curious -- its fine if the answer is yes!
| additional_special_tokens: List[str] = None, | ||
| pad_to_multiple_of: int = 8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These new parameters are not documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of adding this over torch.nn.GELU? It doesn't seem to behave any differently
| bos_index: !PLACEHOLDER # 0 | ||
| eos_index: !PLACEHOLDER # 0 | ||
| pad_token: !PLACEHOLDER # 128256 | ||
| prompt: "Transcribe speech to text." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This recipe doesn't support tasks other than transcription? Couldn't we at least do keyword spotting? "Is the word {word} present in the audio?" For SpeechLLMs my understanding is that we ultimately want multi-task machines, so it would be nice if we at least had a basic concept of how multi-task would be handled.
|
|
||
| | Release | Model | hyperparams file | Dev Clean WER | Test Clean WER | Test Other WER | HuggingFace link | Model link | GPUs | | ||
| |:-------------:|:-------------:|:-------------:|:---------------------------:| :-----:| :-----:| :-----:| :--------:| | ||
| | 29-11-25 | WavLM Large + SmolLM2 1.7B + LoRA | speechllm_ssl_feats.yaml | N/A | 3.17 | 6.83 | [HuggingFace](https://huggingface.co/speechbrain/asr-speechllm-librispeech) | - | 1xH100 40GB | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not really good numbers given the architecture imho, but ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had much better results (2.X%), but I need to train the models a bit longer.
| #!/usr/bin/env python3 | ||
| """Script to extract SSL features from the audio waveforms. | ||
|
|
||
| The script uses the `speechbrain.integrations.hdf5.cached_item` module to cache the features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tutorial
What does this PR do?
This PR adds support of SpeechLLM for ASR with LibriSpeech. Feats extractions, Training, Greedy search, and inference scripts are provided.
Before submitting
PR review
Reviewer checklist