PoC Offline Extraction with SB #2942
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR aims at providing an alternative proof-of-concept for saving/loading features in SpeechBrain.
Background
SpeechBrain has primarily focused on extracting features on the fly—FBanks, SSL representations, etc.—as part of its philosophy of doing everything in a single
train.py
file driven by ayaml
configuration. This enables rapid prototyping with tight feedback loops.However, in recent months and years we’ve seen a trend toward ever-larger datasets (e.g., GigaSpeech’s 10 k hours, LibriHeavy’s 50 k hours) becoming the de facto benchmarks for training models (farewell to our old LibriSpeech on V100). The cost of on-the-fly feature extraction grows with multiple epochs over such large corpora. Moreover, a new form of representation has emerged: speech tokens. These discrete representations—often extracted from SSL encoders or VQ-VAE models—are fixed and never changed, and are used by medium- to large-scale autoregressive models. But because these encoders are heavy, extracting tokens on the fly is prohibitively expensive. Instead, tokens are typically extracted offline (much like SentencePiece tokens) and then loaded at training time so that only the decoder (and the tokens) reside in VRAM.
This growing use of frozen features and discrete representations renders SpeechBrain’s current workflow impractical at scale. The community’s embrace of SpeechLMs and SpeechLLMs marks a paradigm shift in which on-the-fly feature extraction is no longer feasible. This PR addresses that challenge by providing a proof of concept for saving and loading pre-extracted features in SpeechBrain.
Description of the Prototype
I extended the
Brain
class with two new methods:compute_features
andcache_features
.compute_features(batch, stage)
Similar to
fit_batch
, this method takes abatch
and astage
, extracts the required features, and returns a list of dictionaries. Each dictionary must include the utteranceid
plus any feature key/value pairs you want to save. For example, to saveid
,ssl_feats
, andtokens
, return:cache_features(...)
Analogous to
fit()
orevaluate()
, this method iterates over a dataset (or dataloader), callscompute_features
on each batch, and writes the returned feature dictionaries to disk.I/O Backends & Configuration
Inspired by [lhotse’s I/O module](https://github.com/lhotse-speech/lhotse/blob/fda1a986e5e1e72a14c82049b4ee709fc09a81e6/lhotse/features/io.py#L494), I added a
feature_io.py
file defining reader and writer classes, plus a simple factory. Key points:np.memmap
-style access to avoid loading everything into RAM.All configuration lives in YAML via a
FeatureStorageConfig
section that specifies, for each feature:name
: the key under which to store it (e.g.,ssl_feats
)dtype
: e.g.,float32
writer_class
: e.g.,NumpyHdf5Writer
YAML Example
Usage Example
Reading Cached Features in
train.py
Define your readers in YAML:
And use them in your data pipeline:
That’s all! Implement
compute_features
, configure your writers and readers in YAML, and callcache_features
.Room for Improvements
When handling multiple dataset splits, the
dataio_prepare
stage can become verbose. For example:One way to simplify this would be to move the writer (and reader) instantiations into the
Brain
class itself, rather than defining them in YAML. That way, you wouldn’t need to clutter your config with:— the
Brain
subclass could automatically create and exposefeature_writers
andfeature_readers
for each split based on a singlefeature_configs
entry.NOTE: Please don't ask me about solving tests etc. The intended goal of this PR so far is to provide a PoC. I will make things cleaner etc once we are converging towards a general design.