This repository presents an improved speech tokenization framework for Spoken Term Detection (STD), extending our earlier work BEST-STD and LAST. The method combines a bidirectional Mamba (BiMamba) encoder with an Optimal-Transport-based Vector Quantizer, trained with a combined contrastive, commitment, and noise-robust loss. The resulting tokens are consistent across clean and noisy utterances of the same word, enabling fast query-by-example retrieval over large reference databases. The repository includes the implementation, demo scripts, and pre-trained models.
BEST-STD 2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection Anup Singh, Kris Demuynck, Vipul Arora
Paper: https://ieeexplore.ieee.org/abstract/document/11462743
git clone https://github.com/anupsingh15/BEST-STD2.0.git
cd BEST-STD2.0
conda create -n STD anaconda
conda activate STD
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install mamba-ssm
pip install causal-conv1d>=1.4.0
python -m pip install tslearn
pip install -U tensorboard
pip install POT
pip install librosa
pip install npy-append-array
pip install faiss-cpu
pip install Levenshtein
conda install conda-forge::sox # to read .flac files from LibriSpeech
To train the model, run:
python src/main.py
See the demo/ folder for:
- Running inference for retrieval — build the retrieval database, build the FAISS index, and run query-by-example Spoken Term Detection (clean and noise-corrupted queries).
- Extracting token sequences for same-word utterance pairs — tokenize pairs of same-word utterances and compute Jaccard similarity.
The assets/token_subword_mapping/ folder contains short audio clips that let you hear what each learned token represents. For every 25 ms audio segment we extract its continuous encoder embedding and quantize it to a token; for a given token we then concatenate multiple 25 ms segments that were all mapped to that token, giving a rough acoustic signature of the token in isolation.
- Dataset: LibriSpeech Word Alignments
- Pre-trained Models: coming soon
If you find our work useful, please cite:
@INPROCEEDINGS{11462743,
author={Singh, Anup and Arora, Vipul and Demuynck, Kris},
booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={BEST-STD 2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection},
year={2026},
volume={},
number={},
pages={17852-17856},
doi={10.1109/ICASSP55912.2026.11462743}}
👉 You may also check out our earlier works on speech tokenization for STD:
BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection Anup Singh, Kris Demuynck, Vipul Arora Paper: https://ieeexplore.ieee.org/abstract/document/10889633
Language-Agnostic Speech Tokenizer for Spoken Term Detection with Efficient Retrieval Anup Singh, Kris Demuynck, Vipul Arora Paper: https://www.isca-archive.org/interspeech_2025/singh25d_interspeech.html
