BEST-STD 2.0

This repository presents an improved speech tokenization framework for Spoken Term Detection (STD), extending our earlier work BEST-STD and LAST. The method combines a bidirectional Mamba (BiMamba) encoder with an Optimal-Transport-based Vector Quantizer, trained with a combined contrastive, commitment, and noise-robust loss. The resulting tokens are consistent across clean and noisy utterances of the same word, enabling fast query-by-example retrieval over large reference databases. The repository includes the implementation, demo scripts, and pre-trained models.

BEST-STD 2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection Anup Singh, Kris Demuynck, Vipul Arora
Paper: https://ieeexplore.ieee.org/abstract/document/11462743

Setup

Clone the Repository

git clone https://github.com/anupsingh15/BEST-STD2.0.git
cd BEST-STD2.0

Create a Virtual Environment

conda create -n STD anaconda
conda activate STD

Install Dependencies

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install mamba-ssm
pip install causal-conv1d>=1.4.0
python -m pip install tslearn
pip install -U tensorboard
pip install POT
pip install librosa
pip install npy-append-array
pip install faiss-cpu
pip install Levenshtein
conda install conda-forge::sox   # to read .flac files from LibriSpeech

Usage

To train the model, run:

python src/main.py

See the demo/ folder for:

Running inference for retrieval — build the retrieval database, build the FAISS index, and run query-by-example Spoken Term Detection (clean and noise-corrupted queries).
Extracting token sequences for same-word utterance pairs — tokenize pairs of same-word utterances and compute Jaccard similarity.

Token Audio Samples

The assets/token_subword_mapping/ folder contains short audio clips that let you hear what each learned token represents. For every 25 ms audio segment we extract its continuous encoder embedding and quantize it to a token; for a given token we then concatenate multiple 25 ms segments that were all mapped to that token, giving a rough acoustic signature of the token in isolation.

Datasets & Pre-trained Models

Dataset: LibriSpeech Word Alignments
Pre-trained Models: coming soon

Citation

If you find our work useful, please cite:

@INPROCEEDINGS{11462743,
  author={Singh, Anup and Arora, Vipul and Demuynck, Kris},
  booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={BEST-STD 2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection}, 
  year={2026},
  volume={},
  number={},
  pages={17852-17856},
  doi={10.1109/ICASSP55912.2026.11462743}}

👉 You may also check out our earlier works on speech tokenization for STD:

BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection Anup Singh, Kris Demuynck, Vipul Arora Paper: https://ieeexplore.ieee.org/abstract/document/10889633

Language-Agnostic Speech Tokenizer for Spoken Term Detection with Efficient Retrieval Anup Singh, Kris Demuynck, Vipul Arora Paper: https://www.isca-archive.org/interspeech_2025/singh25d_interspeech.html

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
config		config
demo		demo
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BEST-STD 2.0

Setup

Clone the Repository

Create a Virtual Environment

Install Dependencies

Usage

Token Audio Samples

Datasets & Pre-trained Models

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BEST-STD 2.0

Setup

Clone the Repository

Create a Virtual Environment

Install Dependencies

Usage

Token Audio Samples

Datasets & Pre-trained Models

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages