Thanks to visit codestin.com
Credit goes to github.com

Skip to content

anupsingh15/BEST-STD2.0

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BEST-STD 2.0

This repository presents an improved speech tokenization framework for Spoken Term Detection (STD), extending our earlier work BEST-STD and LAST. The method combines a bidirectional Mamba (BiMamba) encoder with an Optimal-Transport-based Vector Quantizer, trained with a combined contrastive, commitment, and noise-robust loss. The resulting tokens are consistent across clean and noisy utterances of the same word, enabling fast query-by-example retrieval over large reference databases. The repository includes the implementation, demo scripts, and pre-trained models.

BEST-STD 2.0

BEST-STD 2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection Anup Singh, Kris Demuynck, Vipul Arora
Paper: https://ieeexplore.ieee.org/abstract/document/11462743

Setup

Clone the Repository

git clone https://github.com/anupsingh15/BEST-STD2.0.git
cd BEST-STD2.0

Create a Virtual Environment

conda create -n STD anaconda
conda activate STD

Install Dependencies

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install mamba-ssm
pip install causal-conv1d>=1.4.0
python -m pip install tslearn
pip install -U tensorboard
pip install POT
pip install librosa
pip install npy-append-array
pip install faiss-cpu
pip install Levenshtein
conda install conda-forge::sox   # to read .flac files from LibriSpeech

Usage

To train the model, run:

python src/main.py 

See the demo/ folder for:

  • Running inference for retrieval — build the retrieval database, build the FAISS index, and run query-by-example Spoken Term Detection (clean and noise-corrupted queries).
  • Extracting token sequences for same-word utterance pairs — tokenize pairs of same-word utterances and compute Jaccard similarity.

Token Audio Samples

The assets/token_subword_mapping/ folder contains short audio clips that let you hear what each learned token represents. For every 25 ms audio segment we extract its continuous encoder embedding and quantize it to a token; for a given token we then concatenate multiple 25 ms segments that were all mapped to that token, giving a rough acoustic signature of the token in isolation.

Datasets & Pre-trained Models

Citation

If you find our work useful, please cite:

@INPROCEEDINGS{11462743,
  author={Singh, Anup and Arora, Vipul and Demuynck, Kris},
  booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={BEST-STD 2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection}, 
  year={2026},
  volume={},
  number={},
  pages={17852-17856},
  doi={10.1109/ICASSP55912.2026.11462743}}

👉 You may also check out our earlier works on speech tokenization for STD:

BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection Anup Singh, Kris Demuynck, Vipul Arora Paper: https://ieeexplore.ieee.org/abstract/document/10889633

Language-Agnostic Speech Tokenizer for Spoken Term Detection with Efficient Retrieval Anup Singh, Kris Demuynck, Vipul Arora Paper: https://www.isca-archive.org/interspeech_2025/singh25d_interspeech.html

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages