In this work I try to make a Voice Activity Detection model which should classify each 10ms frame of extremely noise audio as speech or non-speech.
-
4.3 LeNet-32
4.4 CNN-BiLSTM
4.5 Ensembles
| Model | Architecture | Pros | Cons |
|---|---|---|---|
| Google Webrtc-VAD | GMM | Fast, lightweight | Performs well only on clean speach without noises, does not return probabilities |
| Picovoice Cobra | Closed | Fast, lightweight | Minimum frame size is 30ms |
| Silero VAD | Closed | Pretrained, good quality declared | Minimum frame size is 30ms |
| Recurrent neural networks for voice activity detection | RNN | Outperforms GMM based models | Classifies frame using next frames |
| A Hybrid CNN-BiLSTM Voice Activity Detector | CNN-RNN | Good quality declared | Minimum frame size is 320ms |
| Voice Activity Detection in Noisy Environments | RNN/CNN/CNN-RNN | Good quality declared | Does not operate on 10ms frame size |
| Personal VAD: Speaker-Conditioned Voice Activity Detection | RNN | Good quality declared | Different task, larger frame size |
An ideal audio dataset for this task should:
- have each 10ms frame annotated as speech/non-speech
- consist of speech in different noisy environments
- be balanced
However, it is extremely difficult to obtain such a dataset.
Hence, in this work LibriSpeech dataset (clean subset) is used with different augmentations techniques
and online labeling by external voice activity detector.
UrbanSound8k and ESC-50 noise datasets are used to mix with clean speech from LibriSpeech dataset.
- Load a speech audio-file and a random noise audio-file
- Randomly apply augmentations to each (reverberation, clicks, etc.)
- Mix them
- Consider desired 10ms fragment
- Apply external VAD to this fragment on clean speech audio to get true labels
- In half of the cases return noisy or almost silenced fragment with label=non-speech. Otherwise, return fragment from mixed audio
- Apply feature extraction (e.g. log mel-filterbank energy features)
A little bit on feature selection, (see chap 4)
The task is considered as a binary classification problem. Hence, standard metrics and loss function are used: F1 score, Precision, Recall, Binary Cross Entropy Loss
Inspired by Original LeNet architecture
Inspired by A Hybrid CNN-BiLSTM Voice Activity Detector
CNN-BiLSTM turned out to be aggressive and quite sensitive to noise. In contrast, LeNet32 gives conservative predictions and is very cautious.
Can quality be improved by ensemble this two models? At least it doesn't get worse
ensemble_lenet32_cnn_bilstm - An ensemble with averaged predictions of LeNet32 and CNN-BiLSTM
In fact, the code allows to ensemble as many models as you want. Just specify all the checkpoints and make sure all models use the same features.
- Consider different dataset labeling technique (e.g. forced alignment)
- Consider different features (e.g. 64x64 log mel-filterbank energies)
- Try deeper neural network architectures and ensembles of them
- Try different augmentations techniques
python 3.8.10was used in this work- Check PyTorch version in
requirements.txtaccording to your CUDA version pip3 install -r requirements.txt(if you have problems with versioning then exact versions are specified inrequirements_freezed.txt)- Download datasets:
- Review
config.yamland make changes to suit your needs mv .env.template .env
- Make sure
config.yamlis configured python train.py
All training logs are available in WandB project.
Running test.py script will perform benchmarking with parameters specified in config.yaml
- Make sure
config.yamlis configured (specify models, checkpoints, etc.) python test.py
Results will be logged to WandB project as a Benchmark run.
There will be a table with statistics for each model and a ROC-curve
- Run tests to get threshold values (from
Statisticstable) - In
config.yamlspecify desired model, checkpoint, path to test data, thresholds python submission.pywill createsubmissions/directory with predictions
This work was an attempt to solve the problem of voice activity detection in noisy environment. Audio data processing pipeline was built with extensive augmentation techniques. Two deep neural net architectures based on CNN and CNN-RNN were considered.
The results are not outstanding, but considering frame size limitations and contamination of speech audio they beat baseline of Google's WebRTC VAD
The codebase was written to continue research: new models (as well as dataprocessing pipelines) can be easily added, swapped, combined.
LibriSpeech | UrbanSound8k | ESC-50
Modern Portable Voice Activity Detector Released
Recurrent neural networks for voice activity detection
A Hybrid CNN-BiLSTM Voice Activity Detector: Paper | Code
Voice Activity Detection in Noisy Environments (Some code fragments/ideas/visualization taken)