Voice Activity Detection Test Assignment

Abstract

In this work I try to make a Voice Activity Detection model which should classify each 10ms frame of extremely noise audio as speech or non-speech.

Related work

Model	Architecture	Pros	Cons
Google Webrtc-VAD	GMM	Fast, lightweight	Performs well only on clean speach without noises, does not return probabilities
Picovoice Cobra	Closed	Fast, lightweight	Minimum frame size is 30ms
Silero VAD	Closed	Pretrained, good quality declared	Minimum frame size is 30ms
Recurrent neural networks for voice activity detection	RNN	Outperforms GMM based models	Classifies frame using next frames
A Hybrid CNN-BiLSTM Voice Activity Detector	CNN-RNN	Good quality declared	Minimum frame size is 320ms
Voice Activity Detection in Noisy Environments	RNN/CNN/CNN-RNN	Good quality declared	Does not operate on 10ms frame size
Personal VAD: Speaker-Conditioned Voice Activity Detection	RNN	Good quality declared	Different task, larger frame size

Dataset

An ideal audio dataset for this task should:

have each 10ms frame annotated as speech/non-speech
consist of speech in different noisy environments
be balanced

However, it is extremely difficult to obtain such a dataset. Hence, in this work LibriSpeech dataset (clean subset) is used with different augmentations techniques and online labeling by external voice activity detector.

UrbanSound8k and ESC-50 noise datasets are used to mix with clean speech from LibriSpeech dataset.

Dataflow

Load a speech audio-file and a random noise audio-file
Randomly apply augmentations to each (reverberation, clicks, etc.)
Mix them
Consider desired 10ms fragment
Apply external VAD to this fragment on clean speech audio to get true labels
In half of the cases return noisy or almost silenced fragment with label=non-speech. Otherwise, return fragment from mixed audio
Apply feature extraction (e.g. log mel-filterbank energy features)

Models

Features

A little bit on feature selection, (see chap 4)

Feature	Visual	Description
No features	Raw waveform	Useful for external VAD classifiers, like Webrtc-VAD
Log Mel-Filterbank Energies 8x8		winlen 3ms, winstep 1 ms, nfilt 8 VERY UNINFORMATIVE
Log Mel-Filterbank Energies 32x32		winlen 2.3ms, winstep 0.25 ms, nfilt 32

Metrics

The task is considered as a binary classification problem. Hence, standard metrics and loss function are used: F1 score, Precision, Recall, Binary Cross Entropy Loss

LeNet-32

Inspired by Original LeNet architecture

CNN-BiLSTM

Inspired by A Hybrid CNN-BiLSTM Voice Activity Detector

Ensembles

CNN-BiLSTM turned out to be aggressive and quite sensitive to noise. In contrast, LeNet32 gives conservative predictions and is very cautious.

Can quality be improved by ensemble this two models? At least it doesn't get worse

ensemble_lenet32_cnn_bilstm - An ensemble with averaged predictions of LeNet32 and CNN-BiLSTM

In fact, the code allows to ensemble as many models as you want. Just specify all the checkpoints and make sure all models use the same features.

Results

More logs in WandB

ROC Curve

Speed & thresholds statistics

How to improve?

Consider different dataset labeling technique (e.g. forced alignment)
Consider different features (e.g. 64x64 log mel-filterbank energies)
Try deeper neural network architectures and ensembles of them
Try different augmentations techniques

Installation

python 3.8.10 was used in this work
Check PyTorch version in requirements.txt according to your CUDA version
pip3 install -r requirements.txt (if you have problems with versioning then exact versions are specified in requirements_freezed.txt)
Download datasets:
Review config.yaml and make changes to suit your needs
mv .env.template .env

Training

Make sure config.yaml is configured
python train.py

All training logs are available in WandB project.

Testing

Running test.py script will perform benchmarking with parameters specified in config.yaml

Make sure config.yaml is configured (specify models, checkpoints, etc.)
python test.py

Results will be logged to WandB project as a Benchmark run. There will be a table with statistics for each model and a ROC-curve

Inference

Demo colab notebook

Making submissions

Run tests to get threshold values (from Statistics table)
In config.yaml specify desired model, checkpoint, path to test data, thresholds
python submission.py will create submissions/ directory with predictions

Submission was made with the ensemble model with threshold = 38% and is stored as `submission.csv`

Conclusion

This work was an attempt to solve the problem of voice activity detection in noisy environment. Audio data processing pipeline was built with extensive augmentation techniques. Two deep neural net architectures based on CNN and CNN-RNN were considered.

The results are not outstanding, but considering frame size limitations and contamination of speech audio they beat baseline of Google's WebRTC VAD

The codebase was written to continue research: new models (as well as dataprocessing pipelines) can be easily added, swapped, combined.

Reference

LibriSpeech | UrbanSound8k | ESC-50

WebRTC VAD

Modern Portable Voice Activity Detector Released

Recurrent neural networks for voice activity detection

A Hybrid CNN-BiLSTM Voice Activity Detector: Paper | Code

LeNet: Paper | Code

Voice Activity Detection in Noisy Environments (Some code fragments/ideas/visualization taken)

Personal VAD: Speaker-Conditioned Voice Activity Detection

Deep Neural Network acoustic models for ASR

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data_processing		data_processing
images		images
models		models
test_audios		test_audios
utils		utils
.env.template		.env.template
.gitignore		.gitignore
README.md		README.md
config.py		config.py
config.yaml		config.yaml
requirements.txt		requirements.txt
requirements_freezed.txt		requirements_freezed.txt
submission.csv		submission.csv
submission.py		submission.py
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Voice Activity Detection Test Assignment

Abstract

Table of contents

Related work

Dataset

Dataflow