Bird Audio-Call Identification 🐦

Objective

In this project, we created a convolutional neural network (CNN) to identify bird species in the Western Ghats, a biodiversity hotspot. Our approach included:

Implement preprocessing steps to handle to isolate and clean audio signal
Employ multiprocessing, generators, and disk-writing to improve computation efficiency
Utilize a CNN for identifying the species of origin for calls.

Citation

Sprengel, E., Jaggi, M., Kilcher, Y., & Hofmann, T. (2016). Audio-Based Bird Species Identification using Deep Learning Techniques. Eidgenössische Technische Hochschule (ETH) Zürich, Zürich, Switzerland. Retrieved from ETH Zurich.

We implemented our approaches inspired by the techniques used in this paper "Audio-Based Bird Species Identification using Deep Learning Techniques" by Sprengel.

Dataset

Our data was used from the BirdClef Kaggle competition. We had training data that had bird calls of different lengtha and quality.

Our primary issue was finding a method to convert audio into an image that could be processed by our CNN. To achieve this, we transformed the audio data into spectrograms while making sure to isolate the bird calls. We also standardized the spectrograms into the same size and amplitude scale. We accounted for background noise and enhanced our model by layering different types of background noise and applying pitch and time shifting to the audio.

More info can be found here:

Kaggle competition: https://www.kaggle.com/competitions/birdclef-2024/overview

Files for each step

Data Pipeline

Preprocessing

To divide the sound file into signal and noise parts, we first computed a spectrogram of the entire file using a short-time Fourier transform (STFT). For the signal, we applied binary filters to remove noise, created an indicator vector that marks the signal interval, and identified pixels from the spectrogram that were three times the row and column median. A similar process is used for noise. The threshold is slightly lower with 2.5 times the median. The signal and noise are then extracted, and placed into separate files. Their respective spectrograms are then computed.

Signal/Noise Separation

To achieve this we:
1. Created signal and noise masks
2. Apply binary erosion and dilation to the spectrogram
3. Observe how dilation refines signal presence and signal presence indicator
Creating Spectrograms

This code shows the audio waveforms in spectrograms, which can then be used as input to a convolutional neural network (CNN). Some functions generate spectrograms using Short-Time Fourier Transform (STFT), normalize the spectrogram data, and apply log scaling for better frequency representation. The code allows visualizing the spectrograms and isolating specific segments based on predefined criteria. The source code is inspired by TensorFlow's audio processing tutorial.

Data Augmentation

We performed data augmentation on the spectrograms using four key methods:

Splititng the spectrogram into chunks of a specific size and adding pads to the end to check all chunks are equal size.
Shifting the spectrogram along the time axis by a specified amount, simulating a time delay.
Shifting the spectrogram along the frequency axis to simulate a change in pitch.
Lowering the amplitude of the spectrogram by a given factor to resemble quieter audio.

CNN

To create our Convolutional Neural Network (CNN), we used custom data generators with multi-processing, this allowed for dynamic data loading and preprocessing. It was designed with multiple convolutional layers, and optimized through hyperparameter tuning to balance accuracy and performance.

Results

During the course of our project, we encountered significant limitations related to computational resources. Due to these constraints, we were unable to run our final model on the full dataset as initially planned. Instead, we opted to develop and test a prototype model using a smaller subset of the data, specifically focusing on just 5 birds. This approach allowed us to manage our computational load effectively while still providing valuable insights into the model’s performance.

Despite the reduced scale, our prototype model achieved an accuracy of 78.12% on both the training and validation datasets. This level of performance was consistent across different model configurations, which was initially surprising given the variations in the number of parameters among these configurations. The fact that the test accuracy for various model setups also yielded the same accuracy indicates a stable performance across different model architectures.

Name		Name	Last commit message	Last commit date
Latest commit History 230 Commits
Notes		Notes
PortfolioCode		PortfolioCode
PreProcessingCode		PreProcessingCode
hyperparam_tuning/bird_classification		hyperparam_tuning/bird_classification
train_audio_smaller		train_audio_smaller
.gitignore		.gitignore
README.md		README.md
scraper.ipynb		scraper.ipynb
test_accuracy.py		test_accuracy.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bird Audio-Call Identification 🐦

Objective

Table Of Contents

Citation

Dataset

Files for each step

Data Pipeline

Preprocessing

Data Augmentation

CNN

Results

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

NeemaJ/BirdCLEF

Folders and files

Latest commit

History

Repository files navigation

Bird Audio-Call Identification 🐦

Objective

Table Of Contents

Citation

Dataset

Files for each step

Data Pipeline

Preprocessing

Data Augmentation

CNN

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages