This repository contains the source code, experiments, and resources for the Master's thesis: "Effective Training of Neural Networks for Automatic Speech Recognition".
- Degree: Master of Information Technology (NMAL)
- Institution: Faculty of Information Technology, Brno University of Technology (FIT BUT)
- Author: Matej Horník ([email protected])
- Supervisor: Ing. Alexander Polok ([email protected])
- Year: 2025
- Thesis Link: Official Thesis Page (VUT) (Link will be fully active after defense)
This project systematically investigates efficient training strategies for encoder-decoder Transformer models in Automatic Speech Recognition (ASR). It explores initialization techniques, the role of adapter layers, Parameter-Efficient Fine-tuning (PEFT) methods like LoRA and DoRA, and the impact of domain-specific pre-training, primarily using the LibriSpeech and VoxPopuli datasets.
The code includes scripts for model creation, fine-tuning ( leveraging Hugging Face Transformers), evaluation, and implementations of various experimental setups discussed in the thesis.
A result of this work is a Wav2Vec2-BART (base) model fine-tuned on English VoxPopuli, achieving a Word Error Rate (WER) of 8.85% on the test set.
You can find the model, along with usage instructions and a detailed model card, on the Hugging Face Hub: BUT-FIT/wav2vec2-base_bart-base_voxpopuli-en
This project uses Poetry for dependency management and packaging.
-
Clone the repository:
git clone https://github.com/hornikmatej/thesis_mit.git cd thesis_mit -
Install Poetry: If you don't have Poetry installed, follow the instructions on the official Poetry website.
-
Install dependencies: This command will create a virtual environment and install all the necessary packages defined in
pyproject.tomlandpoetry.lock.poetry install
-
Activate the virtual environment:
poetry shell
You are now in the project's virtual environment with all dependencies available.
The repository is structured to facilitate the reproduction of experiments:
- The main training script for sequence-to-sequence ASR models is
run_speech_recognition_seq2seq.py. - Specific experiment configurations and launch commands are organized within shell scripts, primarily in the
run_scripts/directory (e.g.,run_scripts/voxpopuli_best.sh). - The
src/directory contains custom modules for model creation, specialized trainers, data handling, etc. - Ensure you have the necessary datasets downloaded or accessible (e.g., via Hugging Face Datasets caching). Preprocessing scripts or arguments might be needed as detailed in the thesis or individual run scripts.
Please refer to the thesis document and the comments within the scripts for detailed instructions on running specific experiments.
If you use code or findings from this thesis in your research, please consider citing:
@mastersthesis{Hornik2025EffectiveTraining,
author = {Horník, Matej},
title = {Effective Training of Neural Networks for Automatic Speech Recognition},
school = {Brno University of Technology, Faculty of Information Technology},
year = {2025},
supervisor = {Polok, Alexander},
type = {Master's Thesis},
note = {Online. Available at: \url{https://www.vut.cz/en/students/final-thesis/detail/164401} and code at \url{https://github.com/hornikmatej/thesis_mit}}
}✌️