This repository contains a FastAPI application that provides speech-to-text transcription for Swedish, alignment, and speaker diarization using the whisperx library. Models for speech-to-text (Whisper) and forced alignment are finetuned by Kungliga Biblioteket, KB-Lab. See https://www.kb.se/samverkan-och-utveckling/nytt-fran-kb/nyheter-samverkan-och-utveckling/2025-02-20-valtranad-ai-modell-forvandlar-tal-till-text.html for more details.]
It is all tested on my laptop with a RTX4090 GPU.
CI will enforce linting with ruff ruff check .
- Transcription: Transcribes audio files using the KBLab/kb-whisper-large Whisper model (or smaller if desired).
- Alignment: Aligns the transcribed text with the audio to provide word-level timestamps.
- Speaker Diarization: Identifies different speakers in the audio and labels the transcribed segments accordingly.
- NVIDIA GPU with CUDA drivers (for optimal performance)
- Docker
- Docker Compose
-
Clone the repository:
git clone [https://github.com/joenaess/swe-trans-dia.git](https://github.com/joenaess/swe-trans-dia.git)
-
Create a .env file in the root directory and add your Hugging Face token:
HUGGINGFACE_TOKEN=your_huggingface_token
-
Download models for the container, token should have been approved for pyannote https://huggingface.co/pyannote/speaker-diarization-3.1:
python model_dl.py
-
Build the Docker image:
docker-compose build --no-cache
-
Start the FastAPI server:
docker-compose up -d
-
Run tests:
pytest tests/
-
Send a POST request to the /transcribe/ endpoint with your audio file:
curl -X POST -F "[email protected]" "http://localhost:8000/transcribe/?min_speakers=2&max_speakers=2"
The API documentation is automatically generated by FastAPI and can be accessed at: