This notebook implements a Python pipeline for generating subtitles from audio files, featuring speaker diarization and a quality check agent. It's useful for analyzing short clips with multiple speakers, producing SRT files with labeled dialogues.
- Speaker diarization with consistent labeling.
- Dialogue-level transcription.
- Confidence-based quality evaluation per segment.
- Audio Preparation: Standardize audio to mono PCM WAV.
- Transcription: Leverage Whisper for accurate segmenting.
- Diarization: Use pyannote embeddings and clustering for speaker assignment—more reliable than basic methods.
- Merging: Combine segments for natural flow.
- Output: SRT with labels.
- Quality Agent: Rule-based confidence scoring and feedback.
Input audio → Convert → Transcribe → Embed & Cluster → Merge → SRT & Quality Report.
- Limitations: Fixed speaker count; best on clean audio.
- Improvements: Add overlap detection; LLM feedback; auto-speaker estimation.