A Python library for transcribing media content with speaker diarization support. This package provides high-quality transcription using WhisperX with automatic speaker detection and labeling.
- High-quality transcription using WhisperX
- Automatic speaker diarization
- Support for multiple media sources
- GPU acceleration support
- Flexible output formats
- Type-safe interfaces
pip install diarized-transcriber- Python 3.10 or later
- CUDA-capable GPU
- PyTorch with CUDA support
- HuggingFace account for
pyannote.audioaccess
Before running the examples below you must export an authentication token for
the diarization pipeline. This token is referenced in
diarized_transcriber/transcription/diarization.py (see lines 43–55) and is
required for downloading the pretrained pyannote.audio model.
export HF_TOKEN="<your-huggingface-token>"from diarized_transcriber import TranscriptionEngine, MediaContent, MediaSource
# Initialize the engine
engine = TranscriptionEngine(model_size="base")
# Create media content object
content = MediaContent(
id="example-1",
title="Example Media",
media_url="https://example.com/media.mp3",
source=MediaSource(type="podcast")
)
# Perform transcription
result = engine.transcribe(content)
# Format the result
from diarized_transcriber.utils.formatting import format_transcript
transcript = format_transcript(
result,
output_format="text",
group_by_speaker=True
)
print(transcript)Set up your HuggingFace token for pyannote.audio access:
export HF_TOKEN='your-huggingface-token'Ensure CUDA is properly configured for your system. A good way to check if CUDA is working is to run nvidia-smi in your terminal.
The TranscriptionEngine can be configured with different model sizes:
tiny: Fastest, lowest accuracybase: Good balance of speed and accuracysmall: Better accuracy, slower than basemedium: High accuracy, slowerlarge: Highest accuracy, slowest
Example configuration:
engine = TranscriptionEngine(
model_size="medium",
compute_type="float16" # or "float32" for higher precision
)The library supports multiple output formats.
transcript = format_transcript(
result,
output_format="text",
include_timestamps=True,
timestamp_format="HH:MM:SS.mmm",
group_by_speaker=True
)transcript_dict = format_transcript(
result,
output_format="dict",
include_timestamps=True
)Install the optional dependencies:
pip install diarized-transcriber[server]Start the service:
python -m diarized_transcriber.api.serverThe library provides specific exceptions for different error cases:
GPUConfigError: GPU-related issuesModelLoadError: Model loading failuresAudioProcessingError: Audio processing problemsTranscriptionError: General transcription failuresDiarizationError: Speaker diarization issues
Contributions are welcome! Please see our contributing guidelines for more details.
MIT License - see LICENSE file for details.