Download → Transcribe → Diarize
A production-ready podcast transcription pipeline that automatically fetches new episodes, transcribes them with word-level timestamps, identifies speakers, and uploads structured JSON to cloud storage.
┌─────────────────────────────────────────────────────────────────────────────┐
│ DTD Pipeline Overview │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Taddy API ──▶ Download ──▶ Transcribe ──▶ Diarize ──▶ Identify ──▶ Upload │
│ (episodes) (yt-dlp) (Whisper) (Pyannote) (ECAPA) (Supabase)│
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Input: Podcast RSS feeds via Taddy API Output: Speaker-attributed transcripts with word-level timestamps in JSON
- Automatic Episode Discovery — Polls Taddy API for new episodes within configurable lookback window
- High-Quality Transcription — Uses faster-whisper with large-v3 model and word timestamps
- Speaker Diarization — Pyannote 3.1 identifies who speaks when
- Speaker Recognition — Matches voices against known speaker embeddings (192D ECAPA-TDNN)
- Cloud Storage — Uploads structured JSON to Supabase with date-partitioned paths
- State Tracking — SQLite prevents reprocessing; supports manual reprocess override
- GPU Optimized — Sequential model loading with explicit VRAM management
- Docker Ready — Single container with baked models for 24/7 operation
┌────────────────────────────────────────────────────────────────────────────────┐
│ PIPELINE STAGES │
├────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ FETCH │───▶│DOWNLOAD │───▶│TRANSCRIBE───▶│ DIARIZE │───▶│IDENTIFY │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ Taddy │ │ yt-dlp │ │ Whisper │ │Pyannote │ │ECAPA- │ │
│ │ GraphQL │ │ │ │ large-v3│ │ 3.1 │ │TDNN │ │
│ └─────────┘ └─────────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │ │
│ │ ▼ ▼ ▼ │
│ │ ┌────────────────────────────────────┐ │
│ │ │ VRAM MANAGEMENT │ │
│ │ │ (unload model after each stage) │ │
│ │ └────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ CHECK │ │ MERGE │───▶│ UPLOAD │───▶│ MARK │ │
│ │ STATE │ │ │ │ │ │PROCESSED│ │
│ │ │ │Segments │ │Supabase │ │ │ │
│ │SQLite DB│ │+Speakers│ │ Storage │ │SQLite DB│ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────────────┘
The pipeline uses three GPU-intensive models that cannot fit in VRAM simultaneously:
┌─────────────────────────────────────────────────────────────────┐
│ VRAM USAGE TIMELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ VRAM │
│ ▲ │
│ │ ┌──────┐ │
│ 6GB │Whisper │
│ │ │ │ ┌──────┐ │
│ 4GB │ │ │Pyannote │
│ │ │ │ │ │ ┌──────┐ │
│ 2GB │ │ │ │ │ECAPA │ │
│ │ │ │ │ │ │ │ │
│ └────┴──────┴────┴──────┴────┴──────┴────────────▶ Time │
│ Load Unload Load Unload Load Unload │
│ │
└─────────────────────────────────────────────────────────────────┘
Each model is explicitly unloaded after use via torch.cuda.empty_cache().
| Requirement | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA 8GB VRAM | NVIDIA 16GB+ VRAM |
| CUDA | 11.8 | 12.1+ |
| RAM | 16GB | 32GB |
| Disk | 20GB | 50GB+ |
| Python | 3.11 | 3.11+ |
git clone <your-repo-url>
cd be-flow-dtd
# Using uv (recommended)
uv sync
# Or with pip
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtYou need credentials from three services:
- Go to https://taddy.org/developers
- Create an account and get your API credentials
- Note your
API_KEYandUSER_ID
- Go to https://huggingface.co/settings/tokens
- Create a new token with read access
- Important: Accept the pyannote model license:
- Visit https://huggingface.co/pyannote/speaker-diarization-3.1
- Click "Agree and access repository"
- Go to https://supabase.com and create a project
- Go to Settings → API to get your URL and service key
- Create a storage bucket named
podcast-transcripts:-- In Supabase SQL Editor INSERT INTO storage.buckets (id, name, public) VALUES ('podcast-transcripts', 'podcast-transcripts', false);
cp .env.example .env
# Edit .env with your credentials (see .env.example for all options)Edit podcasts.json to add your podcasts:
{
"podcasts": [
{
"id": "lex-fridman",
"name": "Lex Fridman Podcast",
"taddy_id": "c5b7a123-4567-89ab-cdef-0123456789ab",
"min_speakers": 2,
"max_speakers": 4,
"known_speakers": [
{"slug": "lex-fridman", "name": "Lex Fridman", "role": "host"}
]
},
{
"id": "huberman-lab",
"name": "Huberman Lab",
"taddy_id": "d6c8b234-5678-90bc-def0-1234567890bc",
"min_speakers": 1,
"max_speakers": 3,
"known_speakers": [
{"slug": "andrew-huberman", "name": "Andrew Huberman", "role": "host"}
]
}
]
}Each known_speakers entry requires slug, name, and role (typically "host").
Finding Taddy IDs:
uv run python find_podcast.py "Podcast Name"To enable speaker recognition, you need 192D ECAPA-TDNN embeddings for known speakers.
Option A: Seed from pipeline output (recommended)
Run the pipeline first, then promote unknown embeddings to known:
# Run pipeline to generate unknown embeddings
uv run python main.py --podcast lex-fridman --dry-run
# Interactive CLI to assign unknowns to known speakers
uv run python scripts/seed_speakers.py lex-fridmanOption B: Manual embedding generation
import numpy as np
import torchaudio
from speechbrain.inference.speaker import EncoderClassifier
classifier = EncoderClassifier.from_hparams(
source="speechbrain/spkrec-ecapa-voxceleb",
savedir="data/models/ecapa-tdnn"
)
signal, sr = torchaudio.load("speaker_sample.wav")
if sr != 16000:
signal = torchaudio.transforms.Resample(sr, 16000)(signal)
embedding = classifier.encode_batch(signal).squeeze().cpu().numpy()
np.save("data/speakers/known/podcast-id/speaker-slug.npy", embedding)# Dry run - process but don't upload
uv run python main.py --dry-run --verbose
# Test single podcast
uv run python main.py --podcast lex-fridman --dry-run
# Test single episode by URL
uv run python main.py --episode-url "https://example.com/episode.mp3" --dry-run# Process all configured podcasts
uv run python main.py
# With verbose logging
uv run python main.py --verbose
# Force reprocess already-done episodes
uv run python main.py --reprocessdocker-compose build # Includes all models (10-20 min)
docker-compose up -d # Start service
docker-compose logs -f # View logscrontab -e
# Run every 6 hours
0 */6 * * * cd /path/to/be-flow-dtd && docker-compose run --rm dtd-pipelineCreate /etc/systemd/system/dtd-pipeline.service:
[Unit]
Description=DTD Pipeline
After=docker.service
[Service]
Type=oneshot
WorkingDirectory=/path/to/be-flow-dtd
ExecStart=/usr/bin/docker-compose run --rm dtd-pipeline
[Install]
WantedBy=multi-user.targetCreate /etc/systemd/system/dtd-pipeline.timer:
[Unit]
Description=Run DTD Pipeline every 6 hours
[Timer]
OnBootSec=15min
OnUnitActiveSec=6h
[Install]
WantedBy=timers.targetsudo systemctl enable --now dtd-pipeline.timersudo apt install ffmpeg
nvidia-smi # Verify GPU
# Add to crontab: 0 */6 * * * cd /path/to/be-flow-dtd && uv run python main.pyPerformance notes:
- RTX 3090/4090 processes ~100 hours of audio per day with large-v3
- 8GB VRAM cards: use
WHISPER_MODEL=mediumfor reliable operation - Monitor with
watch -n 1 nvidia-smiduring initial runs
| Variable | Required | Default | Description |
|---|---|---|---|
| Taddy API | |||
TADDY_API_KEY |
Yes | — | API key from taddy.org |
TADDY_USER_ID |
Yes | — | User ID from taddy.org |
LOOKBACK_DAYS |
No | 7 |
Days to look back for new episodes |
| HuggingFace | |||
HF_TOKEN |
Yes | — | Token for pyannote model access |
| Whisper | |||
WHISPER_MODEL |
No | large-v3 |
Model: tiny/base/small/medium/large-v2/large-v3 |
WHISPER_DEVICE |
No | cuda |
Device: cuda/cpu |
WHISPER_COMPUTE_TYPE |
No | float16 |
Precision: float16/int8/float32 |
| Pyannote | |||
PYANNOTE_MODEL |
No | pyannote/speaker-diarization-3.1 |
Diarization model |
| Speaker ID | |||
SPEAKER_MATCH_THRESHOLD |
No | 0.70 |
Cosine similarity threshold (0.0-1.0) |
| Supabase | |||
SUPABASE_URL |
Yes | — | Project URL |
SUPABASE_KEY |
Yes | — | Service role key |
SUPABASE_BUCKET |
No | podcast-transcripts |
Storage bucket name |
| Paths | |||
DATA_DIR |
No | data |
Base data directory |
AUDIO_DIR |
No | data/audio |
Temporary audio files (auto-cleaned) |
SPEAKERS_DIR |
No | data/speakers |
Speaker embeddings directory |
LOGS_DIR |
No | data/logs |
Log files directory |
STATE_DB_PATH |
No | data/state.db |
SQLite database path |
PODCASTS_CONFIG_PATH |
No | podcasts.json |
Podcast config file |
{
"podcasts": [
{
"id": "unique-podcast-id",
"name": "Human Readable Name",
"taddy_id": "uuid-from-taddy",
"min_speakers": 1,
"max_speakers": 10,
"known_speakers": [
{"slug": "speaker-slug", "name": "Speaker Name", "role": "host"}
]
}
]
}Uploaded to: {bucket}/{podcast-id}/{YYYY}/{MM}/{YYYY-MM-DD}-{slug}.json
{
"episode": {
"id": "taddy-episode-uuid",
"podcast_id": "taddy-podcast-uuid",
"title": "Episode Title",
"audio_url": "https://...",
"published_at": "2024-01-15T10:00:00Z",
"duration_seconds": 3600,
"description": "Episode description..."
},
"transcript": [
{
"text": "Hello and welcome to the show.",
"start": 0.0,
"end": 2.5,
"words": [
{"text": "Hello", "start": 0.0, "end": 0.4, "confidence": 0.98},
{"text": "and", "start": 0.5, "end": 0.6, "confidence": 0.99},
{"text": "welcome", "start": 0.7, "end": 1.1, "confidence": 0.97}
],
"speaker_id": "host-uuid",
"speaker_name": "John Host",
"confidence": 0.92
}
],
"speakers": [
{
"id": "host-uuid",
"podcast_id": "podcast-id",
"name": "John Host",
"embedding_path": "data/speakers/known/podcast-id/john-host.npy"
}
],
"processing_metadata": {
"processed_at": "2024-01-15T12:30:00Z",
"whisper_model": "large-v3",
"pyannote_model": "pyannote/speaker-diarization-3.1",
"speaker_threshold": 0.70
}
}uv run python main.py [OPTIONS]| Option | Description |
|---|---|
--podcast ID |
Process only this podcast (matches id in podcasts.json) |
--episode-url URL |
Debug mode: process single episode by direct audio URL |
--dry-run |
Process but don't upload; print JSON to stdout |
--reprocess |
Ignore state.db; reprocess already-completed episodes |
--verbose, -v |
Enable debug logging |
# Find podcast UUID from Taddy API
uv run python find_podcast.py "Podcast Name"
# Seed known speakers (interactive: promote unknown → known)
uv run python scripts/seed_speakers.py <podcast_id>
# CocoIndex monitoring (requires DATABASE_URL)
uv run cocoindex server -ci cocoindex_flow.py# Normal operation - process all podcasts
uv run python main.py
# Process only one podcast
uv run python main.py --podcast the-bitcoin-matrix
# Debug a specific episode without uploading
uv run python main.py --episode-url "https://example.com/ep.mp3" --dry-run
# Reprocess everything with verbose output
uv run python main.py --reprocess --verbosedata/speakers/
├── known/
│ ├── the-bitcoin-matrix/
│ │ └── cedric-youngelman.npy # 192D ECAPA-TDNN embedding
│ └── what-bitcoin-did/
│ └── danny-knowles.npy
└── unknown/
└── the-bitcoin-matrix_unknown_abc123.npy # Auto-generated for unmatched speakers
After processing episodes, unknown speakers appear in data/speakers/unknown/. Use the interactive seeding script to promote them:
uv run python scripts/seed_speakers.py the-bitcoin-matrixThis displays unknown embeddings and known speaker configs, then lets you assign each unknown to a known speaker by number.
Problem: CUDA out of memory
export WHISPER_MODEL=medium # or small
export WHISPER_COMPUTE_TYPE=int8Problem: RuntimeError: CUDA error
uv run python -c "import torch; print(torch.cuda.is_available())"
nvidia-smi| Error | Solution |
|---|---|
HF_TOKEN required |
Set HF_TOKEN in .env and accept pyannote license |
Unauthorized (Taddy) |
Verify TADDY_API_KEY and TADDY_USER_ID |
Invalid API key (Supabase) |
Use service role key, not anon key |
uv run pip install -U yt-dlp
yt-dlp --extract-audio "URL"# View processed episodes
sqlite3 data/state.db "SELECT * FROM processed_episodes ORDER BY processed_at DESC LIMIT 10;"
# Clear state to force reprocess (or use --reprocess flag)
sqlite3 data/state.db "DELETE FROM processed_episodes WHERE podcast_id='my-podcast';"| Scenario | Settings |
|---|---|
| 16GB+ VRAM | WHISPER_MODEL=large-v3, WHISPER_COMPUTE_TYPE=float16 |
| 8GB VRAM | WHISPER_MODEL=medium, WHISPER_COMPUTE_TYPE=float16 |
| 4GB VRAM / CPU | WHISPER_MODEL=small, WHISPER_DEVICE=cpu, WHISPER_COMPUTE_TYPE=int8 |
be-flow-dtd/
├── main.py # Entry point and pipeline orchestration
├── config.py # Pydantic Config from environment variables
├── models.py # Data models (Episode, TranscriptSegment, KnownSpeaker, etc.)
├── state.py # SQLite state tracking (processed_episodes, speakers)
├── utils.py # Retry decorator, slugify, VRAM unload helper
├── logging_config.py # Structured logging with podcast context
├── cocoindex_flow.py # CocoIndex flows for monitoring/visualization
├── find_podcast.py # Taddy API search → podcast UUID + recent episodes
├── podcasts.json # Podcast configuration (id, taddy_id, known_speakers)
├── scripts/
│ └── seed_speakers.py # Interactive CLI: promote unknown → known embeddings
├── services/
│ ├── taddy.py # Taddy GraphQL client
│ ├── downloader.py # yt-dlp audio downloader
│ ├── transcriber.py # faster-whisper transcription
│ ├── diarizer.py # Pyannote 3.1 speaker diarization
│ ├── speaker_id.py # ECAPA-TDNN 192D voiceprint matching
│ ├── merger.py # Transcript + speaker turn alignment
│ └── storage.py # Supabase upload (date-partitioned)
├── data/
│ ├── audio/ # Temporary audio (auto-cleaned)
│ ├── speakers/
│ │ ├── known/ # Known speaker embeddings (.npy)
│ │ └── unknown/ # Unmatched speakers (.npy)
│ ├── logs/
│ │ ├── pipeline.log # Human-readable logs
│ │ └── errors.jsonl # Structured errors
│ └── state.db # SQLite tracking
├── pyproject.toml # Python 3.11+, uv-managed dependencies
├── requirements.txt # Pip-compatible dependencies
├── Dockerfile # Container with baked models
├── docker-compose.yml # GPU deployment
└── .env.example # Environment template
- be-podcast-etl — Downstream belief extraction pipeline (speakers → ads → extract → abstract → embed → weights → headlines → matrix)