DTD Pipeline

Download → Transcribe → Diarize

A production-ready podcast transcription pipeline that automatically fetches new episodes, transcribes them with word-level timestamps, identifies speakers, and uploads structured JSON to cloud storage.

What It Does

┌─────────────────────────────────────────────────────────────────────────────┐
│                           DTD Pipeline Overview                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Taddy API ──▶ Download ──▶ Transcribe ──▶ Diarize ──▶ Identify ──▶ Upload │
│   (episodes)    (yt-dlp)    (Whisper)     (Pyannote)   (ECAPA)    (Supabase)│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Input: Podcast RSS feeds via Taddy API Output: Speaker-attributed transcripts with word-level timestamps in JSON

Key Features

Automatic Episode Discovery — Polls Taddy API for new episodes within configurable lookback window
High-Quality Transcription — Uses faster-whisper with large-v3 model and word timestamps
Speaker Diarization — Pyannote 3.1 identifies who speaks when
Speaker Recognition — Matches voices against known speaker embeddings (192D ECAPA-TDNN)
Cloud Storage — Uploads structured JSON to Supabase with date-partitioned paths
State Tracking — SQLite prevents reprocessing; supports manual reprocess override
GPU Optimized — Sequential model loading with explicit VRAM management
Docker Ready — Single container with baked models for 24/7 operation

Architecture

Pipeline Flow

┌────────────────────────────────────────────────────────────────────────────────┐
│                              PIPELINE STAGES                                    │
├────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐       │
│  │  FETCH  │───▶│DOWNLOAD │───▶│TRANSCRIBE───▶│ DIARIZE │───▶│IDENTIFY │       │
│  │         │    │         │    │         │    │         │    │         │       │
│  │ Taddy   │    │ yt-dlp  │    │ Whisper │    │Pyannote │    │ECAPA-   │       │
│  │ GraphQL │    │         │    │ large-v3│    │  3.1    │    │TDNN     │       │
│  └─────────┘    └─────────┘    └────┬────┘    └────┬────┘    └────┬────┘       │
│       │                             │              │              │             │
│       │                             ▼              ▼              ▼             │
│       │                        ┌────────────────────────────────────┐           │
│       │                        │        VRAM MANAGEMENT             │           │
│       │                        │  (unload model after each stage)   │           │
│       │                        └────────────────────────────────────┘           │
│       │                                                                         │
│       ▼                                                                         │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐                      │
│  │ CHECK   │    │  MERGE  │───▶│ UPLOAD  │───▶│  MARK   │                      │
│  │ STATE   │    │         │    │         │    │PROCESSED│                      │
│  │         │    │Segments │    │Supabase │    │         │                      │
│  │SQLite DB│    │+Speakers│    │ Storage │    │SQLite DB│                      │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘                      │
│                                                                                 │
└────────────────────────────────────────────────────────────────────────────────┘

VRAM Management

The pipeline uses three GPU-intensive models that cannot fit in VRAM simultaneously:

┌─────────────────────────────────────────────────────────────────┐
│                    VRAM USAGE TIMELINE                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  VRAM                                                            │
│   ▲                                                              │
│   │    ┌──────┐                                                  │
│  6GB   │Whisper                                                  │
│   │    │      │    ┌──────┐                                      │
│  4GB   │      │    │Pyannote                                     │
│   │    │      │    │      │    ┌──────┐                          │
│  2GB   │      │    │      │    │ECAPA │                          │
│   │    │      │    │      │    │      │                          │
│   └────┴──────┴────┴──────┴────┴──────┴────────────▶ Time       │
│        Load   Unload Load  Unload Load  Unload                   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Each model is explicitly unloaded after use via torch.cuda.empty_cache().

Getting Started

Prerequisites

Requirement	Minimum	Recommended
GPU	NVIDIA 8GB VRAM	NVIDIA 16GB+ VRAM
CUDA	11.8	12.1+
RAM	16GB	32GB
Disk	20GB	50GB+
Python	3.11	3.11+

Step 1: Clone and Install

git clone <your-repo-url>
cd be-flow-dtd

# Using uv (recommended)
uv sync

# Or with pip
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Step 2: Obtain API Keys

You need credentials from three services:

Taddy API (Podcast Metadata)

Go to https://taddy.org/developers
Create an account and get your API credentials
Note your API_KEY and USER_ID

HuggingFace (Pyannote Model Access)

Go to https://huggingface.co/settings/tokens
Create a new token with read access
Important: Accept the pyannote model license:
- Visit https://huggingface.co/pyannote/speaker-diarization-3.1
- Click "Agree and access repository"

Supabase (Cloud Storage)

Go to https://supabase.com and create a project
Go to Settings → API to get your URL and service key

Create a storage bucket named podcast-transcripts:

-- In Supabase SQL Editor
INSERT INTO storage.buckets (id, name, public)
VALUES ('podcast-transcripts', 'podcast-transcripts', false);

Step 3: Configure Environment

cp .env.example .env
# Edit .env with your credentials (see .env.example for all options)

Step 4: Configure Podcasts

Edit podcasts.json to add your podcasts:

{
  "podcasts": [
    {
      "id": "lex-fridman",
      "name": "Lex Fridman Podcast",
      "taddy_id": "c5b7a123-4567-89ab-cdef-0123456789ab",
      "min_speakers": 2,
      "max_speakers": 4,
      "known_speakers": [
        {"slug": "lex-fridman", "name": "Lex Fridman", "role": "host"}
      ]
    },
    {
      "id": "huberman-lab",
      "name": "Huberman Lab",
      "taddy_id": "d6c8b234-5678-90bc-def0-1234567890bc",
      "min_speakers": 1,
      "max_speakers": 3,
      "known_speakers": [
        {"slug": "andrew-huberman", "name": "Andrew Huberman", "role": "host"}
      ]
    }
  ]
}

Each known_speakers entry requires slug, name, and role (typically "host").

Finding Taddy IDs:

uv run python find_podcast.py "Podcast Name"

Step 5: Set Up Known Speakers (Optional)

To enable speaker recognition, you need 192D ECAPA-TDNN embeddings for known speakers.

Option A: Seed from pipeline output (recommended)

Run the pipeline first, then promote unknown embeddings to known:

# Run pipeline to generate unknown embeddings
uv run python main.py --podcast lex-fridman --dry-run

# Interactive CLI to assign unknowns to known speakers
uv run python scripts/seed_speakers.py lex-fridman

Option B: Manual embedding generation

import numpy as np
import torchaudio
from speechbrain.inference.speaker import EncoderClassifier

classifier = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb",
    savedir="data/models/ecapa-tdnn"
)

signal, sr = torchaudio.load("speaker_sample.wav")
if sr != 16000:
    signal = torchaudio.transforms.Resample(sr, 16000)(signal)

embedding = classifier.encode_batch(signal).squeeze().cpu().numpy()
np.save("data/speakers/known/podcast-id/speaker-slug.npy", embedding)

Step 6: Test the Pipeline

# Dry run - process but don't upload
uv run python main.py --dry-run --verbose

# Test single podcast
uv run python main.py --podcast lex-fridman --dry-run

# Test single episode by URL
uv run python main.py --episode-url "https://example.com/episode.mp3" --dry-run

Step 7: Run for Real

# Process all configured podcasts
uv run python main.py

# With verbose logging
uv run python main.py --verbose

# Force reprocess already-done episodes
uv run python main.py --reprocess

Docker Deployment (24/7 Service)

Quick Start

docker-compose build       # Includes all models (10-20 min)
docker-compose up -d       # Start service
docker-compose logs -f     # View logs

Running as a Scheduled Service

Option 1: Cron (Host Machine)

crontab -e
# Run every 6 hours
0 */6 * * * cd /path/to/be-flow-dtd && docker-compose run --rm dtd-pipeline

Option 2: Systemd Timer

Create /etc/systemd/system/dtd-pipeline.service:

[Unit]
Description=DTD Pipeline
After=docker.service

[Service]
Type=oneshot
WorkingDirectory=/path/to/be-flow-dtd
ExecStart=/usr/bin/docker-compose run --rm dtd-pipeline

[Install]
WantedBy=multi-user.target

Create /etc/systemd/system/dtd-pipeline.timer:

[Unit]
Description=Run DTD Pipeline every 6 hours

[Timer]
OnBootSec=15min
OnUnitActiveSec=6h

[Install]
WantedBy=timers.target

sudo systemctl enable --now dtd-pipeline.timer

Bare-Metal GPU Deployment

sudo apt install ffmpeg
nvidia-smi                 # Verify GPU
# Add to crontab: 0 */6 * * * cd /path/to/be-flow-dtd && uv run python main.py

Performance notes:

RTX 3090/4090 processes ~100 hours of audio per day with large-v3
8GB VRAM cards: use WHISPER_MODEL=medium for reliable operation
Monitor with watch -n 1 nvidia-smi during initial runs

Configuration Reference

Environment Variables

Variable	Required	Default	Description
Taddy API
`TADDY_API_KEY`	Yes	—	API key from taddy.org
`TADDY_USER_ID`	Yes	—	User ID from taddy.org
`LOOKBACK_DAYS`	No	`7`	Days to look back for new episodes
HuggingFace
`HF_TOKEN`	Yes	—	Token for pyannote model access
Whisper
`WHISPER_MODEL`	No	`large-v3`	Model: tiny/base/small/medium/large-v2/large-v3
`WHISPER_DEVICE`	No	`cuda`	Device: cuda/cpu
`WHISPER_COMPUTE_TYPE`	No	`float16`	Precision: float16/int8/float32
Pyannote
`PYANNOTE_MODEL`	No	`pyannote/speaker-diarization-3.1`	Diarization model
Speaker ID
`SPEAKER_MATCH_THRESHOLD`	No	`0.70`	Cosine similarity threshold (0.0-1.0)
Supabase
`SUPABASE_URL`	Yes	—	Project URL
`SUPABASE_KEY`	Yes	—	Service role key
`SUPABASE_BUCKET`	No	`podcast-transcripts`	Storage bucket name
Paths
`DATA_DIR`	No	`data`	Base data directory
`AUDIO_DIR`	No	`data/audio`	Temporary audio files (auto-cleaned)
`SPEAKERS_DIR`	No	`data/speakers`	Speaker embeddings directory
`LOGS_DIR`	No	`data/logs`	Log files directory
`STATE_DB_PATH`	No	`data/state.db`	SQLite database path
`PODCASTS_CONFIG_PATH`	No	`podcasts.json`	Podcast config file

podcasts.json Schema

{
  "podcasts": [
    {
      "id": "unique-podcast-id",
      "name": "Human Readable Name",
      "taddy_id": "uuid-from-taddy",
      "min_speakers": 1,
      "max_speakers": 10,
      "known_speakers": [
        {"slug": "speaker-slug", "name": "Speaker Name", "role": "host"}
      ]
    }
  ]
}

Output JSON Schema

Uploaded to: {bucket}/{podcast-id}/{YYYY}/{MM}/{YYYY-MM-DD}-{slug}.json

{
  "episode": {
    "id": "taddy-episode-uuid",
    "podcast_id": "taddy-podcast-uuid",
    "title": "Episode Title",
    "audio_url": "https://...",
    "published_at": "2024-01-15T10:00:00Z",
    "duration_seconds": 3600,
    "description": "Episode description..."
  },
  "transcript": [
    {
      "text": "Hello and welcome to the show.",
      "start": 0.0,
      "end": 2.5,
      "words": [
        {"text": "Hello", "start": 0.0, "end": 0.4, "confidence": 0.98},
        {"text": "and", "start": 0.5, "end": 0.6, "confidence": 0.99},
        {"text": "welcome", "start": 0.7, "end": 1.1, "confidence": 0.97}
      ],
      "speaker_id": "host-uuid",
      "speaker_name": "John Host",
      "confidence": 0.92
    }
  ],
  "speakers": [
    {
      "id": "host-uuid",
      "podcast_id": "podcast-id",
      "name": "John Host",
      "embedding_path": "data/speakers/known/podcast-id/john-host.npy"
    }
  ],
  "processing_metadata": {
    "processed_at": "2024-01-15T12:30:00Z",
    "whisper_model": "large-v3",
    "pyannote_model": "pyannote/speaker-diarization-3.1",
    "speaker_threshold": 0.70
  }
}

CLI Reference

Main Pipeline

uv run python main.py [OPTIONS]

Option	Description
`--podcast ID`	Process only this podcast (matches `id` in podcasts.json)
`--episode-url URL`	Debug mode: process single episode by direct audio URL
`--dry-run`	Process but don't upload; print JSON to stdout
`--reprocess`	Ignore state.db; reprocess already-completed episodes
`--verbose`, `-v`	Enable debug logging

Utility Scripts

# Find podcast UUID from Taddy API
uv run python find_podcast.py "Podcast Name"

# Seed known speakers (interactive: promote unknown → known)
uv run python scripts/seed_speakers.py <podcast_id>

# CocoIndex monitoring (requires DATABASE_URL)
uv run cocoindex server -ci cocoindex_flow.py

Examples

# Normal operation - process all podcasts
uv run python main.py

# Process only one podcast
uv run python main.py --podcast the-bitcoin-matrix

# Debug a specific episode without uploading
uv run python main.py --episode-url "https://example.com/ep.mp3" --dry-run

# Reprocess everything with verbose output
uv run python main.py --reprocess --verbose

Speaker Embeddings

Directory Structure

data/speakers/
├── known/
│   ├── the-bitcoin-matrix/
│   │   └── cedric-youngelman.npy    # 192D ECAPA-TDNN embedding
│   └── what-bitcoin-did/
│       └── danny-knowles.npy
└── unknown/
    └── the-bitcoin-matrix_unknown_abc123.npy   # Auto-generated for unmatched speakers

Seeding Known Speakers

After processing episodes, unknown speakers appear in data/speakers/unknown/. Use the interactive seeding script to promote them:

uv run python scripts/seed_speakers.py the-bitcoin-matrix

This displays unknown embeddings and known speaker configs, then lets you assign each unknown to a known speaker by number.

Troubleshooting

CUDA/GPU Errors

Problem: CUDA out of memory

export WHISPER_MODEL=medium     # or small
export WHISPER_COMPUTE_TYPE=int8

Problem: RuntimeError: CUDA error

uv run python -c "import torch; print(torch.cuda.is_available())"
nvidia-smi

Authentication Errors

Error	Solution
`HF_TOKEN required`	Set `HF_TOKEN` in `.env` and accept pyannote license
`Unauthorized` (Taddy)	Verify `TADDY_API_KEY` and `TADDY_USER_ID`
`Invalid API key` (Supabase)	Use service role key, not anon key

Audio Download Failures

uv run pip install -U yt-dlp
yt-dlp --extract-audio "URL"

State Database

# View processed episodes
sqlite3 data/state.db "SELECT * FROM processed_episodes ORDER BY processed_at DESC LIMIT 10;"

# Clear state to force reprocess (or use --reprocess flag)
sqlite3 data/state.db "DELETE FROM processed_episodes WHERE podcast_id='my-podcast';"

Performance Tuning

Scenario	Settings
16GB+ VRAM	`WHISPER_MODEL=large-v3`, `WHISPER_COMPUTE_TYPE=float16`
8GB VRAM	`WHISPER_MODEL=medium`, `WHISPER_COMPUTE_TYPE=float16`
4GB VRAM / CPU	`WHISPER_MODEL=small`, `WHISPER_DEVICE=cpu`, `WHISPER_COMPUTE_TYPE=int8`

Project Structure

be-flow-dtd/
├── main.py                 # Entry point and pipeline orchestration
├── config.py               # Pydantic Config from environment variables
├── models.py               # Data models (Episode, TranscriptSegment, KnownSpeaker, etc.)
├── state.py                # SQLite state tracking (processed_episodes, speakers)
├── utils.py                # Retry decorator, slugify, VRAM unload helper
├── logging_config.py       # Structured logging with podcast context
├── cocoindex_flow.py       # CocoIndex flows for monitoring/visualization
├── find_podcast.py         # Taddy API search → podcast UUID + recent episodes
├── podcasts.json           # Podcast configuration (id, taddy_id, known_speakers)
├── scripts/
│   └── seed_speakers.py    # Interactive CLI: promote unknown → known embeddings
├── services/
│   ├── taddy.py            # Taddy GraphQL client
│   ├── downloader.py       # yt-dlp audio downloader
│   ├── transcriber.py      # faster-whisper transcription
│   ├── diarizer.py         # Pyannote 3.1 speaker diarization
│   ├── speaker_id.py       # ECAPA-TDNN 192D voiceprint matching
│   ├── merger.py           # Transcript + speaker turn alignment
│   └── storage.py          # Supabase upload (date-partitioned)
├── data/
│   ├── audio/              # Temporary audio (auto-cleaned)
│   ├── speakers/
│   │   ├── known/          # Known speaker embeddings (.npy)
│   │   └── unknown/        # Unmatched speakers (.npy)
│   ├── logs/
│   │   ├── pipeline.log    # Human-readable logs
│   │   └── errors.jsonl    # Structured errors
│   └── state.db            # SQLite tracking
├── pyproject.toml          # Python 3.11+, uv-managed dependencies
├── requirements.txt        # Pip-compatible dependencies
├── Dockerfile              # Container with baked models
├── docker-compose.yml      # GPU deployment
└── .env.example            # Environment template

Related Projects

be-podcast-etl — Downstream belief extraction pipeline (speakers → ads → extract → abstract → embed → weights → headlines → matrix)

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
scripts		scripts
services		services
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
cocoindex_flow.py		cocoindex_flow.py
config.py		config.py
docker-compose.yml		docker-compose.yml
find_podcast.py		find_podcast.py
logging_config.py		logging_config.py
main.py		main.py
models.py		models.py
podcasts.json		podcasts.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
state.py		state.py
utils.py		utils.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

DTD Pipeline

What It Does

Key Features

Architecture

Pipeline Flow

VRAM Management

Getting Started

Prerequisites

Step 1: Clone and Install

Step 2: Obtain API Keys

Taddy API (Podcast Metadata)

HuggingFace (Pyannote Model Access)

Supabase (Cloud Storage)

Step 3: Configure Environment

Step 4: Configure Podcasts

Step 5: Set Up Known Speakers (Optional)

Step 6: Test the Pipeline

Step 7: Run for Real

Docker Deployment (24/7 Service)

Quick Start

Running as a Scheduled Service

Option 1: Cron (Host Machine)

Option 2: Systemd Timer

Bare-Metal GPU Deployment

Configuration Reference

Environment Variables

podcasts.json Schema

Output JSON Schema

CLI Reference

Main Pipeline

Utility Scripts

Examples

Speaker Embeddings

Directory Structure

Seeding Known Speakers

Troubleshooting

CUDA/GPU Errors

Authentication Errors

Audio Download Failures

State Database

Performance Tuning

Project Structure

Related Projects

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages