Thanks to visit codestin.com
Credit goes to github.com

Skip to content

beliefengines/be-flow-dtd

Repository files navigation

DTD Pipeline

Download → Transcribe → Diarize

A production-ready podcast transcription pipeline that automatically fetches new episodes, transcribes them with word-level timestamps, identifies speakers, and uploads structured JSON to cloud storage.


What It Does

┌─────────────────────────────────────────────────────────────────────────────┐
│                           DTD Pipeline Overview                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Taddy API ──▶ Download ──▶ Transcribe ──▶ Diarize ──▶ Identify ──▶ Upload │
│   (episodes)    (yt-dlp)    (Whisper)     (Pyannote)   (ECAPA)    (Supabase)│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Input: Podcast RSS feeds via Taddy API Output: Speaker-attributed transcripts with word-level timestamps in JSON

Key Features

  • Automatic Episode Discovery — Polls Taddy API for new episodes within configurable lookback window
  • High-Quality Transcription — Uses faster-whisper with large-v3 model and word timestamps
  • Speaker Diarization — Pyannote 3.1 identifies who speaks when
  • Speaker Recognition — Matches voices against known speaker embeddings (192D ECAPA-TDNN)
  • Cloud Storage — Uploads structured JSON to Supabase with date-partitioned paths
  • State Tracking — SQLite prevents reprocessing; supports manual reprocess override
  • GPU Optimized — Sequential model loading with explicit VRAM management
  • Docker Ready — Single container with baked models for 24/7 operation

Architecture

Pipeline Flow

┌────────────────────────────────────────────────────────────────────────────────┐
│                              PIPELINE STAGES                                    │
├────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐       │
│  │  FETCH  │───▶│DOWNLOAD │───▶│TRANSCRIBE───▶│ DIARIZE │───▶│IDENTIFY │       │
│  │         │    │         │    │         │    │         │    │         │       │
│  │ Taddy   │    │ yt-dlp  │    │ Whisper │    │Pyannote │    │ECAPA-   │       │
│  │ GraphQL │    │         │    │ large-v3│    │  3.1    │    │TDNN     │       │
│  └─────────┘    └─────────┘    └────┬────┘    └────┬────┘    └────┬────┘       │
│       │                             │              │              │             │
│       │                             ▼              ▼              ▼             │
│       │                        ┌────────────────────────────────────┐           │
│       │                        │        VRAM MANAGEMENT             │           │
│       │                        │  (unload model after each stage)   │           │
│       │                        └────────────────────────────────────┘           │
│       │                                                                         │
│       ▼                                                                         │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐                      │
│  │ CHECK   │    │  MERGE  │───▶│ UPLOAD  │───▶│  MARK   │                      │
│  │ STATE   │    │         │    │         │    │PROCESSED│                      │
│  │         │    │Segments │    │Supabase │    │         │                      │
│  │SQLite DB│    │+Speakers│    │ Storage │    │SQLite DB│                      │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘                      │
│                                                                                 │
└────────────────────────────────────────────────────────────────────────────────┘

VRAM Management

The pipeline uses three GPU-intensive models that cannot fit in VRAM simultaneously:

┌─────────────────────────────────────────────────────────────────┐
│                    VRAM USAGE TIMELINE                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  VRAM                                                            │
│   ▲                                                              │
│   │    ┌──────┐                                                  │
│  6GB   │Whisper                                                  │
│   │    │      │    ┌──────┐                                      │
│  4GB   │      │    │Pyannote                                     │
│   │    │      │    │      │    ┌──────┐                          │
│  2GB   │      │    │      │    │ECAPA │                          │
│   │    │      │    │      │    │      │                          │
│   └────┴──────┴────┴──────┴────┴──────┴────────────▶ Time       │
│        Load   Unload Load  Unload Load  Unload                   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Each model is explicitly unloaded after use via torch.cuda.empty_cache().


Getting Started

Prerequisites

Requirement Minimum Recommended
GPU NVIDIA 8GB VRAM NVIDIA 16GB+ VRAM
CUDA 11.8 12.1+
RAM 16GB 32GB
Disk 20GB 50GB+
Python 3.11 3.11+

Step 1: Clone and Install

git clone <your-repo-url>
cd be-flow-dtd

# Using uv (recommended)
uv sync

# Or with pip
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Step 2: Obtain API Keys

You need credentials from three services:

Taddy API (Podcast Metadata)

  1. Go to https://taddy.org/developers
  2. Create an account and get your API credentials
  3. Note your API_KEY and USER_ID

HuggingFace (Pyannote Model Access)

  1. Go to https://huggingface.co/settings/tokens
  2. Create a new token with read access
  3. Important: Accept the pyannote model license:

Supabase (Cloud Storage)

  1. Go to https://supabase.com and create a project
  2. Go to Settings → API to get your URL and service key
  3. Create a storage bucket named podcast-transcripts:
    -- In Supabase SQL Editor
    INSERT INTO storage.buckets (id, name, public)
    VALUES ('podcast-transcripts', 'podcast-transcripts', false);

Step 3: Configure Environment

cp .env.example .env
# Edit .env with your credentials (see .env.example for all options)

Step 4: Configure Podcasts

Edit podcasts.json to add your podcasts:

{
  "podcasts": [
    {
      "id": "lex-fridman",
      "name": "Lex Fridman Podcast",
      "taddy_id": "c5b7a123-4567-89ab-cdef-0123456789ab",
      "min_speakers": 2,
      "max_speakers": 4,
      "known_speakers": [
        {"slug": "lex-fridman", "name": "Lex Fridman", "role": "host"}
      ]
    },
    {
      "id": "huberman-lab",
      "name": "Huberman Lab",
      "taddy_id": "d6c8b234-5678-90bc-def0-1234567890bc",
      "min_speakers": 1,
      "max_speakers": 3,
      "known_speakers": [
        {"slug": "andrew-huberman", "name": "Andrew Huberman", "role": "host"}
      ]
    }
  ]
}

Each known_speakers entry requires slug, name, and role (typically "host").

Finding Taddy IDs:

uv run python find_podcast.py "Podcast Name"

Step 5: Set Up Known Speakers (Optional)

To enable speaker recognition, you need 192D ECAPA-TDNN embeddings for known speakers.

Option A: Seed from pipeline output (recommended)

Run the pipeline first, then promote unknown embeddings to known:

# Run pipeline to generate unknown embeddings
uv run python main.py --podcast lex-fridman --dry-run

# Interactive CLI to assign unknowns to known speakers
uv run python scripts/seed_speakers.py lex-fridman

Option B: Manual embedding generation

import numpy as np
import torchaudio
from speechbrain.inference.speaker import EncoderClassifier

classifier = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb",
    savedir="data/models/ecapa-tdnn"
)

signal, sr = torchaudio.load("speaker_sample.wav")
if sr != 16000:
    signal = torchaudio.transforms.Resample(sr, 16000)(signal)

embedding = classifier.encode_batch(signal).squeeze().cpu().numpy()
np.save("data/speakers/known/podcast-id/speaker-slug.npy", embedding)

Step 6: Test the Pipeline

# Dry run - process but don't upload
uv run python main.py --dry-run --verbose

# Test single podcast
uv run python main.py --podcast lex-fridman --dry-run

# Test single episode by URL
uv run python main.py --episode-url "https://example.com/episode.mp3" --dry-run

Step 7: Run for Real

# Process all configured podcasts
uv run python main.py

# With verbose logging
uv run python main.py --verbose

# Force reprocess already-done episodes
uv run python main.py --reprocess

Docker Deployment (24/7 Service)

Quick Start

docker-compose build       # Includes all models (10-20 min)
docker-compose up -d       # Start service
docker-compose logs -f     # View logs

Running as a Scheduled Service

Option 1: Cron (Host Machine)

crontab -e
# Run every 6 hours
0 */6 * * * cd /path/to/be-flow-dtd && docker-compose run --rm dtd-pipeline

Option 2: Systemd Timer

Create /etc/systemd/system/dtd-pipeline.service:

[Unit]
Description=DTD Pipeline
After=docker.service

[Service]
Type=oneshot
WorkingDirectory=/path/to/be-flow-dtd
ExecStart=/usr/bin/docker-compose run --rm dtd-pipeline

[Install]
WantedBy=multi-user.target

Create /etc/systemd/system/dtd-pipeline.timer:

[Unit]
Description=Run DTD Pipeline every 6 hours

[Timer]
OnBootSec=15min
OnUnitActiveSec=6h

[Install]
WantedBy=timers.target
sudo systemctl enable --now dtd-pipeline.timer

Bare-Metal GPU Deployment

sudo apt install ffmpeg
nvidia-smi                 # Verify GPU
# Add to crontab: 0 */6 * * * cd /path/to/be-flow-dtd && uv run python main.py

Performance notes:

  • RTX 3090/4090 processes ~100 hours of audio per day with large-v3
  • 8GB VRAM cards: use WHISPER_MODEL=medium for reliable operation
  • Monitor with watch -n 1 nvidia-smi during initial runs

Configuration Reference

Environment Variables

Variable Required Default Description
Taddy API
TADDY_API_KEY Yes API key from taddy.org
TADDY_USER_ID Yes User ID from taddy.org
LOOKBACK_DAYS No 7 Days to look back for new episodes
HuggingFace
HF_TOKEN Yes Token for pyannote model access
Whisper
WHISPER_MODEL No large-v3 Model: tiny/base/small/medium/large-v2/large-v3
WHISPER_DEVICE No cuda Device: cuda/cpu
WHISPER_COMPUTE_TYPE No float16 Precision: float16/int8/float32
Pyannote
PYANNOTE_MODEL No pyannote/speaker-diarization-3.1 Diarization model
Speaker ID
SPEAKER_MATCH_THRESHOLD No 0.70 Cosine similarity threshold (0.0-1.0)
Supabase
SUPABASE_URL Yes Project URL
SUPABASE_KEY Yes Service role key
SUPABASE_BUCKET No podcast-transcripts Storage bucket name
Paths
DATA_DIR No data Base data directory
AUDIO_DIR No data/audio Temporary audio files (auto-cleaned)
SPEAKERS_DIR No data/speakers Speaker embeddings directory
LOGS_DIR No data/logs Log files directory
STATE_DB_PATH No data/state.db SQLite database path
PODCASTS_CONFIG_PATH No podcasts.json Podcast config file

podcasts.json Schema

{
  "podcasts": [
    {
      "id": "unique-podcast-id",
      "name": "Human Readable Name",
      "taddy_id": "uuid-from-taddy",
      "min_speakers": 1,
      "max_speakers": 10,
      "known_speakers": [
        {"slug": "speaker-slug", "name": "Speaker Name", "role": "host"}
      ]
    }
  ]
}

Output JSON Schema

Uploaded to: {bucket}/{podcast-id}/{YYYY}/{MM}/{YYYY-MM-DD}-{slug}.json

{
  "episode": {
    "id": "taddy-episode-uuid",
    "podcast_id": "taddy-podcast-uuid",
    "title": "Episode Title",
    "audio_url": "https://...",
    "published_at": "2024-01-15T10:00:00Z",
    "duration_seconds": 3600,
    "description": "Episode description..."
  },
  "transcript": [
    {
      "text": "Hello and welcome to the show.",
      "start": 0.0,
      "end": 2.5,
      "words": [
        {"text": "Hello", "start": 0.0, "end": 0.4, "confidence": 0.98},
        {"text": "and", "start": 0.5, "end": 0.6, "confidence": 0.99},
        {"text": "welcome", "start": 0.7, "end": 1.1, "confidence": 0.97}
      ],
      "speaker_id": "host-uuid",
      "speaker_name": "John Host",
      "confidence": 0.92
    }
  ],
  "speakers": [
    {
      "id": "host-uuid",
      "podcast_id": "podcast-id",
      "name": "John Host",
      "embedding_path": "data/speakers/known/podcast-id/john-host.npy"
    }
  ],
  "processing_metadata": {
    "processed_at": "2024-01-15T12:30:00Z",
    "whisper_model": "large-v3",
    "pyannote_model": "pyannote/speaker-diarization-3.1",
    "speaker_threshold": 0.70
  }
}

CLI Reference

Main Pipeline

uv run python main.py [OPTIONS]
Option Description
--podcast ID Process only this podcast (matches id in podcasts.json)
--episode-url URL Debug mode: process single episode by direct audio URL
--dry-run Process but don't upload; print JSON to stdout
--reprocess Ignore state.db; reprocess already-completed episodes
--verbose, -v Enable debug logging

Utility Scripts

# Find podcast UUID from Taddy API
uv run python find_podcast.py "Podcast Name"

# Seed known speakers (interactive: promote unknown → known)
uv run python scripts/seed_speakers.py <podcast_id>

# CocoIndex monitoring (requires DATABASE_URL)
uv run cocoindex server -ci cocoindex_flow.py

Examples

# Normal operation - process all podcasts
uv run python main.py

# Process only one podcast
uv run python main.py --podcast the-bitcoin-matrix

# Debug a specific episode without uploading
uv run python main.py --episode-url "https://example.com/ep.mp3" --dry-run

# Reprocess everything with verbose output
uv run python main.py --reprocess --verbose

Speaker Embeddings

Directory Structure

data/speakers/
├── known/
│   ├── the-bitcoin-matrix/
│   │   └── cedric-youngelman.npy    # 192D ECAPA-TDNN embedding
│   └── what-bitcoin-did/
│       └── danny-knowles.npy
└── unknown/
    └── the-bitcoin-matrix_unknown_abc123.npy   # Auto-generated for unmatched speakers

Seeding Known Speakers

After processing episodes, unknown speakers appear in data/speakers/unknown/. Use the interactive seeding script to promote them:

uv run python scripts/seed_speakers.py the-bitcoin-matrix

This displays unknown embeddings and known speaker configs, then lets you assign each unknown to a known speaker by number.


Troubleshooting

CUDA/GPU Errors

Problem: CUDA out of memory

export WHISPER_MODEL=medium     # or small
export WHISPER_COMPUTE_TYPE=int8

Problem: RuntimeError: CUDA error

uv run python -c "import torch; print(torch.cuda.is_available())"
nvidia-smi

Authentication Errors

Error Solution
HF_TOKEN required Set HF_TOKEN in .env and accept pyannote license
Unauthorized (Taddy) Verify TADDY_API_KEY and TADDY_USER_ID
Invalid API key (Supabase) Use service role key, not anon key

Audio Download Failures

uv run pip install -U yt-dlp
yt-dlp --extract-audio "URL"

State Database

# View processed episodes
sqlite3 data/state.db "SELECT * FROM processed_episodes ORDER BY processed_at DESC LIMIT 10;"

# Clear state to force reprocess (or use --reprocess flag)
sqlite3 data/state.db "DELETE FROM processed_episodes WHERE podcast_id='my-podcast';"

Performance Tuning

Scenario Settings
16GB+ VRAM WHISPER_MODEL=large-v3, WHISPER_COMPUTE_TYPE=float16
8GB VRAM WHISPER_MODEL=medium, WHISPER_COMPUTE_TYPE=float16
4GB VRAM / CPU WHISPER_MODEL=small, WHISPER_DEVICE=cpu, WHISPER_COMPUTE_TYPE=int8

Project Structure

be-flow-dtd/
├── main.py                 # Entry point and pipeline orchestration
├── config.py               # Pydantic Config from environment variables
├── models.py               # Data models (Episode, TranscriptSegment, KnownSpeaker, etc.)
├── state.py                # SQLite state tracking (processed_episodes, speakers)
├── utils.py                # Retry decorator, slugify, VRAM unload helper
├── logging_config.py       # Structured logging with podcast context
├── cocoindex_flow.py       # CocoIndex flows for monitoring/visualization
├── find_podcast.py         # Taddy API search → podcast UUID + recent episodes
├── podcasts.json           # Podcast configuration (id, taddy_id, known_speakers)
├── scripts/
│   └── seed_speakers.py    # Interactive CLI: promote unknown → known embeddings
├── services/
│   ├── taddy.py            # Taddy GraphQL client
│   ├── downloader.py       # yt-dlp audio downloader
│   ├── transcriber.py      # faster-whisper transcription
│   ├── diarizer.py         # Pyannote 3.1 speaker diarization
│   ├── speaker_id.py       # ECAPA-TDNN 192D voiceprint matching
│   ├── merger.py           # Transcript + speaker turn alignment
│   └── storage.py          # Supabase upload (date-partitioned)
├── data/
│   ├── audio/              # Temporary audio (auto-cleaned)
│   ├── speakers/
│   │   ├── known/          # Known speaker embeddings (.npy)
│   │   └── unknown/        # Unmatched speakers (.npy)
│   ├── logs/
│   │   ├── pipeline.log    # Human-readable logs
│   │   └── errors.jsonl    # Structured errors
│   └── state.db            # SQLite tracking
├── pyproject.toml          # Python 3.11+, uv-managed dependencies
├── requirements.txt        # Pip-compatible dependencies
├── Dockerfile              # Container with baked models
├── docker-compose.yml      # GPU deployment
└── .env.example            # Environment template

Related Projects

  • be-podcast-etl — Downstream belief extraction pipeline (speakers → ads → extract → abstract → embed → weights → headlines → matrix)

About

Pipeline using cocoindex to Download/Transcribe/Diarize/Identify Speakers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors