Track and analyze content patterns across YouTube channels over time.
This tool helps you monitor what different groups of YouTube channels are talking about. You define "clusters" of channels (e.g., left-leaning news, right-leaning news, tech channels), and the tool:
- Fetches recent videos from those channels
- Analyzes the transcripts to extract themes, sentiment, and framing
- Tracks how these narratives evolve over time
- Compares what different clusters are focusing on
The tracker is continuously monitoring YouTube channels. Here's what's in the database:
Current Database Status (as of Jan 9, 2026):
- 2,703 total videos across all clusters
- Libs: 872 videos (from channels like Pod Save America, The Young Turks, David Pakman Show)
- Right: 908 videos (from channels like Fox News, Ben Shapiro Show, Daily Wire)
- Mainstream: 923 videos (from channels like CNN, Reuters, MSNBC, ABC News)
Latest Daily Analysis:
Combined topics across all clusters:
Breakdown by cluster perspective:
| Libs | Right | Mainstream |
|---|---|---|
View distribution across Liberal channels:
More daily reports available in data/reports/ directory with word clouds and view statistics updated regularly.
- Automated data collection from YouTube channels (quota-optimized)
- Daily reporting system - automated daily word clouds and view statistics
- Retroactive transcript search - download complete historical datasets for any channel
- AI-powered analysis using Google's Gemini to extract themes, sentiment, and framing
- Temporal tracking - see how topics evolve over weeks, months, or years
- Cross-cluster comparison - identify consensus topics vs echo chambers
- 25+ visualizations including word clouds, heatmaps, and trend charts
- Multi-year analysis - can collect years of historical data efficiently
- Incremental processing - only processes new videos on subsequent runs (100x faster)
- Parallel processing - analyze multiple videos concurrently (3-5x faster)
- Caching - never re-fetch transcripts or re-analyze the same video
# Clone the repo
git clone https://github.com/YOUR_USERNAME/vibes-tracker.git
cd vibes-tracker
# Set up virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtCreate a .env file in the project root:
YOUTUBE_API_KEY=your_youtube_api_key
GEMINI_API_KEY=your_gemini_api_key
Get your API keys:
- YouTube: Google Cloud Console → Enable YouTube Data API v3
- Gemini: Google AI Studio
Edit config/clusters.json to define your channel groups:
{
"tech": ["@mkbhd", "@linustechtips", "@verge"],
"news": ["@cnn", "@foxnews", "@bbcnews"],
"finance": ["@bloomberg", "@cnbc", "@wallstreetjournal"]
}# Full pipeline: fetch data → analyze → visualize
python src/main.py pipeline
# Or run stages individually
python src/main.py ingest # Fetch video metadata
python src/main.py analyze # Run AI analysis
python src/main.py visualize # Generate plotsCheck figures/ for your visualizations and data/analyzed_data.csv for the results.
# Fetch video metadata from YouTube
python src/main.py ingest
# Analyze transcripts with Gemini
python src/main.py analyze
# Generate all visualizations
python src/main.py visualize
# Run temporal trend analysis
python src/main.py temporal --days-back 30
# Compare clusters
python src/main.py compare
# Run complete pipeline
python src/main.py pipelineGenerate a daily report with word clouds and view statistics for videos published on a specific date:
# Generate report for today's videos
python src/daily_report.py
# Generate report for a specific date
python src/daily_report.py --date 2025-01-10The report includes:
- Word clouds for top content across all clusters (filtered by top 67% of views)
- Cluster-specific word clouds with custom color schemes
- View distribution charts showing individual videos as stacked segments
- All outputs saved to
data/reports/YYYY-MM-DD/
Download complete historical datasets for any YouTube channel:
# Download all available transcripts from 2020-2023
python scripts/retroactive_search.py --channel @joerogan --start-year 2020 --end-year 2023
# Limit downloads per run (useful for rate limiting)
python scripts/retroactive_search.py --channel @joerogan --start-year 2020 --end-year 2023 --max-per-run 50
# Use specific date ranges
python scripts/retroactive_search.py --channel @joerogan --start-date 2020-06-01 --end-date 2021-12-31Features:
- Smart Resume: Checks for existing transcripts and only downloads missing ones
- Resilient: Saves each transcript immediately - no data loss if interrupted
- Progress Tracking: Maintains progress in JSON file, fully resumable
- Organized Storage: Saves transcripts to
data/{channel}/with clear naming
See docs/RETROACTIVE_SEARCH.md for detailed documentation.
After the first run, use incremental mode to only process new videos:
# Only fetch and analyze new videos since last run
python src/main.py pipeline --incremental
# This is ~100x faster than re-processing everything# Use more parallel workers for faster analysis (default: 10)
python src/main.py analyze --workers 20
# Force full refresh (re-process everything)
python src/main.py ingest --full-refresh
python src/main.py analyze --full-refreshCollect years of data efficiently:
# Collect 3 years of monthly data (takes ~1 day, uses 4,320 API units)
python src/main.py collect-historical \
--start-year 2022 \
--end-year 2024 \
--frequency monthly
# Then run temporal analysis
python src/main.py temporal --days-back 1095 # 3 yearsSee docs/MULTI_YEAR_ANALYSIS_GUIDE.md for details.
data/cluster_data.csv- Raw video metadatadata/analyzed_data.csv- Enriched with AI analysisdata/historical/YYYY-MM-DD/- Dated snapshots for temporal analysisdata/cache/- Cached transcripts and analysis (avoids redundant API calls)logs/- Detailed logs of all runs
Word Clouds:
- Title word clouds (per cluster + combined)
- Theme word clouds (per cluster + combined)
Sentiment Analysis:
- Sentiment distribution by cluster
- Framing distribution (favorable/critical/neutral/alarmist)
Temporal Trends:
- Theme prevalence over time
- Sentiment evolution
- Emerging vs declining topics
Cross-Cluster Comparison:
- Similarity heatmap
- Consensus vs echo chamber topics
- Theme distribution comparison
For each video, the AI extracts:
- Core Themes - 3-5 main topics discussed
- Theme Categories - Political, Social, Economic, Cultural, International, Tech, Other
- Sentiment - Positive, Neutral, Negative, Mixed
- Framing - favorable, critical, neutral, alarmist
- Named Entities - Key people, organizations, events mentioned
- Summary - One-sentence takeaway
Edit config/pipeline_config.yaml to customize:
ingest:
videos_per_channel: 30 # How many recent videos to fetch
analysis:
model: "gemini-1.5-flash"
enable_caching: true
visualization:
wordcloud_width: 1200
wordcloud_height: 800
custom_stopwords: ["video", "podcast", "episode"]The pipeline has 3 main stages:
- Ingest (
src/ingest.py) - Fetches video metadata from YouTube - Analyze (
src/analyze.py) - Extracts themes/sentiment using Gemini - Visualize (
src/visualize.py) - Generates plots and charts
Additional modules:
- Temporal Analysis (
src/temporal_analysis.py) - Track trends over time - Cross-Cluster Analysis (
src/cross_cluster_analysis.py) - Compare clusters - CLI (
src/main.py) - Unified command-line interface
All configuration is in config/, utilities in src/utils/, and visualizations in src/visualizations/.
First run (1,000 videos):
- Data collection: ~2 minutes (YouTube API)
- Analysis: ~10 minutes with 10 parallel workers (Gemini API)
- Visualization: ~1 minute
Incremental runs (50 new videos):
- Data collection: ~10 seconds
- Analysis: ~30 seconds with caching
- Visualization: ~10 seconds
Caching benefits:
- Transcripts: Never re-fetch the same video (YouTube API savings)
- Analysis: Never re-analyze the same video (Gemini API savings)
- Expected speedup: 10-50x on repeated runs
Typical usage:
- First time fetching a channel: ~100 units (need to resolve channel ID)
- Subsequent fetches: ~2 units per channel (cached channel ID)
- 60 channels daily: ~120 units
- You can fetch ~83 time periods per day
- 1 API call per video analyzed
- Caching prevents re-analyzing the same video
- Incremental mode only analyzes new videos
- Free tier: 1,500 requests/day (check current limits)
vibes-tracker/
├── config/
│ ├── clusters.json # Your channel definitions
│ ├── pipeline_config.yaml # Pipeline settings
│ └── prompts.yaml # AI prompt templates
├── src/
│ ├── main.py # CLI entry point
│ ├── ingest.py # Data collection
│ ├── analyze.py # AI analysis
│ ├── visualize.py # Visualization orchestration
│ ├── temporal_analysis.py # Temporal tracking
│ ├── cross_cluster_analysis.py # Cluster comparison
│ ├── utils/ # Utilities
│ │ ├── config_loader.py
│ │ ├── logger.py
│ │ ├── cache_manager.py
│ │ └── metadata_manager.py
│ └── visualizations/ # Plotting modules
│ ├── word_clouds.py
│ ├── temporal_plots.py
│ ├── cluster_comparison.py
│ └── sentiment_plots.py
├── scripts/
│ └── collect_historical_data.py # Multi-year collection
├── data/ # Generated data (gitignored)
├── figures/ # Generated plots (gitignored)
├── logs/ # Log files (gitignored)
└── docs/ # Documentation
Set up a cron job for automated daily updates:
# Run at 2am daily - collect new videos and generate daily report
0 2 * * * cd /path/to/vibes-tracker && \
source .venv/bin/activate && \
python src/main.py pipeline --incremental && \
python src/daily_report.pyThis workflow:
- Fetches new videos from all configured channels
- Analyzes only new content (incremental mode)
- Generates daily word clouds and view statistics
- Saves reports to
data/reports/YYYY-MM-DD/
Collect and analyze historical data:
# 1. Collect 3 years of data
python src/main.py collect-historical \
--start-year 2022 --end-year 2024 --frequency monthly
# 2. Run temporal analysis
python src/main.py temporal --days-back 1095
# 3. Generate all visualizations
python src/main.py visualize
# 4. Compare clusters
python src/main.py compareStudy a specific time period:
# Collect data around a specific event (e.g., election)
python scripts/collect_historical_data.py \
--start-date 2024-10-01 \
--end-date 2024-12-01
# Analyze the data
python src/main.py analyze
# Generate visualizations
python src/main.py visualizeBuild a complete dataset for a specific channel:
# Download all available transcripts for a channel
python scripts/retroactive_search.py \
--channel @YourTargetChannel \
--start-year 2020 \
--end-year 2024 \
--max-per-run 100
# The script is fully resumable - run it multiple times if needed
# It will skip already-downloaded transcripts and continue where it left offNo transcripts available:
- Many channels disable transcripts or auto-captions
- The tool skips videos without transcripts
- This is expected behavior
API quota exceeded:
- YouTube: 10,000 units/day limit (resets at midnight PT)
- Use incremental mode to minimize API usage
- Spread historical collection across multiple days if needed
Slow analysis:
- Increase parallel workers:
--workers 20 - Use caching to avoid re-analyzing
- Consider analyzing a sample instead of all videos
Cache getting large:
- Cache files are stored in
data/cache/ - Safe to delete if you want to re-analyze everything
- Each transcript is ~1-10 KB, each analysis is ~1-2 KB
- Multi-Year Analysis Guide - How to collect and analyze years of data
- Retroactive Search Guide - Download complete historical datasets for any channel
- Getting Started Guide - Step-by-step setup and usage instructions
- Technical Guide - Architecture, performance tuning, and advanced usage
- Implementation Summary - Technical details of all features
- Phase 2 Test Report - Temporal analysis capabilities
- Phase 3 Test Report - Performance improvements
This is a research tool built for personal use. Feel free to fork and adapt for your own needs.
MIT License - see LICENSE file for details
Built with:
- YouTube Data API v3
- Google Gemini API
- youtube-transcript-api
- Standard Python data science stack (pandas, matplotlib, seaborn, wordcloud)