RETL (Reddit ETL)

RETL is a fast, memory‑aware, streaming ETL toolkit for working with the Reddit monthly RC (comments) and RS (submissions) corpora. It’s designed to scan large .zst JSONL drops efficiently, filter with an intuitive query builder, and export results for analysis — all with pragmatic attention to parallelism, backpressure, and Windows‑friendly file operations.

TL;DR: Point RETL at a comments/ and submissions/ directory full of RC_YYYY-MM.zst / RS_YYYY-MM.zst files, build a query, then extract or analyze in one pass.

Features

🚀 Streaming, single‑pass processing of .zst JSONL monthly dumps (RC/RS)
🧠 Intuitive query builder (DSL): subreddits, authors (allow/deny), regex, keywords, URL presence, domains, score thresholds
🧰 Exports:
- JSONL / JSON array (stitched)
- Partitioned per source (RC/RS) as JSONL or ZST
📈 Analytics helpers: count_by_month(), per‑author counts, “first seen” index
👪 Parent pipeline: collect parent IDs → resolve content → attach parent payloads back to records
🧪 Integrity checks for corrupted monthly files (quick or full)
🧵 Parallel, backpressure‑aware I/O with file concurrency caps and cooperative throttling
🪟 Windows‑friendly I/O with robust retry/backoff on transient errors

Data Layout

RETL expects a base directory with two subfolders:

<base_dir>/
  comments/
    RC_YYYY-MM.zst
    RC_YYYY-MM.zst
    ...
  submissions/
    RS_YYYY-MM.zst
    RS_YYYY-MM.zst
    ...

File name patterns are enforced at discovery time:

Comments: ^RC_\d{4}-\d{2}\.zst$
Submissions: ^RS_\d{4}-\d{2}\.zst$

You can browse and download monthly dumps via Academic Torrents: https://academictorrents.com/browse.php?search=reddit

Note on scale & performance: The monthly files become very large in later years. It’s normal for broader queries to run longer and consume significant I/O. RETL’s streaming design and throttling aim to keep resource use predictable; tune .file_concurrency(n), .parallelism(n), and .io_buffers(...) as appropriate for your hardware and dataset size.

Install

As a library (recommended)

Add RETL to your Cargo.toml via Git (replace the URL with your repo if needed):

[dependencies]
retl = { git = "https://github.com/sjlynch/retl", branch = "main" }

This repository currently sets publish = false in Cargo.toml, so installing from crates.io is not expected.

Build the demo binary

This repo also includes a small example binary in src/main.rs:

cargo build --release
./target/release/retl

The binary demonstrates author collection across a small date window, and is intentionally minimal — the library API is where RETL shines.

Quick Start

Place your corpus under ./data/:

./data/comments/RC_2006-01.zst
./data/submissions/RS_2006-01.zst
...

Create a small driver:

use anyhow::Result;
use retl::{RedditETL, Sources, YearMonth};

fn main() -> Result<()> {
    let counts = RedditETL::new()
        .base_dir("./data")
        .sources(Sources::Both)
        .date_range(Some(YearMonth::new(2006, 1)), Some(YearMonth::new(2006, 1)))
        .progress(true)
        .scan()
        .subreddit("programming")
        .keywords_any(["rust"])
        .contains_url(true)
        .count_by_month()?; // {"2006-01": N}

    for (ym, n) in counts {
        println!("{ym}\t{n}");
    }
    Ok(())
}

Run:

cargo run --release

Usage Examples

All examples below operate on the same API you saw in the Quick Start.

Extract to JSONL

use retl::{RedditETL, Sources, YearMonth};
use std::path::Path;

RedditETL::new()
    .base_dir("./data")
    .sources(Sources::Both)
    .date_range(Some(YearMonth::new(2016, 1)), Some(YearMonth::new(2016, 3)))
    .progress(true)
    .scan()
    .subreddit("askscience")
    .whitelist_fields([
        "author","body","created_utc","subreddit",
        "parent_id","link_id","id","score",
    ])
    .timestamps_human_readable(true)
    .extract_to_jsonl(Path::new("askscience_comments_q1_2016_minimal.jsonl"))?;

Partitioned Export (JSONL/ZST)

use retl::{ExportFormat, RedditETL, Sources, YearMonth};

RedditETL::new()
    .base_dir("./data")
    .sources(Sources::Both)
    .date_range(Some(YearMonth::new(2016, 1)), Some(YearMonth::new(2016, 1)))
    .progress(true)
    .scan()
    .subreddit("programming")
    .allow_pseudo_users() // include "[deleted]"
    .export_partitioned(std::path::Path::new("out_corpus_zst"), ExportFormat::Zst)?;

Outputs:

out_corpus_zst/comments/RC_2016-01.zst
out_corpus_zst/submissions/RS_2016-01.zst

Count by Month

use retl::{RedditETL, Sources, YearMonth};

let counts = RedditETL::new()
    .base_dir("./data")
    .sources(Sources::Both)
    .date_range(Some(YearMonth::new(2016, 1)), Some(YearMonth::new(2016, 12)))
    .progress(false)
    .scan()
    .subreddit("worldnews")
    .keywords_any(["election","vote","ballot"])
    .contains_url(true)
    .min_score(10)
    .count_by_month()?;

Usernames with Filters

use retl::{RedditETL, Sources, YearMonth};

let mut it = RedditETL::new()
    .base_dir("./data")
    .sources(Sources::Both)
    .date_range(Some(YearMonth::new(2006, 1)), Some(YearMonth::new(2006, 1)))
    .progress(false)
    .scan()
    .subreddit("programming")
    .keywords_any(["rust"])
    .contains_url(true)
    .min_score(2)
    .usernames()?;

while let Some(u) = it.next() {
    println!("{}", u);
}

Author Analytics (TSV)

Produce a TSV of total records per author:

use retl::{RedditETL, Sources, YearMonth};

RedditETL::new()
    .base_dir("./data")
    .sources(Sources::Both)
    .date_range(Some(YearMonth::new(2006, 1)), Some(YearMonth::new(2006, 1)))
    .progress(false)
    .scan()
    .subreddit("programming")
    .author_counts_to_tsv(std::path::Path::new("author_counts.tsv"))?;

And the earliest “first seen” timestamp per author:

RedditETL::new()
    .base_dir("./data")
    .sources(Sources::Both)
    .date_range(Some(YearMonth::new(2006, 1)), Some(YearMonth::new(2006, 1)))
    .progress(false)
    .scan()
    .subreddit("programming")
    .build_first_seen_index_to_tsv(std::path::Path::new("first_seen.tsv"))?;

Parents Pipeline (Attach Parent Content)

Collect parent IDs from your spooled JSONL, resolve parent contents by scanning the corpus, then attach parents back onto your records:

use retl::{ParentIds, ParentMaps, RedditETL, Sources, YearMonth};
use std::path::Path;

let (spool_parts, _n) = RedditETL::new()
    .base_dir("./data")
    .sources(Sources::Both)
    .date_range(Some(YearMonth::new(2006, 1)), Some(YearMonth::new(2006, 1)))
    .progress(true)
    .scan()
    .subreddit("programming")
    .allow_pseudo_users()
    .extract_spool_monthly(Path::new("spool"), /*resume=*/false)?;

// Step 2: Collect parent IDs
let ids: ParentIds = RedditETL::new()
    .base_dir("./data")
    .progress(true)
    .collect_parent_ids_from_jsonls(spool_parts.clone())?;

// Step 3: Resolve to cache over ±1 month window
let parents: ParentMaps = RedditETL::new()
    .base_dir("./data")
    .date_range(Some(YearMonth::new(2005, 12)), Some(YearMonth::new(2006, 2)))
    .progress(true)
    .resolve_parent_maps(&ids, Path::new("parents_cache"), /*resume=*/true)?;

// Step 4: Attach parent payloads
let _out_paths = RedditETL::new()
    .base_dir("./data")
    .progress(true)
    .attach_parents_jsonls_parallel(spool_parts, Path::new("spool_with_parents"), &parents, /*resume=*/false)?;

Each comment will receive a "parent" object containing either the parent comment’s body (t1_...) or the submission’s title/selftext (t3_...).

Integrity Checks

Quick sampling (fast) and full decode (slow but thorough):

use retl::{IntegrityMode, RedditETL, Sources, YearMonth};

let bad_quick = RedditETL::new()
    .base_dir("./data")
    .sources(Sources::Comments)
    .date_range(Some(YearMonth::new(2006, 2)), Some(YearMonth::new(2006, 2)))
    .progress(false)
    .check_corpus_integrity(IntegrityMode::Quick { sample_bytes: 64 * 1024 })?;

let bad_full = RedditETL::new()
    .base_dir("./data")
    .sources(Sources::Comments)
    .date_range(Some(YearMonth::new(2006, 2)), Some(YearMonth::new(2006, 2)))
    .progress(false)
    .check_corpus_integrity(IntegrityMode::Full)?;

Performance & Tuning

Parallelism: set Rayon threads based on your hardware:
```
.parallelism(24)
```
File concurrency: limit how many monthly files are decoded at once (helps with RAM/IO pressure):
```
.file_concurrency(4)
```
Buffers: tune read/write buffers if you’re IO‑bound:
```
.io_buffers(256 * 1024, 256 * 1024)
```
Work directory: point scratch space to fast storage:
```
.work_dir("/mnt/nvme/etl_tmp")
```
Progress: toggle progress bars and custom labels:
```
.progress(true).progress_label("Counting")
```

RETL cooperates under low memory by adaptively throttling certain stages.

Environment Aids

Logging: respect RUST_LOG, e.g.:
```
RUST_LOG=info cargo run --release
```
Exclude bots: start from a conservative default list and extend via env/file:
- ETL_EXCLUDE_AUTHORS="bot_a, bot_b, service_c"
- ETL_EXCLUDE_AUTHORS_FILE=/path/to/extra_exclusions.txt Then use:
```
.exclude_common_bots()
```

License

MIT. See LICENSE for details.

Project Goals

Provide an intuitive, composable query system that feels like a fluent DSL
Keep memory profile predictable with adaptive buffering and backpressure
Make common workflows boringly easy (extract, export, count, attach parents)
Stay robust on Windows and networked filesystems via retry/backoff I/O

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RETL (Reddit ETL)

Features

Table of Contents

Data Layout

Install

As a library (recommended)

Build the demo binary

Quick Start

Usage Examples

Extract to JSONL

Partitioned Export (JSONL/ZST)

Count by Month

Usernames with Filters

Author Analytics (TSV)

Parents Pipeline (Attach Parent Content)

Integrity Checks

Performance & Tuning

Environment Aids

License

Project Goals

About

Uh oh!

Releases

Packages

Languages

License

sjlynch/RETL

Folders and files

Latest commit

History

Repository files navigation

RETL (Reddit ETL)

Features

Table of Contents

Data Layout

Install

As a library (recommended)

Build the demo binary

Quick Start

Usage Examples

Extract to JSONL

Partitioned Export (JSONL/ZST)

Count by Month

Usernames with Filters

Author Analytics (TSV)

Parents Pipeline (Attach Parent Content)

Integrity Checks

Performance & Tuning

Environment Aids

License

Project Goals

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages