Reddit Streaming Lakehouse

I. Overview

The Reddit Streaming Lakehouse project collects and analyzes Reddit data with a streaming pipeline.
The system ingests Reddit posts and comments, processes them through multiple data layers (Bronze -> Silver -> Gold), and stores them in a Lakehouse.
From there, the data is used for trend analysis, sentiment detection, and dashboard visualization.

Context:
Reddit is one of the largest online discussion platforms. People share opinions on events, products, and social issues.
By processing Reddit data in a streaming way, we can detect trending topics, measure community sentiment, and provide insights for research or business.

II. Objectives

1. Business Goals

Monitor social trends from popular subreddits.
Analyze community sentiment (Positive, Negative, Neutral).
Summarize and classify content from topic-specific subreddits.

2. Technical Goals

Extract Reddit data stored in MongoDB.
Send MongoDB data to Kafka topics using a custom Python Producer.
Process streaming data with Spark Structured Streaming.
Apply a fine-tuned LLM for sentiment and topic analysis.
Store data in Apache Iceberg on MinIO, with Hive Metastore (MySQL) for metadata.
Query data with Trino.
Build dashboards with Apache Superset.
Orchestrate the pipeline end-to-end with Apache Airflow.

III. Architecture

Streaming Lakehouse Pipeline:

Ingestion: MongoDB -> Kafka
Processing: Spark Structured Streaming (Bronze -> Silver -> Gold)
Storage: Apache Iceberg on MinIO with Hive Metastore (MySQL)
Query: Trino
Visualization: Superset
Orchestration: Apache Airflow

IV. Data

Source: Reddit API (posts & comments) from June 2026.
Data Layers (Medallion Architecture):
- Bronze: raw data from Kafka.
- Silver: cleaned and enriched data with sentiment/topic labels.
- Gold: fact and dimension tables for BI and dashboards.

Data Warehouse Schema

V. Setup

1. Build and Start Environment

All services (Spark, YARN, Kafka, MinIO, Hive Metastore, Trino, Superset, Airflow) are combined in one script.
Run the following command in root folder:

source ./scripts/startServices.sh

This will:

Start Docker containers (Spark cluster, Kafka, MinIO, Hive Metastore, Trino, Superset, Airflow).
Configure networks.
Prepare volumes and paths.

2. Prepare MongoDB

Download the Reddit dataset (jsonl files).
Data cleaned Reddit At the root of the project, create a folder called data and put both 2 jsonl files inside.
Create a free MongoDB Cloud account.
Create a database and collection.
Go to the folder src/dataMongoDB and update the connection string in .venv to point to your MongoDB.

Then run the upload script:

    cd ./src/dataMongoDB/
    pip install -r requirement.txt
    python upload.py

VI. Usage

1. Access Airflow UI

Airflow is already included in the setup.
Open your browser: http://localhost:8089/

Default user: admin
Default password: admin

2. Run the Pipeline

In Airflow UI:

Go to DAGs.
Find the DAG named reddit_streaming_pipeline.
Click play button

This will automatically:

Start DFS and YARN.
Create Kafka topics (redditSubmission, redditComment).
Clear old checkpoints in MongoDB.
Create databases in Iceberg (Bronze, Silver, Gold).
Run Producer (MongoDB → Kafka).
Submit Spark jobs (Bronze → Silver → Gold).
Load results for dashboards in Superset.

3. Dashboard

After the DAG finishes, open Superset at: http://localhost:8088/

Default user: admin
Default password: AdminPassword123!

There you can see dashboards with:

Post and Comment statistics.
Subreddit activity by hour.
Community sentiment trends.
Top domains shared.
Subreddit insights (heatmap, treemap, bubble chart…).

Or you can see our dashboard here: reddit-streaming-lakehouse

VII. Run Step by Step to Understand

1. Create Kafka Topics

Go inside the Kafka container and run:

kafka-topics.sh --create --topic redditSubmission --bootstrap-server kafka1:9092 --replication-factor 2
kafka-topics.sh --create --topic redditComment --bootstrap-server kafka1:9092 --replication-factor 2

2. Producer – Collect Reddit Data

Go to Confluent Kafka environment and run: Delete old checkpoints:

source /opt/venv/bin/activate 
python scripts/delAllDocument.py

source /opt/venv/bin/activate 
python scripts/producer.py

3. Consumer – Bronze Layer

Open Spark shell and create the database:

spark.sql("create database spark_catalog.bronze")
spark.sql("create database spark_catalog.silver")
spark.sql("create database spark_catalog.gold")

Run Spark job:

spark-submit --py-files utils.zip mainBronze.py

4. Transform – Silver Layer

spark-submit --py-files transformer.zip,utils.zip mainRsSilver.py
spark-submit --py-files transformer.zip,utils.zip mainRcSilver.py

5. Transform – Gold Layer

spark-submit --py-files utils.zip createDim.py
spark-submit --py-files transformer.zip,utils.zip mainDimGold.py
spark-submit --py-files transformer.zip,utils.zip mainFactPost.py
spark-submit --py-files transformer.zip,utils.zip mainFactCmt.py

6. Check Data

Use spark-shell:

spark-shell \

--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Use Superset: Example:

SELECT * FROM bronze.reddit_submission;
SELECT * FROM silver.reddit_comment;
SELECT * FROM gold.dimtime;

VIII. Contact

Authors:

Email:

Name		Name	Last commit message	Last commit date
Latest commit History 226 Commits
config		config
docker		docker
imgs		imgs
notebook		notebook
redditAPI		redditAPI
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Streaming Lakehouse

I. Overview

II. Objectives

1. Business Goals

2. Technical Goals

III. Architecture

IV. Data

V. Setup

1. Build and Start Environment

2. Prepare MongoDB

VI. Usage

1. Access Airflow UI

2. Run the Pipeline

3. Dashboard

VII. Run Step by Step to Understand

1. Create Kafka Topics

2. Producer – Collect Reddit Data

3. Consumer – Bronze Layer

4. Transform – Silver Layer

5. Transform – Gold Layer

6. Check Data

VIII. Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

huy-dataguy/Reddit-Streaming-Lakehouse

Folders and files

Latest commit

History

Repository files navigation

Reddit Streaming Lakehouse

I. Overview

II. Objectives

1. Business Goals

2. Technical Goals

III. Architecture

IV. Data

V. Setup

1. Build and Start Environment

2. Prepare MongoDB

VI. Usage

1. Access Airflow UI

2. Run the Pipeline

3. Dashboard

VII. Run Step by Step to Understand

1. Create Kafka Topics

2. Producer – Collect Reddit Data

3. Consumer – Bronze Layer

4. Transform – Silver Layer

5. Transform – Gold Layer

6. Check Data

VIII. Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages