The Reddit Streaming Lakehouse project collects and analyzes Reddit data with a streaming pipeline.
The system ingests Reddit posts and comments, processes them through multiple data layers (Bronze -> Silver -> Gold), and stores them in a Lakehouse.
From there, the data is used for trend analysis, sentiment detection, and dashboard visualization.
Context:
Reddit is one of the largest online discussion platforms. People share opinions on events, products, and social issues.
By processing Reddit data in a streaming way, we can detect trending topics, measure community sentiment, and provide insights for research or business.
- Monitor social trends from popular subreddits.
- Analyze community sentiment (Positive, Negative, Neutral).
- Summarize and classify content from topic-specific subreddits.
- Extract Reddit data stored in MongoDB.
- Send MongoDB data to Kafka topics using a custom Python Producer.
- Process streaming data with Spark Structured Streaming.
- Apply a fine-tuned LLM for sentiment and topic analysis.
- Store data in Apache Iceberg on MinIO, with Hive Metastore (MySQL) for metadata.
- Query data with Trino.
- Build dashboards with Apache Superset.
- Orchestrate the pipeline end-to-end with Apache Airflow.
Streaming Lakehouse Pipeline:
- Ingestion: MongoDB -> Kafka
- Processing: Spark Structured Streaming (Bronze -> Silver -> Gold)
- Storage: Apache Iceberg on MinIO with Hive Metastore (MySQL)
- Query: Trino
- Visualization: Superset
- Orchestration: Apache Airflow
- Source: Reddit API (posts & comments) from June 2026.
- Data Layers (Medallion Architecture):
- Bronze: raw data from Kafka.
- Silver: cleaned and enriched data with sentiment/topic labels.
- Gold: fact and dimension tables for BI and dashboards.
Data Warehouse Schema
All services (Spark, YARN, Kafka, MinIO, Hive Metastore, Trino, Superset, Airflow) are combined in one script.
Run the following command in root folder:
source ./scripts/startServices.shThis will:
- Start Docker containers (Spark cluster, Kafka, MinIO, Hive Metastore, Trino, Superset, Airflow).
- Configure networks.
- Prepare volumes and paths.
- Download the Reddit dataset (
jsonlfiles).
Data cleaned Reddit At the root of the project, create a folder calleddataand put both 2jsonlfiles inside. - Create a free MongoDB Cloud account.
Create a database and collection. - Go to the folder
src/dataMongoDBand update the connection string in.venvto point to your MongoDB.
- Then run the upload script:
cd ./src/dataMongoDB/
pip install -r requirement.txt
python upload.pyAirflow is already included in the setup.
Open your browser: http://localhost:8089/
- Default user:
admin - Default password: admin
In Airflow UI:
- Go to DAGs.
- Find the DAG named
reddit_streaming_pipeline. - Click play button
This will automatically:
- Start DFS and YARN.
- Create Kafka topics (
redditSubmission,redditComment). - Clear old checkpoints in MongoDB.
- Create databases in Iceberg (Bronze, Silver, Gold).
- Run Producer (MongoDB → Kafka).
- Submit Spark jobs (Bronze → Silver → Gold).
- Load results for dashboards in Superset.
After the DAG finishes, open Superset at: http://localhost:8088/
- Default user:
admin - Default password:
AdminPassword123!
There you can see dashboards with:
- Post and Comment statistics.
- Subreddit activity by hour.
- Community sentiment trends.
- Top domains shared.
- Subreddit insights (heatmap, treemap, bubble chart…).
Or you can see our dashboard here: reddit-streaming-lakehouse
Go inside the Kafka container and run:
kafka-topics.sh --create --topic redditSubmission --bootstrap-server kafka1:9092 --replication-factor 2
kafka-topics.sh --create --topic redditComment --bootstrap-server kafka1:9092 --replication-factor 2Go to Confluent Kafka environment and run: Delete old checkpoints:
source /opt/venv/bin/activate
python scripts/delAllDocument.pysource /opt/venv/bin/activate
python scripts/producer.pyOpen Spark shell and create the database:
spark.sql("create database spark_catalog.bronze")
spark.sql("create database spark_catalog.silver")
spark.sql("create database spark_catalog.gold")Run Spark job:
spark-submit --py-files utils.zip mainBronze.pyspark-submit --py-files transformer.zip,utils.zip mainRsSilver.py
spark-submit --py-files transformer.zip,utils.zip mainRcSilver.pyspark-submit --py-files utils.zip createDim.py
spark-submit --py-files transformer.zip,utils.zip mainDimGold.py
spark-submit --py-files transformer.zip,utils.zip mainFactPost.py
spark-submit --py-files transformer.zip,utils.zip mainFactCmt.pyUse spark-shell:
spark-shell \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensionsUse Superset: Example:
SELECT * FROM bronze.reddit_submission;
SELECT * FROM silver.reddit_comment;
SELECT * FROM gold.dimtime;Authors:
Email:


