Thanks to visit codestin.com
Credit goes to github.com

Skip to content

kdb04/TweeeeeDBT

Repository files navigation

TweeeeDBT

Streaming Pipeline

Zookeeper and Kafka setup

docker compose up
docker ps # should show zookeeper and kafka container up and running
  • to test kafka
docker exec -it tweeeeedbt-kafka-1 bash
  • inside the container run
kafka-topics.sh --list --bootstrap-server localhost:9092

Spark setup

  • Requirements java 17 python 3.8.10
    ps: if you have higher version of either things pyspark might probably not get installed corrected and even if it does it will not work correctly :)
wget https://downloads.apache.org/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz
tar -xvzf spark-3.5.5-bin-hadoop3.tgz
mv spark-3.5.5-bin-hadoop3 ~/spark
nano ~/.bashrc # or zshrc
export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH

DB setup (might not be exact on your system)

  • installation
sudo apt install postgresql postgresql-contrib
  • enabling the psl service
sudo systemctl status postgresql
sudo systemctl start postgresql
sudo systemctl enable postgresql
  • access by switching to default postgres user
sudo -i -u postgres
psql
  • creating user and granting permission inside the psql shell
-- Create a database
CREATE DATABASE tweedbt;

-- Create a user with a password
CREATE USER <username> WITH PASSWORD '<password>'; -- these values should go into your .env

-- Grant privileges
GRANT ALL PRIVILEGES ON DATABASE mydb TO <username>;
  • applying the schema
cd DB/
psql -U <username> -d tweedbt -f schema.sql
  • accessing psql
psql -U <username> -d tweedbt
  • \l to list all dbs
  • \d to list all relations

How to run Pyspark, Kafka integration

  • Make sure that each time the old data is flushed
bash reset-kafka.sh
  • Run each of the following commands in seperate terminals and in same order
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 stream_processing.py
python3 consumer_<whatever>.py
python3 producer.py

ToDo

  • Data Visualization Layer
  • Folder Structuring
  • System Architecture Diagram
  • Dockerize Streaming

About

Twitter Streaming

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •