TweeeeDBT

Streaming Pipeline

Zookeeper and Kafka setup

docker compose up

docker ps # should show zookeeper and kafka container up and running

to test kafka

docker exec -it tweeeeedbt-kafka-1 bash

inside the container run

kafka-topics.sh --list --bootstrap-server localhost:9092

Spark setup

Requirements java 17 python 3.8.10
ps: if you have higher version of either things pyspark might probably not get installed corrected and even if it does it will not work correctly :)

wget https://downloads.apache.org/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz

tar -xvzf spark-3.5.5-bin-hadoop3.tgz
mv spark-3.5.5-bin-hadoop3 ~/spark

nano ~/.bashrc # or zshrc

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH

DB setup (might not be exact on your system)

installation

sudo apt install postgresql postgresql-contrib

enabling the psl service

sudo systemctl status postgresql
sudo systemctl start postgresql
sudo systemctl enable postgresql

access by switching to default postgres user

sudo -i -u postgres

psql

creating user and granting permission inside the psql shell

-- Create a database
CREATE DATABASE tweedbt;

-- Create a user with a password
CREATE USER <username> WITH PASSWORD '<password>'; -- these values should go into your .env

-- Grant privileges
GRANT ALL PRIVILEGES ON DATABASE mydb TO <username>;

applying the schema

cd DB/
psql -U <username> -d tweedbt -f schema.sql

accessing psql

psql -U <username> -d tweedbt

\l to list all dbs

\d to list all relations

How to run Pyspark, Kafka integration

Make sure that each time the old data is flushed

bash reset-kafka.sh

Run each of the following commands in seperate terminals and in same order

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 stream_processing.py

python3 consumer_<whatever>.py

python3 producer.py

ToDo

Data Visualization Layer
Folder Structuring
System Architecture Diagram
Dockerize Streaming

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
DB		DB
Requirement Specification		Requirement Specification
.gitignore		.gitignore
ProjectReport.pdf		ProjectReport.pdf
README.md		README.md
batch_processing.py		batch_processing.py
consumer_geolocation.py		consumer_geolocation.py
consumer_teammentions.py		consumer_teammentions.py
consumer_userverify.py		consumer_userverify.py
consumer_verifieduserwindowed.py		consumer_verifieduserwindowed.py
docker-compose.yaml		docker-compose.yaml
image.png		image.png
producer.py		producer.py
requirements.txt		requirements.txt
reset-kafka.sh		reset-kafka.sh
stream_processing.py		stream_processing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

TweeeeDBT

Streaming Pipeline

Zookeeper and Kafka setup

Spark setup

DB setup (might not be exact on your system)

How to run Pyspark, Kafka integration

ToDo

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Uh oh!

Uh oh!

kdb04/TweeeeeDBT

Folders and files

Latest commit

History

Repository files navigation

TweeeeDBT

Streaming Pipeline

Zookeeper and Kafka setup

Spark setup

DB setup (might not be exact on your system)

How to run Pyspark, Kafka integration

ToDo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages