Pinterest Data Pipeline

Project Description

Pinterest crunches billions of data points every day to decide how to provide more value to their users. Pinterest utilizes state-of-the-art machine learning engineering systems, handling billions of daily user interactions like image uploads and clicks. These interactions require daily processing to inform decision-making. Aim of this project is to develop a system mirroring Pinterest's data analysis infrastructure, capable of analysing both historical and real-time data generated by user posts.

Technologies Used

The tools used for this project are listed below:

Apache Kafka:
- Description: Event streaming platform for real-time data capture and processing from various sources.
- Documentation: Apache Kafka Documentation
Amazon MSK (Managed Streaming for Apache Kafka):
- Description: Fully managed service on AWS for building applications using Apache Kafka for streaming data processing.
- Documentation: Amazon MSK Documentation
AWS MSK Connect:
- Description: Feature of Amazon MSK facilitating easy streaming of data to and from Apache Kafka clusters with managed connectors.
- Documentation: MSK Connect Documentation
Kafka REST Proxy:
- Description: Provides a RESTful interface for interacting with an Apache Kafka cluster, simplifying message production, consumption, and administrative tasks.
- Documentation: Confluent REST Proxy Documentation
AWS API Gateway:
- Description: Fully managed service for creating, publishing, maintaining, monitoring, and securing APIs at scale.
- Documentation: AWS API Gateway Documentation
Apache Spark:
- Description: Multi-language engine for executing data engineering, data science, and machine learning tasks on single-node machines or clusters.
- Documentation: Apache Spark Documentation
PySpark:
- Description: Python API for Apache Spark, enabling real-time, large-scale data processing in a distributed environment using Python.
- Documentation: PySpark Documentation
Databricks:
- Description: Unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale.
- Documentation: Databricks Documentation
Managed Workflows for Apache Airflow (MWAA):
- Description: AWS service allowing the use of Apache Airflow and Python to create workflows without managing underlying infrastructure.
- Documentation: MWAA Documentation
AWS Kinesis:
- Description: Managed service for processing and analyzing streaming data.
- Documentation: AWS Kinesis Documentation

Data Pipeline Building Process

Downloaded Pinterest Infrastructure, representing data similar to what the Pinterest API receives when a user uploads data through a post request. The infrastructure comprises three tables: -pinterest_data -geolocation_data -user_data
Established the configuration of an EC2 instance as an Apache Kafka client machine for the purpose of creating topics. Subsequently, configured MSK Connect to enable the MSK cluster to transmit data to an S3 bucket, ensuring that any data sent to the topic is automatically saved and stored in a designated S3 bucket.
Established an API within AWS API Gateway designed to transmit data to the MSK cluster through the MSK connect connector. Implemented a Kafka REST proxy integration method for the API and configured the Kafka REST proxy on my EC2 client. Activated the REST proxy on the EC2 client machine and adapted the user_posting_emulation.py script to dispatch data to my API. This, in turn, directs the data to the MSK Cluster using the previously established plugin-connector pair, ultimately storing the data in the designated S3 bucket.
Retrieve data from AWS into Databricks for batch processing. Utilize Spark on Databricks to clean and perform computations. To facilitate cleaning and querying of batch data, it's necessary to read the data from the S3 bucket into Databricks. Initially, the S3 bucket must be mounted to the Databricks account. Given full access to S3 on the Databricks account and the pre-updated credentials, there's no requirement to generate a new Access Key and Secret Access Key for Databricks.
Orchestrated Databricks workloads on AWS MWAA by uploading a Directed Acyclic Graph (DAG) to an MWAA environment and initiating its execution at a specified time.
Imported data into Databricks for stream processing. Transmitted data to Kinesis streams, processed and transformed it using Databricks, and recorded the streaming data into Delta tables. Configured the pre-existing REST API to enable invocation of Kinesis actions. The API now has the capability to: List streams in Kinesis; Create, describe, and delete streams in Kinesis; Append records to streams in Kinesis; Utilized the user_posting_emulation_streaming script to send requests to my API, adding one record at a time to the created streams.
Processed and transformed the data within Databricks. Subsequently, saved each stream into a Delta table.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.DS_Store		.DS_Store
0a2528ba1237_dag.py		0a2528ba1237_dag.py
CloudPinterestPipeline.jpeg		CloudPinterestPipeline.jpeg
README.md		README.md
cleaning_batch_data.ipynb		cleaning_batch_data.ipynb
process_streaming_data.ipynb		process_streaming_data.ipynb
user_posting_emulation.py		user_posting_emulation.py
user_posting_emulation_streaming.py		user_posting_emulation_streaming.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pinterest Data Pipeline

Project Description

Technologies Used

Data Pipeline Building Process

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

sadiaTab/Pinterest_Data_Pipeline_Project

Folders and files

Latest commit

History

Repository files navigation

Pinterest Data Pipeline

Project Description

Technologies Used

Data Pipeline Building Process

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages