0% found this document useful (0 votes)

4 views26 pages

PROJECT 12 For Python

Python mini project

Uploaded by

nikhilranjan2357

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views26 pages

PROJECT 12 For Python

Python mini project

Uploaded by

nikhilranjan2357

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Azure End-To-End Data Engineering Project

General overview of the project:

Beginning with the 2021 Olympics data stored in GitHub’s CSV files (accessible
here: kaoutharElbakouri/2021-Olympics-in-Tokyo-Data (github.com)), this project employs
Azure Data Factory (ADF) to smoothly bring this information into the raw layer of Azure Data
Lake Storage (ADLS). Moving forward, Azure Databricks takes the lead, refining the dataset and
storing the processed data in ADLS’s transformed layer. Azure Synapse Analytics steps in,
primarily for robust data warehousing and detailed analysis, allowing deeper exploration and
insights. Finally, Power BI visualizes these insights, marking the completion of a step-by-step
process and providing a rich and comprehensive view of the 2021 Olympics dataset.

Here’s the diagram which shows the workflow of the project we’re going to build:

Diagram Project Overview

We are going to split this project into following sections:

1. Create a resource group

The initial step under the Azure for Students subscription is to create a resource group,
specifying its name and region.

create an Azure resource group

This is our created resource group that will acts as a container to hold related Azure resources
for efficient management and organization:
2. Create a storage account

Building upon the resource group, the next step is to create an Azure Data Lake Storage ADLS (a
storage account with namespace hierarchy enabled).

• Select the created resource group.

• Click on the “Create a resource” button within the resource group.

• In the search bar, type “Storage account” and select the option for creating a new
storage account.

• Enter the necessary details for the storage account, such as:

✓ Subscription: Choose your Azure for Students subscription.

✓ Resource group: Select created resource group
✓ Storage account name: Enter “tokyoolympicdatastorage.”
✓ Region: Choose the appropriate region (e.g., West Europe).
✓ Performance, Replication, and Access tier: Select as per your requirements.
✓ Hierarchical namespace: Enable this option to create Azure Data Lake Storage (ADLS).

Azure Data Lake Storage

The next step is to create a container. Inside the ADLS account, locate and select the
“Containers” section, Click on “New Container” and enter the desired name for the container,
for example, “tokyoolympicdata”, then click create.

ADLS container

The next step is to create two directories within this container. Navigate to the specific
container, click on the “Add Directory”. Enter the directory name as “raw-data” and confirm the
creation. Click on “Add Directory” again and enter the directory name as “transformed-data”
and confirm the creation. This is a common way to organize data.

ADLS container Directories

3. Initiating data ingestion: Transferring CSV files from GitHub to ADLS ‘raw-data’ directory
using ADF:

Now that we’ve established the “raw-data” directory, we’re set to initiate the ingestion process
using Azure Data Factory ADF. This involves transferring data from CSV files stored in GitHub to
the Azure Data Lake Storage (ADLS).

To do so, we need to create an ADF resource:

From the Azure portal, in the search bar, type “Data Factory” and select “Data Factory” from
the search results.

Click “Create” to begin the setup of a new Azure Data Factory resource.

Enter the necessary information:

• Name: Provide a unique name for your ADF resource.

• Subscription: Choose your subscription (Azure for Students).

• Resource Group: Select the already created resource group.

• Version: Choose the ADF version you want to use.

Once everything is validated, click “Create” to initiate the provisioning of your new Azure Data
Factory resource.

Here is our created ADF resource:

ADF
Once your Azure Data Factory (ADF) resource is created, you can begin working on it by
launching ADF Studio.

Our data source in GitHub comprises five CSV files.

Data source

Let’s concentrate on ingesting the “Athletes” CSV file from our GitHub data source into Azure
Data Lake Storage (ADLS) using Azure Data Factory (ADF). The same steps will be replicated for
the ingestion of the other files as well.

Create the necessary Linked services:

To effectively handle each CSV file, it’s essential to create two critical connections: a source
linked service and a destination linked service. Our journey begins by crafting a linked service
that establishes the crucial connection between Azure Data Factory (ADF) and the diverse data
sources involved.

Under the ADF (Azure Data Factory) Manage hub, you can find the option for “Linked Services.”
By clicking “New” and choosing “HTTP” as the connection type (our data is on GitHub) you’ll
initiate the process of creating a new linked service:

Athletes HTTP Linked Service

Once you’ve selected the “HTTP” connection type, proceed by naming the linked service,
providing the base URL, selecting “Anonymous” for authentication, and opting for
“AutoResolveIntegrationRuntime” for the integration runtime setting.

For the Base URL, select Athletes csv file in github then click raw data and copy the link.

Athletes Base URL

Let’s create the Azure Data Lake storage linked service by following the same steps as for the
previous linked service but instead of HTTP for connection type, choose Azure Data Lake
Storage Gen2 and provide the necessary details:

ADLS Linked Service

Here are our created linked services:

Linked services
Create the Datasets:

Now that our source and destination linked services have been created, we can create the
Datasets that will be referenced by the ADF activities. For that, under the ADF Author Hub,
locate the Datasets section and create new Dataset.

For the source Dataset, we need to choose HTTP for data store type.
Source Dataset

Then choose CSV for the format type:

Then name your Dataset and specify the already created Linked service (do not forget to publish
the Dataset):
Now let’s create the destination Dataset with the same steps but with ADLS as data store type
instead of HTTP and set the required properties for our Dataset.

Now that our two Datasets have been created, we can proceed with creating the ADF activity to
do the ingestion from source to destination.
ADF Ingestion Activity:

Under the Author Hub, locate the pipelines section and create new pipeline, then select the
copy activity from the Activities section.

Athletes Ingestion Pipeline

The next step is to configure the source and sink for our activity:

Activity source and Sink

Now we can start debugging our pipeline by clicking on Debug.

our pipeline succeeded and we can see Athletes data loaded to ADLS container (raw-data).

After following the same steps for the remaining CSV files, we need to have these CSV files
located in raw-data folder within our ADLS container:
ADLS raw-data
4. Data processing and transformation with Databricks: Enriching ADLS transformed layer

Once data has been ingested into the raw layer of Azure Data Lake Storage (ADLS), the next
stage involves processing this data using Databricks and then storing it in the transformed layer
within ADLS.

For that we need first to create a Databricks resource. To do it, from Azure portal Search for
“Databricks” and select “Azure Databricks” from the available services. Click “Create” and fill
in the required information, such as resource group, workspace name, region, and pricing tier.

Azure Databricks

Once created, click on Launch Workspace and you will be directed to the Databricks
workspace. And once Databrick workspace is launched, locate the compute section (This is
where you manage clusters) and click on Create to set up a cluster based your configurations.

For me, a cluster is already created and ready to execute Spark code:

Cluster
Now, we can write simple code to get data from ADLS raw-data layer, transform it and put it into
ADLS transformed-data layer. To do so, we need to create a notebook by clicking ‘New
Notebook’ and make sure the created Spark cluster is selected:
Databricks Notebook

We have to create the connection between Databricks and ADLS so we can easily access the
data. For that, from the Azure portal search for the App Registrations and click on ‘New
Registration’ to register a new application by providing the name of the app:

app registrations
This is our registered app. From that we need the Client ID and Tenant ID:

Within this application locate the ‘Certificates & secrets’ section and create a secret that we
will need for the connection between Databricks and ADLS:

Secret Key

keep the value of the created secret key for further usage:
Now, we can use the three credentials (client ID, Tenant ID and value of secret key) to connect
Databricks to ADLS.

configs = {"fs.azure.account.auth.type": "OAuth",

"fs.azure.account.oauth.provider.type":
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "34f9f5d3-593c-4389-a3af-78adfecc736b", #Client ID
"fs.azure.account.oauth2.client.secret": 'yaf8Q~FEMX87TQw5a03cSNm6CrOL2vGxQ-qh8dfM',
#Value of secret key
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/eb12f8ec-35f2-
415d-97bf-0e34301876a7/oauth2/token"} #Tenant ID

dbutils.fs.mount(
source = "abfss://[email protected]", #
contrainer@storageacc
mount_point = "/mnt/tokyoolymic",
extra_configs = configs)

And to have the permission to access the files stored in ADLS, we need to explicitly giving this
access using the credentials of the registered application. So, we need to give permission to
this app to access the Data Lake. Under our ADLS container, Click on Access Control (IAM) then
click add role assignement:

app access
Select the ‘Storage Blob Data contributor’ role, click next and Click on Select members and
select the registered app (app1):
To check that we were able to successfully create the connection to ADLS, we can check with
this code within our notebook:

%fs
ls "/mnt/tokyoolymic"
Now, we can read our CSV files:

athletes =
spark.read.format("csv").option("header","true").option("inferSchema","true").load("/mnt/toky
oolymic/raw-data/athletes.csv")
coaches =
spark.read.format("csv").option("header","true").option("inferSchema","true").load("/mnt/toky
oolymic/raw-data/Coaches.csv")
entriesgender =
spark.read.format("csv").option("header","true").option("inferSchema","true").load("/mnt/toky
oolymic/raw-data/EntriesGender.csv")
medals =
spark.read.format("csv").option("header","true").option("inferSchema","true").load("/mnt/toky
oolymic/raw-data/Medals.csv")
teams =
spark.read.format("csv").option("header","true").option("inferSchema","true").load("/mnt/toky
oolymic/raw-data/Teams.csv")

we can print the athletes Data:

We can go over the Apache Spark functions and do whatever transformations we need on our
data. (the goal here is not to do the transformations but just to show how we can use
Databricks, build the connection…).

The next step is to write the (transformed ) data to the transformed-data layer of ADLS:

athletes.repartition(1).write.mode("overwrite").option("header",'true').csv("/mnt/tokyoolymic/tr
ansformed-data/athletes")
coaches.repartition(1).write.mode("overwrite").option("header","true").csv("/mnt/tokyoolymic/
transformed-data/coaches")
entriesgender.repartition(1).write.mode("overwrite").option("header","true").csv("/mnt/tokyool
ymic/transformed-data/entriesgender")
medals.repartition(1).write.mode("overwrite").option("header","true").csv("/mnt/tokyoolymic/tr
ansformed-data/medals")
teams.repartition(1).write.mode("overwrite").option("header","true").csv("/mnt/tokyoolymic/tra
nsformed-data/teams")

Here is the folders that contain the written data to the storage within the metadata as well:

For example, within athletes folder we have (the first three files contain metadata and the last
one is for data):

Now that our data is transformed and stored into the transformed-data layer of ADLS, the
next step is to load it into Azure Synapse Analytics where we can use SQL to do our analysis to
get insights from the data or build a dashboard on top of it
5. Loading data from Azure Data Lake Storage (ADLS) to Azure Synapse Analytics for
advanced data analytics

From Azure portal, search for Azure Synapse Analytics and click on ‘Create’ and provide the
required details:

Synapse Analytics

Once it is complete, we can move forward to load our data into Synapse Analytics. To do so,
open Synapse Studio and click on the Data hub then click the + button and choose Lake
database:

Create the Database and name it:

Once the database is created, we can create the table and load the data (choose create table
from data lake):

And provide the table details: name, linked service (choose the default one) and for input file
just navigate to the file that contains athletes data inside the transformed-data layer of ADLS
(make sure to choose that contains the data and not the metadata).

Upon following the identical procedure for additional data files, our Synapse database now
contains the following tables as expected:
These tables provide the foundation for running SQL queries to perform analysis and derive
insights:
6. Connecting Power BI to Azure Synapse for data visualization and analysis

The final step involves establishing a connection between Power BI and Synapse, enabling the
use of Power BI for comprehensive data analysis and visualization based on the Synapse
database tables.

Open Microsoft Power BI Desktop, click on ‘Get Data’ and search for Azure Synapse Analytics
SQL as a data source then connect:

Go to Synapse Analytics Workspace overview and copy the ‘Serverless SQL endpoint’:

In Power BI, insert the copied string for the server name and Connect:
Finally, we can select the tables we want, load into Power BI then start our Analysis and
Dashboarding works

Azure Data Factory Presentation
No ratings yet
Azure Data Factory Presentation
30 pages
Advanced DB Lecture All in One PDF
50% (4)
Advanced DB Lecture All in One PDF
108 pages
Datacenter Design Presentation
100% (1)
Datacenter Design Presentation
22 pages
ADF Notes
No ratings yet
ADF Notes
1 page
Azure Data Engineer Content
No ratings yet
Azure Data Engineer Content
6 pages
Azure Data Factory
100% (2)
Azure Data Factory
10 pages
Itsm Process Maps Whitepaper 6.08 Web
100% (2)
Itsm Process Maps Whitepaper 6.08 Web
20 pages
What Cloud Computing Really Means: by Eric Knorr, Galen Gruman Created 2008-04-07 02:00AM
No ratings yet
What Cloud Computing Really Means: by Eric Knorr, Galen Gruman Created 2008-04-07 02:00AM
4 pages
Se 08
No ratings yet
Se 08
11 pages
DBMS Question Bank
No ratings yet
DBMS Question Bank
4 pages
SoD Matrix
No ratings yet
SoD Matrix
3 pages
Power BI Data Source Connections Guide
No ratings yet
Power BI Data Source Connections Guide
860 pages
Ionic HTTP Interceptor Guide
No ratings yet
Ionic HTTP Interceptor Guide
10 pages
3 TCP & Udp
No ratings yet
3 TCP & Udp
16 pages
SAP HANA Curriculum
No ratings yet
SAP HANA Curriculum
5 pages
A10 Upgrade Cli
No ratings yet
A10 Upgrade Cli
3 pages
Google File System
No ratings yet
Google File System
6 pages
Monitoring Hadoop Using Ambari
No ratings yet
Monitoring Hadoop Using Ambari
72 pages
Ethical Hacking Assignment Guide
No ratings yet
Ethical Hacking Assignment Guide
5 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
22 pages
Azure Data Factory Guide & Tutorials
No ratings yet
Azure Data Factory Guide & Tutorials
1,158 pages
Erp Definition and Solutions PDF
No ratings yet
Erp Definition and Solutions PDF
2 pages
IEEE IoT Towards Definition Internet of Things Revision1 27MAY15
No ratings yet
IEEE IoT Towards Definition Internet of Things Revision1 27MAY15
87 pages
QlikSense TopologiesV0 - 18
100% (1)
QlikSense TopologiesV0 - 18
1 page
STPM 2015 - Information System Development (Answers) : Answer
No ratings yet
STPM 2015 - Information System Development (Answers) : Answer
6 pages
Data Factory
100% (2)
Data Factory
26 pages
Azure Data Lake and U-SQL
No ratings yet
Azure Data Lake and U-SQL
51 pages
Furniture Management System Project Report1
No ratings yet
Furniture Management System Project Report1
46 pages
Big Data Masters Program Curriculum
No ratings yet
Big Data Masters Program Curriculum
14 pages
Azure Notes - 3 Data Integration
No ratings yet
Azure Notes - 3 Data Integration
9 pages
Document 555463.1
No ratings yet
Document 555463.1
2 pages
3DRi-brochure 2018
No ratings yet
3DRi-brochure 2018
2 pages
Azure Data Factory Workshop
No ratings yet
Azure Data Factory Workshop
26 pages
200T01-A: Implementing An Azure Data Solution: Course Outline Module 1: Azure For The Data Engineer
No ratings yet
200T01-A: Implementing An Azure Data Solution: Course Outline Module 1: Azure For The Data Engineer
4 pages
Week3 Recap OOP
No ratings yet
Week3 Recap OOP
56 pages
Azure Data Factory Data Movement Lab
No ratings yet
Azure Data Factory Data Movement Lab
26 pages
Azure Data Factory For Beginners
No ratings yet
Azure Data Factory For Beginners
250 pages
Project Work
No ratings yet
Project Work
13 pages
BY K Madhavi Data Architect
No ratings yet
BY K Madhavi Data Architect
24 pages
ADF Copy Data
100% (1)
ADF Copy Data
81 pages
ADF Course Deck
No ratings yet
ADF Course Deck
88 pages
ADF Copy Data
No ratings yet
ADF Copy Data
85 pages
Big Data and Visualization Hands-Steps-1
No ratings yet
Big Data and Visualization Hands-Steps-1
7 pages
Azure Data Factory Guide
No ratings yet
Azure Data Factory Guide
13 pages
Azure DataEngineer Course Outline
No ratings yet
Azure DataEngineer Course Outline
4 pages
OS - Module 2-2
No ratings yet
OS - Module 2-2
36 pages
ForumDE AzureDataEngineer Curriculum
No ratings yet
ForumDE AzureDataEngineer Curriculum
6 pages
dp-203 Notes1
No ratings yet
dp-203 Notes1
12 pages
FI Collections Guide
No ratings yet
FI Collections Guide
38 pages
Capgemini Questionnaire
No ratings yet
Capgemini Questionnaire
11 pages
Azure Data Factory Use Case
No ratings yet
Azure Data Factory Use Case
9 pages
Interview Series ADF Part-1
No ratings yet
Interview Series ADF Part-1
17 pages
MYC - Data Engineering Course Details
No ratings yet
MYC - Data Engineering Course Details
4 pages
Azure Data Engineering Project Part 1
No ratings yet
Azure Data Engineering Project Part 1
41 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Detailed Azure Data Factory Presentation
No ratings yet
Detailed Azure Data Factory Presentation
30 pages
Azure de Project
No ratings yet
Azure de Project
29 pages
Azure Data Engineering Roadmap
No ratings yet
Azure Data Engineering Roadmap
36 pages
Azure Data Factory Metadata Guide
No ratings yet
Azure Data Factory Metadata Guide
24 pages
Documentation Project
No ratings yet
Documentation Project
56 pages
Azure Resource Group & SQL Setup Guide
No ratings yet
Azure Resource Group & SQL Setup Guide
73 pages
Adf Part-1
No ratings yet
Adf Part-1
5 pages
Az Questions
No ratings yet
Az Questions
11 pages
Azure Data Factory - Pratap - Qbex Technologies - 8886230001
No ratings yet
Azure Data Factory - Pratap - Qbex Technologies - 8886230001
4 pages
Himanshu - Assignment Solved ETL 1
No ratings yet
Himanshu - Assignment Solved ETL 1
6 pages
Piezo Electric Research Paper
No ratings yet
Piezo Electric Research Paper
29 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
Azure Databricks
No ratings yet
Azure Databricks
21 pages
End To End Project ADF
No ratings yet
End To End Project ADF
73 pages
PDF 1733662736
No ratings yet
PDF 1733662736
17 pages
Tanay Singh Resume
No ratings yet
Tanay Singh Resume
2 pages
Azure ADF
No ratings yet
Azure ADF
22 pages
Session 6 - Azure Case Study - Covid 19
No ratings yet
Session 6 - Azure Case Study - Covid 19
42 pages
ADE Project Amit
No ratings yet
ADE Project Amit
17 pages
Azure Project Execution Plan ADF+DBX+CICD
No ratings yet
Azure Project Execution Plan ADF+DBX+CICD
5 pages
Use-Case 2: Utilize Azure Data Factory (ADF) To Ingest Orders and Customers Data, and Execute Fundamental Transformations On The Datasets
No ratings yet
Use-Case 2: Utilize Azure Data Factory (ADF) To Ingest Orders and Customers Data, and Execute Fundamental Transformations On The Datasets
36 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
Solution (Updated)
No ratings yet
Solution (Updated)
61 pages
SAP Flexible Workflow Explained - Step-By-Step Guide
No ratings yet
SAP Flexible Workflow Explained - Step-By-Step Guide
37 pages
Data Migration Project
No ratings yet
Data Migration Project
36 pages
Azure Data Engineering Course Interview Questions 1751484980
No ratings yet
Azure Data Engineering Course Interview Questions 1751484980
20 pages
Day 7
No ratings yet
Day 7
3 pages
Course Content
No ratings yet
Course Content
13 pages
PROJECT 4 For Python
No ratings yet
PROJECT 4 For Python
26 pages
PROJECT 10 For Python
No ratings yet
PROJECT 10 For Python
16 pages
PROJECT 11 For Python
No ratings yet
PROJECT 11 For Python
22 pages
PROJECT 2 For Python
No ratings yet
PROJECT 2 For Python
41 pages

PROJECT 12 For Python

Uploaded by

PROJECT 12 For Python

Uploaded by

Azure End-To-End Data Engineering Project

General overview of the project:

Diagram Project Overview

1. Create a resource group

create an Azure resource group

• Select the created resource group.

• Click on the “Create a resource” button within the resource group.

✓ Subscription: Choose your Azure for Students subscription.

Azure Data Lake Storage

ADLS container Directories

To do so, we need to create an ADF resource:

Enter the necessary information:

• Name: Provide a unique name for your ADF resource.

• Subscription: Choose your subscription (Azure for Students).

• Resource Group: Select the already created resource group.

• Version: Choose the ADF version you want to use.

Here is our created ADF resource:

Our data source in GitHub comprises five CSV files.

Create the necessary Linked services:

Athletes HTTP Linked Service

Athletes Base URL

ADLS Linked Service

Here are our created linked services:

Then choose CSV for the format type:

Athletes Ingestion Pipeline

Activity source and Sink

Now we can start debugging our pipeline by clicking on Debug.

configs = {"fs.azure.account.auth.type": "OAuth",

we can print the athletes Data:

Create the Database and name it:

You might also like