0% found this document useful (0 votes)

21 views25 pages

Python Data Engineering Pipelines Guide

Uploaded by

an16.sh11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views25 pages

Python Data Engineering Pipelines Guide

Uploaded by

an16.sh11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Components of a

data platform
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
Course contents
ingest data using Singer

apply common data cleaning operations

gain insights by combining data with PySpark

test your code automatically

deploy Spark transformation pipelines

=> intro to data engineering pipelines

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Data is valuable

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Democratizing data increases insights

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Genesis of the data

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Operational data is stored in the landing zone

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Cleaned data prevents rework

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

The business layer provides most insights

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Pipelines move data from one zone to another

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Let’s reason!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Introduction to data
ingestion with Singer
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
Singer’s core concepts
Aim: “The open-source standard for writing scripts
that move data”

Singer is a speci cation

data exchange format: JSON

extract and load with taps and targets

=> language independent

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Singer’s core concepts
Aim: “The open-source standard for writing scripts
that move data”

Singer is a speci cation

data exchange format: JSON

extract and load with taps and targets

=> language independent

communicate over streams:

schema (metadata)

state (process metadata)

record (data)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Singer’s core concepts
Aim: “The open-source standard for writing scripts
that move data”

Singer is a speci cation

data exchange format: JSON

extract and load with taps and targets

=> language independent

communicate over streams:

schema (metadata)

state (process metadata)

record (data)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Describing the data through its schema
columns = ("id", "name", "age", "has_children")
users = {(1, "Adrian", 32, False),
(2, "Ruanne", 28, False),
(3, "Hillary", 29, True)}

json_schema = {

"properties": {"age": {"maximum": 130,

"minimum": 1,
"type": "integer"},

"has_children": {"type": "boolean"},

"id": {"type": "integer"},
"name": {"type": "string"}},

"$id": "http://yourdomain.com/schemas/my_user_schema.json",
"$schema": "http://json-schema.org/draft-07/schema#"}

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Describing the data through its schema
import singer

singer.write_schema(schema=json_schema,

stream_name='DC_employees',

key_properties=["id"])

{"type": "SCHEMA", "stream": "DC_employees", "schema": {"properties":

{"age": {"maximum": 130, "minimum": 1, "type": "integer"}, "has_children":
{"type": "boolean"}, "id": {"type": "integer"}, "name": {"type": "string"}},
"$id": "http://yourdomain.com/schemas/my_user_schema.json",
"$schema": "http://json-schema.org/draft-07/schema#"}, "key_properties": ["id"]}

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Serializing JSON
import json

json.dumps(json_schema["properties"]["age"])

'{"maximum": 130, "minimum": 1, "type": "integer"}'

with open("foo.json", mode="w") as fh:

json.dump(obj=json_schema, fp=fh) # writes the json-serialized object
# to the open file handle

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Let’s practice!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Running an ingestion
pipeline with Singer
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
Streaming record messages
columns = ("id", "name", "age", "has_children")
users = {(1, "Adrian", 32, False),
(2, "Ruanne", 28, False),
(3, "Hillary", 29, True)}

singer.write_record(stream_name="DC_employees",
record=dict(zip(columns, users.pop())))

{"type": "RECORD", "stream": "DC_employees", "record": {"id": 1, "name": "Adrian", "age": 32, "has_children": false}}

fixed_dict = {"type": "RECORD", "stream": "DC_employees"}

record_msg = {**fixed_dict, "record": dict(zip(columns, users.pop()))}
print(json.dumps(record_msg))

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Chaining taps and targets
# Module: my_tap.py
import singer

singer.write_schema(stream_name="foo", schema=…)

singer.write_records(stream_name="foo", records=…)

Ingestion pipeline: Pipe the tap’s output into a Singer target, using the | symbol (Linux & MacOS)

python my_tap.py | target-csv

python my_tap.py | target-csv --config userconfig.cfg
my-packaged-tap | target-csv --config userconfig.cfg

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Modular ingestion pipelines
my-packaged-tap | target-csv
my-packaged-tap | target-google-sheets
my-packaged-tap | target-postgresql --config conf.json

tap-custom-google-scraper | target-postgresql --config headlines.json

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Keeping track with state messages

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Keeping track with state messages
id name last_updated_on

1 Adrian 2019-06-14T14:00:04.000+02:00

2 Ruanne 2019-06-16T18:33:21.000+02:00

3 Hillary 2019-06-14T10:05:12.000+02:00

singer.write_state(value={"max-last-updated-on": some_variable})

Run this tap-mydelta on 2019-06-14 at 12:00:00.000+02:00 (2nd row wasn’t yet present then):

{"type": "STATE", "value": {"max-last-updated-on": "2019-06-14T10:05:12.000+02:00"}}

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Let’s practice!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
100% (2)
Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
582 pages
8500W Installation-Manual
100% (1)
8500W Installation-Manual
21 pages
Big Book of Data Engineering 2nd Edition Final
No ratings yet
Big Book of Data Engineering 2nd Edition Final
97 pages
Snowflake For: Data Engineering
No ratings yet
Snowflake For: Data Engineering
15 pages
Palas Blackbook Original
100% (3)
Palas Blackbook Original
43 pages
Rockwool Installation Guide
100% (1)
Rockwool Installation Guide
8 pages
Building Data Pipelines - 1
No ratings yet
Building Data Pipelines - 1
25 pages
Python For Data Engineering
No ratings yet
Python For Data Engineering
11 pages
Extract, Transform and Load (ETL)
No ratings yet
Extract, Transform and Load (ETL)
31 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Evolution of Data Engineering in Modern Software D
No ratings yet
Evolution of Data Engineering in Modern Software D
15 pages
Python You Should Learn
No ratings yet
Python You Should Learn
12 pages
Building Data Pipelines - 3
No ratings yet
Building Data Pipelines - 3
29 pages
Chapter 3
No ratings yet
Chapter 3
29 pages
Engineering Python
No ratings yet
Engineering Python
5 pages
Python Data Pipeline Guide
No ratings yet
Python Data Pipeline Guide
38 pages
Python Nidhi
No ratings yet
Python Nidhi
24 pages
The 30 Most Useful Python Libraries For Data Engineering - by ODSC - Open Data Science - Medium
No ratings yet
The 30 Most Useful Python Libraries For Data Engineering - by ODSC - Open Data Science - Medium
23 pages
ARTISTLANe 17
No ratings yet
ARTISTLANe 17
39 pages
Designing A G2 Structure Using Python With Graphical User Interface
No ratings yet
Designing A G2 Structure Using Python With Graphical User Interface
9 pages
Essentials of Data engineeringByMukeshSaini
No ratings yet
Essentials of Data engineeringByMukeshSaini
30 pages
Python Record Manual
No ratings yet
Python Record Manual
18 pages
Chapter 4
No ratings yet
Chapter 4
26 pages
Python Ecosystem
No ratings yet
Python Ecosystem
11 pages
Dataflow Pipeline to BigQuery
No ratings yet
Dataflow Pipeline to BigQuery
6 pages
Python Self Study Material
0% (1)
Python Self Study Material
9 pages
Project Documentation
No ratings yet
Project Documentation
36 pages
Designing A G+2 Structure Using Python With Graphical User Interface
No ratings yet
Designing A G+2 Structure Using Python With Graphical User Interface
9 pages
Python Interview Questions
No ratings yet
Python Interview Questions
25 pages
Dynamic Data Management Among
No ratings yet
Dynamic Data Management Among
11 pages
Data-Engineering Compressed
No ratings yet
Data-Engineering Compressed
20 pages
IP Project Saleha
No ratings yet
IP Project Saleha
34 pages
PythonDASE - 2025 Version1
No ratings yet
PythonDASE - 2025 Version1
44 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
8 pages
Extracted
No ratings yet
Extracted
8 pages
De Programs2
No ratings yet
De Programs2
16 pages
Pydriller
No ratings yet
Pydriller
4 pages
L1 - Introduction and Data EcoSystem
No ratings yet
L1 - Introduction and Data EcoSystem
42 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
28 pages
23 Mca 10122
No ratings yet
23 Mca 10122
19 pages
100 Data Engineering QUESTIONS ANSWERS
No ratings yet
100 Data Engineering QUESTIONS ANSWERS
59 pages
Data Engineering Career Boost
No ratings yet
Data Engineering Career Boost
15 pages
1 Intro
No ratings yet
1 Intro
33 pages
Integrating Python Into Mechanical Engineering Enhancing Design and Analysis 20250107035859oovq
No ratings yet
Integrating Python Into Mechanical Engineering Enhancing Design and Analysis 20250107035859oovq
10 pages
Eguide of Cloud Data Engineering
No ratings yet
Eguide of Cloud Data Engineering
23 pages
Brochure Professional Certificate in Data Engineering
100% (1)
Brochure Professional Certificate in Data Engineering
14 pages
Sree Narayana Gurukulam: College of Engineering
No ratings yet
Sree Narayana Gurukulam: College of Engineering
57 pages
Python Weather Forecasting Guide
No ratings yet
Python Weather Forecasting Guide
36 pages
UNIT 3 Developing IoTs-1
No ratings yet
UNIT 3 Developing IoTs-1
53 pages
DSA Practical Workbook - LAb Manuals 18cs
No ratings yet
DSA Practical Workbook - LAb Manuals 18cs
141 pages
Python Libraries Seminar Report
100% (2)
Python Libraries Seminar Report
16 pages
Python Tutorial7
No ratings yet
Python Tutorial7
4 pages
Outline For Document 3
No ratings yet
Outline For Document 3
6 pages
Python for Data-Driven Programmers
100% (3)
Python for Data-Driven Programmers
49 pages
TP2 Python 24 - 25
No ratings yet
TP2 Python 24 - 25
3 pages
Feature Engineering - Introduction
No ratings yet
Feature Engineering - Introduction
74 pages
Tcobza
No ratings yet
Tcobza
2 pages
Data Engineering Guide for Experts
No ratings yet
Data Engineering Guide for Experts
97 pages
Python For Mechanical
No ratings yet
Python For Mechanical
9 pages
NCC 316 Note
No ratings yet
NCC 316 Note
78 pages
Blue Lock: Isagi & Rin's Reunion
No ratings yet
Blue Lock: Isagi & Rin's Reunion
24 pages
Finance Theory Exam
No ratings yet
Finance Theory Exam
6 pages
Lexicology Study Guide
No ratings yet
Lexicology Study Guide
34 pages
Unidad 4
No ratings yet
Unidad 4
12 pages
Soil Variability and Its Consequences in Geotechnical Engineering
No ratings yet
Soil Variability and Its Consequences in Geotechnical Engineering
302 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
86 pages
Saudi Arabia Technician Jobs Listings
No ratings yet
Saudi Arabia Technician Jobs Listings
6 pages
Aws A 5-22
No ratings yet
Aws A 5-22
45 pages
Bonsai Cendrawasih How To Grow
No ratings yet
Bonsai Cendrawasih How To Grow
2 pages
DICA Lab Manual PDF
No ratings yet
DICA Lab Manual PDF
64 pages
Financial Technologies (India) Limited CSR Policy
No ratings yet
Financial Technologies (India) Limited CSR Policy
8 pages
Kopi
No ratings yet
Kopi
5 pages
Creatine Kinase: 7D63-20 and 7D63-30
No ratings yet
Creatine Kinase: 7D63-20 and 7D63-30
8 pages
BLF24 T ST en GB
No ratings yet
BLF24 T ST en GB
4 pages
Design & Implement Trash Rack Cleaning System
No ratings yet
Design & Implement Trash Rack Cleaning System
23 pages
According To Saunders Et Al
No ratings yet
According To Saunders Et Al
13 pages
Keypad Control For Multiple Appliances: S. Ramasamy R.G.Thiagaraj Kumar
No ratings yet
Keypad Control For Multiple Appliances: S. Ramasamy R.G.Thiagaraj Kumar
2 pages
RPMS COT Sheets
No ratings yet
RPMS COT Sheets
12 pages
Pumping Station Design Guidelines
100% (1)
Pumping Station Design Guidelines
8 pages
Maintenance of Capital
No ratings yet
Maintenance of Capital
36 pages
Comprehensive Guide to GA Crossover Techniques
No ratings yet
Comprehensive Guide to GA Crossover Techniques
65 pages
Blocked Credit Under GST
No ratings yet
Blocked Credit Under GST
15 pages
5G Wireless Technology: Millimeter Wave Health Effects
No ratings yet
5G Wireless Technology: Millimeter Wave Health Effects
5 pages
Journal: Embedded Finance
No ratings yet
Journal: Embedded Finance
116 pages
It Ix Sa1 Sample Paper
No ratings yet
It Ix Sa1 Sample Paper
3 pages
Link Game PPSSPP (Sfile
100% (1)
Link Game PPSSPP (Sfile
9 pages
Introduction To The Importance of Sanitation - 5
No ratings yet
Introduction To The Importance of Sanitation - 5
16 pages
Universal Shipbuilding Corporation: Single Loop Electro-Hydraulic Steering Gear S.No.038 TYPE
No ratings yet
Universal Shipbuilding Corporation: Single Loop Electro-Hydraulic Steering Gear S.No.038 TYPE
125 pages

Python Data Engineering Pipelines Guide

Uploaded by

Python Data Engineering Pipelines Guide

Uploaded by

Components of a

apply common data cleaning operations

gain insights by combining data with PySpark

test your code automatically

deploy Spark transformation pipelines

=> intro to data engineering pipelines

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Singer is a speci cation

data exchange format: JSON

extract and load with taps and targets

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Singer is a speci cation

data exchange format: JSON

extract and load with taps and targets

communicate over streams:

state (process metadata)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Singer is a speci cation

data exchange format: JSON

extract and load with taps and targets

communicate over streams:

state (process metadata)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

"properties": {"age": {"maximum": 130,

"has_children": {"type": "boolean"},

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

{"type": "SCHEMA", "stream": "DC_employees", "schema": {"properties":

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

'{"maximum": 130, "minimum": 1, "type": "integer"}'

with open("foo.json", mode="w") as fh:

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

fixed_dict = {"type": "RECORD", "stream": "DC_employees"}

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

python my_tap.py | target-csv

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

tap-custom-google-scraper | target-postgresql --config headlines.json

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

{"type": "STATE", "value": {"max-last-updated-on": "2019-06-14T10:05:12.000+02:00"}}

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

You might also like