0% found this document useful (0 votes)

37 views29 pages

Chapter 3

Uploaded by

an16.sh11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views29 pages

Chapter 3

Uploaded by

an16.sh11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

On the importance

of tests
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
Software tends to change
Common reasons for change:

new functionality desired

bugs need to get squashed

performance needs to be improved

Core functionality rarely evolves

How to ensure stability in light of changes?

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Rationale behind testing
improves chance of code being correct in the future
prevent introducing breaking changes

raises con dence (not a guarantee) that code is correct now

assert actuals match expectations

most up-to-date documentation

form of documentation that is always in sync with what’s running

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

The test pyramid: where to invest your efforts
Testing takes time

thinking what to test

writing tests

running tests

Testing has a high return on investment

when targeted at the correct layer

when testing the non-trivial parts, e.g.

distance between 2 coordinates ? uppercasing
a rst name © Martin Fowler “TestPyramid”

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

The test pyramid: where to invest your efforts
Testing takes time

thinking what to test

writing tests

running tests

Testing has a high return on investment

when targeted at the correct layer

when testing the non-trivial parts, e.g.

distance between 2 coordinates ? uppercasing
a rst name © Martin Fowler “TestPyramid”

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

The test pyramid: where to invest your efforts
Testing takes time

thinking what to test

writing tests

running tests

Testing has a high return on investment

when targeted at the correct layer

when testing the non-trivial parts, e.g.

distance between 2 coordinates ? uppercasing
a rst name © Martin Fowler “TestPyramid”

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

The test pyramid: where to invest your efforts
Testing takes time

thinking what to test

writing tests

running tests

Testing has a high return on investment

when targeted at the correct layer

when testing the non-trivial parts, e.g.

distance between 2 coordinates ? uppercasing
a rst name © Martin Fowler “TestPyramid”

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Let’s have this sink
in!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Writing unit tests for
PySpark
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
Our earlier Spark application is an ETL pipeline

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Our earlier Spark application is an ETL pipeline

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Our earlier Spark application is an ETL pipeline

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Separate transform from extract and load
prices_with_ratings = spark.read.csv(…) # extract
exchange_rates = spark.read.csv(…) # extract

unit_prices_with_ratings = (prices_with_ratings
.join(…) # transform
.withColumn(…)) # transform

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Solution: construct DataFrames in-memory
# Extract the data - depends on input/output (network access,
df = spark.read.csv(path_to_file) lesystem permissions, …)

- unclear how big the data is

- unclear what data goes in

from pyspark.sql import Row + inputs are clear

purchase = Row("price",
"quantity",
+ data is close to where it is being used (“code-
"product") proximity”)
record = purchase(12.99, 1, "cake")
df = spark.createDataFrame((record,))

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Create small, reusable and well-named functions
unit_prices_with_ratings = (prices_with_ratings
.join(exchange_rates, ["currency", "date"])
.withColumn("unit_price_in_euro",
col("price") / col("quantity")
* col("exchange_rate_to_euro"))

def link_with_exchange_rates(prices, rates):

return prices.join(rates, ["currency", "date"])

def calculate_unit_price_in_euro(df):
return df.withColumn(
"unit_price_in_euro",
col("price") / col("quantity") * col("exchange_rate_to_euro"))

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Create small, reusable and well-named functions
def link_with_exchange_rates(prices, rates):
return prices.join(rates, ["currency", "date"])

def calculate_unit_price_in_euro(df):
return df.withColumn(
"unit_price_in_euro",
col("price") / col("quantity") * col("exchange_rate_to_euro"))

unit_prices_with_ratings = (
calculate_unit_price_in_euro(
link_with_exchange_rates(prices, exchange_rates)
)
)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Testing a single unit
def test_calculate_unit_price_in_euro():
record = dict(price=10,
quantity=5,
exchange_rate_to_euro=2.)
df = spark.createDataFrame([Row(**record)])

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Testing a single unit
def test_calculate_unit_price_in_euro():
record = dict(price=10,
quantity=5,
exchange_rate_to_euro=2.)
df = spark.createDataFrame([Row(**record)])
result = calculate_unit_price_in_euro(df)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

expected_record = Row(**record, unit_price_in_euro=4.)

expected = spark.createDataFrame([expected_record])

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

expected_record = Row(**record, unit_price_in_euro=4.)

expected = spark.createDataFrame([expected_record])

assertDataFrameEqual(result, expected)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Take home messages
1. Interacting with external data sources is costly

2. Creating in-memory DataFrames makes testing easier

the data is in plain sight,

focus is on just a small number of examples.

3. Creating small and well-named functions leads to more reusability and easier testing.

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Let’s practice!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Continuous testing
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
Running a test suite
Execute tests in Python, with one of:

in stdlib 3rd party

unittest pytest

doctest nose

Core task: assert or raise

Examples:

assert computed == expected

with pytest.raises(ValueError): # pytest specific

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Manually triggering tests
In a Unix shell:

cd ~/workspace/my_good_python_project
pytest .
# Lots of output…
== 19 passed, 2 warnings in 36.80 seconds ==

cd ~/workspace/my_bad_python_project
pytest .
# Lots of output…
== 3 failed, 1 passed in 6.72 seconds ==

Note: Spark increases time to run unit tests.

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Automating tests
Problem:

forget to run unit tests when making changes

Solution:

Automation

How:

Git -> con gure hooks

Con gure CI/CD pipeline to run tests automatically

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

CI/CD

Continuous Integration:

get code changes integrated with the master branch regularly.

Continuous Delivery:

Create “artifacts” (deliverables like documentation, but also programs) that can be deployed into
production without breaking things.

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Con guring a CI/CD tool
CircleCI looks for .circleci/con g.yml.

Example:

Often:
jobs:
test:
1. checkout code
docker:
- image: circleci/python:3.6.4 2. install test & build requirements
steps:
3. run tests
- checkout
- run: pip install -r requirements.txt 4. package/build the software artefacts
- run: pytest .
5. deploy the artefacts (update docs / install
app / …)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Let’s practice!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
100% (2)
Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
582 pages
IBM Security QRadar SIEM Foundations
100% (1)
IBM Security QRadar SIEM Foundations
256 pages
Spring Boot Basics
No ratings yet
Spring Boot Basics
9 pages
Python Data Pipelines Guide
No ratings yet
Python Data Pipelines Guide
49 pages
Daily Expense Tracker
61% (31)
Daily Expense Tracker
188 pages
Customer Service Management
100% (2)
Customer Service Management
502 pages
Online Railway Ticket Booking System
No ratings yet
Online Railway Ticket Booking System
8 pages
College ERP System Final
No ratings yet
College ERP System Final
19 pages
Building Data Pipelines - 1
No ratings yet
Building Data Pipelines - 1
25 pages
30 Python Best Practices, Tips, and Tricks by Erik Van Baaren Python Land Medium
No ratings yet
30 Python Best Practices, Tips, and Tricks by Erik Van Baaren Python Land Medium
23 pages
Building Data Pipelines - 3
No ratings yet
Building Data Pipelines - 3
29 pages
Data Engineering Career Boost
No ratings yet
Data Engineering Career Boost
15 pages
C Pointers: Usage, Benefits, and Examples
No ratings yet
C Pointers: Usage, Benefits, and Examples
24 pages
Operating Systems: Lecture Notes
No ratings yet
Operating Systems: Lecture Notes
222 pages
Help W - Insite 8.5 Install
No ratings yet
Help W - Insite 8.5 Install
2 pages
Pythonadvanced 151127114045 Lva1 App6891 PDF
No ratings yet
Pythonadvanced 151127114045 Lva1 App6891 PDF
180 pages
Industrial Training Report (Sahil)
No ratings yet
Industrial Training Report (Sahil)
33 pages
Unit Testing Python-Chapter1
No ratings yet
Unit Testing Python-Chapter1
86 pages
Testing Machine Learning Systems - Code, Data and Models - Made With ML
No ratings yet
Testing Machine Learning Systems - Code, Data and Models - Made With ML
33 pages
O Reilly Etl Testing 31693429373641
No ratings yet
O Reilly Etl Testing 31693429373641
183 pages
Beginner's Guide to DeFi Basics
No ratings yet
Beginner's Guide to DeFi Basics
8 pages
House Report
No ratings yet
House Report
26 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
32 pages
Chapter 2
No ratings yet
Chapter 2
46 pages
GRP Project DT
No ratings yet
GRP Project DT
22 pages
Assignment: PHOTOSHOPE: 1. Rectangular Marquee Tool
No ratings yet
Assignment: PHOTOSHOPE: 1. Rectangular Marquee Tool
11 pages
Summer Training Report - Ishan Patwal
No ratings yet
Summer Training Report - Ishan Patwal
21 pages
Using Eclipse Project 3.6 PDF
100% (1)
Using Eclipse Project 3.6 PDF
37 pages
GE3171 - Python Lab ManuaLall Print
0% (1)
GE3171 - Python Lab ManuaLall Print
111 pages
GTA IV Technical Guide
No ratings yet
GTA IV Technical Guide
5 pages
Report
No ratings yet
Report
25 pages
Chapter4 Maintainability
No ratings yet
Chapter4 Maintainability
43 pages
Quick Start Guide
No ratings yet
Quick Start Guide
12 pages
Enable and Collect Trace Logs in Cisco Unified SIP Proxy (CUSP)
No ratings yet
Enable and Collect Trace Logs in Cisco Unified SIP Proxy (CUSP)
7 pages
Brochure Professional Certificate in Data Engineering
100% (1)
Brochure Professional Certificate in Data Engineering
14 pages
How To GaDellther DELL DSET Report On An ESXi Host Server
No ratings yet
How To GaDellther DELL DSET Report On An ESXi Host Server
5 pages
Extract, Transform and Load (ETL)
No ratings yet
Extract, Transform and Load (ETL)
31 pages
Rivals SILENT AIM OPEN SOURCE
0% (1)
Rivals SILENT AIM OPEN SOURCE
2 pages
Data Science Machine Learning 17054
No ratings yet
Data Science Machine Learning 17054
27 pages
Chapter3 Utilizing Classes
No ratings yet
Chapter3 Utilizing Classes
33 pages
Tech Note 872 - Running AlarmDBLogger As Windows Service On Windows 2008 and Windows 7
No ratings yet
Tech Note 872 - Running AlarmDBLogger As Windows Service On Windows 2008 and Windows 7
5 pages
1 Intro
No ratings yet
1 Intro
33 pages
TSW 3 02 and TSW Mobile Client Launch Announcement International
No ratings yet
TSW 3 02 and TSW Mobile Client Launch Announcement International
5 pages
HitmanPro Setup Guide for IT Admins
No ratings yet
HitmanPro Setup Guide for IT Admins
7 pages
ML Lab Report for ECE Students
No ratings yet
ML Lab Report for ECE Students
38 pages
Pds Leb Manual
No ratings yet
Pds Leb Manual
54 pages
My CV
No ratings yet
My CV
1 page
Tba Record Final
No ratings yet
Tba Record Final
140 pages
Python Data Engineering Pipelines Guide
No ratings yet
Python Data Engineering Pipelines Guide
25 pages
Data-Engineering Compressed
No ratings yet
Data-Engineering Compressed
20 pages
The 10 Most Common Database Vulnerabilities
No ratings yet
The 10 Most Common Database Vulnerabilities
1 page
Python Weather Forecasting Guide
No ratings yet
Python Weather Forecasting Guide
36 pages
New 276
No ratings yet
New 276
13 pages
Final Project Documentation
No ratings yet
Final Project Documentation
53 pages
UPDATED Data Science Syllabus
No ratings yet
UPDATED Data Science Syllabus
20 pages
Python Fundamentals A Beginners Journey
No ratings yet
Python Fundamentals A Beginners Journey
17 pages
Tips For Testing in Python 1646539645
No ratings yet
Tips For Testing in Python 1646539645
23 pages
Int 5
No ratings yet
Int 5
12 pages
Python Programming
No ratings yet
Python Programming
2 pages
E LabSoft Classroom Manager 6 Software Suite Single License
No ratings yet
E LabSoft Classroom Manager 6 Software Suite Single License
4 pages
7 Practicals With Python Practice With Data Science Cookbook
No ratings yet
7 Practicals With Python Practice With Data Science Cookbook
4 pages
What Is Spark?: A Fast and General Engine For Large-Scale Data Processing 4 Libraries Built On Top of Spark Core
No ratings yet
What Is Spark?: A Fast and General Engine For Large-Scale Data Processing 4 Libraries Built On Top of Spark Core
45 pages
MIT Data Engineering
No ratings yet
MIT Data Engineering
20 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
26 pages
Program Delivery
No ratings yet
Program Delivery
37 pages
Chapter4 9
No ratings yet
Chapter4 9
43 pages
Visual eBIOS
No ratings yet
Visual eBIOS
3 pages
Engineering Python
No ratings yet
Engineering Python
5 pages
Python Record Manual
No ratings yet
Python Record Manual
18 pages
ARTISTLANe 17
No ratings yet
ARTISTLANe 17
39 pages
Leica Captivate1
No ratings yet
Leica Captivate1
250 pages
PDS Merged New
No ratings yet
PDS Merged New
19 pages
Dav - Lab Manual
No ratings yet
Dav - Lab Manual
34 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
Net Framework
No ratings yet
Net Framework
2 pages
Sanda Sushma - 3.4 Years
No ratings yet
Sanda Sushma - 3.4 Years
4 pages
ML File Fnail Merged
No ratings yet
ML File Fnail Merged
82 pages
Log
No ratings yet
Log
3 pages
Mod Menu Crash 2024 05 20-11 23 46
No ratings yet
Mod Menu Crash 2024 05 20-11 23 46
3 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
Internship-Data Science and Machine Learning Using Python
No ratings yet
Internship-Data Science and Machine Learning Using Python
5 pages
PythonDASE - 2025 Version1
No ratings yet
PythonDASE - 2025 Version1
44 pages
DMV Lab Manual
No ratings yet
DMV Lab Manual
45 pages
Vignesh P Resume
No ratings yet
Vignesh P Resume
2 pages
Macse502 Programming-For-data-science Eth 1.0 83 Macse502
No ratings yet
Macse502 Programming-For-data-science Eth 1.0 83 Macse502
4 pages
Chapter 1+ Python Basics-1
No ratings yet
Chapter 1+ Python Basics-1
16 pages
Data Science
No ratings yet
Data Science
17 pages

Chapter 3

Uploaded by

Chapter 3

Uploaded by

On the importance

new functionality desired

bugs need to get squashed

performance needs to be improved

Core functionality rarely evolves

How to ensure stability in light of changes?

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

raises con dence (not a guarantee) that code is correct now

most up-to-date documentation

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

thinking what to test

Testing has a high return on investment

when targeted at the correct layer

when testing the non-trivial parts, e.g.

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

thinking what to test

Testing has a high return on investment

when targeted at the correct layer

when testing the non-trivial parts, e.g.

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

thinking what to test

Testing has a high return on investment

when targeted at the correct layer

when testing the non-trivial parts, e.g.

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

thinking what to test

Testing has a high return on investment

when targeted at the correct layer

when testing the non-trivial parts, e.g.

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

- unclear how big the data is

- unclear what data goes in

from pyspark.sql import Row + inputs are clear

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

def link_with_exchange_rates(prices, rates):

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

expected_record = Row(**record, unit_price_in_euro=4.)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

expected_record = Row(**record, unit_price_in_euro=4.)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

2. Creating in-memory DataFrames makes testing easier

focus is on just a small number of examples.

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

in stdlib 3rd party

Core task: assert or raise

assert computed == expected

with pytest.raises(ValueError): # pytest specific

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Note: Spark increases time to run unit tests.

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

forget to run unit tests when making changes

Git -> con gure hooks

Con gure CI/CD pipeline to run tests automatically

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

get code changes integrated with the master branch regularly.

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

You might also like