On the importance
of tests
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Oliver Willekens
Data Engineer at Data Minded
Software tends to change
Common reasons for change:
new functionality desired
bugs need to get squashed
performance needs to be improved
Core functionality rarely evolves
How to ensure stability in light of changes?
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Rationale behind testing
improves chance of code being correct in the future
prevent introducing breaking changes
raises con dence (not a guarantee) that code is correct now
assert actuals match expectations
most up-to-date documentation
form of documentation that is always in sync with what’s running
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
The test pyramid: where to invest your efforts
Testing takes time
thinking what to test
writing tests
running tests
Testing has a high return on investment
when targeted at the correct layer
when testing the non-trivial parts, e.g.
distance between 2 coordinates ? uppercasing
a rst name © Martin Fowler “TestPyramid”
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
The test pyramid: where to invest your efforts
Testing takes time
thinking what to test
writing tests
running tests
Testing has a high return on investment
when targeted at the correct layer
when testing the non-trivial parts, e.g.
distance between 2 coordinates ? uppercasing
a rst name © Martin Fowler “TestPyramid”
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
The test pyramid: where to invest your efforts
Testing takes time
thinking what to test
writing tests
running tests
Testing has a high return on investment
when targeted at the correct layer
when testing the non-trivial parts, e.g.
distance between 2 coordinates ? uppercasing
a rst name © Martin Fowler “TestPyramid”
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
The test pyramid: where to invest your efforts
Testing takes time
thinking what to test
writing tests
running tests
Testing has a high return on investment
when targeted at the correct layer
when testing the non-trivial parts, e.g.
distance between 2 coordinates ? uppercasing
a rst name © Martin Fowler “TestPyramid”
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Let’s have this sink
in!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Writing unit tests for
PySpark
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Oliver Willekens
Data Engineer at Data Minded
Our earlier Spark application is an ETL pipeline
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Our earlier Spark application is an ETL pipeline
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Our earlier Spark application is an ETL pipeline
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Separate transform from extract and load
prices_with_ratings = spark.read.csv(…) # extract
exchange_rates = spark.read.csv(…) # extract
unit_prices_with_ratings = (prices_with_ratings
.join(…) # transform
.withColumn(…)) # transform
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Solution: construct DataFrames in-memory
# Extract the data - depends on input/output (network access,
df = spark.read.csv(path_to_file) lesystem permissions, …)
- unclear how big the data is
- unclear what data goes in
from pyspark.sql import Row + inputs are clear
purchase = Row("price",
"quantity",
+ data is close to where it is being used (“code-
"product") proximity”)
record = purchase(12.99, 1, "cake")
df = spark.createDataFrame((record,))
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Create small, reusable and well-named functions
unit_prices_with_ratings = (prices_with_ratings
.join(exchange_rates, ["currency", "date"])
.withColumn("unit_price_in_euro",
col("price") / col("quantity")
* col("exchange_rate_to_euro"))
def link_with_exchange_rates(prices, rates):
return prices.join(rates, ["currency", "date"])
def calculate_unit_price_in_euro(df):
return df.withColumn(
"unit_price_in_euro",
col("price") / col("quantity") * col("exchange_rate_to_euro"))
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Create small, reusable and well-named functions
def link_with_exchange_rates(prices, rates):
return prices.join(rates, ["currency", "date"])
def calculate_unit_price_in_euro(df):
return df.withColumn(
"unit_price_in_euro",
col("price") / col("quantity") * col("exchange_rate_to_euro"))
unit_prices_with_ratings = (
calculate_unit_price_in_euro(
link_with_exchange_rates(prices, exchange_rates)
)
)
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Testing a single unit
def test_calculate_unit_price_in_euro():
record = dict(price=10,
quantity=5,
exchange_rate_to_euro=2.)
df = spark.createDataFrame([Row(**record)])
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Testing a single unit
def test_calculate_unit_price_in_euro():
record = dict(price=10,
quantity=5,
exchange_rate_to_euro=2.)
df = spark.createDataFrame([Row(**record)])
result = calculate_unit_price_in_euro(df)
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Testing a single unit
def test_calculate_unit_price_in_euro():
record = dict(price=10,
quantity=5,
exchange_rate_to_euro=2.)
df = spark.createDataFrame([Row(**record)])
result = calculate_unit_price_in_euro(df)
expected_record = Row(**record, unit_price_in_euro=4.)
expected = spark.createDataFrame([expected_record])
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Testing a single unit
def test_calculate_unit_price_in_euro():
record = dict(price=10,
quantity=5,
exchange_rate_to_euro=2.)
df = spark.createDataFrame([Row(**record)])
result = calculate_unit_price_in_euro(df)
expected_record = Row(**record, unit_price_in_euro=4.)
expected = spark.createDataFrame([expected_record])
assertDataFrameEqual(result, expected)
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Take home messages
1. Interacting with external data sources is costly
2. Creating in-memory DataFrames makes testing easier
the data is in plain sight,
focus is on just a small number of examples.
3. Creating small and well-named functions leads to more reusability and easier testing.
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Let’s practice!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Continuous testing
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Oliver Willekens
Data Engineer at Data Minded
Running a test suite
Execute tests in Python, with one of:
in stdlib 3rd party
unittest pytest
doctest nose
Core task: assert or raise
Examples:
assert computed == expected
with pytest.raises(ValueError): # pytest specific
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Manually triggering tests
In a Unix shell:
cd ~/workspace/my_good_python_project
pytest .
# Lots of output…
== 19 passed, 2 warnings in 36.80 seconds ==
cd ~/workspace/my_bad_python_project
pytest .
# Lots of output…
== 3 failed, 1 passed in 6.72 seconds ==
Note: Spark increases time to run unit tests.
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Automating tests
Problem:
forget to run unit tests when making changes
Solution:
Automation
How:
Git -> con gure hooks
Con gure CI/CD pipeline to run tests automatically
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
CI/CD
Continuous Integration:
get code changes integrated with the master branch regularly.
Continuous Delivery:
Create “artifacts” (deliverables like documentation, but also programs) that can be deployed into
production without breaking things.
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Con guring a CI/CD tool
CircleCI looks for .circleci/con g.yml.
Example:
Often:
jobs:
test:
1. checkout code
docker:
- image: circleci/python:3.6.4 2. install test & build requirements
steps:
3. run tests
- checkout
- run: pip install -r requirements.txt 4. package/build the software artefacts
- run: pytest .
5. deploy the artefacts (update docs / install
app / …)
BUILDING DATA ENGINEERING PIPELINES IN PYTHON
Let’s practice!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N