Introduction to
Data Engineering
Agenda
Explore the role of a data engineer
Analyze data engineering challenges
Intro to BigQuery
Data Lakes and Data Warehouses
Transactional Databases vs Data Warehouses
Partner effectively with other data teams
● Manage data access and governance
● Build production-ready pipelines
Review GCP customer case study
Lab: Analyzing Data with BigQuery
A data engineer builds data pipelines to enable
data-driven decisions
Get the data Get the data
Add new value
to where it can into a usable
to the data
be useful condition
So… how do we
get the raw data Manage the Productionize
from multiple
systems and data data processes
where can be
store it durably?
A data lake brings together data from across the
enterprise into a single location
Replicate
Raw Data
Data Lake
Spread
RDMBS sheets
Other
Offline systems
files and apps
Key considerations when building a Data Lake
1. Can your data lake handle all the
types of data you have?
Replicate
2. Can it scale to meet the
demand? Data Lake
3. Does it support high-throughput
ingestion?
We need an elastic data
4. Is there fine-grained access container that is flexible
control to objects? and durable to stage all
our data …
5. Can other tools connect easily?
Cloud Storage is designed for 99.999999999% annual durability
Replace/decommission Content storage and
Backup Analytics and ML
infrastructure delivery
Quickly create buckets with cloud shell
gsutil mb gs://your-project-name
What if your data is not usable in its original form?
SOME ETL Data Processing
ASSEMBLY
REQUIRED
Cloud Dataproc Cloud Dataflow
Extract, Transform,
and Load
What if your data arrives continuously and endlessly?
THIS DATA
Streaming Data
DOES NOT
WAIT
Processing
Cloud Cloud
Dataflow BigQuery
Pub/Sub
Agenda
Explore the role of a data engineer
Analyze data engineering challenges
Intro to BigQuery
Data Lakes and Data Warehouses
Transactional Databases vs Data Warehouses
Partner effectively with other data teams
● Manage data access and governance
● Build production-ready pipelines
Review GCP customer case study
Lab: Analyzing Data with BigQuery
Common challenges encountered by data engineers
Access to data Data accuracy Availability of Query
and quality computational performance
resources
Challenge: Consolidating disparate datasets, data
formats, and manage access at scale
Access to data Data accuracy Availability of Query
and quality computational performance
resources
Getting insights across multiple datasets is difficult
without a data lake
Data is scattered across
No common tool exists to
Google Analytics 360,
analyze data and share
CRM, and Campaign
results with the rest of
Manager products,
the organization.
among other sources.
Customer and sales
Some data is not in a
data is stored in
queryable format.
a CRM system.
Data is often siloed in many upstream source systems
Example Query:
Give me all the
in-store promotions
for recent orders and
their inventory levels
Stored in a separate system
and restricted access
Challenge: Cleaning, formatting, and getting the data
ready for useful business insights in a data warehouse
Access to data Data accuracy Availability of Query
and quality computational performance
resources
Assume that any raw data from source systems needs to be
cleaned and transformed and stored in a data warehouse
Query: Give me the best
performing in-store
Missing data and all promotions in France
timestamps
stored as UTC
Promotion list stored
as .csv files or manual
spreadsheets
Challenge: Ensuring you have the compute capacity
to meet peak-demand for your team
Access to data Data accuracy Availability of Query
and quality computational performance
resources
Challenge: Data Engineers need to manage server and cluster
capacity if using on-premise
Under-provisioned
Resources (Wasting time) On-Premises
Compute Capacity
Under-utilized
Consumption (Wasting $$$)
Capacity
Time
Challenge: Queries need to be optimized for
performance (caching, parallel execution)
Access to data Data accuracy Availability of Query
and quality computational performance
resources
Challenge: Managing query performance
on-premise comes with added overhead
● Choosing a Query Engine
● Continually patching and Is there a better way to
updating query engine software manage server
overhead so we can
● Managing clusters and when to focus on insights?
re-cluster
● Optimize for concurrent queries
and quota / demand between
teams
Agenda
Explore the role of a data engineer
Analyze data engineering challenges
Intro to BigQuery
Data Lakes and Data Warehouses
Transactional Databases vs Data Warehouses
Partner effectively with other data teams
● Manage data access and governance
● Build production-ready pipelines
Review GCP customer case study
Lab: Analyzing Data with BigQuery
BigQuery is Google’s data warehouse solution
Data Tables and
Data mart Data lake Grants
warehouse views
BigQuery replaces BigQuery BigQuery defines Function the same Cloud IAM grants
a typical data organizes data schemas and way as in a permission to
warehouse tables into units issues queries traditional data perform specific
hardware setup called datasets directly on warehouse actions
external data
sources
Cloud allows data engineers to spend less time managing
hardware and enabling scale; Let Google do that for you
Typical Big Data Processing With Google
Monitoring Insights Insights
Performance Resource
tuning provisioning
Utilization Handling
improvements growing scale
Deployment &
Reliability
configuration
You don't need to provision resources before using BigQuery
Resources On-demand storage
and compute
Consumption
Allocation
Time
Agenda
Explore the role of a data engineer
Analyze data engineering challenges
Intro to BigQuery
Data Lakes and Data Warehouses
Transactional Databases vs Data Warehouses
Partner effectively with other data teams
● Manage data access and governance
● Build production-ready pipelines
Review GCP customer case study
Lab: Analyzing Data with BigQuery
A data engineer gets data into a useable condition
Get the data Get the data
Add new value
to where it can into a usable
to the data
be useful condition
Manage the Productionize
data data processes
A data warehouse stores transformed data in a
usable condition for business insights
Replicate Extract, Transform, Load
Pipeline
Raw Data Data
Data Lake
Warehouse
What are the key
considerations when
deciding between data
warehouse options?
Considerations when choosing a data warehouse
include:
● Can it serve as a sink for both batch
and streaming data pipelines?
Extract, Transform, Load
● Can the data warehouse scale to meet Pipeline
Data
my needs? Warehouse
● How is the data organized, cataloged,
and access controlled?
● Is the warehouse designed for
performance?
● What level of maintenance is required
by our engineering team?
BigQuery is a modern data warehouse that changes
the conventional mode of data warehousing
Complex Restricted Optimized Optimized Needs continuous Needs an army of
Traditional ETL to only a for legacy for batch patching DBAs for operational
DW few users BI data and updates tasks
Automate Make Build the Tee up Fully Simplify
data insights foundation real-time managed data
BigQuery delivery accessible for AI insights operations
You can simplify Data Warehouse ETL pipelines with external
connections to Cloud Storage and Cloud SQL
● Postgres Federate
d Query
Cloud ● MySQL
SQL ● SQL Server
Cloud
Storage
Demo Federated Queries with
BigQuery
Agenda
Explore the role of a data engineer
Analyze data engineering challenges
Intro to BigQuery
Data Lakes and Data Warehouses
Transactional Databases vs Data
Warehouses
Partner effectively with other data teams
● Manage data access and governance
● Build production-ready pipelines
Review GCP customer case study
Lab: Analyzing Data with BigQuery
Cloud SQL is fully managed SQL Server, Postgres, or MySQL
for your Relational Database (transactional RDBMS)
● Automatic encryption
Cloud
SQL ● 30TB storage capacity
● 60,000 IOPS
(read/write per second)
● Auto-scale and auto
backup
Why not simply use Cloud
SQL for reporting workflows?
RDBMS are optimized for data from a single source and
high-throughput writes vs high-read data warehouses
You will likely need and
encounter both a database and
data warehouse in your final
architecture
Cloud BigQuery
SQL
● Scales to GB and TB ● Scales to PB
● Ideal for back-end ● Easily connect to
database applications external data sources
● Record based storage for ingestion
● Column based storage
Relational database management systems (RDBMS)
are critical for managing new transactions
RDBMS are optimized for
high throughput WRITES
to RECORDS
The complete picture: Source data comes into the data lake, is
processed into the data warehouse and made available for insights
line ML Model
pip e
re
Featu
Data
Data Lake Other Team Data
Warehouse Eng Pipeline
Warehouse
BI p
ipe
line
Reporting
Dashboards
Who leads these other
teams that we will have to
partner with?
Agenda
Explore the role of a data engineer
Analyze data engineering challenges
Intro to BigQuery
Data Lakes and Data Warehouses
Transactional Databases vs Data Warehouses
Partner effectively with other data teams
● Manage data access and governance
● Build production-ready pipelines
Review GCP customer case study
Lab: Analyzing Data with BigQuery
A data engineer builds data pipelines to enable
data-driven decisions What teams rely on
these pipelines?
Get the data Get the data
Add new value
to where it can into a usable
to the data
be useful condition
Manage the Productionize
data data processes
Many teams rely on partnerships with data
engineering to get value out of their data
Machine Learning Data Analyst Data Engineer
Engineer
How might each of these teams rely on data engineering?
Machine learning teams need data engineers to help
them capture new features in a stable pipeline
fe at ures ?”
h e s e t i m e
a l l oft u c t ion
“Are t p ro d
bl e a
availa
eline ML Model
re pi p
Raw Data Data Featu
Data Lake
Warehouse
“Can you help us get more
features (columns) of data for
our machine learning model?”
Add value: Machine learning directly in BigQuery
FROM
ML.EVALUATE(MODEL
`bqml_tutorial.sample_model`,
TABLE eval_table)
1 Dataset 2 Create/train 3 Evaluate 4 Predict/classify
CREATE MODEL `bqml_tutorial.sample_model` FROM
OPTIONS(model_type='logistic_reg') AS ML.PREDICT(MODEL
SELECT `bqml_tutorial.sample_model`,
table game_to_predict) )
AS predict
Data analysis and business intelligence teams rely on
data engineering to showcase the latest insights
in yo ur
va i la ble ? ”
i s a ce s s
hat data us to ac
“W us e for
o
wareh
Reporting
p eline Dashboards
B I pi
Raw Data Data
Data Lake
Warehouse
“Our dashboards are slow, can you
help us re-engineer our BI tables
for better performance?”
Add value: BI Engine for dashboard performance
Data
BigQuery BI
Studio
Engine
Batch or
Streaming Sheets
BigQuery
Partner
● No need to manage OLAP cubes
BI tools or separate BI servers for
dashboard performance
● Natively integrates with
BigQuery streaming for real-time
data refresh
● Column oriented in-memory BI
execution engine
Other data engineering teams may rely on your
pipelines being timely and error free
“How can we both ensure
dataset uptime and
performance?”
Data Pipeline Other Team Data
Raw Data Data Lake
Warehouse Warehouse
“We’re noticing high demand for your
datasets -- be sure your warehouse
can scale for many users”
Add value: Cloud Monitoring for performance
● View in-flight and completed queries.
● Track spending on BigQuery resources.
● Use Cloud Audit Logs to view actual job
information (who executed, what query
was ran).
● Create alerts and send notifications.
Agenda
Explore the role of a data engineer
Analyze data engineering challenges
Intro to BigQuery
Data Lakes and Data Warehouses
Transactional Databases vs Data Warehouses
Partner effectively with other data teams
● Manage data access and governance
● Build production-ready pipelines
Review GCP customer case study
Lab: Analyzing Data with BigQuery
A data engineer manages data access and governance
Get the data Get the data
Add new value
to where it can into a usable
to the data
be useful condition
Manage the Productionize
data data processes
Data engineering must set and communicate a
responsible data governance model
line ML Model
pip e
re
Featu
Raw Data Data
Data Lake Other Team Data
Warehouse Eng Pipeline
Warehouse
BI p
ipe
line
● Who should have access? Reporting
● How is PII handled? Dashboards
● How can we educate end-users on
our data catalog?
Cloud Data Catalog is a managed data discovery +
Data Loss Prevention API for guarding PII
Data Catalog
Simplify data discovery at any scale:
Fully managed metadata management service with
no infrastructure to set up or manage
Unified view of all datasets:
Central and secure data catalog across Google
Cloud with metadata capture and tagging
Data governance foundation:
Security compliance with access level controls along
with Cloud Data Loss Prevention integration for
handling sensitive data
Demo Finding PII in your dataset with
DLP API
Agenda
Explore the role of a data engineer
Analyze data engineering challenges
Intro to BigQuery
Data Lakes and Data Warehouses
Transactional Databases vs Data Warehouses
Partner effectively with other data teams
● Manage data access and governance
● Build production-ready pipelines
Review GCP customer case study
Lab: Analyzing Data with BigQuery
A data engineer builds production data pipelines to
enable data-driven decisions
Get the data Get the data
Add new value
to where it can into a usable
to the data
be useful condition
Manage the Productionize
data data processes
Data engineering owns the health and future of their
production data pipelines
line ML Model
pip e
re
Featu
Raw Data Data
Data Lake Other Team Data
Warehouse Eng Pipeline
Warehouse
BI p
ipe
line
● How can we ensure pipeline health and data Reporting
cleanliness? Dashboards
● How do we productionalize these pipelines to
minimize maintenance and maximize uptime?
● How do we respond and adapt to changing
schemas and business needs?
● Are we using the latest data engineering tools and
best practices?
Cloud Composer (managed Apache Airflow) is used
to orchestrate production workflows
Agenda
Explore the role of a data engineer
Analyze data engineering challenges
Intro to BigQuery
Data Lakes and Data Warehouses
Transactional Databases vs Data Warehouses
Partner effectively with other data teams
● Manage data access and governance
● Build production-ready pipelines
Review GCP customer case study
Lab: Analyzing Data with BigQuery
Ocado’s customer service department is
bombarded with messages
Can we use ML to prioritize these messages?
“My order is missing 12
“I love Ocado! You’re the
steaks. I need them for a
best.”
party I’m hosting in 6 hours.”
“I scheduled my delivery for 3
“My delivery is five hours late. PM today. Something came
Can I get a full refund?” up and I won’t be home at
that time. Can I reschedule?”
Ocado’s GCP solution helps them respond to urgent customer
emails 4x faster with ML
Increased contact center efficiency enables representatives to spend extra time on high-priority tasks
AI Platform
http://www.multichannel-blog.co.uk/2017/05/03/google-the-future-of-cloud-conference-in-london-3-4th-may/
Twitter democratized data analysis using BigQuery
“We believe that users with a wide range of technical skills should be able to discover
data and have access to SQL-based analysis and visualization tools that perform well”
-- Twitter
https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/democratizing-data-analysis-with-google-bigquery.html
Recap
● Data sources
● Data lakes
● Data warehouses
● Google Cloud solutions for
Data Engineering
Concept Review:
Data sources feed
into a Data Lake and AI
are processed into
Platform
Notebooks
your Data Warehouse AI
Platform
for analysis
Data stores
AI
Platform
Notebooks
Here’s a useful guide
for “GCP products in
4 words or less”
https://github.com/gr
egsramblings/google-
cloud-4-words
Updated continually By Greg Wilson -
Google DevRel
Agenda
Explore the role of a data engineer
Analyze data engineering challenges
Intro to BigQuery
Data Lakes and Data Warehouses
Transactional Databases vs Data Warehouses
Partner effectively with other data teams
● Manage data access and governance
● Build production-ready pipelines
Review GCP customer case study
Lab: Analyzing Data with BigQuery
Lab
Using BigQuery to do
Analysis
Objectives
● Execute interactive queries in the BigQuery console
● Combine and run analytics on multiple datasets