0% found this document useful (0 votes)

4K views154 pages

Advanced Data Engineering With Databricks

The document provides an overview of an advanced data engineering course with Databricks that teaches how to design databases and pipelines optimized for the Databricks Lakehouse Platform. The course covers topics like implementing efficient incremental data processing, leveraging Databricks tools for access control and job monitoring, and managing data pipelines with techniques like streaming deduplication and slowly changing dimensions. The document lists learning objectives, modules, and tools to be used in the course.

Uploaded by

pop678

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4K views154 pages

Advanced Data Engineering With Databricks

Uploaded by

pop678

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 154

Welcome

Advanced Data Engineering

with Databricks

©2022 Databricks Inc. — All rights reserved

Learning Objectives
1. Design databases and pipelines optimized for the Databricks Lakehouse
Platform.
2. Implement efﬁcient incremental data processing to validate and enrich data
driving business decisions and applications.
3. Leverage Databricks-native features for managing access to sensitive data
and fulﬁlling right-to-be-forgotten requests.
4. Manage error troubleshooting, code promotion, task orchestration, and
production job monitoring using Databricks tools.

©2022 Databricks Inc. — All rights reserved

Course Tools

Lecture TA Help + Discussion Lab Notebooks

Breakout Resources Solutions
Rooms

©2022 Databricks Inc. — All rights reserved

Modules
1. Architecting for the Lakehouse
2. Bronze Ingestion Patterns
3. Promoting to Silver
4. Gold Query Layer
5. Storing Data Securely
6. Propagating Updates and Deletes
7. Orchestration and Scheduling

©2022 Databricks Inc. — All rights reserved

Welcome!
Your Instructor - Your Name

•
PHOTO
•

/in/proﬁle

©2022 Databricks Inc. — All rights reserved

Welcome!
Let’s get to know you

• Name
• Role and team
• Length of experience with Spark and Databricks
• Motivation for attending
• Fun fact or favorite mobile app

©2022 Databricks Inc. — All rights reserved

Course Objective
Design and implement a
multi-pipeline multi-hop architecture
to enable the Lakehouse paradigm.

©2022 Databricks Inc. — All rights reserved

Our Company

moovio

©2022 Databricks Inc. — All rights reserved

Architecting for the
Lakehouse
Adopting the Lakehouse Architecture
Lakehouse Medallion Architecture
Streaming Design Patterns

©2022 Databricks Inc. — All rights reserved

Adopting the
Lakehouse
Architecture

©2022 Databricks Inc. — All rights reserved

Lakehouse
One platform to unify all of
Data your data, analytics, and AI Data
Lake workloads Warehouse

©2022 Databricks Inc. — All rights reserved

The Databricks Lakehouse Platform
Databricks Lakehouse Platform

Data BI and SQL Data Science Real-Time Data

Engineering Analytics and ML Applications

Data Management and Governance

✓ Simple
Open Data Lake
✓ Open
Platform Security & Administration

✓ Collaborative

Unstructured, semi-structured, structured, and streaming data

©2022 Databricks Inc. — All rights reserved

An open approach to bringing
data management and
governance to data lakes

Data Better reliability with transactions Data

Lake 48x faster data processing with indexing Warehouse
Data governance at scale with
ﬁne-grained access control lists

©2022 Databricks Inc. — All rights reserved

Delta Lake brings ACID to object storage
▪ Atomicity
▪ Consistency
▪ Isolation
▪ Durability

©2022 Databricks Inc. — All rights reserved

Delta Lake provides ACID
guarantees scoped to tables

©2022 Databricks Inc. — All rights reserved

Problems solved by ACID transactions in Delta
1. Hard to append data
2. Modiﬁcation of existing data difﬁcult
3. Jobs failing mid way
4. Real-time operations hard
5. Costly to keep historical data versions

©2022 Databricks Inc. — All rights reserved

The Lakehouse
Medallion Architecture

©2022 Databricks Inc. — All rights reserved

Multi-hop Pipeline
Source:
Files or integrated systems
Bronze:
Raw data and metadata
Silver:
Validated data with atomic
grain
Gold:
Reﬁned, aggregated data

©2022 Databricks Inc. — All rights reserved

Bronze Layer

©2022 Databricks Inc. — All rights reserved

Why is the Bronze Layer Important?
• Bronze layer replaces the traditional data lake
• Represents the full, unprocessed history of the data
• Captures the provenance (what, when, and from where) of data loaded into
the lakehouse
• Data is stored efﬁciently using Delta Lake
• If downstream layers discover later they need to ingest more, they can come
back to the Bronze source to obtain it.

©2022 Databricks Inc. — All rights reserved

Bronze Layer Guiding Principles
• The goal of this layer is data capture and provenance:
• Capture exactly what is ingested, without parsing or change.
• Typically a Delta Lake table with these fields in each row:
• Date received/ingested
• Data source (filename, external system, etc)
• Text field with raw unparsed JSON, CSV, or other data
• Other metadata
• Should be append only (batch or streaming)
• Plan ahead if data must be deleted for regulatory purposes

©2022 Databricks Inc. — All rights reserved

Processing Deletes
• Retain all records when possible
• Soft-deletes if necessary
• Hard-deletes may be required by regulatory processes

©2022 Databricks Inc. — All rights reserved

Silver Layer

©2022 Databricks Inc. — All rights reserved

Why is the Silver Layer important?
• Easier to query than the non-curated Bronze “data lake”
• Data is clean
• Transactions have ACID guarantees
• Represents the “Enterprise Data Model”
• Captures the full history of business action modeled
• Each record processed is preserved
• All records can be efﬁciently queried
• Reduces data storage complexity, latency, and redundancy
• Built for both ETL throughput AND analytic query performance

©2022 Databricks Inc. — All rights reserved

Silver Layer Guiding Principles
• Uses Delta Lake tables (with SQL table names)
• Preserves grain of original data (no aggregation)
• Eliminates duplicate records
• Production schema enforced
• Data quality checks passed
• Corrupt data quarantined
• Data stored to support production workloads
• Optimized for long-term retention and ad-hoc queries

©2022 Databricks Inc. — All rights reserved

Gold Layer

©2022 Databricks Inc. — All rights reserved

Why is the Gold Layer important?
• Powers ML applications, reporting, dashboards, ad hoc analytics
• Reduces costs associated with ad hoc queries on silver tables
• Allows ﬁne grained permissions
• Reduces strain on production systems
• Shifts query updates to production workloads

©2022 Databricks Inc. — All rights reserved

Notebook

Streaming Design
Patterns

©2022 Databricks Inc. — All rights reserved

Bronze Ingestion
Patterns
Bronze Ingestion Patterns
Auto Load to Multiplex Bronze
Streaming from Multiplex Bronze

©2022 Databricks Inc. — All rights reserved

Bronze Ingestion
Patterns

©2022 Databricks Inc. — All rights reserved

Singleplex Ingestion

©2022 Databricks Inc. — All rights reserved

Multiplex Ingestion

©2022 Databricks Inc. — All rights reserved

Don't Use Kafka as Bronze
• Data retention limited by Kafka;
expensive to keep full history
• All processing happens on ingest
• If stream gets too far behind,
data is lost
• Cannot recover data (no history
to replay)

©2022 Databricks Inc. — All rights reserved

Delta Lake Bronze

©2022 Databricks Inc. — All rights reserved

Notebook

Auto Load to
Multiplex Bronze

©2022 Databricks Inc. — All rights reserved

Notebook

Streaming from
Multiplex Bronze

©2022 Databricks Inc. — All rights reserved

Promoting to Silver
Streaming Deduplication
Quality Enforcement
Slowly Changing Dimensions
Streaming Joins and Statefulness

©2022 Databricks Inc. — All rights reserved

Promoting Bronze to
Silver

©2022 Databricks Inc. — All rights reserved

Silver Layer Objectives
• Validate data quality and schema
• Enrich and transform data
• Optimize data layout and storage for downstream queries
• Provide single source of truth for analytics

©2022 Databricks Inc. — All rights reserved

Schema Enforcement & Evolution
• Enforcement prevents bad records from entering table
• Mismatch in type or field name
• Evolution allows new fields to be added
• Useful when schema changes in production/new fields added to nested data
• Cannot use evolution to remove fields
• All previous records will show newly added field as Null
• For previously written records, the underlying file isn’t modified.
• The additional field is simply defined in the metadata and dynamically read as null

©2022 Databricks Inc. — All rights reserved

Delta Lake Constraints
• Check NOT NULL or arbitrary boolean condition
• Throws exception on failure

ALTER TABLE tableName ADD CONSTRAINT constraintName

CHECK heartRate >= 0;

©2022 Databricks Inc. — All rights reserved

Alternative Quality Check Approaches
• Add a “validation” field that captures any validation errors and a null value
means validation passed.
• Quarantine data by filtering non-compliant data to alternate location
• Warn without failing by writing additional fields with constraint check results
to Delta tables

©2022 Databricks Inc. — All rights reserved

Notebook

Streaming
Deduplication

©2022 Databricks Inc. — All rights reserved

Notebook

Quality
Enforcement

©2022 Databricks Inc. — All rights reserved

Lab

Promoting to Silver

©2022 Databricks Inc. — All rights reserved

Slowly Changing
Dimensions in the
Lakehouse

©2022 Databricks Inc. — All rights reserved

Fact Tables as Incremental Data
• Often is a time series
• No intermediate aggregations
• No overwrite/update/delete operations
• Append-only operations

©2022 Databricks Inc. — All rights reserved

Using Dimension Tables in Incremental Updates
• Delta Lake enables stream-static
joins
• Each micro-batch captures the
most recent state of joined Delta
table
• Allows modiﬁcation of dimension
while maintaining downstream
composability

©2022 Databricks Inc. — All rights reserved

Slowly Changing Dimensions (SCD)
• Type 0: No changes allowed (static/append only)
E.g. static lookup table

• Type 1: Overwrite (no history retained)

E.g. do not care about historic comparisons other than quite recent
(use Delta Time Travel)

• Type 2: Adding a new row for each change and marking the old as
obsolete
E.g. Able to record product price changes over time, integral to
business logic.

©2022 Databricks Inc. — All rights reserved

Type 0 and Type 1

user_id street name

1 123 Oak Ave Sam

2 99 Jump St Abhi

3 1000 Rodeo Dr Kasey

©2022 Databricks Inc. — All rights reserved

Type 2

user_id street name valid_from current

1 123 Oak Ave Sam 2020-01-01 true

2 99 Jump St Abhi 2020-01-01 false

3 1000 Rodeo Dr Kasey 2020-01-01 false

2 430 River Rd Abhi 2021-10-10 true

3 1000 Rodeo Dr Casey 2021-10-10 true

©2022 Databricks Inc. — All rights reserved

Applying SCD Principles to Facts
• Fact table usually append-only (Type 0)
• Can leverage event and processing times for append-only history

order_id user_id occurred_at action processed_time

123 1 2021-10-01 10:05:00 ORDER_CANCELLED 2021-10-01 10:05:30

123 1 2021-10-01 10:00:00 ORDER_PLACED 2021-10-01 10:06:30

©2022 Databricks Inc. — All rights reserved

Notebook

Type 2 SCD

©2022 Databricks Inc. — All rights reserved

Streaming Joins and
Statefulness

©2022 Databricks Inc. — All rights reserved

The Components of a Stateful Stream

windowedDF =
(eventsDF
.groupBy(window("eventTime",
"10 minutes",
"5 minutes"))
.count()
.writeStream
.trigger(processingTime=“5 minutes”)
)

©2022 Databricks Inc. — All rights reserved

Output Modes

Mode When Stateful Results Materialize

Append (default) Only materialize after watermark +
lateness passed
Complete Materialize every trigger, outputs
complete table
Update Materialize every trigger, outputs
only new values

©2022 Databricks Inc. — All rights reserved

Statefulness vs. Query Progress
• Many operations as speciﬁcally stateful (stream-stream joins, deduplication,
aggregation)
• Some operations just need to store incremental query progress and are not
stateful (appends with simple transformations, stream-static joins, merge)
• Progress and state are stored in checkpoints and managed by driver during
query processing

©2022 Databricks Inc. — All rights reserved

Managing Stream Parameters
GOAL: Balance parameters for sustainable, optimized throughput
• Input Parameters
• Control amount of data in each micro-batch
• State Parameters
• Control amount of data required to calculate query results
• Output Parameters
• Control number and size of ﬁles written

©2022 Databricks Inc. — All rights reserved

Reasoning about Stream Dimensions

RECORD 1
STRUCTURED
Input RECORD 2
...
STREAMING
RECORD N
State size m
(records to be
compared against)

RECORD 1
State RECORD 2
...
RECORD M

Input size n
(records in
mini-batch)

©2022 Databricks Inc. — All rights reserved

Input Parameters

©2022 Databricks Inc. — All rights reserved

Limiting the input dimension
RECORD 1
STRUCTURED
RECORD 2 Limit n in O(n⨯m)
STREAMING
...
RECORD N

Input
State size m
(records to be
RECORD 1 compared against)
RECORD 2
...
RECORD M

State

Input size n
(records in
mini-batch)

©2022 Databricks Inc. — All rights reserved

Why are input parameters important?
▪ Allows you to control the mini-batch size

▪ Defaults are large

▪ Delta Lake: 1000 ﬁles per micro-batch
▪ Pub/Sub & ﬁles: No limit to input batch size

▪ Optimal mini-batch size → Optimal cluster usage

▪ Suboptimal mini-batch size → performance cliff

▪ Shuffle Spill

©2022 Databricks Inc. — All rights reserved

Per Trigger Settings
• File Source
• maxFilesPerTrigger
• Delta Lake and Auto Loader
• maxFilesPerTrigger
• maxBytesPerTrigger
• Kafka
• maxOffsetsPerTrigger

©2022 Databricks Inc. — All rights reserved

Shuffle Partitions with Structured Streaming
• Should match the number of cores in the largest cluster size that might be
used in production
• Number of shuffle partitions == max parallelism
• Cannot be changed without new checkpoint
• Will lose query progress and state information
• Higher shuffle partitions == more files written
• Best practice: use Delta Live Tables for streaming jobs with variable volume

©2022 Databricks Inc. — All rights reserved

Tuning maxFilesPerTrigger
Base it on shufﬂe partition size

• Rule of thumb 1: Optimal shufﬂe partition size ~100-200 MB

• Rule of thumb 2: Set shufﬂe partitions equal to # of cores
• Use Spark UI to tune maxFilesPerTrigger until you get ~100-200 MB per
partition
• Note: Size on disk is not a good proxy for size in memory
• Reason is that ﬁle size is different from the size in cluster memory

©2022 Databricks Inc. — All rights reserved

State Parameters

©2022 Databricks Inc. — All rights reserved

Limiting the state dimension

Limit m in O(n⨯m)
RECORD 1
STRUCTURED
Input RECORD 2
...
STREAMING
RECORD N

State size m
(records to be
compared against)
RECORD 1
RECORD 2
...
RECORD M

State
Input size n
(records in
mini-batch)

©2022 Databricks Inc. — All rights reserved

Limiting the state dimension

RECORD 1
STRUCTURED
Input RECORD 2
...
STREAMING
RECORD N

▪ State Store backed operations

RECORD 1
RECORD 2
▪ Stateful (windowed) aggregations
...
RECORD M ▪ Drop duplicates
State ▪ Stream-Stream Joins
▪ Delta Lake table or external system
▪ Stream-Static Join / Merge
©2022 Databricks Inc. — All rights reserved
Why are state parameters important?
▪ Optimal parameters → Optimal cluster usage

▪ If not controlled, state explosion can occur

▪ Slower stream performance over time
▪ Heavy shuffle spill (Joins/Merge)
▪ Out of memory errors (State Store backed operations)

©2022 Databricks Inc. — All rights reserved

Example Query 1. Main input stream
salesSDF = (
▪ Static Delta Lake table used in spark
.readStream
stream-static join .format("delta")
.table("sales")
▪ State Store-backed windowed )

stateful aggregation
2. Join item category lookup
itemSalesSDF = (
salesSDF
.join( spark.table(“items”), “item_id”)
)

3. Aggregate sales per item per

hour
itemSalesPerHourSDF = (
itemSalesSDF
.groupBy(window(..., “1 hour”),
“item_category”)
.sum(“revenue”)
)
©2022 Databricks Inc. — All rights reserved
State Store Parameters
• Watermarking
• How much history to compare against
• Granularity
• The more granular the aggregate key / window, the more
state
• State store backend
• RocksDB / Default

©2022 Databricks Inc. — All rights reserved

Stream-Static Join & Merge
• Join driven by streaming data
1. Main input stream
• Join triggers shufﬂe
salesSDF = (
• Join itself is stateless spark
.readStream
.format("delta")
• Control state information with .table("sales")
)
predicate
• Goal is to broadcast static table to 2. Join item category lookup
streaming data itemSalesSDF = (
salesSDF

• Broadcasting puts all data on each

.join(
spark.table(“items”)
node .filter(“category=’Food’), # Predicate
on=[“item_id”]
)
)

©2022 Databricks Inc. — All rights reserved

Output Parameters

©2022 Databricks Inc. — All rights reserved

Limiting the output dimension

Limit m in O(n⨯m)
RECORD 1
STRUCTURED
Input RECORD 2
...
STREAMING
RECORD N

State size m
(records to be
compared against)
RECORD 1
RECORD 2
...
RECORD M

State
Input size n
(records in
mini-batch)

©2022 Databricks Inc. — All rights reserved

Why are output parameters important?
• Streaming jobs tend to create many small ﬁles
• Reading a folder with many small ﬁles is slow
• Poor performance for downstream jobs, self-joins, and merge
• Output type can impact state information retained
• Merge statements with full table scans increase state

©2022 Databricks Inc. — All rights reserved

Delta Lake Output Optimizations
• Optimized Writes
• Auto Compaction
• delta.tuneFileSizesForRewrites
• Insert-only merge

©2022 Databricks Inc. — All rights reserved

Notebook

Stream Static Joins

©2022 Databricks Inc. — All rights reserved

Gold Query Layer
Making Data Available for Analytics
Stored Views
Materialized Gold Tables

©2022 Databricks Inc. — All rights reserved

Lakehouse and the
Query Layer

©2022 Databricks Inc. — All rights reserved

What is the Query Layer?
• Stores reﬁned datasets for use by data scientists
• Serves results for pre-computed ML models
• Contains enriched, aggregated views for use by analysts
• Star-schemas and data marts for BI queries
• Powers data-driven applications, dashboards, and reports

Also called the serving layer; gold tables exist at this level.

©2022 Databricks Inc. — All rights reserved

Tables and Views in the Query Layer
• Gold tables
• Saved views
• Databricks SQL saved queries
• Tables in RDS/NoSQL database

©2022 Databricks Inc. — All rights reserved

Gold Tables
• Refined, typically aggregated views of data saved using Delta Lake
• Can be updated with batch or stream processing
• Configured and scheduled as part of ETL workloads
• Results computed on write
• Read is simple deserialization; additional filters can be applied with
pushdowns

©2022 Databricks Inc. — All rights reserved

Saved Views
• Views can be registered to databases and made available to users using
ACLs
• Views are logical queries against source tables
• Logic is executed each time a view is queried
• Views registered against Delta tables will always query the most current
valid version of the table

©2022 Databricks Inc. — All rights reserved

Databricks SQL Saved Queries
• Similar to saved views in when logic is executed
• Auto-detect changes in upstream Delta tables
• Uses new feature Query Result Cache
• Caching allows reviewing dashboards and downloading CSVs without an
active SQL endpoint
• Easy to identify data sources (SQL present in saved query)
• Can be scheduled using Databricks SQL functionality
• Can automatically refresh Databricks SQL dashboards

©2022 Databricks Inc. — All rights reserved

Databricks SQL Endpoints
• Clusters optimized for SQL queries
• Serverless option for quick cluster startup and autoscaling
• Photon-enabled for vectorized execution
• Enhanced throughput for exchanging data with external SQL systems
• Optimized connectors for popular BI tools

©2022 Databricks Inc. — All rights reserved

Tables in External Systems
• Many downstream applications
may require reﬁned data in a
different system
• NoSQL databases
• RDS
• Pub/sub messaging
• Must decide where single source of
truth lives

©2022 Databricks Inc. — All rights reserved

Recommendations
• Use saved views when ﬁltering silver tables

CREATE VIEW sales_florida_2020 AS

SELECT *
FROM sales
WHERE state = ‘FL’ and year = 2020;

©2022 Databricks Inc. — All rights reserved

Recommendations
• Use Delta tables for common partial aggregates
CREATE TABLE store_item_sales AS
SELECT store_id, item_id, department, date,
city, state, region, country,
SUM(quantity) AS items_sold,
SUM(price) AS revenue
FROM sales
INNER JOIN stores ON sales.store_id = stores.store_id
INNER JOIN items ON sales.item_id = items.item_id
GROUP BY store_id, item_id, department, date,
city, state, region, country

©2022 Databricks Inc. — All rights reserved

Recommendations
• Share Databricks SQL queries and dashboards within teams

SELECT date, hour, SUM(quantity * price) hourly_sales

FROM sales
WHERE store_id = 42
AND date > date_sub(current_date(), 14)
GROUP BY date, hour;

©2022 Databricks Inc. — All rights reserved

Recommendations
• Create views with column aliases

CREATE VIEW sales_marketing AS

SELECT
id cust_id,
date deal_date,
total sale_amount
FROM sales;

©2022 Databricks Inc. — All rights reserved

Recommendations
• Analyze query history to identify new candidate gold tables
• Admins can access Databricks SQL query history
• Running analytics against query history can help identify
• Queries that are long-running
• Queries that are run regularly
• Queries that use common datasets
• Transitioning these queries to gold tables and scheduling as engineering
jobs may reduce total operating costs
• Query history can also be useful for identifying predicates used most
frequently; useful for ZORDER indexing during optimization

©2022 Databricks Inc. — All rights reserved

Notebook

Stored Views

©2022 Databricks Inc. — All rights reserved

Notebook

Materialized Gold
Tables

©2022 Databricks Inc. — All rights reserved

Storing Data Securely
PII & Regulatory Compliance
Storing PII Securely
Granting Privileged Access to PII

©2022 Databricks Inc. — All rights reserved

PII & Regulatory
Compliance

©2022 Databricks Inc. — All rights reserved

Regulatory Compliance
• EU = GDPR (General Data Protection Regulation)
• US = CCPA (California Consumer Privacy Act)
• Simpliﬁed Compliance Requirements
• Inform customers what personal information is collected
• Delete, update, or export personal information as requested
• Process request in a timely fashion (30 days)

©2022 Databricks Inc. — All rights reserved

How Lakehouse Simpliﬁes Compliance
• Reduce copies of your PII
• Find personal information quickly
• Reliably change, delete, or export data
• Use transaction logs for auditing

©2022 Databricks Inc. — All rights reserved

Manage Access to PII
• Control access to storage locations with cloud permissions
• Limit human access to raw data
• Pseudonymize records on ingestion
• Use table ACLs to manage user permissions
• Conﬁgure dynamic views for data redaction
• Remove identifying details from demographic views

©2022 Databricks Inc. — All rights reserved

Pseudonymization

©2022 Databricks Inc. — All rights reserved

Pseudonymization
• Switches original data point with pseudonym for later re-identiﬁcation
• Only authorized users will have access to keys/hash/table for
re-identiﬁcation
• Protects datasets on record level for machine learning
• A pseudonym is still considered to be personal data according to the GDPR

©2022 Databricks Inc. — All rights reserved

Hashing
• Apply SHA or other hash to all PII
• Add random string "salt" to values before hashing
• Databricks secrets can be leveraged for obfuscating salt value
• Leads to some increase in data size
• Some operations will be less efﬁcient

©2022 Databricks Inc. — All rights reserved

Tokenization
• Converts all PII to keys
• Values are stored in a secure lookup table
• Slow to write, but fast to read
• De-identiﬁed data stored in fewer bytes

©2022 Databricks Inc. — All rights reserved

Tokenization

©2022 Databricks Inc. — All rights reserved

GDPR with
vault

GDPR
without vault

©2022 Databricks Inc. — All rights reserved

Anonymization

©2022 Databricks Inc. — All rights reserved

Anonymization
• Protects entire tables, databases or entire data catalogues mostly for
Business Intelligence
• Personal data is irreversibly altered in such a way that a data subject can no
longer be identiﬁed directly or indirectly
• Usually a combination of more than one technique used in real-world
scenarios

©2022 Databricks Inc. — All rights reserved

Data Suppression
• Exclude columns with PII from views
• Remove rows where demographic groups are too small
• Use dynamic access controls to provide conditional access to full data

©2022 Databricks Inc. — All rights reserved

Generalization
• Categorical generalization
• Binning
• Truncating IP addresses
• Rounding

©2022 Databricks Inc. — All rights reserved

Categorical Generalization
• Removes precision from data
• Move from speciﬁc categories to more general
• Retain level of speciﬁcity that still provides insight without revealing identity

©2022 Databricks Inc. — All rights reserved

Binning
• Identify meaningful divisions in data and group on boundaries
• Allows access to demographic groups without being able to identify
individual PII
• Can use domain expertise to identify groups of interest

©2022 Databricks Inc. — All rights reserved

Truncating IP addresses
• Rounding IP address to /24 CIDR
• Replace last byte with 0
• Generalizes IP geolocation to city or
neighbourhood level

©2022 Databricks Inc. — All rights reserved

Rounding
• Apply generalized rounding rules to all number data, based on required
precision for analytics
• Provides gross estimates without speciﬁc values
• Example:
• Integers are rounded to multiples of 5
• Values less than 2.5 are rounded to 0 or omitted from reports
• Consider suppressing outliers

Notebook

Storing PII Securely

Managing ACLs for the
Enterprise Lakehouse

Challenges with Analytics in the Data Lake
● Difﬁcult to provide access on need-to-know basis
● Unclear governance & ownership of data assets
● Data proliferation resulting in possible compliance issues
● Difﬁcult to ensure integrity of data and reporting

The Goal
Provide access to valuable data to users across the
company in a secure manner.

Ensure users are only able to access the data they’re

entitled to.

Detect whether any data has been altered or manipulated.

High Level Design DEV Data

Full Access

DEV Data Eng WS

Read Only
SCIM Provisioning Access
from SSO

Full Access
PROD Data Eng WS PROD Data

Read Only
Access
Analytics WS
Full Access

GRANT USAGE, SELECT, READ_METADATA ON DATABASE hr TO

`HR`;
Alternative, grant access on speciﬁc tables:

GRANT USAGE ON DATABASE hr TO `HR`;

GRANT SELECT, READ_METADATA ON TABLE employees TO `HR`;
GRANT SELECT, READ_METADATA ON TABLE addresses TO `HR`;

Enable Secure Data Sharing
Assumptions
● Teams/depts need a private area to collaborate in
● Datasets must not be shared outside team/dept
● New tables are not automatically shared to other members

GRANT USAGE, CREATE ON DATABASE project_data TO `Data

Analysts`;
Alternatively, members automatically see all new tables:

GRANT USAGE, SELECT, READ_METADATA, CREATE ON DATABASE

project_data TO `Data Analysts`;
©2022 Databricks Inc. — All rights reserved
Enable Private User Sandboxes
Assumptions
● Users need a private area to store derivative datasets
● Datasets must not be shared with other users

GRANT USAGE, CREATE ON DATABASE user1 TO

`[email protected]`;

Delegate Administration of Access Policies
Assumptions
● Need to delegate management of ACLs to data stewards/owners
● Monitor using SQL Analytics query logs and/or workspace audit logs

ALTER DATABASE hr OWNER TO `HR Admins`;

Delegate Database Creation to Trusted Users
Assumptions
● Delegate database creation to trusted users
● Database creators manage ACLs
● Monitor grants using SQL Analytics query logs and/or workspace audit logs

GRANT CREATE ON CATALOG TO `Data Admins`;

Give All Users Read-Only Access To All
Datasets
Assumptions
● Simple security model where all users have same level of access
● All new datasets are automatically accessible

GRANT USAGE, SELECT, READ_METADATA ON CATALOG TO `users`;

Dynamic Views on Databricks
● Need to redact ﬁelds based on user’s identity
● Do not give access to underlying table, only view
● Uses existing group membership to ﬁlter rows or columns

Rollout Plan
● SCIM provisioning with groups
● Identify & classify initial set of datasets
● Register in new workspace and apply ACLs
● For Fine-Grained
○ Define access policies (e.g. “Sales only see their own accounts…”, “Support technicians
can only see accounts within their region”, etc.)
○ Identify missing attributes needed for dynamic views
● Lifecycle policy for derivative datasets
● Define process to “promote” new or derivative datasets
○ Need to classify data & define access policies
○ Legal & compliance review

Notebook

Deidentiﬁed PII
Access

Propagating Updates
and Deletes
Processing Records from Change Data Feed
Deleting Data in the Lakehouse

Propagating Changes
with Delta Change
Data Feed

Multi-Hop in the Lakehouse

Bronze Silver Gold

Streaming
CSV, Analytics
JSON,
TXT…

Data Lake

AI & Reporting

Data quality

What is Stream Composability?
• Structured Streaming expects append-only sources
• Delta tables are composable if new streams can be initiated from them

Operations that break stream composability
● Complete aggregations
● Delete
● UPDATE/MERGE

Data is changed in place, breaking append-only expectations

Workarounds for Deleting Data

ignoreChanges ignoreDeletes

Allows deletion of full partitions Allows stream to be executed against

Delta table with upstream changes
No new data ﬁles are written with full
partition removal Must implement logic to avoid
processing duplicate records
Subsumes ignoreDeletes

What Delta Change Data Feed Does for You

Improve ETL Unify batch and BI on your Meet regulatory

pipelines streaming data lake needs
Process less data during Common change format Incrementally update Full history available
ETL to increase efficiency for batch and streaming the data supporting of changes made to
of your pipelines updates, appends, and your BI tool of choice the data, including
deletes deleted information

Delta Change Data Feed

BRONZE SILVER GOLD

Raw Ingestion Filtered, Cleaned, Business-level

and History Augmented Aggregates
External feeds,
Other CDC output,
Extracts

CDF

How Does Delta Change Data Feed Work?
Original Table (v1) Change data Change Data Feed Output
(Merged as v2)
Change Versio
PK B PK B Time
PK B Type n
A1 B1 12:00:0
A2 B2 Preimage 2
0
A2 B2 A2 Z2
Postimag 12:00:0
A2 Z2 2
A3 B3 A3 B3 e 0
12:00:0
A4 B4 A3 B3 Delete 2
0
12:00:0
A4 B4 Insert 2
0
A1 record did not receive an update or delete.
So it will not be output by CDF.
©2022 Databricks Inc. — All rights reserved
Notebook

Deleting at Partition
Boundaries

Orchestration and
Scheduling
Multi-Task Jobs
Promoting Code with Repos
CLI and REST API
Deploying Workloads

Orchestration and
Scheduling with
Multi-Task Jobs

What is a Job?

Job

Task Schedule Cluster

What? When? How?

Orchestration with Multi-Task Jobs

Serial
1 2 3 4

Parallel
2
1 4
3

Jobs revisited
Silver tables pipeline

Job
Task 1 Task 2 Task 3
Task + Cluster Task + Cluster Task + Cluster
Schedule
Task 4
Task + Cluster

Common Jobs Patterns
Sequence Funnel Fan-out

Sequence Funnel Fan-out, star pattern

● Data transformation/ ● Multiple data sources ● Single data source
processing/cleaning ● Data collection ● Data ingestion and
● Bronze/Silver/Gold distribution

Demo

Creating a
Multi-Task Job

Jobs UI Lab

Task-1 Task-5*
Create Database Create Task 5

Task-2* Task-3* Task-6*

From Task 2 From Task 3 Errors

Task-4* Task-7*
Name-Param Cleanup
* Make sure to use a different cluster for Tasks #2 & #3
* Add the parameter “name” with some value (e.g. your name)
* Make sure to use the same cluster for Tasks #5, #6 & #7
©2022 Databricks Inc. — All rights reserved
* After running once with the error, update Task-6 to pass
Promoting Code
with Databricks Repos

CI/CD Integration Development /
Experimentatio Production Jobs
n

▲
▼

Git / CI/CD
Systems Version Review Test

Supported Git
Providers

Enterprise Readiness
Admin Console | Workspace Settings

Lab

Import a Git Repo

https://github.com/databricks-academy/cli-demo

Lab

Orchestration with
the Databricks CLI

Lab

Using the Databricks

REST API

Lab

Deploying Batch and

Streaming Workloads

GDPR Gap Assessment v1.0
No ratings yet
GDPR Gap Assessment v1.0
29 pages
2024 07 Eb Big Book of Data Engineering 3rd Edition
100% (2)
2024 07 Eb Big Book of Data Engineering 3rd Edition
125 pages
Advanced Data Engineering With Databricks
No ratings yet
Advanced Data Engineering With Databricks
154 pages
Databricks Certified Professional Data Engineer Questions and Answers PDF Dumps
No ratings yet
Databricks Certified Professional Data Engineer Questions and Answers PDF Dumps
6 pages
Databricks Practice Questions
No ratings yet
Databricks Practice Questions
83 pages
Certified Data Engineer Associate
No ratings yet
Certified Data Engineer Associate
24 pages
Databricks Associate Data Engineer Notes
No ratings yet
Databricks Associate Data Engineer Notes
39 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
DatabricksDataEngineer Associate2024
80% (5)
DatabricksDataEngineer Associate2024
157 pages
Data Engineering With Databricks Da
100% (3)
Data Engineering With Databricks Da
232 pages
Data Engineering With Databricks
No ratings yet
Data Engineering With Databricks
5 pages
PySpark Interview Questions Guide
100% (3)
PySpark Interview Questions Guide
126 pages
Databricks Certified Data Engineer Associate 4
100% (1)
Databricks Certified Data Engineer Associate 4
13 pages
Databricks Question 1668314325
100% (1)
Databricks Question 1668314325
104 pages
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
0% (1)
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
290 pages
Azure Databricks Interview
100% (2)
Azure Databricks Interview
35 pages
Azure Databricks Documentation
100% (1)
Azure Databricks Documentation
7,197 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Data Analysis With Databricks Version 2
No ratings yet
Data Analysis With Databricks Version 2
137 pages
Azure Databricks Course Slide Deck
75% (4)
Azure Databricks Course Slide Deck
169 pages
PETs Guide for Tech Policymakers
No ratings yet
PETs Guide for Tech Policymakers
79 pages
Azure Data Engineer Interview Questions and Answers
No ratings yet
Azure Data Engineer Interview Questions and Answers
7 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
PracticeExam DataEngineerAssociate
No ratings yet
PracticeExam DataEngineerAssociate
23 pages
Azure Data Factory Interview Questions
0% (1)
Azure Data Factory Interview Questions
14 pages
Data Analysis With Databricks
75% (4)
Data Analysis With Databricks
80 pages
Simplify Your Streaming
No ratings yet
Simplify Your Streaming
27 pages
Delta Lake Data Engineering Overview
No ratings yet
Delta Lake Data Engineering Overview
59 pages
Azure Databricks Overview
100% (1)
Azure Databricks Overview
4 pages
NPC AdvisoryOpinionNo. 2017-023
No ratings yet
NPC AdvisoryOpinionNo. 2017-023
15 pages
Day1 Main
No ratings yet
Day1 Main
188 pages
Crack Your Databricks
100% (2)
Crack Your Databricks
103 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Apache Spark Programming With Databricks
No ratings yet
Apache Spark Programming With Databricks
112 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
(Guia Databrick Lakehouse)
No ratings yet
(Guia Databrick Lakehouse)
83 pages
Azure Synapse Analytics Course Overview
100% (2)
Azure Synapse Analytics Course Overview
261 pages
Architecting A Data Lake
100% (9)
Architecting A Data Lake
60 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Performance Tuning in Azure Databricks
100% (1)
Performance Tuning in Azure Databricks
124 pages
(22990984 - Proceedings On Privacy Enhancing Technologies) ScrambleDB - Oblivious (Chameleon) Pseudonymization-as-a-Service
No ratings yet
(22990984 - Proceedings On Privacy Enhancing Technologies) ScrambleDB - Oblivious (Chameleon) Pseudonymization-as-a-Service
21 pages
APJ Lakehouse Optimisation Webinar
No ratings yet
APJ Lakehouse Optimisation Webinar
53 pages
Sparksql PDF
100% (2)
Sparksql PDF
119 pages
Databricks Lab 1
100% (3)
Databricks Lab 1
7 pages
Sea Risk Data Privacy in Asean
No ratings yet
Sea Risk Data Privacy in Asean
8 pages
Bird Bird Guide To The General Data Protection Regulation
No ratings yet
Bird Bird Guide To The General Data Protection Regulation
82 pages
Azure Data Engineering for Pharma
100% (1)
Azure Data Engineering for Pharma
5 pages
1 - Architecting For The Lakehouse
No ratings yet
1 - Architecting For The Lakehouse
115 pages
Learn How Databricks Streamlines The Data Management Lifecycle
No ratings yet
Learn How Databricks Streamlines The Data Management Lifecycle
20 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Data Engineering Cookbook
89% (9)
Data Engineering Cookbook
88 pages
ARIS Data Protection (GDPR)
No ratings yet
ARIS Data Protection (GDPR)
25 pages
Dynamics 365 - For Sales - GDPR White Paper (2018)
No ratings yet
Dynamics 365 - For Sales - GDPR White Paper (2018)
21 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Databricks Lakehouse Guide
No ratings yet
Databricks Lakehouse Guide
149 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
IGtoolkit Assessments PDF
No ratings yet
IGtoolkit Assessments PDF
41 pages
Preservation of The Privacy of Data On Cloud by Using Key Map Based Data Anonamysa
No ratings yet
Preservation of The Privacy of Data On Cloud by Using Key Map Based Data Anonamysa
12 pages
Spark 4.0
100% (1)
Spark 4.0
123 pages
Data Privacy Laws & MongoDB
100% (1)
Data Privacy Laws & MongoDB
19 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Databricks Dbutils
100% (1)
Databricks Dbutils
34 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
【11】Principles to Practice - Responsible AI in a Dynamic Regulatory Environment 20240516
No ratings yet
【11】Principles to Practice - Responsible AI in a Dynamic Regulatory Environment 20240516
55 pages
Azure Analytics: Synapse
100% (4)
Azure Analytics: Synapse
251 pages
Azure Databricks Guide: CSV & SQL Integration
No ratings yet
Azure Databricks Guide: CSV & SQL Integration
16 pages
Framework For Data Sharing in Practice
No ratings yet
Framework For Data Sharing in Practice
14 pages
ONE 840Dsl 828D 808D MC AMM Op Man 0223 en-US
No ratings yet
ONE 840Dsl 828D 808D MC AMM Op Man 0223 en-US
176 pages
Connect Databricks Delta Tables With DBeaver
No ratings yet
Connect Databricks Delta Tables With DBeaver
10 pages
ISARIC IDDO COVID-19 Data Platform Terms of Submission 25JUN20
No ratings yet
ISARIC IDDO COVID-19 Data Platform Terms of Submission 25JUN20
9 pages
GDPR Data Controller Obligations
No ratings yet
GDPR Data Controller Obligations
38 pages
LLM and Gen AI Data Security Best Practices 2025 v1.0
No ratings yet
LLM and Gen AI Data Security Best Practices 2025 v1.0
62 pages
India's AI Strategy Proposal
No ratings yet
India's AI Strategy Proposal
171 pages
GDPR & ISO 27001: Aligning Security & Privacy
No ratings yet
GDPR & ISO 27001: Aligning Security & Privacy
19 pages
Databricks Question
No ratings yet
Databricks Question
89 pages
004 Azure Databricks Course Slide Deck V3
0% (1)
004 Azure Databricks Course Slide Deck V3
219 pages
Chapter 3. Pseudonymization. Draft
No ratings yet
Chapter 3. Pseudonymization. Draft
20 pages
Databricks Guide
No ratings yet
Databricks Guide
27 pages
Data Processing Agreement Survicate
No ratings yet
Data Processing Agreement Survicate
25 pages
Databricks Certification Preparation Associate DE
50% (2)
Databricks Certification Preparation Associate DE
65 pages
AI Ethics in Education
No ratings yet
AI Ethics in Education
12 pages
An Overview of Big Data and Data Protection in Nigeria 1 Compressed
No ratings yet
An Overview of Big Data and Data Protection in Nigeria 1 Compressed
12 pages
Iso 25237-2017
No ratings yet
Iso 25237-2017
70 pages
Data Science Ethics - Lecture 2
No ratings yet
Data Science Ethics - Lecture 2
36 pages
MHC GDPR Web 2019
No ratings yet
MHC GDPR Web 2019
47 pages
AI How To Sheets CNIL Marengo 1697632356
No ratings yet
AI How To Sheets CNIL Marengo 1697632356
49 pages
088 - Global Sap Data Protection and Privacy Policy Global English
No ratings yet
088 - Global Sap Data Protection and Privacy Policy Global English
13 pages
SHS - Ethics - Form - For - Low - Risk - Research - Anthropology.23.10.24
No ratings yet
SHS - Ethics - Form - For - Low - Risk - Research - Anthropology.23.10.24
33 pages
Databricks Certified Data Engineer Professional Practice Questions
No ratings yet
Databricks Certified Data Engineer Professional Practice Questions
13 pages