Databricks for the
SQL Developer
Gerhard Brueckl
Bronze Silver Gold Platinum
Our Partners
About me
@gbrueckl
blog.gbrueckl.at www.paiqo.com
[email protected]
https://github.com/gbrueckl
Agenda
What is Databricks / Spark?
Why use Databricks for SQL workloads?
SQL with Databricks / Spark
Delta Lake
Advanced SQL techniques
What is Databricks?
Company that provides a Big Data processing
solution in the Cloud using Apache Spark
Founded in 2013
Creators of Apache® Spark™
Offers: Databricks on AWS, Azure Databricks
NO on-prem solution!
What is Apache Spark?
Open-source Cluster computing framework
Runs on YARN, Mesos, …
Built for: Speed, Ease-of-Use, Extensibility
Support for multiple languages
Java, Scala, Python, R, SQL
Project of the Apache Foundation
Largest open-source data project
Apache Spark APIs
Spark unifies: SQL
• Batch Processing
• Interactive SQL
• Real-time processing
• Machine Learning
• Deep Learning
• Graph Processing
How does it work?
‘Driver’ runs the ‘main’ function and executes and
coordinates the various parallel operations on the
worker nodes
The worker nodes read and write data from/to
Data Sources including HDFS
Worker node also caches transformed data in
memory as RDDs (Resilient Distributed Data Sets)
The results of the operations are collected by the
driver and returned to the client.
Worker nodes and the Driver Node execute as
VMs in public clouds (AWS or Azure)
Azure Databricks
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
Azure Databricks
Collaborative Workspace
IoT / streaming data Machine learning models
DATA ENGINEER DATA SCIENTIST BUSINESS ANALYST
Deploy Production Jobs & Workflows
BI tools
Cloud storage
MULTI-STAGE
JOB SCHEDULER NOTIFICATION & LOGS
PIPELINES
Data warehouses
Optimized Databricks Runtime Engine Data exports
Hadoop storage
DATABRICKS I/O APACHE SPARK SERVERLESS Rest APIs
Data warehouses
Enhance Productivity Build on secure & trusted cloud Scale without limits
Why use Databricks for SQL workloads?
• Cloud-only solutions • Scalability
• Works with structured and • Native SQL integration
unstructured data
• Native Azure integration
• Open Standard
• Apache Spark
• Single Tool for all Workloads
• Extensible
Batch processing only – no OLTP!!!
Spark SQL Fundamentals
• Tables are just references and • No indexes
metadata
• Location
• Column definitions
• (No Stored Procedures)
• Partitions
…
•
• (No Statistics)
• (similar to external tables in Polybase)
• Files must match the schema!
Supported SQL Features
ANSI SQL • SELECT and INSERT only!
• Joins
• Groupings/Aggregations
• Rollup/Cube/GroupingSets • Views
• Subselects
• Window Functions
• …
• Temporary tables
Transformations/Functions
Constantly evolving!
SQL SELECT and INSERT
SELECT
• Tables or Views SELECT
SELECT INSERT
• on files directly!
INSERT Databricks
• Creates new file in folder Storage
CREATE TABLE AS SELECT
• Persists result of a SQL query
Processing of a SQL Query
1. Client submits SQL Query
2. Databricks queries Meta Data Catalog
• Checks syntax
Meta Data Store • Checks columns
Cloud storage
• Returns storage locations
3. Databricks queries storage services for raw data
4. Data is loaded into memory of nodes
5. Data is processed on nodes using Spark
SQL Cloud storage
“Result”
Data is written directly to storage services
OR
Data is collected on driver and returned to client
Cloud storage
Databricks Meta Data Store
= Apache Hive Metastore
Metadata of all SQL objects
Databases, Tables, Columns, …
Managed by Databricks
OR
Hosted externally
MSSQL / Azure SQL
MySQL / Azure MySQL
Can be Shared!
Types of Tables
Managed
Stored inside Databricks
Azure Blob Storage
Filesystem not accessible from outside
DROP TABLE also deletes files!
Unmanaged
Usually stored externally
Azure Blob Storage / Azure Data Lake Store / …
Can be shared with other services
DEMO
Delta Lake – delta.io
Delta Lake is an open-source storage layer that brings ACID
transactions to Apache Spark™ and big data workloads.
• ACID compliant transactions • Schema enforcement and evolution
• Optimistic Concurrency Control • Across multiple files/folders
• Support for UPDATE / MERGE • Batch & Streaming
• Time-Travel • 100% compatible with Apache Spark
Delta Lake – CREATE TABLE
CREATE TABLE IF NOT EXISTS DimProductDelta
USING DELTA
PARTITIONED BY (ProductSubcategoryKey)
-- CLUSTERED BY (ProductKey) INTO 4 BUCKETS
LOCATION '/mnt/adls/tables/DimProductDelta'
TBLPROPERTIES ('myKey' = 'myValue')
Avoid defining Columns explicitly – handled by transaction log!
Clustering is not supported!
Delta Lake – UPDATE/DELETE/MERGE
Always results in new files! Even a DELETE!
Old files are invalidated via _delta_log
Operations are logged in _delta_log
Conflicts have to be handled by the User!
Can create A LOT of files!
Delta Lake – UPDATE
Product Price Product Price
UPDATE TABLE DimProduct
Notebook 900 € Notebook 900 €
User
SET Price = 1300
PC 1,500 € PC 1,300 €
WHERE Product = 'PC'
Tablet 500 € Tablet 500 €
000000000.json 000000001.json
_delta_log
"add": {
"remove": { "path": "part-01.parquet", ... },
"path": "part-01.parquet",
"add": { "path": "part-02.parquet", ... }
...
}
Storage
part-01 part-02
part-01
(3 rows) (3 rows) (3 rows)
Delta Lake –DELETE
Product Price Product Price
Notebook 900 € Notebook 900 €
User
DELETE FROM DimProduct
PC 1,500 € WHERE Product = 'PC' Tablet 500 €
Tablet 500 €
000000001.json 000000002.json
_delta_log
"add": { "remove": { "path": "part-01.parquet", ... },
"path": "part-01.parquet", "add": { "path": "part-02.parquet", ... }
... }
Storage
part-01 part-02
part-01
(3 rows) (3 rows) (2 rows)
Delta Lake – _delta_log
Delta Lake – _delta_log
Delta Lake – _delta_log – Schema / Stats
Delta Lake – _delta_log - UPDATE
Delta Lake – Optimization
Manually with OPTIMIZE command
Collapse many small files into few big files
Optimizes current/latest version only!
Bin-Packing / Ordering OPTIMIZE events
WHERE date = 20200101
ZORDER BY (eventType)
Delta Lake – Clean-Up
Manually with VACUUM command
Automatically with every INSERT/UPDATE/MERGE
Default retention period is 7 days
Can be changed via TBLPROPERTIES
VACUUM events
[RETAIN num HOURS] [DRY RUN]
Delta Lake – Table Properties
Clean-Up Settings
ALTER TABLE DimProductDelta SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = '240 HOURS');
ALTER TABLE DimProductDelta SET TBLPROPERTIES ('delta.logRetentionDuration' = '240 HOURS’);
Blocks deletes and modifications of a table
'delta.appendOnly' = 'true'
Configures the number of columns for which statistics are collected
'delta.dataSkippingNumIndexedCols' = '5'
CREATE TABLE DimProductDelta
USING DELTA
TBLPROPERTIES ('key1' = 'val1', 'key2' = 'val2', ...)
Delta Lake – New Meta Tables / DMVs
Get information about schema, partitioning, table size, …
DESCRIBE DETAILS myDeltaTable;
Provides provenance information, including the operation, user, and so on, for
each write to a table
DESCRIBE HISTORY myDeltaTable;
DEMO
Advanced SQL – Extensions
User Defined Functions
Python or Scala
User Defined Aggregates
Scala only
Session-Level only!
Use Globally when packed as JAR
DEMO
Advanced SQL – Sampling
Sampling
SELECT * FROM myTable TABLESAMPLE (10 ROWS)
SELECT * FROM myTable TABLESAMPLE (5 PERCENT)
Return a representative sample
For large tables
Advanced SQL – Security
Cluster Setup
SQL Permissions
Table Level Security
Requires Databricks Premium SKU
Set on Cluster-Level
need to control access to cluster
Privileges
SELECT, CREATE, MODIFY, READ_METADATA, CREATE_NAMED_FUNCTION, ALL PRIVILEGES
Objects
CATALOG, DATABASE, TABLE, VIEW, FUNCTION, ANONYMOUS FUNCTION, ANY FILE
Advanced SQL – External Tables
Connect to any JDBC source
Exposed as regular SQL table
CREATE TABLE myJdbcTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:<databaseServerType>://<jdbcHostname>:<jdbcPort>",
table "<jdbcDatabase>.myTable",
user "<jdbcUsername>",
password "<jdbcPassword>"
)
Distributed Processing – Round Robin
Row Customer Product Sales
1 John PC 100€ Customer Sales
Worker 1
Head node
3 Karl Printer 50€ John 180€
5 Karl Printer 60€ Karl 110€ Customer Sales
Customer Sales
7 John Printer 80€ John 180€
John 230€
Karl 110€
Karl 110€
Customer Sales Peter 200€
Row Customer Product Sales Customer Sales Mark 70€
Peter 200€
Mark 70€
2 Peter PC 200€ Peter 200€
Worker 2
John 150€
4 Mark Phone 70€ Mark 70€
6 John Scanner 150€ John 150€
Distributed Processing – By Column
Row Customer Product Sales
1 John PC 100€ Customer Sales
Worker 1
3 Karl Printer 50€ Head node
John 230€
5 Karl Printer 60€
Karl 110€ Customer Sales
6 John Scanner 150€
John 230€
7 John Printer 80€
Karl 110€
Peter 200€
Mark 70€
Row Customer Product Sales Customer Sales
Worker 2
2 Peter PC 200€ Peter 200€
4 Mark Phone 70€ Mark 70€
Distributed Processing – Non-Additive
Row Customer Product Sales
1 John PC 100€ Product
Worker 1
3 Karl Printer 50€
3
PC Head node
5 Karl Printer 60€ Printer
Product
Customer
6 John Scanner 150€ Scanner PC
PC
4
7 John Printer 80€ Printer
Scanner Printer
Product Scanner
PC Phone
Phone
Row Customer Product Sales Product
Worker 2
2
4
Peter
Mark
PC
Phone
200€
70€
2 PC
Phone