0% found this document useful (0 votes)

1K views40 pages

Databricks For The SQL Developer: Gerhard Brueckl

Uploaded by

Tirumalesh Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views40 pages

Databricks For The SQL Developer: Gerhard Brueckl

Uploaded by

Tirumalesh Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Databricks for the

SQL Developer

Gerhard Brueckl
Bronze Silver Gold Platinum

Our Partners
About me

@gbrueckl

blog.gbrueckl.at www.paiqo.com

[email protected]

https://github.com/gbrueckl
Agenda

What is Databricks / Spark?

Why use Databricks for SQL workloads?

SQL with Databricks / Spark

Delta Lake

Advanced SQL techniques

What is Databricks?

Company that provides a Big Data processing

solution in the Cloud using Apache Spark

Founded in 2013

Creators of Apache® Spark™

Offers: Databricks on AWS, Azure Databricks

NO on-prem solution!
What is Apache Spark?

Open-source Cluster computing framework

Runs on YARN, Mesos, …

Built for: Speed, Ease-of-Use, Extensibility

Support for multiple languages

Java, Scala, Python, R, SQL

Project of the Apache Foundation

Largest open-source data project
Apache Spark APIs

Spark unifies: SQL

• Batch Processing
• Interactive SQL
• Real-time processing
• Machine Learning
• Deep Learning
• Graph Processing
How does it work?

‘Driver’ runs the ‘main’ function and executes and

coordinates the various parallel operations on the
worker nodes
The worker nodes read and write data from/to
Data Sources including HDFS
Worker node also caches transformed data in
memory as RDDs (Resilient Distributed Data Sets)
The results of the operations are collected by the
driver and returned to the client.

Worker nodes and the Driver Node execute as

VMs in public clouds (AWS or Azure)
Azure Databricks
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure

Best of Databricks Best of Microsoft

Designed in collaboration with the founders of Apache Spark

One-click set up; streamlined workflows

Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)

Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
Azure Databricks
Collaborative Workspace

IoT / streaming data Machine learning models

DATA ENGINEER DATA SCIENTIST BUSINESS ANALYST

Deploy Production Jobs & Workflows

BI tools
Cloud storage

MULTI-STAGE
JOB SCHEDULER NOTIFICATION & LOGS
PIPELINES
Data warehouses
Optimized Databricks Runtime Engine Data exports

Hadoop storage
DATABRICKS I/O APACHE SPARK SERVERLESS Rest APIs
Data warehouses

Enhance Productivity Build on secure & trusted cloud Scale without limits
Why use Databricks for SQL workloads?

• Cloud-only solutions • Scalability

• Works with structured and • Native SQL integration

unstructured data
• Native Azure integration
• Open Standard
• Apache Spark
• Single Tool for all Workloads

• Extensible
Batch processing only – no OLTP!!!
Spark SQL Fundamentals

• Tables are just references and • No indexes

metadata
• Location
• Column definitions
• (No Stored Procedures)
• Partitions
…
•
• (No Statistics)
• (similar to external tables in Polybase)

• Files must match the schema!

Supported SQL Features

ANSI SQL • SELECT and INSERT only!

• Joins
• Groupings/Aggregations
• Rollup/Cube/GroupingSets • Views
• Subselects
• Window Functions
• …
• Temporary tables
Transformations/Functions
Constantly evolving!
SQL SELECT and INSERT

SELECT
• Tables or Views SELECT
SELECT INSERT

• on files directly!

INSERT Databricks

• Creates new file in folder Storage

CREATE TABLE AS SELECT

• Persists result of a SQL query
Processing of a SQL Query
1. Client submits SQL Query
2. Databricks queries Meta Data Catalog
• Checks syntax
Meta Data Store • Checks columns
Cloud storage
• Returns storage locations

3. Databricks queries storage services for raw data

4. Data is loaded into memory of nodes
5. Data is processed on nodes using Spark
SQL Cloud storage
“Result”
Data is written directly to storage services
OR
Data is collected on driver and returned to client

Cloud storage
Databricks Meta Data Store

= Apache Hive Metastore

Metadata of all SQL objects

Databases, Tables, Columns, …

Managed by Databricks
OR
Hosted externally
MSSQL / Azure SQL
MySQL / Azure MySQL
Can be Shared!
Types of Tables

Managed
Stored inside Databricks
Azure Blob Storage
Filesystem not accessible from outside
DROP TABLE also deletes files!

Unmanaged
Usually stored externally
Azure Blob Storage / Azure Data Lake Store / …
Can be shared with other services
DEMO
Delta Lake – delta.io

Delta Lake is an open-source storage layer that brings ACID

transactions to Apache Spark™ and big data workloads.

• ACID compliant transactions • Schema enforcement and evolution

• Optimistic Concurrency Control • Across multiple files/folders

• Support for UPDATE / MERGE • Batch & Streaming

• Time-Travel • 100% compatible with Apache Spark

Delta Lake – CREATE TABLE

CREATE TABLE IF NOT EXISTS DimProductDelta

USING DELTA
PARTITIONED BY (ProductSubcategoryKey)
-- CLUSTERED BY (ProductKey) INTO 4 BUCKETS
LOCATION '/mnt/adls/tables/DimProductDelta'
TBLPROPERTIES ('myKey' = 'myValue')

Avoid defining Columns explicitly – handled by transaction log!

Clustering is not supported!

Delta Lake – UPDATE/DELETE/MERGE

Always results in new files! Even a DELETE!

Old files are invalidated via _delta_log

Operations are logged in _delta_log

Conflicts have to be handled by the User!

Can create A LOT of files!

Delta Lake – UPDATE
Product Price Product Price
UPDATE TABLE DimProduct
Notebook 900 € Notebook 900 €
User

SET Price = 1300

PC 1,500 € PC 1,300 €
WHERE Product = 'PC'
Tablet 500 € Tablet 500 €
000000000.json 000000001.json
_delta_log

"add": {
"remove": { "path": "part-01.parquet", ... },
"path": "part-01.parquet",
"add": { "path": "part-02.parquet", ... }
...
}
Storage

part-01 part-02
part-01
(3 rows) (3 rows) (3 rows)
Delta Lake –DELETE
Product Price Product Price
Notebook 900 € Notebook 900 €
User

DELETE FROM DimProduct

PC 1,500 € WHERE Product = 'PC' Tablet 500 €
Tablet 500 €
000000001.json 000000002.json
_delta_log

"add": { "remove": { "path": "part-01.parquet", ... },

"path": "part-01.parquet", "add": { "path": "part-02.parquet", ... }
... }
Storage

part-01 part-02
part-01
(3 rows) (3 rows) (2 rows)
Delta Lake – _delta_log
Delta Lake – _delta_log
Delta Lake – _delta_log – Schema / Stats
Delta Lake – _delta_log - UPDATE
Delta Lake – Optimization

Manually with OPTIMIZE command

Collapse many small files into few big files

Optimizes current/latest version only!

Bin-Packing / Ordering OPTIMIZE events

WHERE date = 20200101
ZORDER BY (eventType)
Delta Lake – Clean-Up

Manually with VACUUM command

Automatically with every INSERT/UPDATE/MERGE

Default retention period is 7 days

Can be changed via TBLPROPERTIES

VACUUM events
[RETAIN num HOURS] [DRY RUN]
Delta Lake – Table Properties

Clean-Up Settings
ALTER TABLE DimProductDelta SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = '240 HOURS');
ALTER TABLE DimProductDelta SET TBLPROPERTIES ('delta.logRetentionDuration' = '240 HOURS’);

Blocks deletes and modifications of a table

'delta.appendOnly' = 'true'

Configures the number of columns for which statistics are collected

'delta.dataSkippingNumIndexedCols' = '5'

CREATE TABLE DimProductDelta

USING DELTA
TBLPROPERTIES ('key1' = 'val1', 'key2' = 'val2', ...)
Delta Lake – New Meta Tables / DMVs

Get information about schema, partitioning, table size, …

DESCRIBE DETAILS myDeltaTable;

Provides provenance information, including the operation, user, and so on, for
each write to a table
DESCRIBE HISTORY myDeltaTable;
DEMO
Advanced SQL – Extensions

User Defined Functions

Python or Scala

User Defined Aggregates

Scala only

Session-Level only!

Use Globally when packed as JAR

DEMO
Advanced SQL – Sampling
Sampling

SELECT * FROM myTable TABLESAMPLE (10 ROWS)

SELECT * FROM myTable TABLESAMPLE (5 PERCENT)

Return a representative sample

For large tables

Advanced SQL – Security
Cluster Setup
SQL Permissions
Table Level Security
Requires Databricks Premium SKU

Set on Cluster-Level
need to control access to cluster

Privileges
SELECT, CREATE, MODIFY, READ_METADATA, CREATE_NAMED_FUNCTION, ALL PRIVILEGES
Objects
CATALOG, DATABASE, TABLE, VIEW, FUNCTION, ANONYMOUS FUNCTION, ANY FILE
Advanced SQL – External Tables

Connect to any JDBC source

Exposed as regular SQL table

CREATE TABLE myJdbcTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:<databaseServerType>://<jdbcHostname>:<jdbcPort>",
table "<jdbcDatabase>.myTable",
user "<jdbcUsername>",
password "<jdbcPassword>"
)
Distributed Processing – Round Robin

Row Customer Product Sales

1 John PC 100€ Customer Sales
Worker 1

Head node
3 Karl Printer 50€ John 180€
5 Karl Printer 60€ Karl 110€ Customer Sales
Customer Sales
7 John Printer 80€ John 180€
John 230€
Karl 110€
Karl 110€

Customer Sales Peter 200€

Row Customer Product Sales Customer Sales Mark 70€

Peter 200€
Mark 70€
2 Peter PC 200€ Peter 200€
Worker 2

John 150€
4 Mark Phone 70€ Mark 70€
6 John Scanner 150€ John 150€
Distributed Processing – By Column

Row Customer Product Sales

1 John PC 100€ Customer Sales
Worker 1

3 Karl Printer 50€ Head node

John 230€
5 Karl Printer 60€
Karl 110€ Customer Sales
6 John Scanner 150€
John 230€
7 John Printer 80€
Karl 110€
Peter 200€
Mark 70€
Row Customer Product Sales Customer Sales
Worker 2

2 Peter PC 200€ Peter 200€

4 Mark Phone 70€ Mark 70€
Distributed Processing – Non-Additive

Row Customer Product Sales

1 John PC 100€ Product
Worker 1

3 Karl Printer 50€

3
PC Head node

5 Karl Printer 60€ Printer

Product
Customer
6 John Scanner 150€ Scanner PC
PC

4
7 John Printer 80€ Printer
Scanner Printer

Product Scanner
PC Phone
Phone
Row Customer Product Sales Product
Worker 2

2
4
Peter
Mark
PC
Phone
200€
70€
2 PC
Phone

DatabricksDataEngineer Associate2024
80% (5)
DatabricksDataEngineer Associate2024
157 pages
Databricks Practice Questions
No ratings yet
Databricks Practice Questions
83 pages
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
Databricks Associate Data Engineer Notes
No ratings yet
Databricks Associate Data Engineer Notes
39 pages
Acc 206 Cost Accounting - Lecture Note
No ratings yet
Acc 206 Cost Accounting - Lecture Note
71 pages
Databricks Question 1668314325
100% (1)
Databricks Question 1668314325
104 pages
PySpark Interview Questions Guide
100% (3)
PySpark Interview Questions Guide
126 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
Sony Ps3 Controller
33% (3)
Sony Ps3 Controller
12 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Data Engineering With Databricks Da
100% (3)
Data Engineering With Databricks Da
232 pages
Romantic Escapade - South & North Goa
No ratings yet
Romantic Escapade - South & North Goa
15 pages
Legal Issues To Consider To Protect Your App in 2024
No ratings yet
Legal Issues To Consider To Protect Your App in 2024
23 pages
Maintenance Task Record E Rating English
No ratings yet
Maintenance Task Record E Rating English
11 pages
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
0% (1)
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
290 pages
Polity (Articles Compilation June2024-Jan2025) M IE Explained - All Subjects (Dec 2025)
No ratings yet
Polity (Articles Compilation June2024-Jan2025) M IE Explained - All Subjects (Dec 2025)
23 pages
Essential of Financial Accounting
No ratings yet
Essential of Financial Accounting
8 pages
Dissertation Presentation Powerpoint Sample
100% (2)
Dissertation Presentation Powerpoint Sample
5 pages
Ad 5 Case Study
No ratings yet
Ad 5 Case Study
11 pages
Day 4 Book Inter Jan'25
No ratings yet
Day 4 Book Inter Jan'25
3 pages
Attachment - 1
No ratings yet
Attachment - 1
2 pages
Azure Synapse Analytics
100% (2)
Azure Synapse Analytics
7,794 pages
Pyspark Interview Questions: Click Here
0% (1)
Pyspark Interview Questions: Click Here
35 pages
AFC Notes Last Year
No ratings yet
AFC Notes Last Year
81 pages
RA100Z - Manual - I56-0508 - Indicador Visual
No ratings yet
RA100Z - Manual - I56-0508 - Indicador Visual
2 pages
Python and Pyspark With Databricks, With Azure Project
No ratings yet
Python and Pyspark With Databricks, With Azure Project
9 pages
Visualizing Association Rules: Introduction To The R-Extension Package Arulesviz
No ratings yet
Visualizing Association Rules: Introduction To The R-Extension Package Arulesviz
24 pages
NJVP2K8 App Slip
No ratings yet
NJVP2K8 App Slip
3 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
Cloud 2
No ratings yet
Cloud 2
3 pages
The Economist
No ratings yet
The Economist
27 pages
Compression: DMET501 - Introduction To Media Engineering
No ratings yet
Compression: DMET501 - Introduction To Media Engineering
26 pages
STD Blanket MSDS FOR TURBINE INSULATION
No ratings yet
STD Blanket MSDS FOR TURBINE INSULATION
7 pages
Chapter 1 Governments and Individuals PDF
No ratings yet
Chapter 1 Governments and Individuals PDF
24 pages
Unit 4 Bank Deposits and Lending
No ratings yet
Unit 4 Bank Deposits and Lending
30 pages
Accounting Standards (Group-Ii) : AS - 4: Contingencies and Events Occurring After The Balance Sheet Date
No ratings yet
Accounting Standards (Group-Ii) : AS - 4: Contingencies and Events Occurring After The Balance Sheet Date
14 pages
APJ Lakehouse Optimisation Webinar
No ratings yet
APJ Lakehouse Optimisation Webinar
53 pages
Azure Synapse Analytics: Partner Webinar
No ratings yet
Azure Synapse Analytics: Partner Webinar
35 pages
Sergiy Lunyakin: Cloud BI With Azure Analysis Services
No ratings yet
Sergiy Lunyakin: Cloud BI With Azure Analysis Services
27 pages
WILP Brochure
No ratings yet
WILP Brochure
20 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
(Exam) Data Engineering Certification Prep Guide - Partners
No ratings yet
(Exam) Data Engineering Certification Prep Guide - Partners
15 pages
Wireless Communication Basics
No ratings yet
Wireless Communication Basics
17 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
SP 3 D Upgrade Guide
No ratings yet
SP 3 D Upgrade Guide
37 pages
Azure Data Engineering for Pharma
100% (1)
Azure Data Engineering for Pharma
5 pages
Data Engineering Roadmap 2023
No ratings yet
Data Engineering Roadmap 2023
1 page
Case Study
No ratings yet
Case Study
9 pages
Spark 4.0
100% (1)
Spark 4.0
123 pages
Guide To Developing An Approved Culinology Degree Program - Updated 2017
No ratings yet
Guide To Developing An Approved Culinology Degree Program - Updated 2017
15 pages
Databricks Python & Linux Commands Guide
No ratings yet
Databricks Python & Linux Commands Guide
109 pages
Databricks Lakehouse Guide
No ratings yet
Databricks Lakehouse Guide
149 pages
DP 3011 ENU PowerPoint - 01 Content
No ratings yet
DP 3011 ENU PowerPoint - 01 Content
42 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
Apache Spark Interview Questions Book
100% (1)
Apache Spark Interview Questions Book
15 pages
E Ticket/Reservation Voucher: Passenger Information
No ratings yet
E Ticket/Reservation Voucher: Passenger Information
2 pages
E Ticket/Reservation Voucher: Passenger Information
No ratings yet
E Ticket/Reservation Voucher: Passenger Information
2 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
Databuildtoolpdf 220704 142715
No ratings yet
Databuildtoolpdf 220704 142715
39 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
PLC, Scada Training
100% (1)
PLC, Scada Training
47 pages
Quality Plan
No ratings yet
Quality Plan
1 page
Introduction To Turbo Generator
No ratings yet
Introduction To Turbo Generator
42 pages
NOC Check List DCA
No ratings yet
NOC Check List DCA
8 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Economic Dispatch Control
100% (1)
Economic Dispatch Control
31 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Ir Remote Control Devices Ece Eie Final Year Project
100% (1)
Ir Remote Control Devices Ece Eie Final Year Project
11 pages
Delta Lake Cheat Sheet-1
100% (1)
Delta Lake Cheat Sheet-1
2 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Databricks Dbutils
100% (1)
Databricks Dbutils
34 pages
Bucket Bag
100% (1)
Bucket Bag
8 pages
Spark Development for Developers
No ratings yet
Spark Development for Developers
172 pages
Azure Databricks Overview
100% (1)
Azure Databricks Overview
4 pages
Simplifying Data Engineering Databricks
100% (1)
Simplifying Data Engineering Databricks
20 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Spark Interview Prep Guide
No ratings yet
Spark Interview Prep Guide
31 pages
Data Engineers Guide Apache Spark Delta Lake v3
No ratings yet
Data Engineers Guide Apache Spark Delta Lake v3
94 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Apache Spark Interview Questions Guide
100% (1)
Apache Spark Interview Questions Guide
7 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Ebook Solving Business Needs With Delta Lakev2
No ratings yet
Ebook Solving Business Needs With Delta Lakev2
43 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Day1 Main
No ratings yet
Day1 Main
188 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Advanced Spark Training
0% (1)
Advanced Spark Training
49 pages

Databricks For The SQL Developer: Gerhard Brueckl

Uploaded by

Databricks For The SQL Developer: Gerhard Brueckl

Uploaded by

Databricks for the

What is Databricks / Spark?

Why use Databricks for SQL workloads?

SQL with Databricks / Spark

Advanced SQL techniques

Company that provides a Big Data processing

Creators of Apache® Spark™

Offers: Databricks on AWS, Azure Databricks

Open-source Cluster computing framework

Built for: Speed, Ease-of-Use, Extensibility

Support for multiple languages

Project of the Apache Foundation

Spark unifies: SQL

‘Driver’ runs the ‘main’ function and executes and

Worker nodes and the Driver Node execute as

Best of Databricks Best of Microsoft

Designed in collaboration with the founders of Apache Spark

One-click set up; streamlined workflows

IoT / streaming data Machine learning models

DATA ENGINEER DATA SCIENTIST BUSINESS ANALYST

Deploy Production Jobs & Workflows

• Cloud-only solutions • Scalability

• Works with structured and • Native SQL integration

• Tables are just references and • No indexes

• Files must match the schema!

ANSI SQL • SELECT and INSERT only!

• Creates new file in folder Storage

CREATE TABLE AS SELECT

3. Databricks queries storage services for raw data

= Apache Hive Metastore

Metadata of all SQL objects

Delta Lake is an open-source storage layer that brings ACID

• ACID compliant transactions • Schema enforcement and evolution

• Support for UPDATE / MERGE • Batch & Streaming

• Time-Travel • 100% compatible with Apache Spark

CREATE TABLE IF NOT EXISTS DimProductDelta

Avoid defining Columns explicitly – handled by transaction log!

Clustering is not supported!

Always results in new files! Even a DELETE!

Old files are invalidated via _delta_log

Operations are logged in _delta_log

Conflicts have to be handled by the User!

Can create A LOT of files!

SET Price = 1300

DELETE FROM DimProduct

"add": { "remove": { "path": "part-01.parquet", ... },

Manually with OPTIMIZE command

Collapse many small files into few big files

Optimizes current/latest version only!

Bin-Packing / Ordering OPTIMIZE events

Manually with VACUUM command

Automatically with every INSERT/UPDATE/MERGE

Default retention period is 7 days

Blocks deletes and modifications of a table

Configures the number of columns for which statistics are collected

CREATE TABLE DimProductDelta

Get information about schema, partitioning, table size, …

User Defined Functions

User Defined Aggregates

Use Globally when packed as JAR

SELECT * FROM myTable TABLESAMPLE (10 ROWS)

Return a representative sample

For large tables

Connect to any JDBC source

Exposed as regular SQL table

Row Customer Product Sales

Customer Sales Peter 200€

Row Customer Product Sales Customer Sales Mark 70€

Row Customer Product Sales

3 Karl Printer 50€ Head node

2 Peter PC 200€ Peter 200€

Row Customer Product Sales

3 Karl Printer 50€

5 Karl Printer 60€ Printer

You might also like