With Spark SQL: Delta Lake DDL/DML: Time Travel

Delta Lake is an open source storage layer that brings ACID transactions to Apache Spark and big data workloads. It allows creating and querying Delta tables, converting Parquet tables to Delta format, updating, deleting, inserting into tables, and time travel capabilities to view historical versions of tables.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

650 views2 pages

With Spark SQL: Delta Lake DDL/DML: Time Travel

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

DELTA LAKE DDL/DML: UPDATE, DELETE, INSERT, ALTER TABLE TIME TRAVEL (CONTINUED)

Update rows that match a predicate condition Rollback a table to an earlier version
WITH SPARK SQL UPDATE tableName SET event = 'click' WHERE event = 'clk' -- RESTORE requires Delta Lake version 0.7.0+ & DBR 7.4+.
RESTORE tableName VERSION AS OF 0
Delete rows that match a predicate condition RESTORE tableName TIMESTAMP AS OF "2020-12-18"
Delta Lake is an open source storage layer that brings ACID DELETE FROM tableName WHERE "date < '2017-01-01"
transactions to Apache Spark™ and big data workloads. Insert values directly into table
INSERT INTO TABLE tableName VALUES ( UTILITY METHODS
delta.io | Documentation | GitHub | Delta Lake on Databricks (8003, "Kim Jones", "2020-12-18", 3.875),
(8004, "Tim Jones", "2020-12-20", 3.750) View table details
);
CREATE AND QUERY DELTA TABLES -- Insert using SELECT statement
DESCRIBE DETAIL tableName
DESCRIBE FORMATTED tableName
INSERT INTO tableName SELECT * FROM sourceTable
Create and use managed database -- Atomically replace all data in table with new values Delete old files with Vacuum
-- Managed database is saved in the Hive metastore. INSERT OVERWRITE loan_by_state_delta VALUES (...)
VACUUM tableName [RETAIN num HOURS] [DRY RUN]
Default database is named "default".
DROP DATABASE IF EXISTS dbName;
Upsert (update + insert) using MERGE Clone a Delta Lake table
CREATE DATABASE dbName; MERGE INTO target -- Deep clones copy data from source, shallow clones don't.
USE dbName -- This command avoids having to specify USING updates CREATE TABLE [dbName.] targetName
dbName.tableName every time instead of just tableName. ON target.Id = updates.Id [SHALLOW | DEEP] CLONE sourceName [VERSION AS OF 0]
WHEN MATCHED AND target.delete_flag = "true" THEN [LOCATION "path/to/table"]
Query Delta Lake table by table name (preferred) DELETE -- specify location only for path-based tables
WHEN MATCHED THEN
/* You can refer to Delta Tables by table name, or by
UPDATE SET * -- star notation means all columns
path. Table name is the preferred way, since named tables
WHEN NOT MATCHED THEN Interoperability with Python / DataFrames
are managed in the Hive Metastore (i.e., when you DROP a
INSERT (date, Id, data) -- or, use INSERT * -- Read name-based table from Hive metastore into DataFrame
named table, the data is dropped also — not the case for
VALUES (date, Id, data) df = spark.table("tableName")
path-based tables.) */
-- Read path-based table into DataFrame
SELECT * FROM [dbName.] tableName
Insert with Deduplication using MERGE df = spark.read.format("delta").load("/path/to/delta_table")
Query Delta Lake table by path MERGE INTO logs Run SQL queries from Python
SELECT * FROM delta.`path/to/delta_table` -- note USING newDedupedLogs
spark.sql("SELECT * FROM tableName")
backticks ON logs.uniqueId = newDedupedLogs.uniqueId
spark.sql("SELECT * FROM delta.`/path/to/delta_table`")
WHEN NOT MATCHED
Convert Parquet table to Delta Lake format in place THEN INSERT * Modify data retention settings for Delta Lake table
-- logRetentionDuration -> how long transaction log history
-- by table name Alter table schema — add columns
CONVERT TO DELTA [dbName.]tableName is kept, deletedFileRetentionDuration -> how long ago a file
[PARTITIONED BY (col_name1 col_type1, col_name2 ALTER TABLE tableName ADD COLUMNS ( must have been deleted before being a candidate for VACCUM.
col_type2)] col_name data_type ALTER TABLE tableName
[FIRST|AFTER colA_name]) SET TBLPROPERTIES(
-- path-based tables delta.logRetentionDuration = "interval 30 days",
CONVERT TO DELTA parquet.`/path/to/table` -- note backticks Alter table — add constraint delta.deletedFileRetentionDuration = "interval 7 days"
[PARTITIONED BY (col_name1 col_type1, col_name2 col_type2)] -- Add "Not null" constraint: );
ALTER TABLE tableName CHANGE COLUMN col_name SET NOT NULL SHOW TBLPROPERTIES tableName;
Create Delta Lake table as SELECT * with no upfront -- Add "Check" constraint:
schema definition ALTER TABLE tableName
ADD CONSTRAINT dateWithinRange CHECK date > "1900-01-01"
CREATE TABLE [dbName.] tableName -- Drop constraint: PERFORMANCE OPTIMIZATIONS
USING DELTA ALTER TABLE tableName DROP CONSTRAINT dateWithinRange
AS SELECT * FROM tableName | parquet.`path/to/data` Compact data files with Optimize and Z-Order
[LOCATION `/path/to/table`] *Databricks Delta Lake feature
-- using location = unmanaged table OPTIMIZE tableName

Create table, define schema explicitly with SQL DDL

TIME TRAVEL [ZORDER BY (colNameA, colNameB)]

CREATE TABLE [dbName.] tableName ( View transaction log (aka Delta Log) Auto-optimize tables
id INT [NOT NULL], *Databricks Delta Lake feature
DESCRIBE HISTORY tableName
name STRING, ALTER TABLE [table_name | delta.`path/to/delta_table`]
SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true)
date DATE,
Query historical versions of Delta Lake tables
int_rate FLOAT)
USING DELTA SELECT * FROM tableName VERSION AS OF 0 Cache frequently queried data in Delta Cache
[PARTITIONED BY (time, date)] -- optional SELECT * FROM tableName@v0 -- equivalent to VERSION AS OF 0 *Databricks Delta Lake feature
SELECT * FROM tableName TIMESTAMP AS OF "2020-12-18" CACHE SELECT * FROM tableName
Copy new data into Delta Lake table (with idempotent retries) -- or:
Find changes between 2 versions of table CACHE SELECT colA, colB FROM tableName WHERE colNameA > 0
COPY INTO [dbName.] targetTable
FROM (SELECT * FROM "/path/to/table") SELECT * FROM tableName VERSION AS OF 12
FILEFORMAT = DELTA -- or CSV, Parquet, ORC, JSON, etc. EXCEPT ALL SELECT * FROM tableName VERSION AS OF 11
WORKING WITH DELTA
DELTATABLES
TABLES TIME TRAVEL (CONTINUED)

# A DeltaTable is the entry point for interacting with Find changes between 2 versions of a table
WITH PYTHON tables programmatically in Python — for example, to
df1 = spark.read.format("delta").load(pathToTable)
perform updates or deletes.
df2 = spark.read.format("delta").option("versionAsOf",
from delta.tables import *
2).load("/path/to/delta_table")
Delta Lake is an open source storage layer that brings ACID df1.exceptAll(df2).show()
deltaTable = DeltaTable.forName(spark, tableName)
transactions to Apache Spark™ and big data workloads. deltaTable = DeltaTable.forPath(spark, Rollback a table by version or timestamp
delta.`path/to/table`)
delta.io | Documentation | GitHub | API reference | Databricks deltaTable.restoreToVersion(0)
deltaTable.restoreToTimestamp('2020-12-01')

READS AND WRITES WITH DELTA LAKE DELTA LAKE DDL/DML: UPDATES, DELETES, INSERTS, MERGES

Read data from pandas DataFrame Delete rows that match a predicate condition
UTILITY METHODS
df = spark.createDataFrame(pdf) # predicate using SQL formatted string Run Spark SQL queries in Python
# where pdf is a pandas DF deltaTable.delete("date < '2017-01-01'")
# then save DataFrame in Delta Lake format as shown below # predicate using Spark SQL functions spark.sql("SELECT * FROM tableName")
deltaTable.delete(col("date") < "2017-01-01") spark.sql("SELECT * FROM delta.`/path/to/delta_table`")
Read data using Apache Spark™ spark.sql("DESCRIBE HISTORY tableName")
# read by path Update rows that match a predicate condition
df = (spark.read.format("parquet"|"csv"|"json"|etc.) # predicate using SQL formatted string
Compact old files with Vacuum
.load("/path/to/delta_table")) deltaTable.update(condition = "eventType = 'clk'", deltaTable.vacuum() # vacuum files older than default
# read table from Hive metastore set = { "eventType": "'click'" } ) retention period (7 days)
df = spark.table("events") # predicate using Spark SQL functions deltaTable.vacuum(100) # vacuum files not required by
deltaTable.update(condition = col("eventType") == "clk", versions more than 100 hours old
Save DataFrame in Delta Lake format set = { "eventType": lit("click") } )
Clone a Delta Lake table
(df.write.format("delta")
.mode("append"|"overwrite")
Upsert (update + insert) using MERGE deltaTable.clone(target="/path/to/delta_table/",
.partitionBy("date") # optional # Available options for merges [see documentation for isShallow=True, replace=True)
.option("mergeSchema", "true") # option - evolve schema details]:
.saveAsTable("events") | .save("/path/to/delta_table") .whenMatchedUpdate(...) | .whenMatchedUpdateAll(...) | Get DataFrame representation of a Delta Lake table
) .whenNotMatchedInsert(...) | .whenMatchedDelete(...) df = deltaTable.toDF()
(deltaTable.alias("target").merge(
Streaming reads (Delta table as streaming source) source = updatesDF.alias("updates"), Run SQL queries on Delta Lake tables
# by path or by table name condition = "target.eventId = updates.eventId") spark.sql("SELECT * FROM tableName")
df = (spark.readStream .whenMatchedUpdateAll() spark.sql("SELECT * FROM delta.`/path/to/delta_table`")
.format("delta") .whenNotMatchedInsert(
.schema(schema) values = {
.table("events") | .load("/delta/events") "date": "updates.date",
) "eventId": "updates.eventId", PERFORMANCE OPTIMIZATIONS
"data": "updates.data",
Streaming writes (Delta table as a sink) "count": 1
Compact data files with Optimize and Z-Order
}
streamingQuery = ( ).execute() *Databricks Delta Lake feature
df.writeStream.format("delta") ) spark.sql("OPTIMIZE tableName [ZORDER BY (colA, colB)]")
.outputMode("append"|"update"|"complete")
.option("checkpointLocation", "/path/to/checkpoints") Insert with Deduplication using MERGE Auto-optimize tables
.trigger(once=True|processingTime="10 seconds") (deltaTable.alias("logs").merge( *Databricks Delta Lake feature. For existing tables:
.table("events") | .start("/delta/events") newDedupedLogs.alias("newDedupedLogs"), spark.sql("ALTER TABLE [table_name |
) "logs.uniqueId = newDedupedLogs.uniqueId") delta.`path/to/delta_table`]
.whenNotMatchedInsertAll() SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true)
.execute() To enable auto-optimize for all new Delta Lake tables:
) spark.sql("SET spark.databricks.delta.properties.
CONVERT PARQUET TO DELTA LAKE defaults.autoOptimize.optimizeWrite = true")
TIME TRAVEL
Convert Parquet table to Delta Lake format in place Cache frequently queried data in Delta Cache
View transaction log (aka Delta Log) *Databricks Delta Lake feature
deltaTable = DeltaTable.convertToDelta(spark,
"parquet.`/path/to/parquet_table`") fullHistoryDF = deltaTable.history() spark.sql("CACHE SELECT * FROM tableName")
-- or:
partitionedDeltaTable = DeltaTable.convertToDelta(spark, Query historical versions of Delta Lake tables spark.sql("CACHE SELECT colA, colB FROM tableName
"parquet.`/path/to/parquet_table`", "part int") WHERE colNameA > 0")
# choose only one option: versionAsOf, or timestampAsOf
df = (spark.read.format("delta")
.option("versionAsOf", 0)
.option("timestampAsOf", "2020-12-18")
.load("/path/to/delta_table"))

cb3401 Question Bank
No ratings yet
cb3401 Question Bank
9 pages
Leetcode SQL QnA 1693149052
No ratings yet
Leetcode SQL QnA 1693149052
60 pages
Databricks Delta for Developers
No ratings yet
Databricks Delta for Developers
11 pages
Informatica BDM Training Agenda
100% (2)
Informatica BDM Training Agenda
4 pages
Databricks For The SQL Developer: Gerhard Brueckl
No ratings yet
Databricks For The SQL Developer: Gerhard Brueckl
40 pages
Databricks Data Engineer Associate Notes
100% (1)
Databricks Data Engineer Associate Notes
5 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
44 pages
17.views and MaterializedViews
No ratings yet
17.views and MaterializedViews
13 pages
DataStage Faq S
No ratings yet
DataStage Faq S
57 pages
Teradata Scripts
No ratings yet
Teradata Scripts
998 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Mastercard Data Engineer Interview Questions
No ratings yet
Mastercard Data Engineer Interview Questions
16 pages
Azure Data Factory Workshop
No ratings yet
Azure Data Factory Workshop
26 pages
Data Migration and CDC Tasks
No ratings yet
Data Migration and CDC Tasks
11 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
PLSQL Introduction Final
No ratings yet
PLSQL Introduction Final
81 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Teradata SQL Performance Tuning Case Study Part II
0% (1)
Teradata SQL Performance Tuning Case Study Part II
37 pages
Data Modeling Interviews
No ratings yet
Data Modeling Interviews
16 pages
Oracle 12c - CDB - PDB - Performing Basic Tasks PDF
No ratings yet
Oracle 12c - CDB - PDB - Performing Basic Tasks PDF
18 pages
OBIA 7.9.6.4 - Configuration of DAC, Infa, OBI
No ratings yet
OBIA 7.9.6.4 - Configuration of DAC, Infa, OBI
30 pages
Pysparkdump
No ratings yet
Pysparkdump
4 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Snowflake Training for IT Pros
No ratings yet
Snowflake Training for IT Pros
7 pages
ETL Vs ELT and Data Lakehouse Presentation
No ratings yet
ETL Vs ELT and Data Lakehouse Presentation
16 pages
Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
Naresh DE
No ratings yet
Naresh DE
5 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
AWS Oracle DB Migration Questionnaire
No ratings yet
AWS Oracle DB Migration Questionnaire
2 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Dimensional Modeling
100% (1)
Dimensional Modeling
12 pages
Data Engineering
No ratings yet
Data Engineering
6 pages
Talend Interview Questions
No ratings yet
Talend Interview Questions
5 pages
Manideep Lenkalapally
No ratings yet
Manideep Lenkalapally
7 pages
Pyspark Study Material
No ratings yet
Pyspark Study Material
5 pages
Data Warehousing for Analysts
No ratings yet
Data Warehousing for Analysts
40 pages
Creating Secrets in Databricks
No ratings yet
Creating Secrets in Databricks
13 pages
Big Query Optimization Document
No ratings yet
Big Query Optimization Document
10 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Databricks Spark Reference Applications
No ratings yet
Databricks Spark Reference Applications
37 pages
Snowflake - Billing Components
No ratings yet
Snowflake - Billing Components
9 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Pyspark 4
No ratings yet
Pyspark 4
5 pages
PySpark Programming & Spark SQL Guide
No ratings yet
PySpark Programming & Spark SQL Guide
7 pages
Naveen Azure Latest
No ratings yet
Naveen Azure Latest
5 pages
Database Views for Data Analysts
No ratings yet
Database Views for Data Analysts
2 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Delta Lake Cheat Sheet-1
100% (1)
Delta Lake Cheat Sheet-1
2 pages
Delta Lake
No ratings yet
Delta Lake
2 pages
Cloud 2
No ratings yet
Cloud 2
3 pages
(Exam) Data Engineering Certification Prep Guide - Partners
No ratings yet
(Exam) Data Engineering Certification Prep Guide - Partners
15 pages
De Mod 3 Manage Data With Delta Lake
No ratings yet
De Mod 3 Manage Data With Delta Lake
16 pages
Delta Lake
No ratings yet
Delta Lake
11 pages
A Quick Technical Guide To Delta Lake
No ratings yet
A Quick Technical Guide To Delta Lake
10 pages
Data Engineering Cert Guide
No ratings yet
Data Engineering Cert Guide
15 pages
Confirmatory Factor Analysis Using AMOS: Step 1: Launch The AMOS Software
100% (1)
Confirmatory Factor Analysis Using AMOS: Step 1: Launch The AMOS Software
12 pages
Kapil
No ratings yet
Kapil
5 pages
WT Lab Manual: Overview Object Web Technologies
No ratings yet
WT Lab Manual: Overview Object Web Technologies
83 pages
Xiics CH-1 To 16 2,3,5
No ratings yet
Xiics CH-1 To 16 2,3,5
18 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
18CSC303J DBMS UNIT II - Nested Sub Query
No ratings yet
18CSC303J DBMS UNIT II - Nested Sub Query
18 pages
Evaluation
No ratings yet
Evaluation
41 pages
Oilmap Help
No ratings yet
Oilmap Help
143 pages
Table Oriented Metrics For Relational Database
No ratings yet
Table Oriented Metrics For Relational Database
19 pages
Dsbda Case Study
No ratings yet
Dsbda Case Study
14 pages
Pandas Notes
No ratings yet
Pandas Notes
6 pages
Co4, Co5, Co6 Rdbms Assignment Solution
No ratings yet
Co4, Co5, Co6 Rdbms Assignment Solution
32 pages
Cs3353 Fds Unit 1 Notes Eduengg
No ratings yet
Cs3353 Fds Unit 1 Notes Eduengg
51 pages
9., ,, 9 - 11. 2014. 9 International Conference Serbia, Zajecar, Serbia Hotel, April 9 - 11, 2014
No ratings yet
9., ,, 9 - 11. 2014. 9 International Conference Serbia, Zajecar, Serbia Hotel, April 9 - 11, 2014
6 pages
Business Intelligence in Agriculture-A Practical Approach Vico Mijic
No ratings yet
Business Intelligence in Agriculture-A Practical Approach Vico Mijic
35 pages
Exp21 - Exp23 (11 Files Merged)
No ratings yet
Exp21 - Exp23 (11 Files Merged)
29 pages
Data Engineer-Resume
No ratings yet
Data Engineer-Resume
1 page
Exp 2 Mapping EER To Relational Model Updated
No ratings yet
Exp 2 Mapping EER To Relational Model Updated
10 pages
SnowProCore Exam Study Guide 072624 PDF
No ratings yet
SnowProCore Exam Study Guide 072624 PDF
17 pages
Empowering Developers To Deploy Their Own Data Stores. A Story of Terraform, Puppet and Rage - Tomas Doran
No ratings yet
Empowering Developers To Deploy Their Own Data Stores. A Story of Terraform, Puppet and Rage - Tomas Doran
29 pages
Os Answer 1
No ratings yet
Os Answer 1
3 pages
Practice 23 - Switching Over A Refreshable Clone PDB
No ratings yet
Practice 23 - Switching Over A Refreshable Clone PDB
7 pages
Modelling Methods for Digital Libraries
No ratings yet
Modelling Methods for Digital Libraries
10 pages
Spark Streaming - Malay
100% (1)
Spark Streaming - Malay
1 page
Mi0034 Database Management System Set1
No ratings yet
Mi0034 Database Management System Set1
27 pages
PLSQL Tutorial pdf.463
No ratings yet
PLSQL Tutorial pdf.463
47 pages
Cordsuii
No ratings yet
Cordsuii
9 pages
Salesforce CLI Commands Guide
No ratings yet
Salesforce CLI Commands Guide
4 pages
Option Chain File Using Guide
No ratings yet
Option Chain File Using Guide
10 pages

With Spark SQL: Delta Lake DDL/DML: Time Travel

Uploaded by

With Spark SQL: Delta Lake DDL/DML: Time Travel

Uploaded by

DELTA LAKE DDL/DML: UPDATE, DELETE, INSERT, ALTER TABLE TIME TRAVEL (CONTINUED)

Create table, define schema explicitly with SQL DDL

You might also like