0% found this document useful (0 votes)

253 views1 page

Tabular Iceberg-Spark Cheat-Sheet

This document provides an overview of Iceberg's capabilities for creating and altering tables, inserting and merging data, and working with catalogs and metadata tables in Spark SQL. It describes Iceberg's support for primitive and nested data types, partitioning transforms, schema evolution operations, and writing data from DataFrames.

Uploaded by

fjaimesilva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

253 views1 page

Tabular Iceberg-Spark Cheat-Sheet

Uploaded by

fjaimesilva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

C R E A T E a n d A LT E R T A B L E Writes

Example s y ntax I N S ERT

CREATE TABLE IF NOT EXISTS logs (
INSERT INTO table SELECT id, data FROM ...
iceberg.apache.org • spark.apache.org
level string, event_ts timestamp, msg string, ...)

Iceberg Spark 3.3 USING iceberg PARTITIONED BY (level, hours(event_ts)) INSERT INTO table VALUES (1, 'a'), (2, 'b'), ...

C at a l o g s Supported t y pes M ER G E

Primitive types: MERGE INTO target_table t

Configure a catalog, called “sandbox”

USING source_changes s ON t.id = s.id

spark.sql.catalog.sandbox=\
boolean , int , bigint , float , double , decimal(P,S) ,
WHEN MATCHED AND s.operation = 'delete' THEN DELETE

org.apache.iceberg.spark.SparkCatalog
date , timestamp , string , binary WHEN MATCHED THEN UPDATE SET t.count =

spark.sql.catalog.sandbox.type=rest
t.count + s.count

Note: Spark’s timestamp type is Iceberg’s timestamp with time zone type
spark.sql.catalog.sandbox.uri=\
WHEN NOT MATCHED THEN INSERT (t.id, t.count)

https://api.tabular.io/ws
Nested types: VALUES (s.id, s.count)
spark.sql.catalog.sandbox.warehouse=sandbox

struct<name type, ...> , array<item_type> ,

spark.sql.catalog.sandbox.credential=...
For performance, add filters to the ON clause for the target table
spark.sql.defaultCatalog=sandbox map<key_type, value_type>
ON t.id = s.id AND t.event_ts >=

Supported partition transforms date_add(current_date(), -2)

Working with multiple catalogs in SQL
column Partition by the unmodified column value Uses write.merge.mode
See the session’s current catalog and database
years(event_ts) Year granularity e.g. 2023 copy-on-write vs merge-on-read
SHOW CURRENT DATABASE
months(event_ts) Month granularity e.g. 2023-03 Note: When in doubt, use copy-on-write for the best read performance
Sets the current catalog and database
days(event_ts) Day granularity e.g. 2023-03-01 To enable merge-on-read:
USE sandbox.examples
hours(event_ts) Hour granularity e.g. 2023-03-01-10 ALTER TABLE target_table SET TBLPROPERTIES (

List databases and tables

truncate(width, col) Truncate strings or numbers in col 'format-version'='2',

SHOW DATABASES
'write.merge.mode'='merge-on-read')
bucket(width, col) Hash col values into width buckets
SHOW TABLES
Schema e v olution (ALT ER TABLE ta b l e …)
UPDAT E
q u e r i e s & m e t a d at a t a b l e s UPDATE table SET count = count + 1 WHERE id = 5
ADD COLUMN line_no int AFTER event_ts
Simple select example D E L E T E FR O M
-- widen type (int to bigint, float to double, etc.)

SELECT count(1) as row_count FROM logs

ALTER COLUMN line_no TYPE bigint DELETE FROM table WHERE id = 5
WHERE event_ts >= date_add(current_date(), -7))

ALTER COLUMN line_no COMMENT 'Line number'

AND event_ts < current_date() D ataframe writes
ALTER COLUMN line_no FIRST
Note: Filters automatically select files using partitions and value stats Create a writer
ALTER COLUMN line_no AFTER event_ts
writer = df.writeTo(tableName)
Metadata tables
RENAME COLUMN msg TO message
-- lists all tags and branches
Note: In catalogs with multiple formats, add .using("iceberg")
db.table.refs DROP COLUMN line_no
Create from dataframe
-- all known revisions of the table
Adding/updating nested types df.writeTo("catalog.db.table").partitionedBy($"col").create()
db.table.snapshots ADD COLUMN location struct<lat float, long float>
Append
-- history of the main branch
ADD COLUMN location.altitude float
df.writeTo("catalog.db.table").append()
db.table.history
Note: UPDATE COLUMN can’t modify struct types Overwrite
Note: Must be loaded using the full table name
A lter partition spec df.writeTo("catalog.db.table").overwrite($"report_date" === d)
Others:
ALTER TABLE ... ADD PARTITION FIELD days(event_ts) AS day df.writeTo("catalog.db.table").overwritePartitions()
partitions, manifests, files, data_files,
delete_files ALTER TABLE ... DROP PARTITION FIELD days(event_ts) Stored procedures
Setting distribution and sort order B asic s y ntax
I nspecting tables
DESCRIBE db.table
Globally sort by event_ts CALL system.procedure_name(named_arg => value, ...)

ALTER TABLE logs WRITE ORDERED BY event_ts Compaction

T ime tra v el
Distribute by partitions to writers and locally sort by event_ts Compact data and rewrite all delete files
SELECT ... FROM table FOR VERSION AS OF ref_or_id
ALTER TABLE logs WRITE DISTRIBUTED BY PARTITION
CALL catalog.system.rewrite_data_files(

SELECT ... FROM table

LOCALLY ORDERED BY event_ts
table => 'table_name',

FOR TIMESTAMP AS OF '2022-04-14 11:00:00-07:00'

Remove write order where => 'col1 = "value"',

-- Also works with metadata tables

options => map('min-input-files', '2',

ALTER TABLE logs WRITE UNORDERED

Loading a table from a metadata file 'delete-file-threshold', '1'))

df = spark.read.format("iceberg").load(
Table properties Compact and sort
"s3://bucket/path/to/metadata.json") Set table properties
CALL catalog.system.rewrite_data_files(

ALTER TABLE table SET TBLPROPERTIES ('prop'='val') table => 'table_name',

M etadata columns
strategy => 'sort',

_file The file location containing the record Format version: 1 or 2 sort_order => 'col1, col2 desc')
_pos The position within _file of the record format-version Compact and sort using z-order
_partition The partition tuple used to store the record Note: Must be 2 for merge -on-read CALL catalog.system.rewrite_data_files(

Age limit for snapshot retention table => 'table_name',

Functions
strategy => 'sort',

Call I ceberg transform functions history.expire.max-snapshot-age-ms sort_order => 'zorder(col1, col2)')

SELECT catalog.system.truncate(10, name) FROM table Minimum number of snapshots to retain O ptimi ze table metadata
SELECT catalog.system.bucket(16, id) FROM table
history.expire.min-snapshots-to-keep CALL catalog.system.rewrite_manifests(table => 'table')
I nspect the I ceberg librar y v ersion
Mode by command: copy-on-write or merge-on-read R oll back to pre v ious snapshot or time
SELECT catalog.system.iceberg_version() as version
write.(update|delete|merge).mode CALL catalog.system.rollback_to_snapshot(

table => 'table_name',

Isolation level by command: snapshot or serializable

snapshot_id => 9180664844100633321)
write.(update|delete|merge).isolation-level
CALL catalog.system.rollback_to_timestamp(

Target size, in bytes, for split combining for the table table => 'table_name',

tab ul ar.io • d ocs.tab ul ar.i o

v 0.4.4 read.split.target-size timestamp => TIMESTAMP '2023-01-01 00:00:00.000')

SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Administering SAP Datasphere
100% (1)
Administering SAP Datasphere
280 pages
Azure DataBricks
No ratings yet
Azure DataBricks
37 pages
Build A Data Pipeline Using AWS Glue
No ratings yet
Build A Data Pipeline Using AWS Glue
27 pages
Snowpro-Core 7
No ratings yet
Snowpro-Core 7
37 pages
Python for Absolute Beginners
100% (8)
Python for Absolute Beginners
168 pages
Airflow 2 X
100% (2)
Airflow 2 X
39 pages
AWS Tools for Data Engineers
No ratings yet
AWS Tools for Data Engineers
24 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Dbms Commands
100% (1)
Dbms Commands
6 pages
Snowflake Architecture Guide
No ratings yet
Snowflake Architecture Guide
18 pages
Barclays Data Engineer Interview Questions
No ratings yet
Barclays Data Engineer Interview Questions
17 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
How To Work With Iceberg Format in AWS-Glue
No ratings yet
How To Work With Iceberg Format in AWS-Glue
17 pages
Integrating Data and Managing Spaces in SAP
No ratings yet
Integrating Data and Managing Spaces in SAP
330 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Data Engineering Cert Guide
No ratings yet
Data Engineering Cert Guide
15 pages
Unix Commands Cheat Sheet
No ratings yet
Unix Commands Cheat Sheet
12 pages
Azure Databricks Onboarding Guide
No ratings yet
Azure Databricks Onboarding Guide
298 pages
Olap Case Study - VJ
No ratings yet
Olap Case Study - VJ
16 pages
Azure Data Factory: Cloud ETL & Integration
No ratings yet
Azure Data Factory: Cloud ETL & Integration
10 pages
Oracle Database Design Final Exam
No ratings yet
Oracle Database Design Final Exam
14 pages
Flink Vs Spark by Slim Baltagi
No ratings yet
Flink Vs Spark by Slim Baltagi
67 pages
Mongodb Notes Basic To Advanced 1692833294
No ratings yet
Mongodb Notes Basic To Advanced 1692833294
10 pages
Bayesian Machine Learning
No ratings yet
Bayesian Machine Learning
127 pages
1z0-931 Dumps Oracle Autonomous Database Cloud 2019 Specialist
100% (2)
1z0-931 Dumps Oracle Autonomous Database Cloud 2019 Specialist
6 pages
RDBMS Project
100% (1)
RDBMS Project
4 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
School of Information Science: Addis Ababa University College of Natural and Computational Science
0% (1)
School of Information Science: Addis Ababa University College of Natural and Computational Science
8 pages
SAP HANA SQL Script Reference en
No ratings yet
SAP HANA SQL Script Reference en
48 pages
Oracle Interview Questions 50
No ratings yet
Oracle Interview Questions 50
2 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Pyspark Cashing & Persisting - Complete Guide
No ratings yet
Pyspark Cashing & Persisting - Complete Guide
3 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
ADO.NET
No ratings yet
ADO.NET
6 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Iceberg Change Data Capture (CDC) Guide
No ratings yet
Iceberg Change Data Capture (CDC) Guide
11 pages
A Deep Dive Into Query Execution Engine of Spark SQL
100% (2)
A Deep Dive Into Query Execution Engine of Spark SQL
88 pages
DBMS Important Questions MSBTE
No ratings yet
DBMS Important Questions MSBTE
4 pages
Google Cloud Core Infrastructure Guide
No ratings yet
Google Cloud Core Infrastructure Guide
69 pages
Unit-3: Relational Database Model
No ratings yet
Unit-3: Relational Database Model
17 pages
Python Api Manual PDF
100% (1)
Python Api Manual PDF
100 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Mongodb Cheat Sheet
No ratings yet
Mongodb Cheat Sheet
10 pages
Hive Integrating Hive and Bi
No ratings yet
Hive Integrating Hive and Bi
42 pages
RDBMS Practical for Agri-IT Students
100% (1)
RDBMS Practical for Agri-IT Students
2 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
Simulado Databricks
No ratings yet
Simulado Databricks
25 pages
CS3492-DBMS QB
No ratings yet
CS3492-DBMS QB
27 pages
Matillion Optimizing Snowflake
No ratings yet
Matillion Optimizing Snowflake
23 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Presentation Lesson1
No ratings yet
Presentation Lesson1
19 pages
10 ETL Testing SQL Queries by Yogesh Tyagi 1738066872
No ratings yet
10 ETL Testing SQL Queries by Yogesh Tyagi 1738066872
13 pages
BVoc-Software-02Sem-DikshaSinghal-DATABASE MANAGEMENT SYSTEM
No ratings yet
BVoc-Software-02Sem-DikshaSinghal-DATABASE MANAGEMENT SYSTEM
78 pages
Using Databricks Notebook in Talend Studio
No ratings yet
Using Databricks Notebook in Talend Studio
19 pages
Teradata SQL Performance Tuning Case Study Part II
0% (1)
Teradata SQL Performance Tuning Case Study Part II
37 pages
Apache Flink
No ratings yet
Apache Flink
40 pages
? Create The ROOT - DEPTH Table - ESS-DWW Courseware - Snowflake University - On-Demand
No ratings yet
? Create The ROOT - DEPTH Table - ESS-DWW Courseware - Snowflake University - On-Demand
7 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Cs Practical 24-25
No ratings yet
Cs Practical 24-25
12 pages
Debezium Openshift
No ratings yet
Debezium Openshift
7 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
DataStage Faq S
No ratings yet
DataStage Faq S
57 pages
What Are The Difference Between DDL, DML and DCL Commands - Oracle FAQ
No ratings yet
What Are The Difference Between DDL, DML and DCL Commands - Oracle FAQ
4 pages
Hadoop ECO System
No ratings yet
Hadoop ECO System
1 page
Hadoop Zertifizierung
No ratings yet
Hadoop Zertifizierung
1 page
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
Python For Non-Programmers - 1-1
No ratings yet
Python For Non-Programmers - 1-1
19 pages
Hadoop Data Lake: Hadoop Log Files Json
No ratings yet
Hadoop Data Lake: Hadoop Log Files Json
5 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Comprehensive Azure SQL Training Guide
No ratings yet
Comprehensive Azure SQL Training Guide
6 pages
Outbox Pattern with Debezium CDC
No ratings yet
Outbox Pattern with Debezium CDC
11 pages
05.azure Data Lake Authentication
No ratings yet
05.azure Data Lake Authentication
16 pages
Testing in Python - Unit Test & Script
No ratings yet
Testing in Python - Unit Test & Script
5 pages
Grocery Store Project
No ratings yet
Grocery Store Project
88 pages
Caching in Snowflake
No ratings yet
Caching in Snowflake
7 pages
AWS Services Overview
No ratings yet
AWS Services Overview
28 pages
Set Your Data in Motion
No ratings yet
Set Your Data in Motion
8 pages
SAP HANA Series Data Developer Guide
No ratings yet
SAP HANA Series Data Developer Guide
30 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Everything You Need To Know, A Step-By-Step Guide Pre-Requisites An
No ratings yet
Everything You Need To Know, A Step-By-Step Guide Pre-Requisites An
22 pages
Azure Cosmos DB Change Feed Guide
No ratings yet
Azure Cosmos DB Change Feed Guide
8 pages
Relational Algebra Operations From Set Theory: Cartesian Product (Cross Product or Cross Join)
No ratings yet
Relational Algebra Operations From Set Theory: Cartesian Product (Cross Product or Cross Join)
1 page
Csci 455 CH 1
No ratings yet
Csci 455 CH 1
20 pages
Zbusca Exit
No ratings yet
Zbusca Exit
5 pages
Attachment
No ratings yet
Attachment
3 pages
70 Important Question CS
No ratings yet
70 Important Question CS
29 pages
SQL Question
No ratings yet
SQL Question
40 pages
JDBC Interview Q&A Guide
No ratings yet
JDBC Interview Q&A Guide
3 pages
Steps For Creating Spring Boot Application
No ratings yet
Steps For Creating Spring Boot Application
3 pages
Chapter 2 and 3 Database and System Planning in HRIS
No ratings yet
Chapter 2 and 3 Database and System Planning in HRIS
44 pages
SA Paribohon
No ratings yet
SA Paribohon
8 pages
CS2029
No ratings yet
CS2029
6 pages
Oracle Sem 2 Part1
100% (1)
Oracle Sem 2 Part1
101 pages
Snowflake Setup - MD
No ratings yet
Snowflake Setup - MD
2 pages
Big Query Optimization Document
No ratings yet
Big Query Optimization Document
10 pages
Stored Procedure, Function and Trigger
No ratings yet
Stored Procedure, Function and Trigger
35 pages
Manual Conmponetes Doa PDF
No ratings yet
Manual Conmponetes Doa PDF
283 pages
Adm Cockpit
No ratings yet
Adm Cockpit
46 pages
RSQLite: SQLite Interface for R
No ratings yet
RSQLite: SQLite Interface for R
22 pages
KBT RACE 2 User Manual
No ratings yet
KBT RACE 2 User Manual
4 pages

Tabular Iceberg-Spark Cheat-Sheet

Uploaded by

Tabular Iceberg-Spark Cheat-Sheet

Uploaded by

C R E A T E a n d A LT E R T A B L E Writes

Example s y ntax I N S ERT

Primitive types: MERGE INTO target_table t

Configure a catalog, called “sandbox”

struct<name type, ...> , array<item_type> ,

Supported partition transforms date_add(current_date(), -2)

List databases and tables

SELECT count(1) as row_count FROM logs

ALTER COLUMN line_no COMMENT 'Line number'

ALTER TABLE logs WRITE ORDERED BY event_ts Compaction

SELECT ... FROM table

FOR TIMESTAMP AS OF '2022-04-14 11:00:00-07:00'

Remove write order where => 'col1 = "value"',

-- Also works with metadata tables

ALTER TABLE logs WRITE UNORDERED

ALTER TABLE table SET TBLPROPERTIES ('prop'='val') table => 'table_name',

Age limit for snapshot retention table => 'table_name',

Call I ceberg transform functions history.expire.max-snapshot-age-ms sort_order => 'zorder(col1, col2)')

table => 'table_name',

Isolation level by command: snapshot or serializable

tab ul ar.io • d ocs.tab ul ar.i o

v 0.4.4 read.split.target-size timestamp => TIMESTAMP '2023-01-01 00:00:00.000')

You might also like