Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
253 views1 page

Tabular Iceberg-Spark Cheat-Sheet

This document provides an overview of Iceberg's capabilities for creating and altering tables, inserting and merging data, and working with catalogs and metadata tables in Spark SQL. It describes Iceberg's support for primitive and nested data types, partitioning transforms, schema evolution operations, and writing data from DataFrames.

Uploaded by

fjaimesilva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
253 views1 page

Tabular Iceberg-Spark Cheat-Sheet

This document provides an overview of Iceberg's capabilities for creating and altering tables, inserting and merging data, and working with catalogs and metadata tables in Spark SQL. It describes Iceberg's support for primitive and nested data types, partitioning transforms, schema evolution operations, and writing data from DataFrames.

Uploaded by

fjaimesilva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

C R E A T E a n d A LT E R T A B L E Writes

Example s y ntax I N S ERT


CREATE TABLE IF NOT EXISTS logs (
INSERT INTO table SELECT id, data FROM ...
iceberg.apache.org • spark.apache.org
level string, event_ts timestamp, msg string, ...)

Iceberg Spark 3.3 USING iceberg PARTITIONED BY (level, hours(event_ts)) INSERT INTO table VALUES (1, 'a'), (2, 'b'), ...

C at a l o g s Supported t y pes M ER G E

Primitive types: MERGE INTO target_table t

Configure a catalog, called “sandbox”


USING source_changes s ON t.id = s.id

spark.sql.catalog.sandbox=\
boolean , int , bigint , float , double , decimal(P,S) ,
WHEN MATCHED AND s.operation = 'delete' THEN DELETE

org.apache.iceberg.spark.SparkCatalog
date , timestamp , string , binary WHEN MATCHED THEN UPDATE SET t.count =

spark.sql.catalog.sandbox.type=rest
t.count + s.count

Note: Spark’s timestamp type is Iceberg’s timestamp with time zone type
spark.sql.catalog.sandbox.uri=\
WHEN NOT MATCHED THEN INSERT (t.id, t.count)

https://api.tabular.io/ws
Nested types: VALUES (s.id, s.count)
spark.sql.catalog.sandbox.warehouse=sandbox

struct<name type, ...> , array<item_type> ,


spark.sql.catalog.sandbox.credential=...
For performance, add filters to the ON clause for the target table
spark.sql.defaultCatalog=sandbox map<key_type, value_type>
ON t.id = s.id AND t.event_ts >=

Supported partition transforms date_add(current_date(), -2)


Working with multiple catalogs in SQL
column Partition by the unmodified column value Uses write.merge.mode
See the session’s current catalog and database
years(event_ts) Year granularity e.g. 2023 copy-on-write vs merge-on-read
SHOW CURRENT DATABASE
months(event_ts) Month granularity e.g. 2023-03 Note: When in doubt, use copy-on-write for the best read performance
Sets the current catalog and database
days(event_ts) Day granularity e.g. 2023-03-01 To enable merge-on-read:
USE sandbox.examples
hours(event_ts) Hour granularity e.g. 2023-03-01-10 ALTER TABLE target_table SET TBLPROPERTIES (

List databases and tables


truncate(width, col) Truncate strings or numbers in col 'format-version'='2',

SHOW DATABASES
'write.merge.mode'='merge-on-read')
bucket(width, col) Hash col values into width buckets
SHOW TABLES
Schema e v olution (ALT ER TABLE ta b l e …)
UPDAT E
q u e r i e s & m e t a d at a t a b l e s UPDATE table SET count = count + 1 WHERE id = 5
ADD COLUMN line_no int AFTER event_ts
Simple select example D E L E T E FR O M
-- widen type (int to bigint, float to double, etc.)

SELECT count(1) as row_count FROM logs


ALTER COLUMN line_no TYPE bigint DELETE FROM table WHERE id = 5
WHERE event_ts >= date_add(current_date(), -7))

ALTER COLUMN line_no COMMENT 'Line number'


AND event_ts < current_date() D ataframe writes
ALTER COLUMN line_no FIRST
Note: Filters automatically select files using partitions and value stats Create a writer
ALTER COLUMN line_no AFTER event_ts
writer = df.writeTo(tableName)
Metadata tables
RENAME COLUMN msg TO message
-- lists all tags and branches
Note: In catalogs with multiple formats, add .using("iceberg")
db.table.refs DROP COLUMN line_no
Create from dataframe
-- all known revisions of the table
Adding/updating nested types df.writeTo("catalog.db.table").partitionedBy($"col").create()
db.table.snapshots ADD COLUMN location struct<lat float, long float>
Append
-- history of the main branch
ADD COLUMN location.altitude float
df.writeTo("catalog.db.table").append()
db.table.history
Note: UPDATE COLUMN can’t modify struct types Overwrite
Note: Must be loaded using the full table name
A lter partition spec df.writeTo("catalog.db.table").overwrite($"report_date" === d)
Others:
ALTER TABLE ... ADD PARTITION FIELD days(event_ts) AS day df.writeTo("catalog.db.table").overwritePartitions()
partitions, manifests, files, data_files,
delete_files ALTER TABLE ... DROP PARTITION FIELD days(event_ts) Stored procedures
Setting distribution and sort order B asic s y ntax
I nspecting tables
DESCRIBE db.table
Globally sort by event_ts CALL system.procedure_name(named_arg => value, ...)

ALTER TABLE logs WRITE ORDERED BY event_ts Compaction


T ime tra v el
Distribute by partitions to writers and locally sort by event_ts Compact data and rewrite all delete files
SELECT ... FROM table FOR VERSION AS OF ref_or_id
ALTER TABLE logs WRITE DISTRIBUTED BY PARTITION
CALL catalog.system.rewrite_data_files(

SELECT ... FROM table


LOCALLY ORDERED BY event_ts
table => 'table_name',

FOR TIMESTAMP AS OF '2022-04-14 11:00:00-07:00'

Remove write order where => 'col1 = "value"',

-- Also works with metadata tables


options => map('min-input-files', '2',

ALTER TABLE logs WRITE UNORDERED


Loading a table from a metadata file 'delete-file-threshold', '1'))

df = spark.read.format("iceberg").load(
Table properties Compact and sort
"s3://bucket/path/to/metadata.json") Set table properties
CALL catalog.system.rewrite_data_files(

ALTER TABLE table SET TBLPROPERTIES ('prop'='val') table => 'table_name',

M etadata columns
strategy => 'sort',

_file The file location containing the record Format version: 1 or 2 sort_order => 'col1, col2 desc')
_pos The position within _file of the record format-version Compact and sort using z-order
_partition The partition tuple used to store the record Note: Must be 2 for merge -on-read CALL catalog.system.rewrite_data_files(

Age limit for snapshot retention table => 'table_name',

Functions
strategy => 'sort',

Call I ceberg transform functions history.expire.max-snapshot-age-ms sort_order => 'zorder(col1, col2)')


SELECT catalog.system.truncate(10, name) FROM table Minimum number of snapshots to retain O ptimi ze table metadata
SELECT catalog.system.bucket(16, id) FROM table
history.expire.min-snapshots-to-keep CALL catalog.system.rewrite_manifests(table => 'table')
I nspect the I ceberg librar y v ersion
Mode by command: copy-on-write or merge-on-read R oll back to pre v ious snapshot or time
SELECT catalog.system.iceberg_version() as version
write.(update|delete|merge).mode CALL catalog.system.rollback_to_snapshot(

table => 'table_name',

Isolation level by command: snapshot or serializable


snapshot_id => 9180664844100633321)
write.(update|delete|merge).isolation-level
CALL catalog.system.rollback_to_timestamp(

Target size, in bytes, for split combining for the table table => 'table_name',

tab ul ar.io • d ocs.tab ul ar.i o

v 0.4.4 read.split.target-size timestamp => TIMESTAMP '2023-01-01 00:00:00.000')

You might also like