C R E A T E a n d A LT E R T A B L E Writes
Example s y ntax I N S ERT
CREATE TABLE IF NOT EXISTS logs (
INSERT INTO table SELECT id, data FROM ...
iceberg.apache.org • spark.apache.org
level string, event_ts timestamp, msg string, ...)
Iceberg Spark 3.3 USING iceberg PARTITIONED BY (level, hours(event_ts)) INSERT INTO table VALUES (1, 'a'), (2, 'b'), ...
C at a l o g s Supported t y pes M ER G E
Primitive types: MERGE INTO target_table t
Configure a catalog, called “sandbox”
USING source_changes s ON t.id = s.id
spark.sql.catalog.sandbox=\
boolean , int , bigint , float , double , decimal(P,S) ,
WHEN MATCHED AND s.operation = 'delete' THEN DELETE
org.apache.iceberg.spark.SparkCatalog
date , timestamp , string , binary WHEN MATCHED THEN UPDATE SET t.count =
spark.sql.catalog.sandbox.type=rest
t.count + s.count
Note: Spark’s timestamp type is Iceberg’s timestamp with time zone type
spark.sql.catalog.sandbox.uri=\
WHEN NOT MATCHED THEN INSERT (t.id, t.count)
https://api.tabular.io/ws
Nested types: VALUES (s.id, s.count)
spark.sql.catalog.sandbox.warehouse=sandbox
struct<name type, ...> , array<item_type> ,
spark.sql.catalog.sandbox.credential=...
For performance, add filters to the ON clause for the target table
spark.sql.defaultCatalog=sandbox map<key_type, value_type>
ON t.id = s.id AND t.event_ts >=
Supported partition transforms date_add(current_date(), -2)
Working with multiple catalogs in SQL
column Partition by the unmodified column value Uses write.merge.mode
See the session’s current catalog and database
years(event_ts) Year granularity e.g. 2023 copy-on-write vs merge-on-read
SHOW CURRENT DATABASE
months(event_ts) Month granularity e.g. 2023-03 Note: When in doubt, use copy-on-write for the best read performance
Sets the current catalog and database
days(event_ts) Day granularity e.g. 2023-03-01 To enable merge-on-read:
USE sandbox.examples
hours(event_ts) Hour granularity e.g. 2023-03-01-10 ALTER TABLE target_table SET TBLPROPERTIES (
List databases and tables
truncate(width, col) Truncate strings or numbers in col 'format-version'='2',
SHOW DATABASES
'write.merge.mode'='merge-on-read')
bucket(width, col) Hash col values into width buckets
SHOW TABLES
Schema e v olution (ALT ER TABLE ta b l e …)
UPDAT E
q u e r i e s & m e t a d at a t a b l e s UPDATE table SET count = count + 1 WHERE id = 5
ADD COLUMN line_no int AFTER event_ts
Simple select example D E L E T E FR O M
-- widen type (int to bigint, float to double, etc.)
SELECT count(1) as row_count FROM logs
ALTER COLUMN line_no TYPE bigint DELETE FROM table WHERE id = 5
WHERE event_ts >= date_add(current_date(), -7))
ALTER COLUMN line_no COMMENT 'Line number'
AND event_ts < current_date() D ataframe writes
ALTER COLUMN line_no FIRST
Note: Filters automatically select files using partitions and value stats Create a writer
ALTER COLUMN line_no AFTER event_ts
writer = df.writeTo(tableName)
Metadata tables
RENAME COLUMN msg TO message
-- lists all tags and branches
Note: In catalogs with multiple formats, add .using("iceberg")
db.table.refs DROP COLUMN line_no
Create from dataframe
-- all known revisions of the table
Adding/updating nested types df.writeTo("catalog.db.table").partitionedBy($"col").create()
db.table.snapshots ADD COLUMN location struct<lat float, long float>
Append
-- history of the main branch
ADD COLUMN location.altitude float
df.writeTo("catalog.db.table").append()
db.table.history
Note: UPDATE COLUMN can’t modify struct types Overwrite
Note: Must be loaded using the full table name
A lter partition spec df.writeTo("catalog.db.table").overwrite($"report_date" === d)
Others:
ALTER TABLE ... ADD PARTITION FIELD days(event_ts) AS day df.writeTo("catalog.db.table").overwritePartitions()
partitions, manifests, files, data_files,
delete_files ALTER TABLE ... DROP PARTITION FIELD days(event_ts) Stored procedures
Setting distribution and sort order B asic s y ntax
I nspecting tables
DESCRIBE db.table
Globally sort by event_ts CALL system.procedure_name(named_arg => value, ...)
ALTER TABLE logs WRITE ORDERED BY event_ts Compaction
T ime tra v el
Distribute by partitions to writers and locally sort by event_ts Compact data and rewrite all delete files
SELECT ... FROM table FOR VERSION AS OF ref_or_id
ALTER TABLE logs WRITE DISTRIBUTED BY PARTITION
CALL catalog.system.rewrite_data_files(
SELECT ... FROM table
LOCALLY ORDERED BY event_ts
table => 'table_name',
FOR TIMESTAMP AS OF '2022-04-14 11:00:00-07:00'
Remove write order where => 'col1 = "value"',
-- Also works with metadata tables
options => map('min-input-files', '2',
ALTER TABLE logs WRITE UNORDERED
Loading a table from a metadata file 'delete-file-threshold', '1'))
df = spark.read.format("iceberg").load(
Table properties Compact and sort
"s3://bucket/path/to/metadata.json") Set table properties
CALL catalog.system.rewrite_data_files(
ALTER TABLE table SET TBLPROPERTIES ('prop'='val') table => 'table_name',
M etadata columns
strategy => 'sort',
_file The file location containing the record Format version: 1 or 2 sort_order => 'col1, col2 desc')
_pos The position within _file of the record format-version Compact and sort using z-order
_partition The partition tuple used to store the record Note: Must be 2 for merge -on-read CALL catalog.system.rewrite_data_files(
Age limit for snapshot retention table => 'table_name',
Functions
strategy => 'sort',
Call I ceberg transform functions history.expire.max-snapshot-age-ms sort_order => 'zorder(col1, col2)')
SELECT catalog.system.truncate(10, name) FROM table Minimum number of snapshots to retain O ptimi ze table metadata
SELECT catalog.system.bucket(16, id) FROM table
history.expire.min-snapshots-to-keep CALL catalog.system.rewrite_manifests(table => 'table')
I nspect the I ceberg librar y v ersion
Mode by command: copy-on-write or merge-on-read R oll back to pre v ious snapshot or time
SELECT catalog.system.iceberg_version() as version
write.(update|delete|merge).mode CALL catalog.system.rollback_to_snapshot(
table => 'table_name',
Isolation level by command: snapshot or serializable
snapshot_id => 9180664844100633321)
write.(update|delete|merge).isolation-level
CALL catalog.system.rollback_to_timestamp(
Target size, in bytes, for split combining for the table table => 'table_name',
tab ul ar.io • d ocs.tab ul ar.i o
v 0.4.4 read.split.target-size timestamp => TIMESTAMP '2023-01-01 00:00:00.000')