DuckDB

Test-Driving the Lance Lakehouse Format in DuckDB

2026-05-21T00:00:00+00:00

With the lance extension, DuckDB users can query Lance datasets with the same familiar SQL interface (via the CLI or SDKs), while adding capabilities for AI and retrieval workloads. This blog post highlights how Lance is a good option for workloads that need to support storage and querying of vectors, rich table operations, and AI-oriented access patterns, while also supporting scan-friendly analytical workloads at scale. And with DuckDB, it becomes trivial to query those kinds of datasets in SQL.

In this blog, there will be some mentions of “retrieval workloads” and “AI data patterns” or “AI datasets”. By “retrieval workloads,” we mean queries that find rows by similarity or keyword relevance, such as vector search and full-text search, rather than by exact filters or aggregations. By “AI data patterns,” we mean datasets that mix embeddings, images, or audio alongside scalar metadata.

What is Lance?

Lance is an open lakehouse format designed for modern ML and AI workloads. Unlike Parquet, Lance is a file format, a table format, and a lightweight catalog spec all at once. At the table format level, Lance supports versioning, schema evolution, indexes, and transactional updates through MVCC and ACID-style semantics. In practice, this means Lance is built for datasets that change over time and need more than read-only scans.

This matters because many AI datasets are no longer just rows of scalar values. They often contain embeddings, long-form text, images, audio, metadata for filtering, and indexes used for retrieval. A format that works well for these workloads needs to do more than store and scan columns efficiently. It also needs to support search, updates, and lifecycle operations without forcing users into managing multiple different systems.

The mental model is still familiar to users coming from Parquet: columnar data in an open format, queried with standard analytical tools. Lance's fragment-based layout stores data in small columnar chunks. This design enables efficient random access without trade-offs in scan performance or memory utilization, something that has historically been difficult to achieve in columnar formats, and that the Lance team examines in this 2025 paper on adaptive structural encodings.

On the data evolution side, adding columns or backfilling existing rows with new data only writes new files without touching existing ones. That makes schema changes lightweight in practice, which is useful for workflows where columns are added incrementally, such as appending derived features or embeddings to an existing dataset.

The Lance DuckDB Extension

The lance extension brings Lance into DuckDB as part of the SQL-based workflow. You can read Lance datasets directly, write to them via COPY, attach them as table namespaces, build indexes, and query them with regular DuckDB SQL. On top of that, the extension exposes Lance-native search functionality through SQL table functions.

This fits naturally with how DuckDB is already used: as a single, embedded SQL query engine that operates on many different data sources and file formats. With the Lance extension, DuckDB remains the familiar query engine, while Lance provides the storage, indexing, and search capabilities underneath, which is especially beneficial when your data is multimodal and includes embeddings.

Example Usage

Installing and using the extension is straightforward:

INSTALL lance;
LOAD lance;

SELECT *
FROM 'path/to/dataset.lance'
LIMIT 10;

DuckDB can also write Lance datasets directly:

COPY (
    SELECT *
    FROM (
        VALUES
            (1::BIGINT, 'duck', [0.9, 0.7, 0.1]::FLOAT[3]),
            (2::BIGINT, 'horse', [0.3, 0.1, 0.5]::FLOAT[3]),
            (3::BIGINT, 'dragon', [0.5, 0.2, 0.7]::FLOAT[3])
        ) AS t(id, animal, vec)
) TO 'path/to/out.lance' (FORMAT lance, mode 'overwrite');

Once the data is in Lance, DuckDB can query it with Lance-native search operators. For example, hybrid search combines vector similarity and keyword relevance in one SQL query:

SELECT id, text, _hybrid_score, _distance, _score
FROM lance_hybrid_search(
    'path/to/dataset.lance',
    'vec',
    [0.1, 0.2, 0.3, 0.4]::FLOAT[4],
    'text',
    'puppy',
    k = 10,
    prefilter = false,
    alpha = 0.5,
    oversample_factor = 4
)
ORDER BY _hybrid_score DESC;

The extension also exposes lance_vector_search(...) for vector similarity search and lance_fts(...) for full-text search, so users can choose the retrieval mode that fits their workload.

If you want table-style access instead of path-based access, you can attach a directory as a Lance namespace:

ATTACH 'path/to/dir' AS ns (TYPE lance);

SELECT count(*)
FROM ns.main.my_table;

Index creation also happens through SQL. For example, a vector index can be created directly on a Lance dataset:

CREATE INDEX vec_idx ON 'path/to/dataset.lance' (vec)
USING IVF_FLAT WITH (num_partitions = 1, metric_type = 'l2');

The extension surface goes well beyond read-only scans. In the current implementation, DuckDB can:

Read Lance datasets with direct path scans
Write and append Lance datasets with COPY ... TO ... (FORMAT lance)
Run vector, full-text, and hybrid search with SQL functions
Attach local directories or custom catalogs via REST namespaces
Create, update, delete, merge, and alter tables in attached namespaces
Create and manage vector, scalar, and full-text indexes
Run maintenance operations such as compaction, cleanup, and index optimization

Note that this extension is not just a file reader, but it also gives DuckDB users a way to work with Lance as an operational table format from inside SQL.

Why Lance and DuckDB?

The combination of Lance and DuckDB is compelling for three reasons.

First, it gives users one SQL surface for analytics plus retrieval. The same DuckDB workflow can scan a dataset, filter it, join it with other tables, compute aggregates, and then run vector search or hybrid search over the result set. That is a good fit for AI applications where retrieval is only one step in a larger analytical pipeline.

Second, Lance is a table format for more than traditional analytics. Many AI pipelines need versioned datasets, updates, deletes, MERGE-style changes, index management, and schema evolution. The DuckDB extension exposes these capabilities through SQL, which means users do not need to leave the DuckDB environment just because their dataset is doing more than serving analytical reads.

Third, the workflow scales from local files to remote storage without changing the mental model. You can start with a local lance dataset, then move to object storage.

The extension also supports REST namespaces, so DuckDB can connect to a remote Lance catalog (including LanceDB Enterprise) and treat it like an attached database. That makes the local-to-remote storage progression feel incremental rather than disruptive.

To sum up, DuckDB remains the familiar SQL engine, while Lance adds storage and indexing features that are especially useful when the same dataset powers both analytics and retrieval.

Performance Experiment

LAION is an open dataset of image/caption pairs scraped from the public web, originally released to support research on models like CLIP, which learn a shared embedding space for images and text. The full release spans billions of pairs. For this experiment, we used the lance-format/laion-1m subset on Hugging Face Hub, which is easy to reproduce locally.

Each row carries a caption, a 768-dimensional CLIP image embedding, the raw image bytes, and scalar metadata like width, height, and NSFW flags. This mix of scalar, text, vector, and blob data in a single table makes it a useful workload for comparing formats, and it is structurally different from the wide-but-flat schemas like TPC-H or ClickBench that are traditionally used for analytical benchmarks.

The public Hugging Face export used by the benchmark currently materializes 69,632 rows locally, not the full million-row source dataset. The runner first downloads the public Parquet shards, then builds all local artifacts from that same baseline: an LZ4-compressed Parquet file, an indexed DuckDB database, and a Lance dataset. Generated files are reused across runs, so the initial download is the only networked step.

The experiments were run on an Apple MacBook Pro with a 10-core M1 Max CPU and 32 GB of RAM, running DuckDB 1.5.2.

The benchmark was run using DuckDB as the query engine for the following three storage formats:

Parquet: DuckDB scanning the LZ4-compressed Parquet baseline directly, with no auxiliary indexes.
DuckDB indexed: the same baseline loaded into a DuckDB table, with DuckDB's vss (HNSW) and fts extensions layered on top, plus scalar indexes on filter columns. This is the typical “build it yourself in DuckDB” stack.
Lance native: the same baseline written to a Lance dataset with a vector index, a full-text index, and native blob storage, queried through the DuckDB lance extension.

The workloads are aligned by task across the three paths, even though the exact SQL differs by storage/indexing backend:

fts: find rows by keyword search over the caption text.
vector_exact: run nearest-neighbor search over the CLIP embedding column without using an approximate vector index.
vector_indexed: run nearest-neighbor search over the same embedding column using the available vector index.
hybrid: combine text search and vector search into one retrieval query, returning the best-ranked matches from both signals.
blob_read: fetch image bytes for selected rows, which exercises random access to large binary values rather than just scalar or vector columns.

Each workload is run five times by default, and the tables below report the average. The full scripts and SQL queries are in the laion_1m benchmark directory.

Cold Results

The table below runs each workload cold, in a fresh DuckDB process, so it captures process startup, file open, and first-query cost. It's closest to what a one-off script or a cron job would see.

Workload	Parquet	DuckDB indexed	Lance native
`fts`	12 ms	11 ms	21 ms
`vector_exact`	695 ms	61 ms	89 ms
`vector_indexed`	761 ms	104 ms	12 ms
`hybrid`	465 ms	80 ms	17 ms
`blob_read`	1559 ms	271 ms	278 ms

In the cold run, Lance stands out in the vector_indexed and hybrid workloads. DuckDB’s own format does well in vector_exact and fts, while the blob_read workload is pretty much on par. Parquet is not well optimized for vector searches or blob reads, but does well on a simple text search powered by regex.

Warm Results

The warm results are from running all workloads in a single DuckDB session after a silent warmup pass, so caches, memory-mapped pages, and loaded indexes are already primed.

Workload	Parquet	DuckDB indexed	Lance native
`fts`	12 ms	10 ms	7 ms
`vector_exact`	703 ms	30 ms	50 ms
`vector_indexed`	755 ms	2 ms	5 ms
`hybrid`	471 ms	11 ms	8 ms
`blob_read`	1484 ms	266 ms	276 ms

When caches and indexes are already warm, both DuckDB and Lance are significantly faster than using Parquet on retrieval workloads.

Conclusion

Lance is a relatively new addition to the world of open lakehouse formats. It is designed for datasets that change over time, contain more than scalar values, and need to support both search and retrieval alongside traditional scan workloads. From DuckDB, the extension makes these capabilities available through SQL, while preserving the familiar embedded workflow. The benchmark results reflect, particularly in cold runs, how Lance is a good alternative to DuckDB’s own format for vector and hybrid search.

The Lance support in DuckDB was made possible through a collaboration between DuckLabs and LanceDB.

DuckDB 1.5.3: Not an Ordinary Patch Release

2026-05-20T00:00:00+00:00

In this blog post, we highlight a few important features shipped in DuckDB v1.5.3, the third patch release in DuckDB's v1.5 line. You can find the complete release notes on GitHub.

To install the new version, please visit the installation page.

What's New

While DuckDB v1.5.3 is a patch release, its extensions brings various new features. We list these below.

Quack as a Core Extension

On May 12, we introduced Quack, our new remote protocol that turns DuckDB into a client-server database. If you are new to Quack and don't know where to start, check out the following resources:

For a high-level overview, see the Quack explainer page.
For the rationale and history behind Quack, along with an introduction of the protocol and its features, see the announcement blog post.
For the reference manual and setup guide, check out the Quack documentation.

Starting from DuckDB v1.5.3, we ship Quack as a core extension. This means that you can now start using Quack right away from any client running DuckDB: it will be transparently autoinstalled and autoloaded on first use.

DuckDB Server

CALL quack_serve(
    'quack:localhost',
    token = 'super_secret'
);

CREATE TABLE hello AS
    FROM VALUES ('world') v(s);

quack:

DuckDB Client

CREATE SECRET (
    TYPE quack,
    TOKEN 'super_secret'
);

ATTACH 'quack:localhost' AS remote;
FROM remote.hello;

Please note that Quack is still in beta state and breaking changes may happen in the protocol, in function names, etc. We plan to release the production-ready version of Quack together with DuckDB v2.0 in fall 2026.

DuckLake with Quack

DuckLake now supports DuckDB with Quack as its catalog database (ducklake#1151). Let the example speak for itself!

DuckDB Server

CALL quack_serve(
    'quack:localhost',
    token => 'oogieboogie'
);

quack:

DuckDB Client

INSTALL ducklake;

CREATE SECRET (
    TYPE quack, TOKEN 'oogieboogie'
);
ATTACH 'ducklake:quack:localhost'
    AS lake (DATA_PATH 'data');
USE lake;

CREATE TABLE pond (
    id INT,
    species VARCHAR,
    weight DOUBLE
);
INSERT INTO pond VALUES
    (1, 'mallard', 1.2),
    (2, 'pintail', 0.9);
INSERT INTO pond VALUES
    (3, 'wood duck', 0.7);
SELECT * FROM pond ORDER BY id;

AWS Extension Features

The AWS extension now supports the web_identity chain type for IAM Roles for Service Accounts (IRSA) support. This was made possible through a contribution by community member Marcel Steinbach (@mst).

The AWS extension now also supports IAM authentication for managed PostgreSQL databases running on RDS/Aurora. For more details, see the AWS RDS IAM Authentication section in the documentation.

`HTTP_PROXY` Variable for the HTTPS Extension

Setting the HTTP_PROXY environment variable now sets the http_proxy DuckDB configuration option (duckdb#22541). This option makes sure that extensions installs are also passing through the proxy, which may come in handy in e.g. environments that use firewalls.

Note that since the introduction of curl into DuckDB's network stack, curl automatically uses HTTP_PROXY and HTTPS_PROXY, so now implicitly also DuckDB handles those parameters when the httpfs extension is loaded with the default curl backend.

Iceberg

The DuckDB-Iceberg extension has shipped a number of features between DuckDB v1.5.2 and v1.5.3. Most notably:

MERGE INTO is now supported for Iceberg tables (iceberg#788)
The INSERT and UPDATE statements are now supported on partitioned Iceberg tables with a truncate or bucket transform (iceberg#879)
CTAS statements in DuckDB-Iceberg using ADBC are now possible (iceberg#974)
We added the iceberg_schema_properties, set_iceberg_schema_properties, and remove_iceberg_schema_properties functions to allow getting, setting, and removing Iceberg schema properties (iceberg#960)
ALTER TABLE support has been added for Iceberg tables (iceberg#932, iceberg#928, iceberg#924, iceberg#912, iceberg#904, iceberg#853, iceberg#985, iceberg#981)
Support for the GEOMETRY type has been added for Iceberg tables (iceberg#968, iceberg#902)

Development and Internals

Shipping jemalloc as a Statically Linked Library

The jemalloc allocator is now part of core DuckDB (duckdb#22603) as a static third-party library which is included and linked by default on Linux. Previously jemalloc was a statically-linked extension – the new packaging is cleaner since other DuckDB extensions can be loaded dynamically.

`DISABLE_EXTENSION_LOAD` Flag

The DISABLE_EXTENSION_LOAD compile-time flag was fixed in duckdb#22019. When compiling DuckDB with this flag, loading extensions is disabled.

Coming Up

We have two events coming up in the next few weeks:

DuckCon #7. On June 24, we'll host our next user conference, DuckCon #7, in Amsterdam's beautiful Royal Tropical Institute.

Ubuntu Summit Talk. Next week, Gábor Szárnyas of DuckLabs will give a talk titled “DuckDB: Not Quack Science” at the Ubuntu Summit. Yes, his talk will include the new Quack protocol.

Conclusion

This post is a short summary of the changes in v1.5.3. As usual, you can find the full release notes on GitHub.

Quack: The DuckDB Client-Server Protocol

2026-05-12T00:00:00+00:00

Background: Database Architectures

When databases first emerged, there was no distinction between a ‘client’ and a ‘server’, the whole database just ran on a single computer. Somewhere in the 80s, Sybase was the first to introduce the concept of a database ‘server’ and a ‘client’ running on different computers. Ever since, it was just assumed that every database system used a client-server architecture along with a communication protocol to talk between those. This was convenient, because the single mutable state stays in a single place under the control of a server, and there can be many clients at the same time reading and writing data. There are of course drawbacks to this method, most notably, those protocols can add a significant amount of overhead. If you’re curious to read more, we wrote a research paper on database protocols a while back.

Of course, there were always dissenters to the client-server architecture, most notably the ubiquitous SQLite in 2000, and of course DuckDB, first released in 2019. We made quite a lot of noise about implementing an in-process architecture, where there is no client-server, no protocol, just low-level API calls. This worked really well for interactive use cases in e.g., data science, where analysts would interact with their data for example in a Python notebook and their data was managed in a DuckDB instance running in the very same process. It also worked really well for the many use cases where DuckDB was just “glued” to an existing application to provide SQL functionality on data living in that application.

Being an in-process system works “less well” for use cases when trying to modify the same database file from multiple processes at the same time. There are a lot of use cases where this is relevant, for example, when inserting into the same database from a bunch of processes collecting telemetry while at the same time querying the same tables to drive a dashboard. There are very good technical reasons why we could not make this work, most notably, the fact that DuckDB keeps a bunch of state in main memory and would have to synchronize that state if multiple processes start making changes simultaneously.

And yes, there were workarounds. Of course you can whip up a custom Remote Procedure Call (RPC) solution where there is a process that holds the DuckDB database instance and offers a service to other processes to query and insert data. There are also multiple projects out there that retrofit client-server abilities to DuckDB, for example using the Arrow Flight SQL protocol. MotherDuck has their own custom client-server protocol. And of course, you can always (gasp) switch to a more traditional database system that had client-server support, for example the also-ubiquitous PostgreSQL. You can then even proceed to run a so-called “EleDucken”, DuckDB in said PostgreSQL using one of the various extensions out there that enable this, for example pg_duckdb.

The vast number of workarounds people built to bolt a client-server solution onto DuckDB has at the very least convinced us that this is something people cared about. We see DuckDB as a universal data wrangling tool. If this means having a client-server protocol in addition to the in-process capabilities – fine. If this ends up unlocking a vast new set of cases in which DuckDB can be useful – excellent! In the end we care deeply about user experience and perhaps less about having the last word on architecture. So we bit the bullet, eventually, finally, and today we are very happy to announce the result:

Introducing the Quack Protocol for DuckDB

What do two (or more) ducks do if they want to talk to each other? They quack! So it is quite natural that we need to call the protocol that two DuckDB instances can use to talk to each other “Quack”, too! We had the opportunity to design a database protocol from scratch in 2026 without having to consider any legacy, which is quite a luxury. We were able to learn from the existing protocols, including the more recent Arrow Flight SQL and others. Before we dive into how Quack works internally, let's see how it works from a user perspective. First, you need two DuckDB instances. That’s right, DuckDB will act both as a client and as a server! The two instances can be on different computers a world apart (or in space) or just two different terminal windows on your laptop. First, we need to install the Quack extension in both DuckDB instances. For now, Quack lives in the core_nightly repository and is available in DuckDB v1.5.2, the current release version:

DuckDB #1

CALL quack_serve(
    'quack:localhost',
    token = 'super_secret'
);

CREATE TABLE hello AS
    FROM VALUES ('world') v(s);

quack:

DuckDB #2

CREATE SECRET (
    TYPE quack,
    TOKEN 'super_secret'
);

ATTACH 'quack:localhost' AS remote;
FROM remote.hello;

This should show the content of the remote table hello, world in DuckDB #2. Witchcraft! We can also copy data from the local instance to the remote one:

DuckDB #1

-- Step two
FROM hello2;

quack:

DuckDB #2

-- Step one
CREATE TABLE remote.hello2 AS
    FROM VALUES ('world2') v(s);

Similarly, you should see world2 in the output on DuckDB #1. Obviously those are the most basic examples we can think of. Tables can be much more complex, queries can be much more complex, data volumes can be quite vast (see below). There is also a way to just ship an entire verbatim query to the remote side using the query function, which is better for very complex queries on large datasets and offers more control over what exactly is executed remotely:

DuckDB #1

-- Waiting to serve data

quack:

DuckDB #2

FROM remote.query(
    'SELECT s FROM hello'
);

Of course there is much more to see here. Please consult our documentation for more details.

Protocol Design

HTTP-Based

Quack is built straight on the venerable HTTP, the Hypertext Transfer Protocol. From its humble beginnings at CERN, HTTP has become a de-facto protocol layer on top of TCP and all the stuff below. The entire stack is optimized to transmit HTTP message streams efficiently. The protocol has surprisingly low overhead if implemented properly. Everyone and their little brother knows how to deal with HTTP in load balancing, authentication, firewalls, intrusion detection etc. It would be rather misguided not to build a database protocol on top of HTTP in 2026. HTTP also allows the DuckDB-Wasm distribution to speak Quack natively! So DuckDB running in a browser can e.g., directly connect to a DuckDB instance running in an EC2 server using Quack.

Request-Response Pattern

Interactions on Quack are always driven by the client in a request-response pattern. Quack messages are for example connection requests, to authenticate with a token as seen above. See below on how authentication and authorization are handled in detail. Subsequent messages are requests to execute a query and return the first part of the response and follow-up fetch messages to retrieve large results, possibly from multiple threads in parallel.

Serialization

Requests and responses are encoded using the new MIME type application/duckdb. This encoding leverages DuckDB’s internal efficient serialization primitives for complex structures like data types and result sets. We have been using the same primitives for example in our Write-Ahead Log (WAL) files for years, meaning they are fairly well-optimized and battle-tested.

Encryption

While we want Quack to “just work” we also are wary of the security nightmares of attaching a database directly to the evil internet, as has happened before. This is why Quack will by default generate a random authentication token at server start-up, which then has to be given to the client. In addition, the Quack server will by default only bind to localhost (which can of course be overridden). Quack does not use SSL by default, because it is a bit silly to bring all that infrastructure and add dependencies just for localhost communication. We do not recommend opening up a DuckDB Quack endpoint directly to the Internet. Instead we strongly recommend that you use a common HTTP endpoint like nginx if you should choose to expose Quack to the World Wide Web and have that proxy terminate SSL (e.g., with Let's Encrypt). The Quack client will assume SSL is enabled for non-local connections, this can be overridden. We provide a guide for this in our documentation.

Round-Trips

We have been careful to optimize the number of protocol round trips or request/response pairs for queries. Once connected, a query can be completely handled with a single round trip. This is a critical optimization for latency-sensitive environments. At the same time, we have seriously optimized Quack for efficient bulk response transfer. As far as we know, Quack is currently the fastest way to shove tables through a socket, and millions of rows can be transferred in a few seconds. Below are a few benchmark results.

Authentication and Authorization

Authentication and authorization of database queries are an endless source of joy and complexity. We are likely unable to capture everyone’s use case, certainly not in a first release. The smart thing is therefore not to try. For Quack, we have chosen an auth model that ties into DuckDB’s philosophy of extensibility. There are hundreds of DuckDB extensions out there already. Quack ships with a default Authentication method and no authorization restrictions, but both can be overridden by user-supplied code. As you have seen above, the Quack server generates a default random authentication token on startup. When a client connects, it provides an authentication string. The server side will call an authentication callback. By default, it will compare the client-supplied token with the one that was randomly generated before. But this callback can be changed through configuration! You can bring your own authentication function that for example queries an LDAP directory, reads a text file, or just rolls the dice. Up to you. Similarly, the authorization function can be changed. The default authorization function just says “yes” to everything, but you can inspect each query a client attempts to execute, correlate the query to the previously used authentication string etc. Those callbacks can even be plain SQL macros! Please see our documentation for more details.

Default Port

By default, a Quack server listens on port 9494, the number 94 being easy to remember as the year Netscape Navigator was released.

Benchmarks

We have set up two benchmarks to showcase the Quack protocol. Those benchmarks were run on AWS virtual machines running Ubuntu on Arm. We picked the m8g.2xlarge instance type, which has 8 vCPUs and 32 GB of RAM and, importantly, “up to 15 Gbps” network bandwidth. We recreated a real-world scenario where client and server are in the same data center, but on different machines. We made sure both instances were in the same “availability zone”. Ping time between the instances averaged around 0.280 ms.

Bulk Transfer

The first benchmark tests bulk transfer, the case where a fairly large number of rows should be transferred over the database protocol. If you’ve read the paper we linked above, you know that this is a case where traditional database protocols were struggling. We compare Quack with two systems: the widespread PostgreSQL protocol and the newer Arrow Flight SQL protocol. Arrow Flight is provided by the GizmoSQL server that also uses DuckDB internally. We transfer an increasing number of rows of the TPC-H lineitem table, all the way up to a whopping 60 million rows (76 GB in CSV format!) and report the median wall clock time over 5 runs. We expect the modern bulk-oriented protocols to far outclass the PostgreSQL protocol. Here are the results:

Runtimes of bulk transfer operations (lower is better)

Would you like to see the results as a table? Click here.

Announcing the Program of DuckCon #7 Amsterdam

2026-05-08T00:00:00+00:00

We are excited to announce the program of DuckCon #7 Amsterdam, DuckDB's user conference. The event will be held on Wednesday, June 24, 2026, at the Royal Tropical Institute. The program runs from 15:00 to 20:00 CEST.

See the registration link and the full program on the DuckCon #7 event page.

Delta Grows Up: Writes, Unity Catalog and Time Travel

2026-05-07T00:00:00+00:00

Welcome back! While we here at DuckLabs are typically of the quacking persuasion, we’ve been busy as beavers, shoring up our Delta to prepare for what’s next… Unity Catalog! Let’s look at how DuckDB’s Delta and Unity Catalog extensions have grown up enough to shed the experimental tag, and see what has changed since our last update.

Time to Open the Delta

Before we jump in, let's review briefly. Delta is a foundational open table format and toolset for building and managing data lakes, related to Iceberg and other lakehouse formats. DuckDB supports Delta tables via its Delta Extension.

In that last update we highlighted performance wins, particularly file skipping via filter pushdowns, and metadata caching with snapshot pinning. Now we build on these, and add writes, time travel and Unity Catalog support, plus more performance gains!

Building Up the Delta (Lake): Writes

What fun are reads without writes? The big addition since we last chatted is INSERT support! It works as simply as you expect. Let's assume you have a Delta table ready to go. INSERT away, it's that simple:

-- Schema: (text VARCHAR, code BIGINT)
ATTACH './path/to/my_table' AS my_table (TYPE delta);

INSERT INTO my_table
VALUES ('Question 2', 2), ('The Answer', 42);

-- Bulk insert from a query
INSERT INTO my_table
FROM (SELECT text || ' (copy)', code + 100 FROM my_table);

Also worth calling out – multiple INSERTs within a BEGIN / COMMIT block are stored as a single Delta version: one atomic commit, one new log entry. And, as you'll see later, this works with catalogs too! UPDATE, MERGE, and DELETE are not yet supported, but on our future work list.

Time Travel

DuckDB's Delta extension now supports time travel. Any Delta table can be queried as of a particular version. DuckDB supports binding to a specific VERSION either at ATTACH time, or as part of an individual query.

Let's assume that we have built up the above my_table incrementally, with versions 0, 1, and 2 containing:

Version	Contents
0	`('Question 1', 1)`
1	+ `('Question 2', 2)`, `('The Answer', 42)`
2	+ `('Question 1 (copy)', 101)`, `('Question 2 (copy)', 102)`, `('The Answer (copy)', 142)`

You can attach normally and query arbitrary versions inline as needed. The most flexible approach:

ATTACH './path/to/my_table' AS my_table (TYPE delta);

SELECT count() FROM my_table AT (VERSION => 0); -- 1  (Question 1 only)
SELECT count() FROM my_table AT (VERSION => 1); -- 3  (after 1st insert)
SELECT count() FROM my_table;                   -- 6  (latest)

Or attach, pinned to a specific version, which is useful when you want a stable reference that never changes, regardless of future writes:

-- Always v1, no matter what gets written later
ATTACH './path/to/my_table' AS my_table_v1
    (TYPE delta, VERSION 1);

SELECT count() FROM my_table_v1;      -- → 3

-- Locked to whatever was latest at attach time
ATTACH './path/to/my_table' AS my_table_pinned
    (TYPE delta, PIN_SNAPSHOT);

SELECT count() FROM my_table_pinned;  -- → 6

Growing Up: No Longer a Kit 🦫

The DuckDB Delta extension is no longer a kit and has grown up quite a bit since a year ago. As you just saw, we added writes and time travel. These features open the door to something bigger: Unity Catalog coordination.

Unity Catalog Support atop the Delta

Data lake systems excel at scale. As your data assets multiply, you need a way to discover what exists, control who can access it, audit how it's being used, and coordinate writes across multiple engines. Data catalogs have evolved to address exactly these needs, sitting above the storage layer to manage the metadata, governance, and transactional bookkeeping that make large-scale data lakes effective. The OSS Unity Catalog team has a good overview if you'd like to go deeper; the concepts apply broadly regardless of which catalog you use.

What is Unity Catalog?

Unity Catalog (UC for short) is an open standard for governing data and AI assets, including tables, volumes, models, and functions, across engines and clouds. It turns your data lake into a lakehouse, and gives you a single place to discover, audit, and control access to your data, regardless of what's reading or writing it. DuckDB's Unity Catalog extension is built upon the Unity Catalog Open API. There are two main implementations: OSS Unity Catalog, which you can self-host (and Docker-ify in minutes), and Databricks Unity Catalog, the managed version. Like Delta, the DuckDB Unity Catalog extension has shed its experimental tag. Let's put both to work.

Getting Started: OSS Unity Catalog

We've set up a Docker image playground bundling OSS Unity Catalog and DuckDB together, so you can follow along with easy docker build-and-run setup. Grab it if you would like to walk through the samples or experiment on your own. (If you'd prefer to run OSS UC directly, the official image is the upstream of our playground.)

Let's start with Docker. Assuming you now have the image running, it already executed (roughly) the following steps in the build phase to prepare our playground:

# Create a schema
/home/unitycatalog/bin/uc schema create --catalog unity --name my_schema

# Create the "pets" table
/home/unitycatalog/bin/uc table create \
    --full_name        unity.my_schema.pets \
    --columns          "uuid STRING, name STRING, age INT, adopted BOOLEAN" \
    --format           DELTA \
    --storage_location file:///home/unitycatalog/etc/data/external/unity/my_schema/tables/pets

After that, we can test things out from DuckDB. To see for yourself, docker exec -it duckdb-playground duckdb will give you a DuckDB shell inside the container.

Before doing anything meaningful we'll need to set up a DuckDB secret. In this example the TOKEN value is ignored by local OSS UC server, but the field is required. Create the secret, then you can immediately attach and read:

LOAD unity_catalog;

CREATE SECRET (
    TYPE     unity_catalog,
    TOKEN    'demo-ignored-token',
    ENDPOINT 'http://unitycatalog:8080'
);

ATTACH 'unity' AS my_catalog
    (TYPE unity_catalog, DEFAULT_SCHEMA 'my_schema');

SELECT name, age, adopted FROM my_catalog.pets ORDER BY name;
-- returns a single 'Seed' row

That's it! You just queried Unity-Catalog-managed, Delta-stored pets data.

Tip Want to experiment with this on Databricks Unity Catalog? Setting up a Databricks Unity Catalog is out of scope for this blog, but if you have one ready to go, you will need these to get bootstrapped with DuckDB:

set ENDPOINT to your Workspace URL (typically: https://{instance}.cloud.databricks.com/)

set TOKEN appropriately (e.g. create a PAT with unity-catalog scope); getting the correct token depends entirely on your setup. To dive in, see Access Control in Unity Catalog.

With these in hand you can use DuckDB directly, or access the extensive UC Open API directly.

Next, let's complete the circle and write some data into our pets table:

INSERT INTO my_catalog.pets
    (uuid, name, age, adopted)
SELECT
    gen_random_uuid()::VARCHAR,
    ['Luna', 'Milo', 'Bella', 'Charlie', 'Max', 'Lucy', 'Cooper',
     'Daisy', 'Buddy', 'Lily', 'Rocky', 'Molly', 'Bear', 'Lola',
     'Duke', 'Sadie', 'Tucker', 'Zoe', 'Oliver', 'Stella'
    ][1 + (random() * 19)::INT],
    (1 + (random() * 14)::INT)::INT,
    random() > 0.5
FROM range(10);

SELECT count() FROM my_catalog.pets;

You can also easily find and see the created files; check the local data directory (also bind-mounted in Docker), and you should find both pre-existing files, and a new Parquet file containing the inserted rows. In my case it looks like this:

tree data

data
└── external
    └── unity
        └── my_schema
            └── tables
                └── pets
                    ├── _delta_log
                    │   ├── 00000000000000000000.json
                    │   ├── 00000000000000000001.json
                    │   └── 00000000000000000002.json
                    ├── duckdb-19cb47ae-9f35-4126-b67d-c94fcade68cc.parquet
                    └── duckdb-e3bb0336-f16a-4d21-9495-0fbf55c6cba8.parquet

7 directories, 5 files

Catalog Managed Tables

With the basics out of the way, we can talk about Catalog Managed Tables (CMT). This is available today in both OSS and Databricks Unity Catalog.

The big feature in CMT is Catalog Commits, which enables coordinated concurrent writes. Without Catalog Commits, DuckDB writes go directly to the Delta log. While modern storage backends prevent outright lost writes, UC is left out of the loop entirely. Its metadata, audit trail, and statistics fall out of sync with the actual table state, and other engines querying through UC may see a stale view.

Catalog Commits (CC) fixes this: every write is staged and registered through UC before it becomes visible. UC acts as the commit arbiter, preserving first writer commits, and sending a conflict error to later writers. This matters wherever multiple writers are appending simultaneously, e.g., parallel ETL pipelines, partitioned bulk loads, and concurrent analytical inserts. Each writer works independently; UC ensures exactly one commit lands per version and keeps its own catalog in sync with every one of them.

Consistent reads and audit history are already inherent to Delta and UC respectively. CC doesn't add functionality, it just ensures UC stays in sync with every commit. And Catalog Commits coordinate per table; there is no cross-table atomicity. If you write to two tables in the same BEGIN / COMMIT block, each table commits independently.

To opt a table into CMT (and therefore CC), set the delta.feature.catalogManaged table property at creation time. This is done via Spark or the UC CLI, as DuckDB's Unity Catalog extension does not yet support CREATE TABLE DDL:

-- Via Spark
CREATE TABLE my_catalog.my_schema.concurrent_tbl (
    uuid    STRING  NOT NULL,
    name    STRING  NOT NULL,
    age     INT     NOT NULL,
    adopted BOOLEAN NOT NULL
)
TBLPROPERTIES ('delta.feature.catalogManaged' = 'supported');

Once enabled, DuckDB writes go through UC's commit staging automatically — the INSERT syntax is unchanged:

INSERT INTO my_catalog.my_schema.concurrent_tbl
    (uuid, name, age, adopted)
VALUES (gen_random_uuid()::VARCHAR, 'Luna', 3, true);

Now each DuckDB writer stages its commit to a _staged_commits/ directory and registers it with UC before that data becomes visible. UC arbitrates: exactly one writer wins each version in a race, the others get a conflict error and can retry. Next, let's look at how UC handles the race.

Deeper Dive

Racing Commits

To see how Catalog Commits arbitrates, we launched 20 concurrent DuckDB writers, 8 at a time, all inserting into the same managed table:

seq 1 20 | xargs -P 8 -I{} scripts/unity/05-cmc/write-single {}

[worker 6] OK - inserted 5 rows
[worker 5] CONFLICT - another writer won this version, retry needed
[worker 2] CONFLICT - another writer won this version, retry needed
[worker 8] CONFLICT - another writer won this version, retry needed
[worker 7] CONFLICT - another writer won this version, retry needed
[worker 3] CONFLICT - another writer won this version, retry needed
[worker 1] OK - inserted 5 rows
[worker 4] CONFLICT - another writer won this version, retry needed
[worker 16] OK - inserted 5 rows
[worker 13] CONFLICT - another writer won this version, retry needed
[worker 15] CONFLICT - another writer won this version, retry needed
[worker 11] CONFLICT - another writer won this version, retry needed
[worker 14] CONFLICT - another writer won this version, retry needed
[worker 12] OK - inserted 5 rows
[worker 9] CONFLICT - another writer won this version, retry needed
[worker 10] CONFLICT - another writer won this version, retry needed
[worker 17] CONFLICT - another writer won this version, retry needed
[worker 20] CONFLICT - another writer won this version, retry needed
[worker 18] OK - inserted 5 rows
[worker 19] CONFLICT - another writer won this version, retry needed

Here we see 5 successful writes, and 15 signaled conflicts. Let's confirm in the data:

SELECT count() AS total_rows FROM my_catalog.my_schema.concurrent_tbl;

┌────────────┐
│ total_rows │
│   int64    │
├────────────┤
│         35 │
└────────────┘

10 seeded rows + (5 writes × 5 rows each) = 35 total rows. (In a real workload, you would retry the conflicted writes and land all 20 inserts.) Catalog Managed Table commits gave us clear signal and semantics during highly concurrent writes, as promised.

Travel in Time, Faster

DuckDB's Delta snapshot loading is getting a speed boost: snapshots will load incrementally when possible, making time travel across nearby versions significantly faster. Consider a table where some initial queries are made against version 16:

ATTACH './path/to/table' AS t (TYPE delta, VERSION 16);
SELECT count() FROM t;  -- → 17

And now some work needs to be done against version 20. If we peek under the hood (warning: sneaky code follows), we'll see that none of the previously loaded Delta log metadata files were re-loaded:

SET enable_logging = true;
SET delta_kernel_logging = true;
CALL enable_logging('DeltaKernel', level = 'trace');

ATTACH './path/to/table' AS t (TYPE delta, VERSION 20);
SELECT count() FROM t;  -- → 21

-- Delta kernel logs 'Provisionally selecting ... .json'
-- whenever it reads a log file from scratch. We search for any such
-- message referencing a zero-padded log filename; zero matches
-- means the cached v16 snapshot was reused rather than rebuilt.
SELECT count() FROM duckdb_logs
WHERE type = 'DeltaKernel'
  AND message LIKE '%00000000000000000%.json%';
-- → 0

In Delta lakes with thousands or millions of snapshots, incremental loading provides a big win when working across multiple versions.

At time of writing, incremental snapshot loading is supported in nightly builds. You can install it using:
FORCE INSTALL delta FROM core_nightly;
Please be aware that nightly builds are not intended for production use. The implementation will be included in the next stable release, v1.5.3.

Conclusions

A year ago, DuckDB could read Delta tables. Today it can insert data into them, travel through their history, and query and write through a governed catalog — without the experimental caveat on any of it. The combination of Delta for open storage, Unity Catalog for governance and coordination, and DuckDB for fast analytical queries is a stack you can build on.

There's more to come: DDL support to create and manage tables directly, delete/update/merge support, and multi-table atomicity for writes that span more than one table. In the meantime, the playground image linked above has everything you need to kick the tires. As always, feedback and bug reports are welcome on GitHub.

The DuckLake Spec Is so Simple, Even a Clanker Can Build One for Dataframes

2026-05-04T00:00:00+00:00

We are showcasing the simplicity of DuckLake's v1.0 specification by developing a dataframe reader/writer with AI.

Announcing DuckDB 1.5.2

2026-04-13T00:00:00+00:00

In this blog post, we highlight a few important fixes in DuckDB v1.5.2, the second patch release in DuckDB's v1.5 line. You can find the complete release notes on GitHub.

To install the new version, please visit the installation page.

Data Lake and Lakehouse Formats

DuckLake

We are proud to release a stable, production-ready lakehouse specification and its reference implementation in DuckDB.

We published a detailed blog post on the DuckLake site but here's a quick summary: DuckLake v1.0 ships dozens of bugfixes and guarantees backward-compatibility. Additionally, it has a number of cool features: data inlining, sorted tables, bucket partitioning, and deletion buffers as Iceberg-compatible Puffin files. More on this in the announcement blog post.

Iceberg

The Iceberg extension ships a number of new features. It now supports the following:

GEOMETRY type
ALTER TABLE statement
Updates and deletes from partitioned tables
Truncate and bucket partitions

Last week, DuckLabs engineer Tom Ebergen gave a talk at the Iceberg Summit titled “Building DuckDB-Iceberg: Exploring the Iceberg Ecosystem”, where he shared his experiences with Iceberg.

Preliminary Jepsen Test Results

To make DuckDB as robust as possible, we started a collaboration with Jepsen. The preliminary test suite is available at https://github.com/duckdb/duckdb-jepsen.

The test suite has uncovered a bug that was triggered by INSERT INTO statements that perform conflict resolution on a primary key, and already shipped a fix in this release.

New Online Shell

The online WebAssembly shell at shell.duckdb.org received a complete overhaul. A highlight of the new shell is the ability to store and list files using the .files dot command and its variants.

Using the file storage feature, you can turn your browser session into workbench: you can drag-and-drop files from your local file system to upload them, create new ones using DuckDB's COPY ... TO statement and download the results. For more information on this feature, use the .help command.

The new shell comes with a few built-in datasets: you're welcome to try them out and experiment. Your old links to shell.duckdb.org should still work but if you experience any problems, please submit an issue in the duckdb-web repository.

Benchmarks

We benchmarked DuckDB using the Linux v7 kernel on an r8gd.8xlarge instance with 32 vCPUs, 256 GiB RAM, and an NVMe SSD. We first ran the scale factor 300 test on Ubuntu 24.04 LTS, then upgraded to Ubuntu 26.04 beta. We noticed that the composite TPC-H score shows a ~10% improvement, jumping from 778,041 to 854,676 when measured with TPC-H's QphH@Score metric.

Coming Up

This quarter, we have quite a few exciting events lined up.

DuckCon #7. On June 24, we'll host our next user conference, DuckCon #7, in Amsterdam's beautiful Royal Tropical Institute.

AI Council Talk. On May 12, DuckDB co-creator Hannes Mühleisen will give a talk at AI Council 2026 titled “Super-Secret Next Big Thing for DuckDB”. Well, at this point, we cannot tell you more than he will present the super-secret next big thing for DuckDB. But, if you cannot make it, don't worry: we'll publish the presentation afterwards.

Ubuntu Summit Talk. We already talked about performance on Ubuntu. In late May, Gábor Szárnyas of DuckLabs will give a talk titled “DuckDB: Not Quack Science” at the Ubuntu Summit.

Conclusion

This post is a short summary of the changes in v1.5.2. As usual, you can find the full release notes on GitHub.

DuckLake v1.0: The Lakehouse Format Built on SQL Reaches Production-Readiness

2026-04-13T00:00:00+00:00

We released the DuckLake v1.0 standard!

Data Inlining in DuckLake: Unlocking Streaming for Data Lakes

2026-04-02T00:00:00+00:00

DuckLake’s data inlining stores small updates directly in the catalog, eliminating the “small files problem” and making continuous streaming into data lakes practical. Our benchmark shows 926× faster queries and 105× faster ingestion when compared to Iceberg.

DuckDB Now Speaks Dutch!

2026-04-01T00:00:00+00:00

Historically speaking, SQL queries have always been formulated in English. The initial name of the language was even Structured English Query Language (SEQUEL), before it became SQL. Now, what if the Dutch hadn't traded away New Amsterdam (present-day New York)? Would we all have been writing SQL in Dutch instead?

Well, wonder no longer. Today we're releasing EendDB: a DuckDB extension that brings you the Gestructureerde Zoektaal, or GZT for short.

Want joins? We've got SAMENVOEGEN. Aggregates? GROEP PER. Window functions? Those work too — though you'll have to look up the Dutch keywords in the repository yourself.

You can try it out right now in DuckDB v1.5.1:

INSTALL eenddb FROM community;
LOAD eenddb;
CALL enable_dutch_parser();

MAAK TABEL eend (
    id        GEHEEL_GETAL,
    naam      TEKST,
    leeftijd  GEHEEL_GETAL,
    gewicht   KOMMAGETAL,
    soort     TEKST
);

TOEVOEGEN AAN eend WAARDEN
    (1, 'Donald',  29, 1.2, 'Wilde eend'),
    (2, 'Daffy',   35, 1.5, 'Zwarte eend'),
    (3, 'Daisy',   27, 1.1, 'Wilde eend'),
    (4, 'Scrooge', 75, 1.8, 'Wilde eend');

SELECTEER *
VAN eend
WAARBIJ gewicht > 1.2 EN naam ZOALS '%D%'
VOLGORDE PER leeftijd;

┌───────┬─────────┬──────────┬─────────┬─────────────┐
│  id   │  naam   │ leeftijd │ gewicht │    soort    │
│ int32 │ varchar │  int32   │  float  │   varchar   │
├───────┼─────────┼──────────┼─────────┼─────────────┤
│     2 │ Daffy   │       35 │     1.5 │ Zwarte eend │
└───────┴─────────┴──────────┴─────────┴─────────────┘

Of course, no query language is complete without joins and aggregates. Let's create a second table and count the ducks per soort:

MAAK TABEL soorten (soort TEKST, leefgebied TEKST);

TOEVOEGEN AAN soorten WAARDEN
    ('Wilde eend',  'Meren en rivieren'),
    ('Zwarte eend', 'Kustgebieden');

SELECTEER s.leefgebied, count(*) ALS aantal_eenden
VAN eend ALS e
LINKS SAMENVOEGEN soorten ALS s OP e.soort = s.soort
GROEP PER s.leefgebied
VOLGORDE PER aantal_eenden AFLOPEND;

┌───────────────────┬───────────────┐
│    leefgebied     │ aantal_eenden │
│      varchar      │     int64     │
├───────────────────┼───────────────┤
│ Meren en rivieren │             3 │
│ Kustgebieden      │             1 │
└───────────────────┴───────────────┘

After we are done playing around, we obviously have to clean up after ourselves. Rather than DROP a table, in Dutch we like to throw it away (“weggooien”):

GOOI_WEG TABEL eend;
GOOI_WEG TABEL soorten;

Under the hood, the parser is using DuckDB's new experimental parser, based on Parsing Expression Grammar.

For more examples, check out the repository on GitHub.

DuckDB

Test-Driving the Lance Lakehouse Format in DuckDB

What is Lance?

The Lance DuckDB Extension

Example Usage

Why Lance and DuckDB?

Performance Experiment

Cold Results

Warm Results

Conclusion

DuckDB 1.5.3: Not an Ordinary Patch Release

What's New

Quack as a Core Extension

DuckDB Server

DuckDB Client

DuckLake with Quack

DuckDB Server

DuckDB Client

AWS Extension Features

HTTP_PROXY Variable for the HTTPS Extension

Iceberg

Development and Internals

Shipping jemalloc as a Statically Linked Library

DISABLE_EXTENSION_LOAD Flag

Coming Up

Conclusion

Quack: The DuckDB Client-Server Protocol

Background: Database Architectures

Introducing the Quack Protocol for DuckDB

DuckDB #1

DuckDB #2

DuckDB #1

DuckDB #2

DuckDB #1

DuckDB #2

Protocol Design

HTTP-Based

Request-Response Pattern

Serialization

Encryption

Round-Trips

Authentication and Authorization

Default Port

Benchmarks

Bulk Transfer

Announcing the Program of DuckCon #7 Amsterdam

Delta Grows Up: Writes, Unity Catalog and Time Travel

Time to Open the Delta

Building Up the Delta (Lake): Writes

Time Travel

Growing Up: No Longer a Kit 🦫

Unity Catalog Support atop the Delta

What is Unity Catalog?

Getting Started: OSS Unity Catalog

Catalog Managed Tables

Deeper Dive

Racing Commits

Travel in Time, Faster

Conclusions

The DuckLake Spec Is so Simple, Even a Clanker Can Build One for Dataframes

Announcing DuckDB 1.5.2

Data Lake and Lakehouse Formats

DuckLake

Iceberg

Preliminary Jepsen Test Results

New Online Shell

Benchmarks

Coming Up

Conclusion

DuckLake v1.0: The Lakehouse Format Built on SQL Reaches Production-Readiness

Data Inlining in DuckLake: Unlocking Streaming for Data Lakes

DuckDB Now Speaks Dutch!

`HTTP_PROXY` Variable for the HTTPS Extension

`DISABLE_EXTENSION_LOAD` Flag