DuckLake Ruby

🦆 DuckLake for Ruby

Run your own data lake with a SQL database and file/object storage

DuckLake::Client.new(
  catalog_url: "postgres://user:pass@host:5432/dbname",
  storage_url: "s3://my-bucket/"
)

Learn more

Note: DuckLake is not considered production-ready at the moment

Installation

First, install libduckdb. For Homebrew, use:

brew install duckdb

Then add this line to your application’s Gemfile:

gem "ducklake"

Getting Started

Create a client - this one stores everything locally

ducklake =
  DuckLake::Client.new(
    catalog_url: "sqlite:///ducklake.sqlite",
    storage_url: "data_files/",
    create_if_not_exists: true
  )

Create a table

ducklake.sql("CREATE TABLE events (id bigint, name text)")

Load data from a file

ducklake.sql("COPY events FROM 'data.csv'")

Confirm a new Parquet file was added to the data lake

ducklake.list_files("events")

Query the data

ducklake.sql("SELECT COUNT(*) FROM events").to_a

Catalog Database

Catalog information can be stored in:

Postgres: postgres://user@pass@host:5432/dbname
SQLite: sqlite:///path/to/dbname.sqlite
DuckDB: duckdb:///path/to/dbname.duckdb

Note: MySQL and MariaDB are not currently supported due to duckdb/ducklake#70 and duckdb/ducklake#210

There are two ways to set up the schema:

Run this script
Configure the client to do it

DuckLake::Client.new(create_if_not_exists: true, ...)

Data Storage

Data can be stored in:

Local files: data_files/
Amazon S3: s3://my-bucket/path/
Other providers: todo

Amazon S3

Credentials are detected in the standard AWS SDK locations, or you can pass them manually

DuckLake::Client.new(
  storage_options: {
    aws_access_key_id: "...",
    aws_secret_access_key: "...",
    region: "us-east-1"
  },
  ...
)

IAM permissions

Read: s3::ListBucket, s3::GetObject
Write: s3::ListBucket, s3::PutObject
Maintenance: s3::ListBucket, s3::GetObject, s3::PutObject, s3::DeleteObject

Operations

Create an empty table

ducklake.sql("CREATE TABLE events (id bigint, name text)")

Or a table from a file

ducklake.sql("CREATE TABLE events AS FROM 'data.csv'")

Load data from a file

ducklake.sql("COPY events FROM 'data.csv'")

You can also load data directly from other data sources

ducklake.attach("blog", "postgres://localhost:5432/blog")
ducklake.sql("INSERT INTO events SELECT * FROM blog.ahoy_events")

Or register existing data files

ducklake.add_data_files("events", "data.parquet")

Note: This transfers ownership to the data lake, so the file may be deleted as part of maintenance

Update data

ducklake.sql("UPDATE events SET name = ? WHERE id = ?", ["Test", 1])

Delete data

ducklake.sql("DELETE * FROM events WHERE id = ?", [1])

Transactions

Run multiple statements in a transaction

ducklake.transaction do
  # ...
end

Raise DuckLake::Rollback to rollback

Add commit info to a transaction

ducklake.transaction(commit_message: "...", commit_author: "...") do
  # ...
end

This will appear in snapshots

Schema Changes

Update the schema

ducklake.sql("ALTER TABLE events ADD COLUMN active BOOLEAN")

Set or remove a partitioning key

ducklake.sql("ALTER TABLE events SET PARTITIONED BY (name)")
# or
ducklake.sql("ALTER TABLE events RESET PARTITIONED BY")

Views

Create a view

ducklake.sql("CREATE VIEW events_view AS SELECT * FROM events")

Drop a view

ducklake.sql("DROP VIEW events_view")

Snapshots

Get snapshots

ducklake.snapshots

Get the current snapshot

ducklake.current_snapshot

Query the data at a specific snapshot version or time

ducklake.sql("SELECT * FROM events AT (VERSION => ?)", [3])
# or
ducklake.sql("SELECT * FROM events AT (TIMESTAMP => ?)", [Date.today - 7])

You can also specify a snapshot when creating the client

DuckLake::Client.new(snapshot_version: 3, ...)
# or
DuckLake::Client.new(snapshot_time: Date.today - 7, ...)

Maintenance

Merge files

ducklake.merge_adjacent_files

Expire snapshots

ducklake.expire_snapshots(older_than: Date.today - 7)

Clean up old files

ducklake.cleanup_old_files(older_than: Date.today - 7)

Rewrite files with a certain percentage of deleted rows

ducklake.rewrite_data_files(delete_threshold: 0.5)

Configuration

Get options

ducklake.options

Set an option globally

ducklake.set_option("parquet_compression", "zstd")

Or for a specific table

ducklake.set_option("parquet_compression", "zstd", table_name: "events")

Read-Only Mode

Note: This feature is experimental and does not prevent the DuckDB engine from writing files via sql

Attach the catalog in read-only mode

DuckLake::Client.new(read_only: true, ...)

Use read-only credentials for catalog database and storage provider and disable external access

You should also consider disabling community extensions

ducklake.sql("SET allow_community_extensions = false")

And locking the configuration

ducklake.sql("SET lock_configuration = true")

External Access

Restrict external access to the DuckDB engine

ducklake.disable_external_access

Allow specific directories and paths

ducklake.disable_external_access(
  allowed_directories: ["/path/to/directory"],
  allowed_paths: ["/path/to/file.txt"]
)

The storage URL is automatically included in allowed_directories

SQL Safety

Use parameterized queries when possible

ducklake.sql("SELECT * FROM events WHERE id = ?", [1])

For places that do not support parameters, use quote or quote_identifier

quoted_table = ducklake.quote_identifier("events")
quoted_file = ducklake.quote("path/to/data.csv")
ducklake.sql("COPY #{quoted_table} FROM #{quoted_file}")

Encryption

Note: This feature is experimental and must be set when creating the catalog

Encrypt Parquet files

DuckLake::Client.new(encryption: true, ...)

Polars

Note: This feature is experimental and does not currently work on tables with schema changes

Query the data with Ruby Polars

ducklake.polars("events")

Specify a snapshot

ducklake.polars("events", snapshot_version: 3)
# or
ducklake.polars("events", snapshot_time: Date.today - 7)

Reference

Get table info

ducklake.table_info

Get column info

ducklake.column_info("events")

Drop a table

ducklake.drop_table("events")
# or
ducklake.drop_table("events", if_exists: true)

List files

ducklake.list_files("events")

List files at a specific snapshot version or time

ducklake.list_files("events", snapshot_version: 3)
# or
ducklake.list_files("events", snapshot_time: Date.today - 7)

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

Report bugs
Fix bugs and submit pull requests
Write, clarify, or fix documentation
Suggest or add new features

To get started with development:

git clone https://github.com/ankane/ducklake-ruby.git
cd ducklake-ruby
bundle install

# Postgres
createdb ducklake_ruby_test
createdb ducklake_ruby_test2
bundle exec rake test:postgres

# MySQL and MariaDB
mysqladmin create ducklake_ruby_test
mysqladmin create ducklake_ruby_test2
bundle exec rake test:mysql

# SQLite
bundle exec rake test:sqlite

# DuckDB
bundle exec rake test:duckdb

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github/workflows		.github/workflows
lib		lib
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Gemfile		Gemfile
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
ducklake.gemspec		ducklake.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DuckLake Ruby

Installation

Getting Started

Catalog Database

Data Storage

Amazon S3

Operations

Transactions

Schema Changes

Views

Snapshots

Maintenance

Configuration

Read-Only Mode

External Access

SQL Safety

Encryption

Polars

Reference

History

Contributing

About

Uh oh!

Releases

Packages

Languages

License

ankane/ducklake-ruby

Folders and files

Latest commit

History

Repository files navigation

DuckLake Ruby

Installation

Getting Started

Catalog Database

Data Storage

Amazon S3

Operations

Transactions

Schema Changes

Views

Snapshots

Maintenance

Configuration

Read-Only Mode

External Access

SQL Safety

Encryption

Polars

Reference

History

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages