🦆 DuckLake for Ruby
Run your own data lake with a SQL database and file/object storage
DuckLake::Client.new(
catalog_url: "postgres://user:pass@host:5432/dbname",
storage_url: "s3://my-bucket/"
)Note: DuckLake is not considered production-ready at the moment
First, install libduckdb. For Homebrew, use:
brew install duckdbThen add this line to your application’s Gemfile:
gem "ducklake"Create a client - this one stores everything locally
ducklake =
DuckLake::Client.new(
catalog_url: "sqlite:///ducklake.sqlite",
storage_url: "data_files/",
create_if_not_exists: true
)Create a table
ducklake.sql("CREATE TABLE events (id bigint, name text)")Load data from a file
ducklake.sql("COPY events FROM 'data.csv'")Confirm a new Parquet file was added to the data lake
ducklake.list_files("events")Query the data
ducklake.sql("SELECT COUNT(*) FROM events").to_aCatalog information can be stored in:
- Postgres:
postgres://user@pass@host:5432/dbname - SQLite:
sqlite:///path/to/dbname.sqlite - DuckDB:
duckdb:///path/to/dbname.duckdb
Note: MySQL and MariaDB are not currently supported due to duckdb/ducklake#70 and duckdb/ducklake#210
There are two ways to set up the schema:
- Run this script
- Configure the client to do it
DuckLake::Client.new(create_if_not_exists: true, ...)Data can be stored in:
- Local files:
data_files/ - Amazon S3:
s3://my-bucket/path/ - Other providers: todo
Credentials are detected in the standard AWS SDK locations, or you can pass them manually
DuckLake::Client.new(
storage_options: {
aws_access_key_id: "...",
aws_secret_access_key: "...",
region: "us-east-1"
},
...
)IAM permissions
- Read:
s3::ListBucket,s3::GetObject - Write:
s3::ListBucket,s3::PutObject - Maintenance:
s3::ListBucket,s3::GetObject,s3::PutObject,s3::DeleteObject
Create an empty table
ducklake.sql("CREATE TABLE events (id bigint, name text)")Or a table from a file
ducklake.sql("CREATE TABLE events AS FROM 'data.csv'")Load data from a file
ducklake.sql("COPY events FROM 'data.csv'")You can also load data directly from other data sources
ducklake.attach("blog", "postgres://localhost:5432/blog")
ducklake.sql("INSERT INTO events SELECT * FROM blog.ahoy_events")Or register existing data files
ducklake.add_data_files("events", "data.parquet")Note: This transfers ownership to the data lake, so the file may be deleted as part of maintenance
Update data
ducklake.sql("UPDATE events SET name = ? WHERE id = ?", ["Test", 1])Delete data
ducklake.sql("DELETE * FROM events WHERE id = ?", [1])Run multiple statements in a transaction
ducklake.transaction do
# ...
endRaise DuckLake::Rollback to rollback
Add commit info to a transaction
ducklake.transaction(commit_message: "...", commit_author: "...") do
# ...
endThis will appear in snapshots
Update the schema
ducklake.sql("ALTER TABLE events ADD COLUMN active BOOLEAN")Set or remove a partitioning key
ducklake.sql("ALTER TABLE events SET PARTITIONED BY (name)")
# or
ducklake.sql("ALTER TABLE events RESET PARTITIONED BY")Create a view
ducklake.sql("CREATE VIEW events_view AS SELECT * FROM events")Drop a view
ducklake.sql("DROP VIEW events_view")Get snapshots
ducklake.snapshotsGet the current snapshot
ducklake.current_snapshotQuery the data at a specific snapshot version or time
ducklake.sql("SELECT * FROM events AT (VERSION => ?)", [3])
# or
ducklake.sql("SELECT * FROM events AT (TIMESTAMP => ?)", [Date.today - 7])You can also specify a snapshot when creating the client
DuckLake::Client.new(snapshot_version: 3, ...)
# or
DuckLake::Client.new(snapshot_time: Date.today - 7, ...)Merge files
ducklake.merge_adjacent_filesExpire snapshots
ducklake.expire_snapshots(older_than: Date.today - 7)Clean up old files
ducklake.cleanup_old_files(older_than: Date.today - 7)Rewrite files with a certain percentage of deleted rows
ducklake.rewrite_data_files(delete_threshold: 0.5)Get options
ducklake.optionsSet an option globally
ducklake.set_option("parquet_compression", "zstd")Or for a specific table
ducklake.set_option("parquet_compression", "zstd", table_name: "events")Note: This feature is experimental and does not prevent the DuckDB engine from writing files via sql
Attach the catalog in read-only mode
DuckLake::Client.new(read_only: true, ...)Use read-only credentials for catalog database and storage provider and disable external access
You should also consider disabling community extensions
ducklake.sql("SET allow_community_extensions = false")ducklake.sql("SET lock_configuration = true")Restrict external access to the DuckDB engine
ducklake.disable_external_accessAllow specific directories and paths
ducklake.disable_external_access(
allowed_directories: ["/path/to/directory"],
allowed_paths: ["/path/to/file.txt"]
)The storage URL is automatically included in allowed_directories
Use parameterized queries when possible
ducklake.sql("SELECT * FROM events WHERE id = ?", [1])For places that do not support parameters, use quote or quote_identifier
quoted_table = ducklake.quote_identifier("events")
quoted_file = ducklake.quote("path/to/data.csv")
ducklake.sql("COPY #{quoted_table} FROM #{quoted_file}")Note: This feature is experimental and must be set when creating the catalog
Encrypt Parquet files
DuckLake::Client.new(encryption: true, ...)Note: This feature is experimental and does not currently work on tables with schema changes
Query the data with Ruby Polars
ducklake.polars("events")Specify a snapshot
ducklake.polars("events", snapshot_version: 3)
# or
ducklake.polars("events", snapshot_time: Date.today - 7)Get table info
ducklake.table_infoGet column info
ducklake.column_info("events")Drop a table
ducklake.drop_table("events")
# or
ducklake.drop_table("events", if_exists: true)List files
ducklake.list_files("events")List files at a specific snapshot version or time
ducklake.list_files("events", snapshot_version: 3)
# or
ducklake.list_files("events", snapshot_time: Date.today - 7)View the changelog
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs and submit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
git clone https://github.com/ankane/ducklake-ruby.git
cd ducklake-ruby
bundle install
# Postgres
createdb ducklake_ruby_test
createdb ducklake_ruby_test2
bundle exec rake test:postgres
# MySQL and MariaDB
mysqladmin create ducklake_ruby_test
mysqladmin create ducklake_ruby_test2
bundle exec rake test:mysql
# SQLite
bundle exec rake test:sqlite
# DuckDB
bundle exec rake test:duckdb