Thanks to visit codestin.com
Credit goes to github.com

Skip to content

nshkrdotcom/duckdb_ex

Repository files navigation

DuckDB Elixir Client Logo

DuckDB Elixir

CI Elixir OTP Hex.pm Documentation License

A DuckDB client for Elixir, providing a powerful Relation API for analytical queries.

Status: 🚀 Core Relation API implemented and tested - Ready for evaluation

About

DuckdbEx brings DuckDB's analytical power to Elixir with:

  • Lazy Relation API: Composable, chainable query building inspired by Python's DuckDB API
  • CLI-based Architecture: Uses DuckDB CLI via erlexec for maximum portability
  • Idiomatic Elixir: Functional, pipe-friendly API with pattern matching
  • Comprehensive Testing: TDD approach with 71 passing tests
  • Python API Compatibility: Functions mirror the official Python API

Installation

# Add to mix.exs
def deps do
  [
    {:duckdb_ex, "~> 0.1.1"}
  ]
end

Quick Start

💡 New! Check out the examples/ directory for 8 comprehensive, runnable examples:

mix run examples/00_quickstart.exs
mix run examples/01_basic_queries.exs
mix run examples/02_tables_and_data.exs
# ... and more!

Basic Connection and Queries

# Connect to in-memory database
{:ok, conn} = DuckdbEx.Connection.connect(:memory)

# Or connect to a file
{:ok, conn} = DuckdbEx.Connection.connect("/path/to/database.duckdb")

# Execute SQL directly
{:ok, result} = DuckdbEx.Connection.execute(conn, "SELECT 42 as answer")

# Fetch all rows as maps
{:ok, rows} = DuckdbEx.Connection.fetch_all(conn, "SELECT * FROM users")
# => [%{"id" => 1, "name" => "Alice"}, ...]

# Fetch single row
{:ok, row} = DuckdbEx.Connection.fetch_one(conn, "SELECT * FROM users LIMIT 1")
# => %{"id" => 1, "name" => "Alice"}

# Close connection
DuckdbEx.Connection.close(conn)

Relation API - Lazy Query Building

The Relation API allows you to build complex queries through method chaining, with execution deferred until you fetch results:

# Create a relation (no execution yet)
relation = DuckdbEx.Connection.table(conn, "orders")

# Chain operations (still no execution)
result = relation
|> DuckdbEx.Relation.filter("amount > 100")
|> DuckdbEx.Relation.project(["customer_name", "amount", "order_date"])
|> DuckdbEx.Relation.order("amount DESC")
|> DuckdbEx.Relation.limit(10)
|> DuckdbEx.Relation.fetch_all()  # Executes here

# Result: Top 10 orders over $100
{:ok, rows} = result

Working with Relations

Creating Relations

# From a table or view
relation = DuckdbEx.Connection.table(conn, "products")

# From SQL
relation = DuckdbEx.Connection.sql(conn, "SELECT * FROM generate_series(1, 100)")

# From range (using DuckDB's range function)
relation = DuckdbEx.Connection.sql(conn, "SELECT * FROM range(10)")

Filtering Data

# Simple filter
relation
|> DuckdbEx.Relation.filter("price > 50")
|> DuckdbEx.Relation.fetch_all()

# Chain multiple filters (AND logic)
relation
|> DuckdbEx.Relation.filter("price > 50")
|> DuckdbEx.Relation.filter("category = 'Electronics'")
|> DuckdbEx.Relation.fetch_all()

# Complex conditions
relation
|> DuckdbEx.Relation.filter("price > 50 AND (category = 'Electronics' OR category = 'Computers')")
|> DuckdbEx.Relation.fetch_all()

Selecting Columns

# Select specific columns
relation
|> DuckdbEx.Relation.project(["name", "price"])
|> DuckdbEx.Relation.fetch_all()

# Use expressions
relation
|> DuckdbEx.Relation.project([
  "name",
  "price",
  "price * 1.1 as price_with_tax",
  "upper(category) as category_upper"
])
|> DuckdbEx.Relation.fetch_all()

Sorting and Limiting

# Order by column
relation
|> DuckdbEx.Relation.order("price DESC")
|> DuckdbEx.Relation.fetch_all()

# Multiple columns
relation
|> DuckdbEx.Relation.order("category ASC, price DESC")
|> DuckdbEx.Relation.fetch_all()

# Limit results
relation
|> DuckdbEx.Relation.limit(100)
|> DuckdbEx.Relation.fetch_all()

# Top-N query
relation
|> DuckdbEx.Relation.order("revenue DESC")
|> DuckdbEx.Relation.limit(10)
|> DuckdbEx.Relation.fetch_all()

Aggregations

Simple Aggregations

# Count all rows
relation
|> DuckdbEx.Relation.aggregate("count(*) as total")
|> DuckdbEx.Relation.fetch_all()
# => {:ok, [%{"total" => 1000}]}

# Multiple aggregations
relation
|> DuckdbEx.Relation.aggregate([
  "count(*) as count",
  "sum(amount) as total",
  "avg(amount) as average",
  "min(amount) as minimum",
  "max(amount) as maximum"
])
|> DuckdbEx.Relation.fetch_all()

GROUP BY Aggregations

# Group by single column
DuckdbEx.Connection.table(conn, "sales")
|> DuckdbEx.Relation.aggregate(
  "sum(amount) as total_sales",
  group_by: ["region"]
)
|> DuckdbEx.Relation.fetch_all()

# Group by multiple columns
DuckdbEx.Connection.table(conn, "sales")
|> DuckdbEx.Relation.aggregate(
  ["sum(amount) as total", "count(*) as count"],
  group_by: ["region", "year"]
)
|> DuckdbEx.Relation.fetch_all()

# With filtering and ordering
DuckdbEx.Connection.table(conn, "products")
|> DuckdbEx.Relation.filter("price > 10")  # WHERE clause
|> DuckdbEx.Relation.aggregate(
  ["sum(price) as total", "count(*) as count"],
  group_by: ["category"]
)
|> DuckdbEx.Relation.filter("total > 1000")  # HAVING clause
|> DuckdbEx.Relation.order("total DESC")
|> DuckdbEx.Relation.fetch_all()

Convenience Aggregate Methods

# Count rows
relation |> DuckdbEx.Relation.count() |> DuckdbEx.Relation.fetch_all()
# => {:ok, [%{"count" => 100}]}

# Sum a column
relation |> DuckdbEx.Relation.sum("amount") |> DuckdbEx.Relation.fetch_all()
# => {:ok, [%{"sum" => 45000}]}

# Average
relation |> DuckdbEx.Relation.avg("price") |> DuckdbEx.Relation.fetch_all()
# => {:ok, [%{"avg" => 42.5}]}

# Min/Max
relation |> DuckdbEx.Relation.min("temperature") |> DuckdbEx.Relation.fetch_all()
relation |> DuckdbEx.Relation.max("score") |> DuckdbEx.Relation.fetch_all()

Complete Examples

E-commerce Analytics

{:ok, conn} = DuckdbEx.Connection.connect(:memory)

# Create and populate table
DuckdbEx.Connection.execute(conn, """
  CREATE TABLE orders (
    order_id INTEGER,
    customer_name VARCHAR,
    product_category VARCHAR,
    amount DECIMAL(10,2),
    order_date DATE
  )
""")

DuckdbEx.Connection.execute(conn, """
  INSERT INTO orders VALUES
    (1, 'Alice', 'Electronics', 999.99, '2024-01-15'),
    (2, 'Bob', 'Books', 29.99, '2024-01-16'),
    (3, 'Alice', 'Electronics', 49.99, '2024-01-17'),
    (4, 'Charlie', 'Furniture', 599.99, '2024-01-18'),
    (5, 'Bob', 'Electronics', 299.99, '2024-01-19')
""")

# Analyze: Top customers by total spending in Electronics
{:ok, top_customers} =
  conn
  |> DuckdbEx.Connection.table("orders")
  |> DuckdbEx.Relation.filter("product_category = 'Electronics'")
  |> DuckdbEx.Relation.aggregate(
    ["sum(amount) as total_spent", "count(*) as order_count"],
    group_by: ["customer_name"]
  )
  |> DuckdbEx.Relation.filter("total_spent > 100")
  |> DuckdbEx.Relation.order("total_spent DESC")
  |> DuckdbEx.Relation.fetch_all()

# Result:
# [
#   %{"customer_name" => "Alice", "total_spent" => 1049.98, "order_count" => 2},
#   %{"customer_name" => "Bob", "total_spent" => 299.99, "order_count" => 1}
# ]

Time Series Analysis

# Daily sales aggregation with statistical measures
{:ok, daily_stats} =
  conn
  |> DuckdbEx.Connection.table("sales")
  |> DuckdbEx.Relation.aggregate(
    [
      "date_trunc('day', timestamp) as day",
      "sum(amount) as daily_total",
      "avg(amount) as daily_avg",
      "stddev_pop(amount) as daily_stddev",
      "count(*) as transaction_count"
    ],
    group_by: ["date_trunc('day', timestamp)"]
  )
  |> DuckdbEx.Relation.order("day DESC")
  |> DuckdbEx.Relation.limit(30)
  |> DuckdbEx.Relation.fetch_all()

Data Pipeline

defmodule DataPipeline do
  def process_sales_data(conn) do
    # Reusable base relation
    base = DuckdbEx.Connection.table(conn, "raw_sales")

    # High-value customers
    high_value = base
    |> DuckdbEx.Relation.filter("total_purchases > 1000")
    |> DuckdbEx.Relation.project(["customer_id", "email"])

    # Recent activity
    recent = base
    |> DuckdbEx.Relation.filter("order_date > '2024-01-01'")
    |> DuckdbEx.Relation.aggregate(
      "count(*) as recent_orders",
      group_by: ["customer_id"]
    )

    # Execute both queries
    {:ok, high_value_customers} = DuckdbEx.Relation.fetch_all(high_value)
    {:ok, recent_activity} = DuckdbEx.Relation.fetch_all(recent)

    {high_value_customers, recent_activity}
  end
end

Working with DuckDB Functions

# Use DuckDB's built-in functions
conn
|> DuckdbEx.Connection.sql("SELECT * FROM range(100)")
|> DuckdbEx.Relation.filter("range % 2 = 0")  # Even numbers only
|> DuckdbEx.Relation.project(["range", "range * range as squared"])
|> DuckdbEx.Relation.fetch_all()

# Generate test data
conn
|> DuckdbEx.Connection.sql("SELECT * FROM generate_series(1, 1000) as id")
|> DuckdbEx.Relation.project([
  "id",
  "random() as random_value",
  "case when id % 2 = 0 then 'even' else 'odd' end as parity"
])
|> DuckdbEx.Relation.aggregate(
  ["avg(random_value) as avg_random", "count(*) as count"],
  group_by: ["parity"]
)
|> DuckdbEx.Relation.fetch_all()

API Reference

DuckdbEx.Connection

  • connect(database, opts \\ []) - Open database connection
  • execute(conn, sql, params \\ []) - Execute SQL query
  • fetch_all(conn, sql) - Execute and fetch all rows
  • fetch_one(conn, sql) - Execute and fetch first row
  • close(conn) - Close connection
  • sql(conn, sql) - Create relation from SQL
  • table(conn, table_name) - Create relation from table

DuckdbEx.Relation

Transformations (lazy, return new relation):

  • project(relation, columns) - Select columns
  • filter(relation, condition) - Filter rows
  • limit(relation, n) - Limit results
  • order(relation, order_by) - Sort results
  • aggregate(relation, expressions, opts \\ []) - Aggregate data

Convenience Aggregates:

  • count(relation) - Count rows
  • sum(relation, column) - Sum column
  • avg(relation, column) - Average column
  • min(relation, column) - Minimum value
  • max(relation, column) - Maximum value

Execution (trigger query execution):

  • execute(relation) - Execute and return result struct
  • fetch_all(relation) - Execute and fetch all rows
  • fetch_one(relation) - Execute and fetch first row
  • fetch_many(relation, n) - Execute and fetch N rows

DuckdbEx.Result

  • fetch_all(result) - Get all rows as list of maps
  • fetch_one(result) - Get first row as map
  • fetch_many(result, n) - Get N rows as list of maps
  • row_count(result) - Get number of rows
  • columns(result) - Get column names
  • to_tuples(result) - Convert rows to tuples

Architecture

DuckdbEx uses the DuckDB CLI process via erlexec instead of native NIFs:

Advantages:

  • ✅ Maximum portability (works everywhere DuckDB CLI works)
  • ✅ No compilation needed
  • ✅ Easy to debug and maintain
  • ✅ Handles all DuckDB features automatically

Trade-offs:

  • JSON serialization overhead (minimal for analytical queries)
  • No zero-copy data transfer
  • Cannot implement native UDFs

This architecture is ideal for analytical workloads where query execution time dominates, and the JSON overhead is negligible compared to query processing.

Examples

The examples/ directory contains 8 comprehensive, runnable examples demonstrating all features:

Example Description Run With
00_quickstart.exs Your first DuckDB query mix run examples/00_quickstart.exs
01_basic_queries.exs Simple queries, math, strings, dates mix run examples/01_basic_queries.exs
02_tables_and_data.exs CREATE, INSERT, UPDATE, DELETE mix run examples/02_tables_and_data.exs
03_transactions.exs Transaction management mix run examples/03_transactions.exs
04_relations_api.exs Lazy query building mix run examples/04_relations_api.exs
05_csv_parquet_json.exs Reading/writing files mix run examples/05_csv_parquet_json.exs
06_analytics_window_functions.exs Advanced analytics mix run examples/06_analytics_window_functions.exs
07_persistent_database.exs File-based databases mix run examples/07_persistent_database.exs

See examples/README.md for detailed descriptions and more information.

Testing

# Run all tests
mix test

# Run specific test file
mix test test/duckdb_ex/relation_test.exs

# Run with coverage
mix test --cover

# Run with specific seed
mix test --seed 123456

Current Test Coverage: 114 tests, 100% pass rate (after performance optimization)

Development Status

✅ Implemented

Core Connection API:

  • Connection management (connect, close)
  • Query execution (execute) with deterministic completion detection
  • Result fetching (fetch_all, fetch_one)
  • Exception hierarchy (27 types)
  • Transaction management (begin, commit, rollback, transaction helper)
  • Checkpoint support
  • Read-only connections

Relation API - Basic Operations:

  • Relation creation (sql, table)
  • Projections (project)
  • Filtering (filter)
  • Ordering (order)
  • Limiting (limit)
  • Lazy evaluation

Relation API - Aggregations:

  • Generic aggregation (aggregate)
  • GROUP BY support
  • HAVING clause (via filter after aggregate)
  • Convenience methods (count, sum, avg, min, max)
  • Statistical functions (stddev, variance)

Relation API - Advanced:

  • Joins (inner, left, right, outer, cross)
  • Set operations (union, intersect, except)
  • Distinct operations

File Format Support:

  • CSV reading/writing (read_csv_auto, COPY TO)
  • Parquet reading/writing
  • JSON reading/writing
  • Direct file querying

Performance:

  • Optimized query execution (100-200x faster via completion markers)
  • Tests run in ~1 second (previously took minutes due to timeouts)

📋 Planned (Phase 2+)

  • Explorer DataFrame integration
  • Prepared statements
  • Extensions management
  • Streaming results

Contributing

This project follows strict Test-Driven Development (TDD):

  1. RED: Write failing tests first
  2. GREEN: Implement minimal code to pass tests
  3. REFACTOR: Improve code while keeping tests green
  4. DOCUMENT: Add comprehensive docs and examples

All contributions should:

  • Include comprehensive tests
  • Follow existing code style
  • Reference Python API where applicable
  • Maintain 100% test pass rate

Comparison with Python API

# Python DuckDB
import duckdb
conn = duckdb.connect()
rel = conn.table('users')
result = (rel
  .filter('age > 25')
  .project(['name', 'email'])
  .order('name')
  .limit(10)
  .fetchall())
# Elixir DuckDB
{:ok, conn} = DuckdbEx.Connection.connect(:memory)
{:ok, result} = conn
|> DuckdbEx.Connection.table("users")
|> DuckdbEx.Relation.filter("age > 25")
|> DuckdbEx.Relation.project(["name", "email"])
|> DuckdbEx.Relation.order("name")
|> DuckdbEx.Relation.limit(10)
|> DuckdbEx.Relation.fetch_all()

API is intentionally similar for easy migration!

Performance

DuckdbEx uses a completion marker approach for deterministic query completion detection instead of timeouts:

  • 100-200x faster query execution (7-12ms vs 1000-2000ms per query)
  • Full test suite runs in ~1 second (114 tests)
  • No arbitrary timeouts or guessing
  • Proper error handling for aborted transactions

How It Works

Instead of waiting for timeouts, we append a marker query after each command:

-- Your query
SELECT * FROM users;
-- Completion marker (added automatically)
SELECT '__DUCKDB_COMPLETE__' as __status__;

When we see the marker in the output, we know DuckDB is done. The marker is stripped before returning results to you.

Why This Approach

  • Deterministic: We know exactly when queries complete
  • Fast: No waiting for arbitrary timeouts
  • Reliable: Works for all query types (SELECT, DDL, DML)
  • Error-aware: Special handling for aborted transactions

Performance Considerations

  • DuckDB excels at analytical queries on large datasets
  • Relation API allows DuckDB to optimize entire query tree
  • JSON overhead is minimal compared to query execution time
  • Best for OLAP workloads, not OLTP

Requirements

  • Elixir 1.14+
  • Erlang/OTP 25+
  • DuckDB CLI installed and in PATH

License

MIT License - see LICENSE file for details.

Acknowledgments

Support

For questions and discussions:


Made with ❤️ for the Elixir and DuckDB communities