lea is a minimalist alternative to tools like dbt, SQLMesh, and Google's Dataform.
lea aims to be simple and opinionated, and yet offers the possibility to be extended. We happily use it every day at Carbonfact to manage our data warehouse. We will actively maintain it and add features, while welcoming contributions.
Right now lea is compatible with BigQuery (used at Carbonfact) and DuckDB (quack quack).
Use one of the following commands, depending on which warehouse you wish to use:
pip install lea-cli[duckdb]
pip install lea-cli[bigquery]This installs the lea command. It also makes the lea Python library available.
lea is configured by setting environment variables. The following variables are available:
# General configuration
LEA_USERNAME=max
# DuckDB π¦
LEA_WAREHOUSE=duckdb
LEA_DUCKDB_PATH=duckdb.db
# BigQuery π¦
LEA_WAREHOUSE=bigquery
LEA_BQ_LOCATION=EU
LEA_BQ_PROJECT_ID=carbonfact-dwh
LEA_BQ_DATASET_NAME=kaya
LEA_BQ_SERVICE_ACCOUNT=<a JSON dump of the service account file>
LEA_BQ_SCOPES=https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/driveThese parameters can be provided in an .env file, or directly in the shell. Each command also has an --env flag to provide a path to an .env file.
The prepare command has to be run once to create whatever needs creating. For instance, when using BigQuery, a dataset has to be created:
lea prepareThis is the main command. It runs the queries in the views directory.
lea runThe queries are run concurrently. They are organized in a DAG, which is traversed in a topological order. The DAG's structure is determined automatically by analyzing the dependency between queries.
Each query is expected to be placed under a schema, represented by a directory. Schemas can have sub-schemas. For instance, the following file structure is valid:
views/
schema_1/
table_1.sql
table_2.sql
schema_2/
table_3.sql
table_4.sql
sub_schema_2_1/
table_5.sql
table_6.sql
Each view will be named according to its location, following the warehouse convention:
| Warehouse | Dataset | Username | Schema | Table | Name |
|---|---|---|---|---|---|
| DuckDB | dataset |
user |
schema |
table |
schema.table (stored in dataset_user.db) |
| BigQuery | dataset |
user |
schema |
table |
dataset_user.schema__table |
The convention in lea to reference a table in a sub-schema is to use a double underscore __:
| Warehouse | Dataset | Username | Schema | Sub-schema | Table | Name |
|---|---|---|---|---|---|---|
| DuckDB | dataset |
user |
schema |
sub |
table |
schema.sub__table (stored in dataset_user.db) |
| BigQuery | dataset |
user |
schema |
sub |
table |
dataset_user.schema__sub__table |
Schemas are expected to be placed under a views directory. This can be changed by providing an argument to the run command:
lea run /path/to/viewsThis argument also applies to other commands in lea.
By default, lea appends a _<user> suffix to schema names. This way you can have a development schema and a production schema. Use the --production flag to disable this behavior.
lea run --productionThe <user> is determined automatically from the login name. It can be overriden by setting the LEA_USERNAME environment variable.
A single view can be run:
lea run --select core.usersSeveral views can be run:
lea run --select core.users --select core.ordersSimilar to dbt, lea also supports graph operators:
lea run --select core.users+ # users and everything that depends on it
lea run --select +core.users # users and everything it depends on
lea run --select +core.users+ # users and all its dependenciesYou can select all views in a schema:
lea run --select core/This also work with sub-schemas:
lea run --select analytics.finance/There are thus 8 possible operators:
schema.table (table by itself)
schema.table+ (table with its descendants)
+schema.table (table with its ancestors)
+schema.table+ (table with its ancestors and descendants)
schema/ (all tables in schema)
schema/+ (all tables in schema with their descendants)
+schema/ (all tables in schema with their ancestors)
+schema/+ (all tables in schema with their ancestors and descendants)
Combinations are possible:
lea run --select core.users+ --select +core.ordersThere's an Easter egg that allows selecting views that have been commited or modified in the current Git branch:
lea run --select git
lea run --select git+ # includes all descendantsThis becomes very handy when using lea in continuous integration. See dependency freezing for more information.
WAP is a data engineering pattern that ensures data consistency and reliability. It's the data engineering equivalent of blue-green deployment in the software engineering world.
By default, when you run a refresh, the tables gets progressively overwritten. This isn't necessarily a good idea. Let's say you refresh table A. Then you refresh table B that depends on A. If the refresh of B fails, you're left with a corrupted state. This is what the WAP pattern is trying to solve.
With lea, the WAP patterns works by creating temporary tables in the same schema as the original tables. The temporary tables are then populated with the new data. Once the temporary tables are ready, the original tables are swapped with the temporary tables. If the refresh fails, the switch doesn't happen, and the original tables are untouched.
lea run --wapThe lea run command creates a .cache.pkl file during the run. This file is a checkpoint containing the state of the DAG. It is used to determine which queries to run next time. That is, if some queries have failed, only those queries and their descendants will be run again next time. The .cache.pkl is deleted once all queries have succeeded.
This checkpointing logic can be disabled with the --fresh flag.
lea run --freshThe --fail-fast flag can be used to immediately stop if a query fails:
lea run --fail-fastFor debugging purposes, it is possible to print out a query and copy it to the clipboard:
lea run --select core.users --print | pbcopylea testThere are two types of tests:
- Singular tests β these are queries which return failing rows. They are stored in a
testsdirectory. - Assertion tests β these are comment annotations in the queries themselves:
@NO_NULLSβ checks that all values in a column are not null.@UNIQUEβ checks that a column's values are unique.@UNIQUE_BY(<by>)β checks that a column's values are unique within a group.@SET{<elements>}β checks that a column's values are in a set of values.
Here's an example of a query annotated with assertion tests:
SELECT
-- @UNIQUE
-- @NO_NULLS
user_id,
-- @NO_NULLS
address,
-- @UNIQUE_BY(address)
full_name,
-- @SET{'A', 'B', 'AB', 'O'}
blood_type
FROM core.usersAs with the run command, there is a --production flag to disable the <user> suffix, and therefore target production data.
You can select a subset views, which will thus run the tests that depend on them:
lea test --select-views core.usersIt is possible to generate documentation for the queries. This is done by inspecting the schema of the generated views and extracting the comments in the queries.
lea docs
--output-dir docs # where to put the generated filesThis will also create a Mermaid diagram in the docs directory. This diagram is a visualization of the DAG. See here for an example.
lea diffThis prints out a summary of the difference between development and production. Here is an example output:
core__users
+ 42 rows
+ age
+ email
- core__coupons
- 129 rows
- coupon_id
- amount
- user_id
- has_aggregation_key
core__orders
- discount
+ supplier
core__sales
+ 100 rowsThis is handy in pull requests. For instance, at Carbonfact, we have a dataset for each pull request. We compare it to the production dataset and post the diff as a comment in the pull request. The diff is updated every time the pull request is updated. Check out this example for more information.
lea teardownThis deletes the schema created by lea prepare. This is handy during continuous integration. For example, you might create a temporary schema in a branch. You would typically want to delete it after testing is finished and/or when the branch is merged.
SQL queries can be templated with Jinja. A .sql.jinja extension is necessary for lea to recognise them.
You have access to an env variable within the template context, which is simply an access point to os.environ.
You can write views with Python scripts. The only requirement is that the script contains a dataframe with a pandas DataFrame with the same name as the script. For instance, users.py should contain a users variable.
import pandas as pd
users = pd.DataFrame(
[
{"id": 1, "name": "Max"},
{"id": 2, "name": "Angie"},
]
)The lea run command can be used to only refresh a subset of views. Let's say we have this DAG:
fee -> fi -> fo -> fum
Assuming LEA_USERNAME=max, running lea run --select fo+ will
- Execute
foand materialize it tofo_max. - Execute
fumand materialize it tofum_max.
This only works if fee_max and fi_max already exist. This might be the case if you've run a full refresh before. But if you're running a first refresh, then fee_max and fi_max won't exist! This is where the freeze-unselected flag comes into play:
lea run --select fo+ --freeze-unselectedThis means the main fee and fi tables will be used instead of fee_max and fi_max.
Dependency freezing is particularly useful when using lea in a CI/CD context. You can run the following command in a pull request:
lea run --select git+ --freeze-unselectedThis will only run the modified views and their descendants. The dependencies of these modified will be taken from production. This is akin to dbt't defer command, but without the need for managing artifacts.
The added benefit is that you are guaranteed to do a comparison with the same base tables when running lea diff. Check out this article to learn more.
Feel free to reach out to [email protected] if you want to know more and/or contribute π
We have suggested some issues as good places to get started.
lea is free and open-source software licensed under the Apache License, Version 2.0.