DBT Fundamentals
DBT Fundamentals
GLOSSARY
SOURCES ................................................................................................................................................... 2
SEEDS ........................................................................................................................................................... 3
SNAPSHOTS ............................................................................................................................................ 3
TESTS ............................................................................................................................................................ 8
ETL ELT
Extract | Load
Transform Load Transform
Develop
Test/ Document
Deploy
Data transformation prior to loading it (resource-efficient for Loads raw data first, enabling multiple transformation versions
the destination system) from the same source
Requires transformation servers & middleware infrastructure. Leverages modern DWH computing power for transformations
Suited for legacy systems and smaller data volumes Scales with big data (efficient larger volumes handling)
Pre-validated, clean data (at time of loading), less Raw data kept alongside transformed views (data integrity).
downstream errors . More cost-effective (eliminates need for separate
Limited flexibility transformation infrastructure).
DBT
Issues DBT solves: Data Warehouses
Lack of testing/ documentation Data BI
Transformed
Rewriting stored procedures code Raw data
Loaders data Tools
Hard to understand transformation code
N
N New
records?
Y adds new records
‘dbt_valid_to’ = ‘null’
Problems/ Challenges
Streaming data. Dbt primarily designed for batch processing (no native support for real-time data transformations).
Best practices are hard to implement in dbt-core (limited vs dbt cloud).
Dbt overreliance may lead to bad practices (ie. creating too complex transformations, neglecting database-level optimizations,
building too many intermediate tables,...).
Hard to deal with custom DDL (ie. specific table properties, custom storage patterns, handling complex partitioning strategies,
custom indexes or materialized views with specific configurations).
SQLFluff
SQLFluff: SQL linting tool that helps maintain consistent and high-quality SQL code in your data models
SQL code rule checking: analyzes your SQL code against a predefined set of rules and best practices, such as:
Keywords capitalization; proper indentation & formatting; appropriate spacing around operators; naming conventions; code
structure/ organization; query complexity & performance considerations.
Code fixing capabilities: automatically fixes many common problems in your SQL code:
Reformat code to match style guidelines; fix indentation issues; standardize capitalization; correct spacing; restructure
queries for better readability.
Models (Incremental)
{{ % if % }} ... {{ % endif % }} a jinja statement wrapping the incremental logic {{ this }} variable used to self-
(where clause) with updated_at cutoff in a conditional statement reference the model & compiles
where the code is running (model as
is_incremental() a built-in dbt function that executes incremental model if 3 it exists in the dwh)
conditions are met:
1. materialized = ‘incremental’.
2. a table exists for this model in our DWH.
3. --full-refresh flag is not passed (a full refresh overrides the incremental
materialization and builds a table from scratch again).
Jinja
Jinja, a python templating library that extends dbt’s SQL capabilities to:
Control structures (ie. ‘if’ statements, ‘for’ loops)
Leverage variables and/or results of a query into another query
Abstract snippets of SQL into reusable macros
Jinja leverages delimiters
Expressions {{ ... }} : used to reference variables and/or call macros. Output a string.
Statements {% ... %} : used for control flow (ie. for loops, if statements, set/ modify variables). Not a string output.
Comments {# ... #}
Functions {{ ref() }} and {{ source() }} for lineage & dependency management
dbt commands
dbt compile compiles model to SQL and stores it into target folder
Macros
Macros are reusable pieces of code (aka. dbt functions), defined in .sql files inside the ‘macros’ folder. How to use macros?
Write your own macro Open-sourced macros (dbt-utils library installed),
callable inside the sql statement
Variables
Variables are defined inside the dbt_project.yml file, and can be scoped
global var
globally or to a specific package imported
specific var to my_dbt_project
vars can be overwriten in command line following the pattern
dbt run --vars ‘{ ”key” : “value” }
vars can be accessed in a sql statement passing the var() function
dbt tests
states singular sql test in tests’ schema.yml file
adds test in a sql statement
model calculates calculates the validity of an email adds a test in model yml file to validate email logic
Dbt Tests’ best practices - what type of tests and where in the dbt project?
Check & run models that depend on fresh sources only fresh
Yes Yes
New git tag CI build PR Ask for
Run CI build Merge PR
created successful? approved? review
not possible to
deploy unless main
is fixed
pre_dbt_workflow
EmptyOperator
dbt_build
BashOperator
Partial regular refresh
(dim_customers & everything
before)
post_dbt_workflow
EmptyOperator
dbt_snapshot
BashOperator
dbt_stg_customers dbt_seed
BashOperator BashOperator
dbt_dim_customers
BashOperator
post_dbt_workflow
EmptyOperator
Dag Operator
Runs inside Cosmos
DbtDag() DbtTaskGroup()
country_codes_seed
DbtRunLocalOperator
jaffle_shop_cosmos_dag
stg_customers dim_customers
country_codes_seed
DbtRunLocalOperator
NODE SELECTION
SYNTAX OVERVIEW EXCLUDING MODELS
run --select (-s), --exclude, --selector, --defer bt provides an --exclude flag with the same semantics as --select. Models
test --select (-s), --exclude, --selector, --defer specified with the --exclude flag will be removed from the set of models
seed --select (-s), --exclude, --selector selected with --select.
snapshot --select (-s), --exclude, --selector
Example:
list --select (-s), --exclude, --selector, --resource-type
• $ dbt run --select my_package.*+ --exclude my_package.a_big_model+
compile --select (-s), --exclude, --selector
freshness --select (-s), --exclude, --selector
build --select (-s), --exclude, --selector, --resource-type, --defer
docs generate --select (-s), --exclude, --selector
GRAPH OPERATORS
Plus operator(+)
• $ dbt run --select my_model+ - select my_model and all children
• $ dbt run --select +my_model - select my_model and all parents
• $ dbt run --select +my_model+ - select my_model, and all of its parents and children
N-plus operator
• $ dbt run --select my_model+1 - select my_model and its first-degree children
• $ dbt run --select 2+my_model - select my_model, its first-degree parents, and its second-degree parents ("grandparents")
• $ dbt run --select 3+my_model+4 - select my_model, its parents up to the 3rd degree, and its children down to the 4th degree
At operator(@)
• $ dbt run --models @my_model - select my_model, its children, and the parents of its children
Star operator(*)
• $ dbt run --select finance.base.* - run all of the models in models/finance/base
SET OPERATORS
Unions (space-delineated)
• $ dbt run --select +snowplow_sessions +fct_orders$ dbt run --select +my_model - run snowplow_sessions, all ancestors of snowplow_sessions, fct_orders,
and all ancestors of fct_orders)
Intersections (comma-separated)
• $ dbt run --select +snowplow_sessions,+fct_orders - run all the common ancestors of snowplow_sessions and fct_orders
• $ dbt run --select 3+my_model+4 $ dbt run --select marts.finance,tag:nightly - run models that are in the marts/finance subdirectory and tagged nightly
STATE DEFER
Some methods require a manifest file to compare the current state of the project Defer allows you build your project without having to build upstream
with another state, like the state of a previous invocation or the state of the project resources. It requires a state.
in production.
It is commonly used for Slim CI:
The path of this manifest can be passed using the --state flag. dbt build -s “state”modified+” --defer --state path/to/artifacts
dbt build -s “state”modified+” --defer --state path/to/artifacts
Bruno’s LinkedIn
/in/brunoszdl/
ZACH WILSON
DataExpert.io Founder
Zach’s LinkedIn
in/eczachly/
ALBERT CAMPILLO
Analytics Engineer | Technical infographist
Albert’s LinkedIn
in/albertcampillo/