A comprehensive learning repository for mastering dbt (Data Build Tool) with practical examples, tutorials, and end-to-end data pipeline implementations.
This repository demonstrates professional data transformation workflows using dbt with DuckDB, covering everything from basic concepts to advanced analytics and machine learning feature engineering.
- β dbt - Data transformation framework
- β DuckDB - Embedded analytical database
- β Python 3.12 - Runtime environment
- β uv - Fast Python package manager
File: dbt_tutorial.md
- β Introduction to dbt concepts
- β Setting up your first project
- β Basic model creation
- β Testing and documentation
File: ecommerce_pipeline_example.md
- β Practical e-commerce use case
- β Data modeling patterns
- β Dimensional modeling
- β Business logic implementation
File: ecommerce_pipeline_tdd.md
- β Test-Driven Development with dbt
- β Data quality frameworks
- β Testing strategies
- β CI/CD integration
File: ecommerce_analytics_end_to_end.md
Comprehensive tutorial demonstrating the complete data analytics lifecycle:
-
Data Wrangling
- Hierarchical indexing patterns
- Data cleaning and standardization
- Combining datasets (merge, concat, join)
- Reshaping and pivoting data
-
Data Aggregation
- Split-Apply-Combine methodology
- RFM (Recency, Frequency, Monetary) analysis
- Pivot tables and cross-tabulation
- Customer behavior metrics
-
Exploratory Data Analysis (EDA)
- Distribution analysis
- Correlation analysis
- Customer segmentation visualization
- Product performance analysis
-
Predictive Modeling Preparation
- Feature engineering for ML
- Customer churn prediction features
- Customer Lifetime Value (CLV) modeling
- Training data preparation
- β Staging Layer: Clean, standardized data from raw sources
- β Intermediate Models: RFM analysis, customer orders aggregation
- β
Data Marts:
- Customer metrics with CLV predictions
- Customer segmentation (Champions, At Risk, etc.)
- Product performance analytics
- β
ML Features: Ready-to-use features for:
- Churn prediction models
- CLV forecasting
- Customer behavior analysis
-
Churn Prevention Campaigns
- Identify high-value at-risk customers
- Targeted win-back strategies
-
Marketing Budget Allocation
- CLV-based budget distribution
- Segment-specific campaign planning
-
Product Recommendations
- Customer-product affinity matrices
- Collaborative filtering data prep
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install Python 3.12
uv python install 3.12# Clone the repository
git clone <repository-url>
cd dbt_get_started
# Install all Python dependencies (dbt + PySpark + GE)
uv sync
# Run dbt commands from the repo root via uv
uv run dbt seed --project-dir my_first_project
uv run dbt run --project-dir my_first_project
uv run dbt test --project-dir my_first_project
uv run dbt docs generate --project-dir my_first_project
uv run dbt docs serve --project-dir my_first_project# 1. Ensure dependencies are installed once
uv sync
# 2. Run the PySpark / Delta / GE pipeline via uv
uv run pyspark-pipeline
# Optional: call spark-submit explicitly (still via uv env)
uv run spark-submit --packages io.delta:delta-spark_2.12:3.1.0 pyspark_pipeline/pipeline.pyOutputs land in lakehouse/{bronze,silver,intermediate,gold,ml} as Delta tables
plus artifacts/ml_customer_features.parquet and
artifacts/segment_clv.png. Great Expectations validations are baked into the
script so it fails fast if predicted CLV is negative or emails are malformed.
You can load the provided sample CSVs in two ways:
- Configure dbt to include the root-level
data_seeds/directory:
# my_first_project/dbt_project.yml
seed-paths: ["seeds", "../data_seeds"]- Or copy the CSVs into the project's default seed folder:
cp -R data_seeds my_first_project/seeds/rawThen run:
cd my_first_project
dbt seeddbt_get_started/
βββ data_seeds/ # Sample CSVs for seeding
β βββ raw_customers.csv
β βββ raw_orders.csv
β βββ raw_order_items.csv
β βββ raw_products.csv
β βββ raw_categories.csv
βββ my_first_project/ # Main dbt project
β βββ models/
β β βββ staging/ # Raw data cleaning & standardization
β β β βββ stg_ecommerce__users.sql
β β β βββ stg_ecommerce__orders.sql
β β β βββ stg_ecommerce__order_items.sql
β β β βββ stg_ecommerce__products.sql
β β βββ intermediate/ # Business logic & aggregations
β β β βββ int_customer_orders.sql
β β β βββ int_customer_rfm.sql
β β β βββ int_product_metrics.sql
β β βββ marts/ # Analytics-ready tables
β β β βββ fct_customer_metrics.sql
β β β βββ dim_customer_segments.sql
β β β βββ fct_product_performance.sql
β β βββ ml/ # ML feature engineering
β β βββ ml_customer_features.sql
β βββ seeds/ # Optional local seeds folder
β βββ tests/ # Custom data quality tests
β βββ dbt_project.yml # Project configuration
β βββ my_dbt.duckdb # DuckDB database
βββ pyspark_pipeline/ # Local PySpark/Delta implementation
β βββ __init__.py
β βββ pipeline.py # Entry point for ecommerce_pyspark_end_to_end tutorial
βββ dbt_tutorial.md # Basic tutorial
βββ ecommerce_pipeline_example.md # E-commerce pipeline guide
βββ ecommerce_pipeline_tdd.md # TDD approach guide
βββ ecommerce_analytics_end_to_end.md # β Complete analytics workflow
βββ pyproject.toml # Python dependencies
βββ README.md # This file
- unique
- not_null
- accepted_values
- relationships
### Custom Tests
```sql
# tests/assert_positive_revenue.sql
select *
from {{ ref('fct_customer_metrics') }}
where predicted_clv_12m < 0
# All tests (run from repo root)
uv run dbt test --project-dir my_first_project
# Specific model
uv run dbt test --project-dir my_first_project --select fct_customer_metrics
# By tag
uv run dbt test --project-dir my_first_project --select tag:customerselect
customer_segment,
count(*) as customers,
avg(monetary_value) as avg_value,
avg(predicted_clv_12m) as avg_clv
from {{ ref('dim_customer_segments') }}
group by 1
order by avg_clv desc;select
churn_risk,
customer_segment,
count(*) as at_risk_customers,
sum(monetary_value) as revenue_at_risk
from {{ ref('fct_customer_metrics') }}
where churn_risk in ('High', 'Medium')
group by 1, 2;select
performance_tier,
category,
count(*) as product_count,
sum(revenue) as total_revenue,
sum(total_profit) as total_profit
from {{ ref('fct_product_performance') }}
group by 1, 2;- Modularity: Keep models focused and reusable
- Documentation: Document all models in schema.yml
- Testing: Test at every layer (staging β marts)
- Naming Conventions:
stg_for staging modelsint_for intermediate modelsfct_for fact tablesdim_for dimension tablesml_for ML features
- Materialization:
- Views for staging (fast, always fresh)
- Tables for marts (performance)
- Incremental for large datasets
# Development workflow
dbt compile # Check SQL syntax
dbt run --select model_name # Run specific model
dbt run --select +model_name # Run model + upstream
dbt run --select model_name+ # Run model + downstream
# By tags
dbt run --select tag:staging
dbt run --select tag:ml
# Testing
dbt test
dbt test --select model_name
# Documentation
dbt docs generate
dbt docs serve --port 8080
# Full refresh
dbt run --full-refresh
# Production
dbt run --target prod- Orchestration: Schedule with Airflow/Dagster/Prefect
- CI/CD: Automate testing on PR
- Monitoring: Set up data quality alerts
- Incremental Models: Use for large datasets
- Performance: Optimize materializations
from airflow import DAG
from airflow.operators.bash import BashOperator
dag = DAG('dbt_ecommerce_analytics', schedule_interval='@daily')
dbt_run = BashOperator(
task_id='dbt_run',
bash_command='cd /path/to/project && dbt run',
dag=dag
)
dbt_test = BashOperator(
task_id='dbt_test',
bash_command='cd /path/to/project && dbt test',
dag=dag
)
dbt_run >> dbt_test- β
Use
{{ ref() }}for model dependencies - β Leverage incremental models for large tables
- β Optimize with appropriate materializations
- β Use CTEs for readability and performance
- β Index columns used in joins (database-specific)
Feel free to contribute by:
- β Adding new examples
- β Improving documentation
- β Reporting issues
- β Suggesting best practices
This is a learning repository for educational purposes.
- β dbt Documentation
- β dbt Discourse Community
- β DuckDB Documentation
- β Analytics Engineering Guide
After completing these tutorials, you'll be ready to:
- β Build production data pipelines
- β Implement data quality frameworks
- β Create analytics-ready data models
- β Prepare features for ML models
- β Apply analytics engineering best practices
Happy learning! π