Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Hands-on dbt course: Build analytics pipelines step-by-step with AI tutor (kai-dbt)

Notifications You must be signed in to change notification settings

rebekz/dbt_getting_started

Repository files navigation

dbt Learning Repository

A comprehensive learning repository for mastering dbt (Data Build Tool) with practical examples, tutorials, and end-to-end data pipeline implementations.

🎯 Overview

This repository demonstrates professional data transformation workflows using dbt with DuckDB, covering everything from basic concepts to advanced analytics and machine learning feature engineering.

πŸ›  Tech Stack

  • βœ… dbt - Data transformation framework
  • βœ… DuckDB - Embedded analytical database
  • βœ… Python 3.12 - Runtime environment
  • βœ… uv - Fast Python package manager

πŸ“š Learning Materials

1. Basic Tutorial

File: dbt_tutorial.md

  • βœ… Introduction to dbt concepts
  • βœ… Setting up your first project
  • βœ… Basic model creation
  • βœ… Testing and documentation

2. E-commerce Pipeline Example

File: ecommerce_pipeline_example.md

  • βœ… Practical e-commerce use case
  • βœ… Data modeling patterns
  • βœ… Dimensional modeling
  • βœ… Business logic implementation

3. E-commerce TDD Approach

File: ecommerce_pipeline_tdd.md

  • βœ… Test-Driven Development with dbt
  • βœ… Data quality frameworks
  • βœ… Testing strategies
  • βœ… CI/CD integration

4. End-to-End Analytics Workflow ⭐

File: ecommerce_analytics_end_to_end.md

Comprehensive tutorial demonstrating the complete data analytics lifecycle:

πŸ“Š Modules Covered

  1. Data Wrangling

    • Hierarchical indexing patterns
    • Data cleaning and standardization
    • Combining datasets (merge, concat, join)
    • Reshaping and pivoting data
  2. Data Aggregation

    • Split-Apply-Combine methodology
    • RFM (Recency, Frequency, Monetary) analysis
    • Pivot tables and cross-tabulation
    • Customer behavior metrics
  3. Exploratory Data Analysis (EDA)

    • Distribution analysis
    • Correlation analysis
    • Customer segmentation visualization
    • Product performance analysis
  4. Predictive Modeling Preparation

    • Feature engineering for ML
    • Customer churn prediction features
    • Customer Lifetime Value (CLV) modeling
    • Training data preparation

πŸŽ“ What You'll Build

  • βœ… Staging Layer: Clean, standardized data from raw sources
  • βœ… Intermediate Models: RFM analysis, customer orders aggregation
  • βœ… Data Marts:
    • Customer metrics with CLV predictions
    • Customer segmentation (Champions, At Risk, etc.)
    • Product performance analytics
  • βœ… ML Features: Ready-to-use features for:
    • Churn prediction models
    • CLV forecasting
    • Customer behavior analysis

πŸ’Ό Business Use Cases

  1. Churn Prevention Campaigns

    • Identify high-value at-risk customers
    • Targeted win-back strategies
  2. Marketing Budget Allocation

    • CLV-based budget distribution
    • Segment-specific campaign planning
  3. Product Recommendations

    • Customer-product affinity matrices
    • Collaborative filtering data prep

πŸš€ Quick Start

Prerequisites

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install Python 3.12
uv python install 3.12

Setup

# Clone the repository
git clone <repository-url>
cd dbt_get_started

# Install all Python dependencies (dbt + PySpark + GE)
uv sync

# Run dbt commands from the repo root via uv
uv run dbt seed --project-dir my_first_project
uv run dbt run --project-dir my_first_project
uv run dbt test --project-dir my_first_project
uv run dbt docs generate --project-dir my_first_project
uv run dbt docs serve --project-dir my_first_project

PySpark + Delta Pipeline (matches ecommerce_pyspark_end_to_end.md)

# 1. Ensure dependencies are installed once
uv sync

# 2. Run the PySpark / Delta / GE pipeline via uv
uv run pyspark-pipeline

# Optional: call spark-submit explicitly (still via uv env)
uv run spark-submit --packages io.delta:delta-spark_2.12:3.1.0 pyspark_pipeline/pipeline.py

Outputs land in lakehouse/{bronze,silver,intermediate,gold,ml} as Delta tables plus artifacts/ml_customer_features.parquet and artifacts/segment_clv.png. Great Expectations validations are baked into the script so it fails fast if predicted CLV is negative or emails are malformed.

πŸ§ͺ Seed Data Options

You can load the provided sample CSVs in two ways:

  1. Configure dbt to include the root-level data_seeds/ directory:
# my_first_project/dbt_project.yml
seed-paths: ["seeds", "../data_seeds"]
  1. Or copy the CSVs into the project's default seed folder:
cp -R data_seeds my_first_project/seeds/raw

Then run:

cd my_first_project
dbt seed

πŸ“ Project Structure

dbt_get_started/
β”œβ”€β”€ data_seeds/                 # Sample CSVs for seeding
β”‚   β”œβ”€β”€ raw_customers.csv
β”‚   β”œβ”€β”€ raw_orders.csv
β”‚   β”œβ”€β”€ raw_order_items.csv
β”‚   β”œβ”€β”€ raw_products.csv
β”‚   └── raw_categories.csv
β”œβ”€β”€ my_first_project/              # Main dbt project
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ staging/               # Raw data cleaning & standardization
β”‚   β”‚   β”‚   β”œβ”€β”€ stg_ecommerce__users.sql
β”‚   β”‚   β”‚   β”œβ”€β”€ stg_ecommerce__orders.sql
β”‚   β”‚   β”‚   β”œβ”€β”€ stg_ecommerce__order_items.sql
β”‚   β”‚   β”‚   └── stg_ecommerce__products.sql
β”‚   β”‚   β”œβ”€β”€ intermediate/          # Business logic & aggregations
β”‚   β”‚   β”‚   β”œβ”€β”€ int_customer_orders.sql
β”‚   β”‚   β”‚   β”œβ”€β”€ int_customer_rfm.sql
β”‚   β”‚   β”‚   └── int_product_metrics.sql
β”‚   β”‚   β”œβ”€β”€ marts/                 # Analytics-ready tables
β”‚   β”‚   β”‚   β”œβ”€β”€ fct_customer_metrics.sql
β”‚   β”‚   β”‚   β”œβ”€β”€ dim_customer_segments.sql
β”‚   β”‚   β”‚   └── fct_product_performance.sql
β”‚   β”‚   └── ml/                    # ML feature engineering
β”‚   β”‚       └── ml_customer_features.sql
β”‚   β”œβ”€β”€ seeds/                     # Optional local seeds folder
β”‚   β”œβ”€β”€ tests/                     # Custom data quality tests
β”‚   β”œβ”€β”€ dbt_project.yml           # Project configuration
β”‚   └── my_dbt.duckdb             # DuckDB database
β”œβ”€β”€ pyspark_pipeline/             # Local PySpark/Delta implementation
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── pipeline.py               # Entry point for ecommerce_pyspark_end_to_end tutorial
β”œβ”€β”€ dbt_tutorial.md               # Basic tutorial
β”œβ”€β”€ ecommerce_pipeline_example.md # E-commerce pipeline guide
β”œβ”€β”€ ecommerce_pipeline_tdd.md     # TDD approach guide
β”œβ”€β”€ ecommerce_analytics_end_to_end.md  # ⭐ Complete analytics workflow
β”œβ”€β”€ pyproject.toml                # Python dependencies
└── README.md                     # This file

πŸ” Key Concepts Demonstrated

  • unique
  • not_null
  • accepted_values
  • relationships

### Custom Tests
```sql
# tests/assert_positive_revenue.sql
select *
from {{ ref('fct_customer_metrics') }}
where predicted_clv_12m < 0

Running Tests

# All tests (run from repo root)
uv run dbt test --project-dir my_first_project

# Specific model
uv run dbt test --project-dir my_first_project --select fct_customer_metrics

# By tag
uv run dbt test --project-dir my_first_project --select tag:customer

πŸ“Š Example Queries

Customer Segmentation

select
    customer_segment,
    count(*) as customers,
    avg(monetary_value) as avg_value,
    avg(predicted_clv_12m) as avg_clv
from {{ ref('dim_customer_segments') }}
group by 1
order by avg_clv desc;

Churn Risk Analysis

select
    churn_risk,
    customer_segment,
    count(*) as at_risk_customers,
    sum(monetary_value) as revenue_at_risk
from {{ ref('fct_customer_metrics') }}
where churn_risk in ('High', 'Medium')
group by 1, 2;

Product Performance

select
    performance_tier,
    category,
    count(*) as product_count,
    sum(revenue) as total_revenue,
    sum(total_profit) as total_profit
from {{ ref('fct_product_performance') }}
group by 1, 2;

🎯 Best Practices

  1. Modularity: Keep models focused and reusable
  2. Documentation: Document all models in schema.yml
  3. Testing: Test at every layer (staging β†’ marts)
  4. Naming Conventions:
    • stg_ for staging models
    • int_ for intermediate models
    • fct_ for fact tables
    • dim_ for dimension tables
    • ml_ for ML features
  5. Materialization:
    • Views for staging (fast, always fresh)
    • Tables for marts (performance)
    • Incremental for large datasets

πŸ”§ Common Commands

# Development workflow
dbt compile                    # Check SQL syntax
dbt run --select model_name    # Run specific model
dbt run --select +model_name   # Run model + upstream
dbt run --select model_name+   # Run model + downstream

# By tags
dbt run --select tag:staging
dbt run --select tag:ml

# Testing
dbt test
dbt test --select model_name

# Documentation
dbt docs generate
dbt docs serve --port 8080

# Full refresh
dbt run --full-refresh

# Production
dbt run --target prod

🚒 Deployment Considerations

For Production

  1. Orchestration: Schedule with Airflow/Dagster/Prefect
  2. CI/CD: Automate testing on PR
  3. Monitoring: Set up data quality alerts
  4. Incremental Models: Use for large datasets
  5. Performance: Optimize materializations

Example Airflow DAG

from airflow import DAG
from airflow.operators.bash import BashOperator

dag = DAG('dbt_ecommerce_analytics', schedule_interval='@daily')

dbt_run = BashOperator(
    task_id='dbt_run',
    bash_command='cd /path/to/project && dbt run',
    dag=dag
)

dbt_test = BashOperator(
    task_id='dbt_test',
    bash_command='cd /path/to/project && dbt test',
    dag=dag
)

dbt_run >> dbt_test

πŸ“ˆ Performance Tuning

  • βœ… Use {{ ref() }} for model dependencies
  • βœ… Leverage incremental models for large tables
  • βœ… Optimize with appropriate materializations
  • βœ… Use CTEs for readability and performance
  • βœ… Index columns used in joins (database-specific)

🀝 Contributing

Feel free to contribute by:

  • βœ… Adding new examples
  • βœ… Improving documentation
  • βœ… Reporting issues
  • βœ… Suggesting best practices

πŸ“ License

This is a learning repository for educational purposes.

πŸ”— Resources

πŸ’‘ What's Next?

After completing these tutorials, you'll be ready to:

  • βœ… Build production data pipelines
  • βœ… Implement data quality frameworks
  • βœ… Create analytics-ready data models
  • βœ… Prepare features for ML models
  • βœ… Apply analytics engineering best practices

Happy learning! πŸŽ‰

About

Hands-on dbt course: Build analytics pipelines step-by-step with AI tutor (kai-dbt)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages