Thanks to visit codestin.com
Credit goes to github.com

Skip to content

maithtruong/sales_pipeline

Repository files navigation

ELT Data Pipeline with dbt, Snowflake, and Airflow

This project implements a modern ELT data pipeline using dbt, Snowflake, and Apache Airflow. The pipeline extracts data from Snowflake's TPCH sample dataset, transforms it using dbt models, and orchestrates the workflow with Airflow.

Project Overview

This pipeline demonstrates a typical ELT architecture used in modern analytics engineering:

  1. Raw data source Data comes from Snowflake's snowflake_sample_data.tpch_sf1 dataset.

  2. Staging layer dbt models clean and standardize raw source tables.

  3. Intermediate transformations Business logic and joins are applied to prepare analytics-ready tables.

  4. Data marts / fact tables Aggregated datasets optimized for analytics and reporting.

  5. Orchestration Airflow schedules and executes dbt runs.

The result is a clean analytics-ready fact table (fact_orders) containing order-level metrics.


Architecture

Data Lineage (dbt)

Below is the transformation lineage generated by dbt.

dbt Lineage

The pipeline follows a layered transformation approach:

Source Tables (Snowflake TPCH)
        │
        ▼
Staging Models
(stg_tpch_orders, stg_tpch_line_items)
        │
        ▼
Intermediate Models
(int_order_items, int_order_items_summary)
        │
        ▼
Fact Table
(fact_orders)

Airflow DAG

The pipeline is orchestrated using Apache Airflow, which schedules and runs the dbt workflow.

Airflow DAG

The DAG performs:

  1. dbt dependency installation
  2. dbt model execution
  3. dbt tests

This ensures transformations and validations are executed automatically.


Tech Stack

Tool Purpose
Snowflake Cloud data warehouse
dbt Data transformation and modeling
Apache Airflow 3 Workflow orchestration
Astronomer Cosmos Integrates dbt with Airflow
Python Pipeline orchestration
uv Python dependency and virtual environment management
Docker Containerized environment

Project Structure

project-root
│
├── dags/
│   └── dbt_dag.py
│
├── dbt_project/
│   ├── models/
│   │   ├── staging/
│   │   └── marts/
│   │
│   ├── macros/
│   └── tests/
│
├── media/
│   ├── airflow_dag.png
│   └── dbt_lineage.png
│
├── Dockerfile
├── requirements.txt
└── README.md

Data Pipeline Logic

Source Layer

The pipeline uses Snowflake's TPCH sample dataset:

  • orders
  • lineitem

These tables act as raw source data.


Staging Layer

The staging models standardize column naming and structure.

Example:

stg_tpch_orders
stg_tpch_line_items

Key operations:

  • column renaming
  • surrogate key generation
  • source tests (not null, uniqueness)

Intermediate Layer

Intermediate models combine and enrich staging tables.

Example:

int_order_items
int_order_items_summary

Key operations:

  • joins between orders and line items
  • calculation of discount metrics

Fact Layer

The final analytics model:

fact_orders

This table includes:

  • order information
  • aggregated item sales
  • discount calculations

Data Quality Tests

The pipeline includes both generic and singular dbt tests.

Generic Tests

  • unique
  • not_null
  • relationships
  • accepted_values

Singular Tests

Custom SQL tests validate business logic:

  • discount values cannot be positive
  • order dates must be within valid ranges

Setup Guide

1. Clone the repository

git clone https://github.com/maithtruong/sales_pipeline.git
cd sales_pipeline

2. Create Python Environment (uv)

This project uses uv for environment management.

uv venv
source .venv/bin/activate

Install dependencies:

uv pip install -r requirements.txt

3. Configure Snowflake

Run the following SQL to create required resources:

  • warehouse
  • database
  • role
  • schema

(see SQL script in the project documentation)


4. Configure dbt Profile

Update profiles.yml with your Snowflake credentials.

Example configuration:

warehouse: dbt_wh
database: dbt_db
schema: dbt_schema
role: dbt_role

5. Run dbt Locally

Install dependencies:

dbt deps

Run models:

dbt run

Execute tests:

dbt test

6. Start Airflow

Build containers and start Airflow:

docker compose up --build

Open the Airflow UI:

http://localhost:8080

Trigger the dbt_dag.


Differences from the Original Tutorial

This implementation includes several modern improvements:

Change Description
uv instead of pip/venv Faster dependency management
Latest dbt version Updated syntax and compatibility
Airflow 3 New scheduler and runtime improvements

Learning Goals

This project demonstrates:

  • building a modern ELT pipeline
  • dbt modeling best practices
  • data testing and validation
  • workflow orchestration with Airflow
  • integration between Airflow and dbt

References

This project is based on the following tutorial:

https://www.youtube.com/watch?v=OLXkGB7krGo

Many thanks to the author for the excellent guide on building an ELT pipeline with dbt, Snowflake, and Airflow.
This implementation follows the tutorial while introducing some updates, including:

  • uv-based Python environment management
  • latest dbt version
  • Apache Airflow 3

About

ELT pipeline with dbt, Airflow and Snowflake.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors