Thanks to visit codestin.com
Credit goes to github.com

Skip to content

CliffLolo/etl-pg-python-metabase

Repository files navigation

Data Pipeline using Postgres + Python + Metabase

Architectural Diagram

Image

Schema Diagram

Image

Prerequisites

  • Docker
  • Docker Compose
  • Git

Services

The project consists of the following services:

Required:

  • Postgres
  • Adminer
  • Metabase (Analytics)
  • Prometheus (Monitoring)
  • Grafana (Visualization)

Run Locally

1. Clone repository and start docker containers

git clone 

2. Go to the project directory

cd clifford-frempong-de-task/

3. Create a .env File

In the root directory of the project, create a .env file. This file will securely store sensitive credentials and configuration settings required for the project.

4. Populate the .env File

Use the structure below to configure your environment variables:

# ==========================
# API Configuration
# ==========================
API_KEY="your_api_key_here"           # New York Times API Key
API_SECRET="your_api_secret_here"     # New York Times API Secret (if required)
BASE_URL="https://api.nytimes.com/svc/books/v3/lists/overview.json"  # NYT API Base URL

# ==========================
# Database Configuration
# ==========================
DB_HOST="localhost"                       # Database Host (e.g., localhost or IP address)
DB_PORT="5432"                            # Database Port (default for PostgreSQL is 5432)
DB_USER="your_db_username"                # Database Username
DB_PASSWORD="your_db_password"            # Database Password
DB_NAME="your_database_name"              # Database Name

# ==========================
# Grafana Configuration
# ==========================
GF_USER="your_db_username"                # Grafana Username
GF_PASSWORD="your_db_password"            # Grafana Password

# ==========================
# Data Extraction Settings
# ==========================
start_date="YYYY-MM-DD"              # Data extraction start date (e.g., 2021-01-01)
end_date="YYYY-MM-DD"                # Data extraction end date (e.g., 2023-12-31)


# ==========================
# Marquez Settings
# ==========================
MARQUEZ_NAMESPACE="namespace for marqueez"         # Marqueez namespace

Use .env-example for Reference

A .env-example file is provided in the project for guidance. Copy it and rename it to .env:

cp .env-example .env

Then, update it with your actual credentials.


Security Notice

⚠️ Important:
Ensure the .env file is NOT committed to version control. The .gitignore file should already exclude .env, but double-check to prevent accidental exposure of sensitive information.

# .gitignore
.env

6. Run the docker compose file

docker-compose up -build

7. Check the command to check if all components are up and running

docker compose ps
Image

If all is well, you'll have everything running in their own containers, with a load generator script configured to load data from 2021-2023 into the postgres db

8. Verify PostgreSQL data

Access the Postgres UI and confirm the presence of 7 tables dim_book, dim_date, dim_list, dim_publisher, fact_book_rankings, fact_publisher_performance and load_status with data in them. Use "postgres" as the username and password when logging in. Image Image

Business Intelligence: Metabase

  1. In a browser, go to Metabase UI

  2. Click Let's get started.

Image

  1. Complete the first set of fields asking for your email address. This information isn't crucial for anything but does have to be filled in

Image

  1. On the Add your data page, select Postgres and fill in the following information:

Image

Image

  1. Proceed past the screens until you reach your primary dashboard

Image

  1. Click New

Image

  1. Click SQL query

Image

  1. From Select a database, select DeelCompanyDB.

  2. In the query editor, enter:

    SELECT *
    FROM load_status; 
  3. You can save the output and add it to a dashboard:

Insights

  1. Which book remained in the top 3 ranks for the longest time in 2022? description:
WITH consecutive_rankings AS (
    SELECT 
        b.title,
        b.author,
        fr.rank,
        d.full_date,
        COUNT(*) OVER (PARTITION BY b.book_key) as weeks_in_top_3
    FROM fact_book_rankings fr
    JOIN dim_book b ON fr.book_key = b.book_key
    JOIN dim_date d ON fr.date_key = d.date_key
    WHERE d.year = 2022 
    AND fr.rank <= 3
)
SELECT 
    title,
    author,
    weeks_in_top_3,
    MIN(full_date) as first_appearance,
    MAX(full_date) as last_appearance
FROM consecutive_rankings
GROUP BY title, author, weeks_in_top_3
ORDER BY weeks_in_top_3 DESC
LIMIT 1;
  1. Which are the top 3 lists to have the least number of unique books in their rankings for the entirety of the data? description:
SELECT 
    l.list_name,
    COUNT(DISTINCT b.book_key) as unique_books_count
FROM fact_book_rankings fr
JOIN dim_list l ON fr.list_key = l.list_key
JOIN dim_book b ON fr.book_key = b.book_key
GROUP BY l.list_key, l.list_name
ORDER BY unique_books_count ASC
LIMIT 3;
  1. Publishers are ranked based on how their respective books performed on this list. For each book, a publisher gets points based on the best rank a book got in a given period of time. The publisher gets 5 points if the book is ranked 1st, 4 for 2nd rank, 3 for 3rd rank, 2 for 4th and 1 point for 5th. Create a quarterly rank for publishers from 2021 to 2023, getting only the top 5 for each quarter. description:
WITH publisher_points AS (
    SELECT 
        p.publisher_key,
        p.publisher_name,
        d.year,
        d.quarter,
        d.quarter_name,
        SUM(CASE 
            WHEN fr.rank = 1 THEN 5
            WHEN fr.rank = 2 THEN 4
            WHEN fr.rank = 3 THEN 3
            WHEN fr.rank = 4 THEN 2
            WHEN fr.rank = 5 THEN 1
            ELSE 0
        END) as total_points,
        ROW_NUMBER() OVER (
            PARTITION BY d.year, d.quarter 
            ORDER BY SUM(CASE 
                WHEN fr.rank = 1 THEN 5
                WHEN fr.rank = 2 THEN 4
                WHEN fr.rank = 3 THEN 3
                WHEN fr.rank = 4 THEN 2
                WHEN fr.rank = 5 THEN 1
                ELSE 0 
            END) DESC
        ) as quarterly_rank
    FROM fact_book_rankings fr
    JOIN dim_book b ON fr.book_key = b.book_key
    JOIN dim_publisher p ON b.publisher_key = p.publisher_key
    JOIN dim_date d ON fr.date_key = d.date_key
    WHERE d.year BETWEEN 2021 AND 2023
    AND fr.rank <= 5
    GROUP BY p.publisher_key, p.publisher_name, d.year, d.quarter, d.quarter_name
)
SELECT 
    publisher_name,
    year,
    quarter_name,
    total_points
FROM publisher_points
WHERE quarterly_rank <= 5
ORDER BY year, quarter, quarterly_rank;
  1. Two friends Jake and Pete have podcasts where they review books. Jake's team reviews the book ranked first on every list, while Pete’s team reviews the book ranked third. Both of them share books, if Jake’s team wants to review a book, they first check with Pete’s before buying and vice versa. Which team bought what book in 2023? description:
WITH team_books AS (
    SELECT 
        b.title,
        b.author,
        l.list_name,
        d.full_date,
        fr.rank,
        CASE 
            WHEN fr.rank = 1 THEN 'Jake'
            WHEN fr.rank = 3 THEN 'Pete'
        END as reviewer,
        ROW_NUMBER() OVER (
            PARTITION BY b.book_key 
            ORDER BY d.full_date
        ) as first_appearance
    FROM fact_book_rankings fr
    JOIN dim_book b ON fr.book_key = b.book_key
    JOIN dim_list l ON fr.list_key = l.list_key
    JOIN dim_date d ON fr.date_key = d.date_key
    WHERE d.year = 2023
    AND fr.rank IN (1, 3)
)
SELECT 
    reviewer as purchased_by,
    title,
    author,
    list_name,
    full_date as purchase_date
FROM team_books
WHERE first_appearance = 1
ORDER BY full_date, list_name;

Monitoring: Prometheus and Grafana

1. Prometheus (Metrics Collection)

Prometheus collects metrics from PostgreSQL via the Postgres Exporter. Access Prometheus at:

Image

Check under Status → Target Health to ensure PostgreSQL metrics are being scraped. Image Image

2. Grafana (Metrics Visualization)

Grafana visualizes metrics collected by Prometheus. Access Grafana at:

Login Credentials:

  • Username: admin
  • Password: admin

Image

3. Configure Grafana Data Source

  1. Log in to Grafana
  2. Navigate to Connections → Data Sources → Add Data Source

Image

  1. Select Prometheus

Image

  1. Set the URL to:

    http://prometheus:9090
    

    Image

  2. Click Save & Test

Image

4. Import PostgreSQL Monitoring Dashboard

  1. Go to Dashboards → Import

Image

  1. Enter Dashboard ID 9628
  2. Set Prometheus as the data source

Image

  1. Visualize PostgreSQL performance metrics! Image

Pipeline Resilience Strategies

Fault Isolation:

Each service operates within its own container, enabling isolation of any issues that may arise. This ensures that if one component encounters a problem, it won't affect the functioning of the others.

Side note :)

Data Lineage Tracking with OpenLineage

Metadata Management & Visualization with Marquez


You have some infrastructure running in Docker containers, so don't forget to run docker-compose down to shut everything down!


Contributed by Clifford Frempong

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published