This project builds a data pipeline for loading the Chicago taxi trips dataset into BigQuery for subsequent analysis.
Project Steps:
- Terraform: Create a bucket in GCP and a dataset in BigQuery.
- AirFlow: Pipeline for loading data into the bucket and subsequently creating an external table in BigQuery.
- dbt: Create models for use in subsequent analysis.
Terraform docs
You need to:
- configure a service account in GCP
- install Google Cloud SDK
- authenticate in GCP
- install Terraform
To create the infrastructure, run the following script:
bash run_terraform.shCheck what it is going to do and press yes.
To destroy the created infrastructure run:
bash destroy_terraform.shAirflow docs
You need to place the Google credentials in the ~/.google/credentials/ directory on your machine (either local or VM).
cd ~ && mkdir -p ~/.google/credentials/
mv <path/to/your/service-account-authkey>.json ~/.google/credentials/google_credentials.jsonBefore running the container, remember to update:
- the
GCP_PROJECT_IDandGCP_GCS_BUCKETvariable values in the.envfile - the
DOWNLOAD_START_DATE,DOWNLOAD_END_DATE,BIGQUERY_DATASET,TABLE_IDvariable values indag__data_ingestion.py
Execution:
- Run the following command to build an image, initialize Airflow, and kick up all services:
bash run_airflow.sh-
Login to Airflow web UI on
localhost:8080with default credentialsadmin/adminand run DAG nameddag__data_ingestion -
To shutdown all Airflow services run:
bash shutdown_airflow.shdbt docs
Before running models please install dbt-core or set up dbt cloud. For more details, refer to the official documentation
Commands to run dbt models:
dbt seed
dbt build
Models overview:
