Loan ETL Pipeline

This wiki provides comprehensive documentation on the Loan ETL Pipeline, which processes loan-related data through multiple stages of extraction, transformation, and loading (ETL) into structured, clean formats suitable for analysis. The pipeline is implemented using PySpark, Python, and Google Cloud Storage (GCS) for managing and processing data.

Overview

The Loan ETL Pipeline is designed to process raw XML and CSV loan data, transforming it through various stages into a clean, analytics-ready format. The data flows through bronze and silver layers:

Bronze Layer: Initial data profiling, basic cleaning, and format unification.
Silver Layer: Further validation and transformation, producing a consistent and structured dataset for analysis.

The pipeline leverages PySpark for distributed processing, Google Cloud Storage for cloud storage management, and Cerberus for schema validation, ensuring data quality and consistency.

Dependencies

To run the Loan ETL Pipeline, the following libraries are required:

logging: Provides standard logging for tracking ETL steps and debugging.
pyspark: Essential for distributed data processing and transformation.
google-cloud-storage: Allows interaction with Google Cloud Storage for file operations.
lxml: For parsing and transforming XML data.
pandas: Used for data manipulation and validation tasks.
cerberus: For validating CSV data against defined schemas.
delta: For handling Delta Lake operations during data transformation.

Configuration

1. Set Job Parameters

The set_job_params function defines ETL configurations, including column names, data types, and validation rules. It also manages the transformation settings used across different modules.

def set_job_params():
    config = {
        "DATE_COLUMNS": ["CS11", "CS12", "CS22"],
        "COLLATERAL_COLUMNS": {
            # Define columns and their types (e.g., StringType, DoubleType)
        }
    }
    return config

2. Logging Setup

Logging is configured to provide insights into the ETL process, allowing easy tracking of information, warnings, and errors.

import logging
import sys

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
handler.setFormatter(formatter)
logger.addHandler(handler)

Pipeline Architecture

1. Bronze Processing

The bronze layer processes raw XML and CSV data files from GCS, performing initial transformations like unpivoting and basic data cleaning. Key functions include:

Data Retrieval: Fetches raw XML or CSV files from GCS.
Data Profiling: Validates and profiles data to separate valid and invalid records.
Bronze Table Creation: Stores processed data with basic transformations applied.

2. Silver Processing

The silver layer processes bronze data into a cleaned and validated format. Key transformations include:

Data Unpivoting: Converts wide data to long format to streamline analysis.
Data Type Conversion: Applies correct data types and rounds numeric values.
Partitioned Storage: Writes transformed data to GCS in partitioned Parquet format for efficient querying.

Core Functions

1. set_job_params

Defines the configurations for data transformation, including column names and types.

2. Data Retrieval and Storage

get_raw_file: Retrieves XML files from GCS based on the specified file key.
get_old_df: Fetches previously processed data from the bronze layer for reference.

def get_raw_file(bucket_name, prefix, file_key):
    # Retrieves raw XML files from the bucket

3. XML Data Parsing and Transformation

create_dataframe: Reads XML data into a PySpark DataFrame, adding metadata and versioning.

def create_dataframe(spark, bucket_name, xml_file):
    # Parses XML data into a DataFrame

4. Bronze Data Processing

generate_deal_details_bronze: Processes raw XML data to create the bronze table, performing an SCD-2 (Slowly Changing Dimension) upsert if legacy data exists.

def generate_deal_details_bronze(spark, raw_bucketname, data_bucketname, source_prefix, target_prefix, file_key):
    # Processes data for the bronze table

5. Data Profiling and Validation

profile_bronze_data: Profiles CSV data, validating records and separating valid and invalid entries based on defined schemas.

def profile_bronze_data(raw_bucketname, data_bucketname, source_prefix, file_key, data_type, ingestion_date):
    # Profiles and validates data in the bronze layer

6. Silver Data Processing

generate_collateral_silver: Converts collateral data from bronze to silver, applying transformations and saving the final data in GCS.

def generate_collateral_silver(spark, bucket_name, source_prefix, target_prefix, dl_code, ingestion_date):
    # Processes data for the silver table

7. Utility Functions

These are located in src.loan_etl_pipeline.utils and include:

replace_no_data: Replaces "no data" placeholders with None.
replace_bool_data: Converts "Y"/"N" strings to Boolean values.
cast_to_datatype: Ensures columns are cast to specified data types.

Usage Examples

Running Bronze Profiling Job

profile_bronze_data(
    raw_bucketname="source_bucket",
    data_bucketname="processed_bucket",
    source_prefix="data/raw",
    file_key="deal_2024",
    data_type="collaterals",
    ingestion_date="2024-01-01"
)

Generating Silver Collateral Data

generate_collateral_silver(
    spark=spark_session,
    bucket_name="processed_bucket",
    source_prefix="data/bronze/collaterals",
    target_prefix="data/silver/collaterals",
    dl_code="DL1234",
    ingestion_date="2024-01-01"
)

References

For more information, consult the following documentation:

Google Cloud Storage: Storage Documentation
PySpark: PySpark Documentation
Cerberus: Cerberus Schema Validation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loan ETL Pipeline

Contents

Overview

Dependencies

Configuration

1. Set Job Parameters

2. Logging Setup

Pipeline Architecture

1. Bronze Processing

2. Silver Processing

Core Functions

1. set_job_params

2. Data Retrieval and Storage

3. XML Data Parsing and Transformation

4. Bronze Data Processing

5. Data Profiling and Validation

6. Silver Data Processing

7. Utility Functions

Usage Examples

Running Bronze Profiling Job

Generating Silver Collateral Data

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally