Ed Tech CodePro Lead Classification MLOps Project

This end-to-end MLOps project demonstrates how to operationalize a machine learning pipeline for lead scoring. It covers model building, evaluation, packaging, and deployment, with a focus on maintainability and scalability — critical for AI adoption in real-world enterprise use cases. This reflects hands-on application of lifecycle management in applied AI.

Project Overview

This project implements a full-fledged MLOps pipeline to classify leads for Ed Tech using a modular architecture orchestrated by Apache Airflow. It integrates:

Data Ingestion & Cleaning: Automated scripts to validate, clean, and prepare raw lead data.
Model Training: A dedicated training pipeline that experiments with and refines models.
Inference & Prediction: A pipeline for real-time prediction and lead scoring.
Testing & Validation: Robust unit tests ensuring the integrity of each stage.

Features

Automated Data Processing: Clean and validate your data with a dedicated pipeline.
Model Training & Inference: Seamless pipelines for building and deploying machine learning models.
Airflow-Orchestrated Workflows: Manage complex dependencies and schedules with ease.
Interactive Analysis: Leverage Jupyter notebooks for exploratory data analysis and model experimentation.
Modular & Scalable: Easily extend or modify each component of the system.

Repository Structure

airflow
├── airflow.cfg
├── dags
│   ├── __init__.py
│   ├── lead_scoring_data_pipeline
│   │   ├── __init__.py
│   │   ├── constants.py
│   │   ├── data
│   │   │   ├── leadscoring.csv
│   │   │   └── leadscoring_inference.csv
│   │   ├── data_validation_checks.py
│   │   ├── lead_scoring_data_cleaning.db
│   │   ├── lead_scoring_data_pipeline.py
│   │   ├── mappings
│   │   │   ├── city_tier_mapping.py
│   │   │   ├── interaction_mapping.csv
│   │   │   └── significant_categorical_level.py
│   │   ├── schema.py
│   │   └── utils.py
│   ├── lead_scoring_inference_pipeline
│   │   ├── __init__.py
│   │   ├── constants.py
│   │   ├── lead_scoring_inference_pipeline.py
│   │   ├── prediction_distribution.txt
│   │   ├── schema.py
│   │   └── utils.py
│   ├── lead_scoring_training_pipeline
│   │   ├── __init__.py
│   │   ├── constants.py
│   │   ├── lead_scoring_training_pipeline.py
│   │   └── utils.py
│   ├── master_pipeline_dag.py
│   └── unit_test
│       ├── __init__.py
│       ├── constants.py
│       ├── leadscoring_test.csv
│       ├── test_runner_dag.py
│       ├── test_with_pytest.py
│       └── unit_test_cases.db
notebooks
├── 01.data_cleaning.ipynb
├── 02.model_experimentation.ipynb
├── data
│   ├── cleaned_data.csv
│   └── leadscoring.csv
├── mappings
│   ├── city_tier_mapping.py
│   ├── interaction_mapping.csv
│   └── significant_categorical_level.py
└── profile_reports
    ├── cleaned_data_report.html
    └── raw_data_report.html
screenshots.pdf
webserver_config.py

Folder Highlights:

airflow: Contains Airflow configuration and DAGs for each pipeline component, including data cleaning, model training, inference, and unit testing.
notebooks: Jupyter notebooks and supplementary files for exploratory data analysis and model experimentation.
screenshots.pdf: Visual documentation of key pipeline outputs and dashboards.
webserver_config.py: Configuration file for setting up the Airflow web server.

Getting Started

Prerequisites

Python: Version 3.7 or higher
Apache Airflow: Installed and configured (see instructions below)
Jupyter Notebook: For running and modifying interactive notebooks
Other Dependencies: Listed in requirements.txt (if available)

Installation

Clone the Repository:

git clone https://github.com/your-username/your-repo.git
cd your-repo

Set Up a Virtual Environment & Install Dependencies:

python -m venv venv
source venv/bin/activate  # On Windows use: venv\Scripts\activate
pip install -r requirements.txt

Configure Apache Airflow:
Ensure you have a proper Airflow setup as described in the Airflow Setup Instructions.

Airflow Setup Instructions

Follow these steps to set up Airflow:

A. Update `airflow.cfg`

Open the airflow.cfg file.

Make the following changes:

base_url = http://localhost:6007

web_server_port = 6007

B. Run Initialization Commands

Initialize the Airflow metadata database:
```
airflow db init
```

Create an admin user to access the Airflow UI:

airflow users create \
    --username upgrad \
    --firstname upgrad \
    --lastname upgrad \
    --role Admin \
    --email [email protected] \
    --password admin

C. Start Airflow

Start the Airflow web server:
```
airflow webserver
```
In a new terminal, start the scheduler:
```
airflow scheduler
```

Clean Restart (Optional)

If you want to completely reset your Airflow setup and start fresh:

Reset the database (this will erase all metadata):
```
airflow db reset --yes
```
Re-initialize the metadata database:
```
airflow db init
```

Recreate the admin user:

airflow users create \
    --username upgrad \
    --firstname upgrad \
    --lastname upgrad \
    --role Admin \
    --email [email protected] \
    --password admin

Pipeline Overview

Data Pipeline

Location: airflow/dags/lead_scoring_data_pipeline/
Purpose: Ingest raw lead data, perform data cleaning, run validation checks, and prepare data for downstream processes.
Key Components:
- Data validation scripts and cleaning utilities.
- Mapping modules (e.g., city tier, interaction mapping) to enrich raw data.

Training Pipeline

Location: airflow/dags/lead_scoring_training_pipeline/
Purpose: Train machine learning models for lead classification using processed data.
Key Components:
- Model training scripts.
- Constants and utility functions to manage training configurations.

Inference Pipeline

Location: airflow/dags/lead_scoring_inference_pipeline/
Purpose: Generate predictions using the trained model.
Key Components:
- Prediction scripts.
- Schema definitions to ensure prediction consistency.

Master Pipeline DAG

Location: airflow/dags/master_pipeline_dag.py
Purpose: Orchestrate the overall workflow by coordinating the data, training, and inference pipelines.
Key Components:
- Workflow scheduling and dependency management.

Unit Testing

Location: airflow/dags/unit_test/
Purpose: Validate pipeline functionality and data integrity through automated tests.
Key Components:
- Pytest-based test scripts and test data.
- A dedicated test runner DAG for executing tests within Airflow.

Notebooks

The notebooks directory provides interactive resources for:

Data Cleaning: (01.data_cleaning.ipynb) – Explore, visualize, and clean raw data.
Model Experimentation: (02.model_experimentation.ipynb) – Experiment with different models and parameters.
Supplementary Files:
- Data Files: Raw and cleaned datasets.
- Mapping Scripts: For data transformation.
- Profile Reports: Detailed HTML reports summarizing data quality.

Running the Project

Start Airflow:
Follow the Airflow Setup Instructions to initialize and run the Airflow webserver and scheduler.
Trigger DAGs:
From the Airflow UI, trigger the master pipeline DAG to run the complete end-to-end workflow—from data cleaning to model inference.
Explore Notebooks:
Open the notebooks in Jupyter to interactively analyze data and experiment with models.
Run Tests:
Execute unit tests to validate pipeline integrity:
```
pytest dags/unit_test/test_runner_dag.py
```

License

This project is open-source and available under the MIT License.

Contact & Acknowledgements

For questions or further information, please reach out to:

Maintainer: [Amit Mohite]
Email: [[email protected]]

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
dags		dags
notebooks		notebooks
.gitignore		.gitignore
01.INSTRUCTIONS_data-pipeline.txt		01.INSTRUCTIONS_data-pipeline.txt
02.INSTRUCTIONS_training.txt		02.INSTRUCTIONS_training.txt
03.INSTRUCTIONS_inference.txt		03.INSTRUCTIONS_inference.txt
README.MD		README.MD
airflow.cfg		airflow.cfg
screenshots.pdf		screenshots.pdf
webserver_config.py		webserver_config.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ed Tech CodePro Lead Classification MLOps Project

Table of Contents

Project Overview

Features

Repository Structure

Getting Started

Prerequisites

Installation

Airflow Setup Instructions

A. Update `airflow.cfg`

B. Run Initialization Commands

C. Start Airflow

Clean Restart (Optional)

Pipeline Overview

Data Pipeline

Training Pipeline

Inference Pipeline

Master Pipeline DAG

Unit Testing

Notebooks

Running the Project

License

Contact & Acknowledgements

About

Uh oh!

Releases

Packages

Languages

mohiteamit/MLops-lead-scoring-system

Folders and files

Latest commit

History

Repository files navigation

Ed Tech CodePro Lead Classification MLOps Project

Table of Contents

Project Overview

Features

Repository Structure

Getting Started

Prerequisites

Installation

Airflow Setup Instructions

A. Update airflow.cfg

B. Run Initialization Commands

C. Start Airflow

Clean Restart (Optional)

Pipeline Overview

Data Pipeline

Training Pipeline

Inference Pipeline

Master Pipeline DAG

Unit Testing

Notebooks

Running the Project

License

Contact & Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

A. Update `airflow.cfg`

Packages