Thanks to visit codestin.com
Credit goes to github.com

Skip to content

davidheredia17/InsulinMetric

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InsulinMetric: Predictive Clinical Portal for Diabetes Screening

Overview

InsulinMetric is an end-to-end Machine Learning project designed to act as an assisted diagnostic tool for early diabetes screening. It goes beyond a simple data science script by integrating a fully functional, clinical-style Graphical User Interface (GUI).

The core of the project is a Random Forest Classifier trained on biomedical data. A significant engineering emphasis was placed on medical safety: prioritizing Recall (Sensitivity) and adjusting decision thresholds to minimize False Negatives, ensuring that potential high-risk patients are not overlooked.

Dataset

This project uses the Pima Indians Diabetes Database.

  • Source: Kaggle / UCI Machine Learning Repository
  • Description: Predicts the onset of diabetes based on diagnostic measures. All patients in the original dataset are females at least 21 years old of Pima Indian heritage. (Note: Our custom UI expands this to simulate a general hospital environment by making gender fields dynamic).

Project Structure

InsulinMetric/

├── data/
│ └── diabetes.csv # Raw dataset
├── output/ # Auto-generated plots from model training
│ ├── 01_dirty_feature_importance.png
│ ├── 02_missing_values_heatmap.png
│ ├── 03_insulin_distribution.png
│ ├── 04_scaling_comparison.png
│ ├── 05_final_feature_importance.png
│ └── 06_evaluation_metrics.png
├── venv/ # Isolated Python virtual environment (ignored in git)

├── interface.py # Clinical GUI Application (Frontend)
├── main.py # ML Pipeline & Training Script (Backend)
├── medical_model.pkl # Exported 'Brain' (Model + Scaler + Features)
└── requirements.txt # Project dependencies

Installation & Requirements

To run this project locally, ensure you have an active virtual environment, then install the required dependencies:

# Activate virtual environment (Windows)
.\venv\Scripts\activate
# For Mac/Linux: source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Dependencies (requirements.txt):

pandas
numpy
scikit-learn
matplotlib
seaborn
joblib
tkcalendar

Machine Learning Pipeline (main.py)

The backend is structured into distinct, logical blocks representing a professional Data Science workflow:

Block 0: Configuration Panel

A centralized dictionary (CONFIG) manages hyperparameters, paths, and clinical thresholds (e.g., recall_threshold: 0.35). This makes the model easily tunable without diving deep into the code.

Block I: Data Loading & Diagnosis

Medical datasets often use 0 to represent missing data. The model first identifies these "false zeros" in critical columns (like Insulin and Blood Pressure) and converts them to NaN to prevent the algorithm from learning incorrect biological baselines.

Block II: Smart Imputation

Instead of using a global mean, the project applies domain-specific logical imputation:

  • Age Quartiles: Missing Blood Pressure, BMI, and Skin Thickness are imputed based on the patient's age group.
  • Target-based: Missing Glucose and Insulin are imputed based on the actual outcome (Diabetic vs. Healthy) to preserve the statistical significance of the classes.

Block III: Scaling & Splitting

Data is split using stratify=y to maintain the natural distribution of the disease. Features are then standardized using Z-Score Scaling (StandardScaler) so that high-magnitude variables (like Insulin) do not overshadow low-magnitude variables (like the genetic filter).

Block IV: Random Forest Training

An ensemble of decision trees (n_estimators=200) is trained. class_weight='balanced' is utilized to force the algorithm to pay closer attention to the minority class (Diabetic patients).

Block V: Clinical Evaluation

The model is evaluated not just on Accuracy, but strictly on Recall. We implemented a custom decision threshold (0.35) to act as a strict medical screening tool, aggressively minimizing False Negatives.

Block VI: Cross-Validation

To ensure stability and rule out "lucky splits," a 5-Fold Cross-Validation is performed, proving the model's AUC consistency across different subsets of data.

Clinical User Interface (interface.py)

A custom desktop application built with Tkinter to simulate a hospital's ERP software.

Key UI Features:

  • Dynamic Fields: If the user selects "Male", the "Pregnancies" field is automatically hidden, and a background value of 0.0 is injected into the ML model to prevent data shape errors.
  • Automated DPF Calculation: Instead of asking the user for a complex decimal (Diabetes Pedigree Function), the UI asks simple questions about their family history and maps it to the correct statistical weight (0.15 to 1.90).
  • Age Calculation: Automatically derives the patient's age from their Date of Birth.
  • Real-time Pipeline: Instantiates the saved StandardScaler from training to scale new user inputs on the fly before predicting.

Results & Visualizations

The following visualizations highlight the data cleaning process and the final performance of our clinical model.

1. The Impact of Dirty Data

Showing how untreated 0 values (false zeros) hide the true predictive power of key biological markers like Insulin.

Dirty Feature Importance

2. Missing Values Heatmap

Visualizing the exact locations of the medical NaN values before our smart imputation.

Missing Values Heatmap

3. Smart Imputation Distribution

The Insulin distribution after applying target-based imputation, ensuring smooth, realistic data curves.

Insulin Distribution

4. Data Standardization

Comparing the raw data (left) with the Z-Score scaled data (right) to ensure all features contribute equally during training.

Scaling Comparison

5. Final Feature Importance (Clean Data)

After smart imputation and scaling, Glucose and Insulin correctly emerge as the dominant biological predictors for the Random Forest.

Final Feature Importance

6. Clinical Performance (Confusion Matrix & ROC)

By lowering the decision threshold to 0.35, the model acts as a highly sensitive screening tool. The Confusion Matrix demonstrates exceptional Recall (identifying the vast majority of positive cases), while the ROC curve highlights the overall diagnostic power (AUC).

Evaluation Metrics

About

Predictive Clinical Portal for Diabetes Screening using Machine Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages