InsulinMetric is an end-to-end Machine Learning project designed to act as an assisted diagnostic tool for early diabetes screening. It goes beyond a simple data science script by integrating a fully functional, clinical-style Graphical User Interface (GUI).
The core of the project is a Random Forest Classifier trained on biomedical data. A significant engineering emphasis was placed on medical safety: prioritizing Recall (Sensitivity) and adjusting decision thresholds to minimize False Negatives, ensuring that potential high-risk patients are not overlooked.
This project uses the Pima Indians Diabetes Database.
- Source: Kaggle / UCI Machine Learning Repository
- Description: Predicts the onset of diabetes based on diagnostic measures. All patients in the original dataset are females at least 21 years old of Pima Indian heritage. (Note: Our custom UI expands this to simulate a general hospital environment by making gender fields dynamic).
InsulinMetric/
│
├── data/
│ └── diabetes.csv # Raw dataset
├── output/ # Auto-generated plots from model training
│ ├── 01_dirty_feature_importance.png
│ ├── 02_missing_values_heatmap.png
│ ├── 03_insulin_distribution.png
│ ├── 04_scaling_comparison.png
│ ├── 05_final_feature_importance.png
│ └── 06_evaluation_metrics.png
├── venv/ # Isolated Python virtual environment (ignored in git)
│
├── interface.py # Clinical GUI Application (Frontend)
├── main.py # ML Pipeline & Training Script (Backend)
├── medical_model.pkl # Exported 'Brain' (Model + Scaler + Features)
└── requirements.txt # Project dependencies
To run this project locally, ensure you have an active virtual environment, then install the required dependencies:
# Activate virtual environment (Windows)
.\venv\Scripts\activate
# For Mac/Linux: source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
Dependencies (requirements.txt):
pandas
numpy
scikit-learn
matplotlib
seaborn
joblib
tkcalendar
The backend is structured into distinct, logical blocks representing a professional Data Science workflow:
A centralized dictionary (CONFIG) manages hyperparameters, paths, and clinical thresholds (e.g., recall_threshold: 0.35). This makes the model easily tunable without diving deep into the code.
Medical datasets often use 0 to represent missing data. The model first identifies these "false zeros" in critical columns (like Insulin and Blood Pressure) and converts them to NaN to prevent the algorithm from learning incorrect biological baselines.
Instead of using a global mean, the project applies domain-specific logical imputation:
- Age Quartiles: Missing Blood Pressure, BMI, and Skin Thickness are imputed based on the patient's age group.
- Target-based: Missing Glucose and Insulin are imputed based on the actual outcome (Diabetic vs. Healthy) to preserve the statistical significance of the classes.
Data is split using stratify=y to maintain the natural distribution of the disease. Features are then standardized using Z-Score Scaling (StandardScaler) so that high-magnitude variables (like Insulin) do not overshadow low-magnitude variables (like the genetic filter).
An ensemble of decision trees (n_estimators=200) is trained. class_weight='balanced' is utilized to force the algorithm to pay closer attention to the minority class (Diabetic patients).
The model is evaluated not just on Accuracy, but strictly on Recall. We implemented a custom decision threshold (0.35) to act as a strict medical screening tool, aggressively minimizing False Negatives.
To ensure stability and rule out "lucky splits," a 5-Fold Cross-Validation is performed, proving the model's AUC consistency across different subsets of data.
A custom desktop application built with Tkinter to simulate a hospital's ERP software.
Key UI Features:
- Dynamic Fields: If the user selects "Male", the "Pregnancies" field is automatically hidden, and a background value of 0.0 is injected into the ML model to prevent data shape errors.
- Automated DPF Calculation: Instead of asking the user for a complex decimal (Diabetes Pedigree Function), the UI asks simple questions about their family history and maps it to the correct statistical weight (0.15 to 1.90).
- Age Calculation: Automatically derives the patient's age from their Date of Birth.
- Real-time Pipeline: Instantiates the saved StandardScaler from training to scale new user inputs on the fly before predicting.
The following visualizations highlight the data cleaning process and the final performance of our clinical model.
Showing how untreated 0 values (false zeros) hide the true predictive power of key biological markers like Insulin.
Visualizing the exact locations of the medical NaN values before our smart imputation.
The Insulin distribution after applying target-based imputation, ensuring smooth, realistic data curves.
Comparing the raw data (left) with the Z-Score scaled data (right) to ensure all features contribute equally during training.
After smart imputation and scaling, Glucose and Insulin correctly emerge as the dominant biological predictors for the Random Forest.
By lowering the decision threshold to 0.35, the model acts as a highly sensitive screening tool. The Confusion Matrix demonstrates exceptional Recall (identifying the vast majority of positive cases), while the ROC curve highlights the overall diagnostic power (AUC).





