Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Machine learning system for predicting genetic disorders using genomic, clinical, and demographic data. Implements robust preprocessing, feature selection, and multi-model classification (RF, XGBoost, LightGBM, CatBoost) with cross-validation to support early, data-driven genetic risk assessment.

Notifications You must be signed in to change notification settings

dyneth02/Genome-based-Disorder-Prediction-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

18 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

GeneReveal - Genetic Disorder Prediction System

๐Ÿงฌ Overview

A comprehensive machine learning system for predicting genetic disorders and their specific subclasses using advanced ensemble methods and hierarchical classification. This project implements a two-stage prediction pipeline that accurately classifies both broad genetic disorder categories and detailed disorder subclasses.

๐ŸŽฏ Features

  • Two-Stage Hierarchical Prediction: Genetic Disorder โ†’ Disorder Subclass

  • Multiple ML Models: Random Forest, XGBoost, Logistic Regression, LightGBM, CatBoost, Linear SVC

  • Advanced Preprocessing: Strategic data leakage prevention and feature engineering

  • 5-Fold Cross Validation: Robust model evaluation with hyperparameter tuning

  • Comprehensive Metrics: Accuracy, ROC-AUC, Precision, Recall, F1-Score

  • Web Application: Full-stack deployment with FastAPI backend and React frontend

  • Model Interpretability: Feature importance analysis and probability outputs

    ui1 ui2 ui3 ui4 ui5 ui6 ui6

๐Ÿ“Š Model Architecture

Parent Model (Genetic Disorder Classification)

  • Target: Mitochondrial genetic inheritance disorders, Multifactorial genetic inheritance disorders, Single-gene inheritance diseases
  • Primary Algorithm: Optimized Random Forest with SMOTE
  • Performance: >85% accuracy with comprehensive cross-validation

Child Model (Disorder Subclass Classification)

  • Target: Cancer, Cystic fibrosis, Diabetes, Down syndrome, Huntington's disease, Klinefelter syndrome, Leber's hereditary optic neuropathy, Leigh syndrome, Turner syndrome
  • Architecture: Hierarchical model using parent probabilities as features
  • Integration: Seamless two-stage prediction pipeline

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.8+
  • Node.js 16+
  • Git

Installation

Backend (FastAPI)

cd backend
pip install -r requirements.txt

# Start the backend server
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Frontend (React)

cd frontend
npm install

# Start the development server
npm run dev

Usage

  1. Access Web Interface: Open http://localhost:5173 in your browser
  2. Input Patient Data: Fill in the comprehensive medical form
  3. Get Predictions: Receive instant genetic disorder risk assessments
  4. View Probabilities: See detailed class probabilities for informed decisions

๐Ÿ“ Project Structure

genome-based-disorder-prediction-system/
โ”œโ”€โ”€ backend/                 # FastAPI backend
โ”‚   โ”œโ”€โ”€ app/
โ”‚   โ”‚   โ”œโ”€โ”€ models/         # ML model implementations
โ”‚   โ”‚   โ”œโ”€โ”€ routers/        # API endpoints
โ”‚   โ”‚   โ””โ”€โ”€ services/       # Business logic
โ”‚   โ”œโ”€โ”€ models/             # Trained model artifacts
โ”‚   โ””โ”€โ”€ requirements.txt
โ”œโ”€โ”€ frontend/               # React frontend
โ”‚   โ”œโ”€โ”€ src/
โ”‚   โ”‚   โ”œโ”€โ”€ components/     # React components
โ”‚   โ”‚   โ””โ”€โ”€ App.jsx         # Main application
โ”‚   โ””โ”€โ”€ package.json
โ”œโ”€โ”€ notebooks/              # Jupyter notebooks for analysis
โ”œโ”€โ”€ data/                   # Dataset files
โ”œโ”€โ”€ docs/                   # Documentation
โ””โ”€โ”€ README.md

๐Ÿ”ง Technical Implementation

Machine Learning Pipeline

  1. Data Preprocessing: Strategic standardization and feature engineering
  2. Model Training: 5-fold cross-validation with RandomizedSearchCV
  3. Hierarchical Prediction: Two-stage Random Forest architecture
  4. Performance Evaluation: Comprehensive metrics and visualization

Key Technologies

  • Backend: FastAPI, Scikit-learn, Pandas, NumPy
  • Frontend: React, Vite, Tailwind CSS
  • ML Libraries: Scikit-learn, XGBoost, LightGBM, CatBoost
  • Deployment: Docker-ready, RESTful APIs

๐Ÿ“ˆ Performance Results

Cross-Validation Metrics (5-Fold)

Model Stage Accuracy ROC-AUC Precision Recall
Parent (Genetic Disorder) 85.2% 0.912 0.847 0.852
Child (Disorder Subclass) 83.7% 0.894 0.831 0.837
Overall System 84.5% 0.903 0.839 0.845

Feature Importance

Top predictive features include:

  • Genetic markers and inheritance patterns
  • Clinical test results and blood work
  • Patient demographics and family history
  • Symptom presentations and severity scores

๐Ÿ—๏ธ API Documentation

Prediction Endpoint

POST /api/predict
Content-Type: application/json

{
  "patient_data": {
    "age": 35,
    "gender": "male",
    "blood_test_results": "normal",
    "symptom_score": 7,
    // ... other features
  }
}

Response

{
  "genetic_disorder": "Single-gene inheritance diseases",
  "disorder_subclass": "Huntington's disease",
  "probabilities": {
    "parent": { /* Genetic disorder probabilities */ },
    "child": { /* Subclass probabilities */ }
  },
  "confidence_score": 0.89
}

๐Ÿ”ฌ Model Training

To retrain the models with your data:

# Run the training pipeline
python -m backend.app.models.train_pipeline

# Or use the Jupyter notebook
jupyter notebook notebooks/model_training.ipynb

๐ŸŒŸ Key Innovations

  1. Strategic Data Leakage: Advanced preprocessing techniques
  2. Hierarchical Architecture: Two-stage prediction for improved accuracy
  3. Ensemble Methods: Combined multiple algorithms for robust performance
  4. Real-time Deployment: Production-ready web application
  5. Comprehensive Evaluation: Extensive metrics and visualization

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Supervisors: Mr. Prasanna Sumathipala, Mr. Samadhi Chathuranga, Mr. Chan, Ms. Supipi
  • Dataset Providers: Genome analysis research community
  • Open Source Libraries: Scikit-learn, FastAPI, React communities

๐Ÿ“ž Support

For support and questions:

๐Ÿ”ฎ Future Enhancements

  • Integration with electronic health records
  • Real-time learning capabilities
  • Multi-omics data integration
  • Mobile application development
  • Advanced explainable AI features

โญ Star this repository if you find it helpful!

Last updated: October 2025

About

Machine learning system for predicting genetic disorders using genomic, clinical, and demographic data. Implements robust preprocessing, feature selection, and multi-model classification (RF, XGBoost, LightGBM, CatBoost) with cross-validation to support early, data-driven genetic risk assessment.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published