Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Samujjalborah/ML-Training

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧬 Bioinformatics ML Repository

Machine Learning Models for Antimicrobial Resistance & Peptide Engineering

A collection of production-ready ML projects focused on antimicrobial resistance prediction and antimicrobial peptide design. Built with scikit-learn, Streamlit, and Biopython.


πŸ“‹ Projects

1. πŸ›‘οΈ Ceftriaxone Resistance Predictor

Classification Model for Antibiotic Resistance Detection

  • Task: Binary classification (Susceptible vs Resistant)
  • Model: Random Forest Classifier
  • Accuracy: 94.9% | Sensitivity: 93.9% | Specificity: 95.9%
  • Data: 4,383 E. coli isolates from NCBI
  • App: streamlit run src/app.py

2. πŸ’Š AI Peptide Dosing Calculator

Regression Model for Antimicrobial Peptide Potency Prediction

  • Task: MIC (Minimum Inhibitory Concentration) prediction
  • Model: Random Forest Regressor
  • RΒ² Score: 0.45 | RMSE: 0.63 log units
  • Data: 3,143 E. coli isolates with MIC values
  • App: streamlit run src/app_MIC.py

πŸ₯ Biological Context

Antimicrobial Resistance (AMR)

Challenge: Antibiotic-resistant bacteria cause ~1.3M deaths annually (WHO). Traditional lab testing takes 24-48 hours, delaying treatment.

Solution: Use genomic markers to instantly predict resistance from DNA sequences.

Antimicrobial Peptides (AMPs)

Challenge: Designing potent peptides requires expensive lab screening. Potency varies wildly (MIC: 0.1 - 1000+ Β΅M).

Solution: Use machine learning to predict peptide efficacy from physicochemical properties, enabling faster design cycles.


οΏ½ Repository Structure

ML-Training/
β”œβ”€β”€ projects/
β”‚   β”œβ”€β”€ cefixime-resistance-training/    # Antibiotic resistance classifier
β”‚   β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”‚   β”œβ”€β”€ raw/                      # Original NCBI isolates
β”‚   β”‚   β”‚   └── processed/                # Cleaned genotype data
β”‚   β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”‚   β”œβ”€β”€ process.py                # Data preprocessing
β”‚   β”‚   β”‚   └── train.py                  # Model training (RF classifier)
β”‚   β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”‚   └── ceftriaxone_model.pkl    # Trained classifier
β”‚   β”‚   └── results/
β”‚   β”‚       β”œβ”€β”€ confusion_matrix.html     # Interactive CM
β”‚   β”‚       └── feature_importance.csv    # Top resistance genes
β”‚   β”‚
β”‚   └── MIC Regression/                   # Peptide potency regressor
β”‚       β”œβ”€β”€ data/
β”‚       β”‚   β”œβ”€β”€ raw/                      # Raw peptide sequences & MIC values
β”‚       β”‚   └── processed/                # Computed physicochemical features
β”‚       β”œβ”€β”€ src/
β”‚       β”‚   β”œβ”€β”€ process.py                # Data preprocessing
β”‚       β”‚   └── train.py                  # Model training (RF regressor)
β”‚       β”œβ”€β”€ models/
β”‚       β”‚   └── mic_predictor.pkl        # Trained regressor
β”‚       └── results/
β”‚           β”œβ”€β”€ predicted_vs_actual.png   # Predictions visualization
β”‚           └── feature_importance.png    # Top peptide features
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ app.py                            # Ceftriaxone classifier Streamlit app
β”‚   β”œβ”€β”€ app_MIC.py                        # MIC regressor Streamlit app
β”‚   └── features.py                       # Biopython feature extraction
β”‚
β”œβ”€β”€ utils/
β”‚   └── model_evaluation.py               # Shared evaluation metrics
β”‚
β”œβ”€β”€ requirements.txt                      # Python dependencies
└── README.md                             # This file

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • Git

Installation

# Clone repository
git clone https://github.com/vihaankulkarni29/ML-Training
cd ML-Training

# Install dependencies
pip install -r requirements.txt

Run Applications

Ceftriaxone Resistance Predictor (Classifier):

streamlit run src/app.py

Access at http://localhost:8501

AI Peptide Dosing Calculator (Regressor):

streamlit run src/app_MIC.py

Access at http://localhost:8501


πŸ“Š Project 1: Ceftriaxone Resistance Predictor

Problem Statement

Antibiotic susceptibility testing via culture takes 24-48 hours. Patients with life-threatening infections can't wait. Goal: Predict Ceftriaxone resistance instantly from genomic markers.

Solution

  • Model: Random Forest Classifier (100 trees, balanced class weights)
  • Data: 4,383 E. coli isolates from NCBI MicroBIGG-E
  • Features: 352 detected resistance genes/mutations

Performance Metrics

Metric Value
Accuracy 94.9%
Sensitivity 93.9%
Specificity 95.9%
ROC-AUC 0.978
Test Set Size 876 isolates

Key Insights

The model independently discovered known resistance mechanisms:

  • blaCTX-M-15 (Extended-Spectrum Beta-Lactamase) - strongest predictor
  • blaCMY-2 (AmpC Cephalosporinase)
  • gyrA_S83L (Gyrase mutation - fluoroquinolone resistance)

Biological Mechanism

Beta-lactamase genes encode enzymes that destroy beta-lactam antibiotics (e.g., cephalosporins) before they can bind to bacterial cell walls.

Files

  • Training: projects/cefixime-resistance-training/src/train.py
  • Model: projects/cefixime-resistance-training/models/ceftriaxone_model.pkl
  • App: src/app.py

πŸ’Š Project 2: AI Peptide Dosing Calculator

Problem Statement

Antimicrobial peptide (AMP) design is expensive and slow. Wet-lab screening for potency (MIC) takes months. Goal: Predict MIC instantly from sequence, enabling computational design cycles.

Solution

  • Model: Random Forest Regressor (100 trees)
  • Data: 3,143 E. coli isolates with MIC values (NCBI)
  • Target: neg_log_mic_microM (-log10 of MIC in Β΅M)

Performance Metrics

Metric Current (K-mers) Previous (Baseline)
RΒ² Score 0.9992 0.4461
RMSE 0.024 log units 0.629 log units
Pearson r 0.9996 0.6742
p-value < 0.001 < 0.001
Test Set Size 629 peptides 629 peptides
Features 410 (7 + 399 k-mers) 7 (physicochemical only)

Interpretation

  • RMSE of 0.024 log units = ~1.06x fold-change (nearly perfect prediction!)
  • Model explains 99.9% of variance in test data (breakthrough performance)
  • Near-perfect correlation with actual values (r = 0.9996)

Feature Engineering

Physicochemical Properties (7 features via Biopython):

  1. Molecular Weight - correlates with toxicity vs efficacy
  2. Aromaticity - aromatic residues enhance membrane interaction
  3. Instability Index - peptide stability in vivo
  4. Isoelectric Point - charge affects cellular uptake
  5. GRAVY (hydrophobicity) - hydrophobic residues improve activity
  6. Length - longer peptides often more potent but less specific
  7. Positive Charge - (K + R count) - important for bacterial binding

K-mer (Dipeptide) Features (399 features via CountVectorizer):

  • Extracts all 2-character amino acid combinations (e.g., "KK", "WR", "EK")
  • Captures sequence order information (solves "bag of words" problem)
  • Preserves local context: distinguishes R-R-W-W from W-R-W-R
  • Min frequency threshold (min_df=5) filters rare k-mers
  • Breakthrough improvement: RΒ² 0.45 β†’ 0.9992 (+122% relative gain)

Potency Categories

  • < 2 Β΅M: πŸ’Ž Excellent (highly potent)
  • 2-10 Β΅M: βœ… Good (reasonable activity)
  • 10-50 Β΅M: ⚠️ Weak (marginal)
  • 50 Β΅M: ❌ Inactive (not viable)

Model Evolution: Solving the "Bag of Words" Problem

Initial Challenge (RΒ² = 0.45)

The baseline model using only physicochemical properties hit a performance ceiling because it treated sequences as ingredients, not recipes.

The Problem:

  • Sequence R-R-W-W (positive charge β†’ hydrophobic) might be highly potent
  • Sequence W-R-W-R (alternating pattern) could be ineffective
  • Issue: Both have identical weight, charge, GRAVY β†’ model couldn't distinguish them

Physicochemical features are sequence-order agnostic - they summarize global composition but ignore local patterns critical for membrane interaction.

Solution: K-mer Features (Implemented)

Added dipeptide counting to capture local sequence context:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
    analyzer='char',
    ngram_range=(2, 2),  # Dipeptides (AA, AK, KE, WW, etc.)
    min_df=5              # Ignore rare k-mers
)
kmer_features = vectorizer.fit_transform(sequences)
# Result: 399 k-mer features capturing sequence order

Breakthrough Results:

  • RΒ² improved from 0.45 β†’ 0.9992 (99.9% variance explained)
  • RMSE reduced from 0.63 β†’ 0.024 log units (~27x improvement)
  • Model now distinguishes R-R-W-W from W-R-W-R based on local patterns

Why K-mers Work:

  • Capture pairwise amino acid interactions (e.g., "KK" = strong positive clustering)
  • Preserve positional information without overfitting (unlike full sequence embeddings)
  • Interpretable: Can analyze top k-mers for biological plausibility
  • Computationally efficient for inference

Biological Validation: Top k-mer features likely include:

  • "KK", "RR" - positive charge clustering (enhances bacterial binding)
  • "WW", "FF" - hydrophobic patches (membrane insertion)
  • "KE", "RD" - charged pairs (amphipathicity)

This aligns with known AMP design principles where local sequence motifs drive activity more than global properties.

Files

  • Feature extraction: src/features.py
  • Training: projects/MIC Regression/src/train.py
  • Model: projects/MIC Regression/models/mic_predictor.pkl
  • Processed data: projects/MIC Regression/data/processed/processed_features.csv
  • App: src/app_MIC.py

πŸ”¬ Technical Stack

Data Science

  • Pandas: Data manipulation & analysis
  • NumPy: Numerical computations
  • Scikit-Learn: RandomForest classifiers & regressors
  • Biopython: Protein sequence analysis (Bio.SeqUtils.ProtParam)
  • SciPy: Statistical tests (Pearson correlation, etc.)

Visualization

  • Matplotlib: Static publication-ready plots
  • Plotly: Interactive HTML charts
  • Kaleido: PNG export from Plotly

Deployment

  • Streamlit: Interactive web apps (no frontend coding)
  • Joblib: Model persistence (.pkl files)
  • GitHub: Version control & deployment integration

πŸ₯ Biological Background

Antimicrobial Resistance (AMR)

Global Impact:

  • ~1.3M deaths/year attributable to AMR (WHO, 2022)
  • Top 10 global health threat
  • Economic cost: $100B+ annually in healthcare

Genetic Basis (Ceftriaxone Example):

  1. Enzymatic Inactivation: blaCTX-M genes produce beta-lactamases that hydrolyze beta-lactam ring
  2. Target Modification: gyrA mutations alter DNA gyrase binding site
  3. Efflux Pumps: acrB overexpression exports antibiotics before they act

Antimicrobial Peptides (AMPs)

Natural Defense:

  • Found in all life forms (immune system, skin, GI tract)
  • Kill bacteria via direct membrane disruption
  • Less likely to develop resistance (multiple mechanisms)

Design Challenge:

  • Potency (MIC) varies 1000-fold (0.1 - 100+ Β΅M)
  • Toxicity risk increases with potency
  • Design space is massive (20^n for n-length peptides)

ML Solution:

  • Use physicochemical properties to predict potency
  • Enable rational design instead of random screening
  • Reduce wet-lab costs & timelines

πŸ“š Literature & Data Sources

Antimicrobial Resistance

Antimicrobial Peptides

Biopython Feature Extraction


⚠️ Disclaimers

Ceftriaxone Predictor

For research/educational use only. Not a clinical diagnostic device.

  • Always confirm predictions with lab culture + antibiotic susceptibility testing (EUCAST/CLSI)
  • Consult clinical microbiology before treatment decisions
  • Models trained on specific E. coli population; validate locally

MIC Calculator

For research/design purposes only. Not validated for clinical use.

  • Predicted MIC is a computational estimate; always validate experimentally
  • Model trained on specific data; performance may vary on novel sequences
  • Use as design guidance, not final arbiter of peptide efficacy

🎯 Roadmap

Q1 2025

  • Multi-organism support (Klebsiella, Pseudomonas)
  • SHAP explainability for individual predictions
  • Confidence intervals for MIC predictions

Q2 2025

  • REST API for integration with LIS systems
  • Additional antibiotics (fluoroquinolones, aminoglycosides)
  • Uncertainty quantification via Bayesian methods

Q3 2025

  • Mobile app (iOS/Android) for field deployment
  • Real-time database updates from NCBI
  • Community contribution framework

πŸ‘€ Author

Vihaan Kulkarni β€” Bioinformatics & Machine Learning Engineer


πŸ“„ License

MIT License β€” Free for academic and research use.


Last Updated: December 17, 2025

Status: βœ… Active Development

Phase 6: Documentation

  1. Fill out README.md with:
    • Problem statement
    • Key insights (with screenshots)
    • Model metrics
    • Deployment link
  2. Use "Problem β†’ Method β†’ Insight β†’ Impact" structure

πŸ“¦ Standard Dependencies

Every project includes:

  • Data: pandas, numpy
  • Visualization: plotly, kaleido
  • Modeling: scikit-learn
  • Explainability: shap
  • Deployment: streamlit

Optional (uncomment in requirements.txt if needed):

  • Experiment Tracking: mlflow, wandb
  • Deep Learning: torch, tensorflow

πŸ’‘ Pro Tips

  1. Run baseline first: Always compare against a simple model
  2. Plotly over Matplotlib: Interactive charts reveal more insights
  3. Document as you go: Fill README during the project, not after
  4. Save figures: Use fig.write_html() to preserve interactivity
  5. Version control: Commit after each major milestone

πŸŽ“ Learning Resources


πŸ“Š Portfolio Goals

  • βœ… 1 high-quality project per week
  • βœ… Every project deployed with Streamlit
  • βœ… README formatted for resume/GitHub
  • βœ… Interactive visualizations (no static PNGs)
  • βœ… Model explainability included

Built by Vihaan Kulkarni
Senior ML Engineer & Data Storyteller

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 97.8%
  • Python 2.1%
  • Jupyter Notebook 0.1%