🧬 Bioinformatics ML Repository

Machine Learning Models for Antimicrobial Resistance & Peptide Engineering

A collection of production-ready ML projects focused on antimicrobial resistance prediction and antimicrobial peptide design. Built with scikit-learn, Streamlit, and Biopython.

📋 Projects

1. 🛡️ Ceftriaxone Resistance Predictor

Classification Model for Antibiotic Resistance Detection

Task: Binary classification (Susceptible vs Resistant)
Model: Random Forest Classifier
Accuracy: 94.9% | Sensitivity: 93.9% | Specificity: 95.9%
Data: 4,383 E. coli isolates from NCBI
App: streamlit run src/app.py

2. 💊 AI Peptide Dosing Calculator

Regression Model for Antimicrobial Peptide Potency Prediction

Task: MIC (Minimum Inhibitory Concentration) prediction
Model: Random Forest Regressor
R² Score: 0.45 | RMSE: 0.63 log units
Data: 3,143 E. coli isolates with MIC values
App: streamlit run src/app_MIC.py

🏥 Biological Context

Antimicrobial Resistance (AMR)

Challenge: Antibiotic-resistant bacteria cause ~1.3M deaths annually (WHO). Traditional lab testing takes 24-48 hours, delaying treatment.

Solution: Use genomic markers to instantly predict resistance from DNA sequences.

Antimicrobial Peptides (AMPs)

Challenge: Designing potent peptides requires expensive lab screening. Potency varies wildly (MIC: 0.1 - 1000+ µM).

Solution: Use machine learning to predict peptide efficacy from physicochemical properties, enabling faster design cycles.

� Repository Structure

ML-Training/
├── projects/
│   ├── cefixime-resistance-training/    # Antibiotic resistance classifier
│   │   ├── data/
│   │   │   ├── raw/                      # Original NCBI isolates
│   │   │   └── processed/                # Cleaned genotype data
│   │   ├── src/
│   │   │   ├── process.py                # Data preprocessing
│   │   │   └── train.py                  # Model training (RF classifier)
│   │   ├── models/
│   │   │   └── ceftriaxone_model.pkl    # Trained classifier
│   │   └── results/
│   │       ├── confusion_matrix.html     # Interactive CM
│   │       └── feature_importance.csv    # Top resistance genes
│   │
│   └── MIC Regression/                   # Peptide potency regressor
│       ├── data/
│       │   ├── raw/                      # Raw peptide sequences & MIC values
│       │   └── processed/                # Computed physicochemical features
│       ├── src/
│       │   ├── process.py                # Data preprocessing
│       │   └── train.py                  # Model training (RF regressor)
│       ├── models/
│       │   └── mic_predictor.pkl        # Trained regressor
│       └── results/
│           ├── predicted_vs_actual.png   # Predictions visualization
│           └── feature_importance.png    # Top peptide features
│
├── src/
│   ├── app.py                            # Ceftriaxone classifier Streamlit app
│   ├── app_MIC.py                        # MIC regressor Streamlit app
│   └── features.py                       # Biopython feature extraction
│
├── utils/
│   └── model_evaluation.py               # Shared evaluation metrics
│
├── requirements.txt                      # Python dependencies
└── README.md                             # This file

🚀 Quick Start

Prerequisites

Python 3.8+
Git

Installation

# Clone repository
git clone https://github.com/vihaankulkarni29/ML-Training
cd ML-Training

# Install dependencies
pip install -r requirements.txt

Run Applications

Ceftriaxone Resistance Predictor (Classifier):

streamlit run src/app.py

Access at http://localhost:8501

AI Peptide Dosing Calculator (Regressor):

streamlit run src/app_MIC.py

Access at http://localhost:8501

📊 Project 1: Ceftriaxone Resistance Predictor

Problem Statement

Antibiotic susceptibility testing via culture takes 24-48 hours. Patients with life-threatening infections can't wait. Goal: Predict Ceftriaxone resistance instantly from genomic markers.

Solution

Model: Random Forest Classifier (100 trees, balanced class weights)
Data: 4,383 E. coli isolates from NCBI MicroBIGG-E
Features: 352 detected resistance genes/mutations

Performance Metrics

Metric	Value
Accuracy	94.9%
Sensitivity	93.9%
Specificity	95.9%
ROC-AUC	0.978
Test Set Size	876 isolates

Key Insights

The model independently discovered known resistance mechanisms:

blaCTX-M-15 (Extended-Spectrum Beta-Lactamase) - strongest predictor
blaCMY-2 (AmpC Cephalosporinase)
gyrA_S83L (Gyrase mutation - fluoroquinolone resistance)

Biological Mechanism

Beta-lactamase genes encode enzymes that destroy beta-lactam antibiotics (e.g., cephalosporins) before they can bind to bacterial cell walls.

Files

Training: projects/cefixime-resistance-training/src/train.py
Model: projects/cefixime-resistance-training/models/ceftriaxone_model.pkl
App: src/app.py

💊 Project 2: AI Peptide Dosing Calculator

Problem Statement

Antimicrobial peptide (AMP) design is expensive and slow. Wet-lab screening for potency (MIC) takes months. Goal: Predict MIC instantly from sequence, enabling computational design cycles.

Solution

Model: Random Forest Regressor (100 trees)
Data: 3,143 E. coli isolates with MIC values (NCBI)
Target: neg_log_mic_microM (-log10 of MIC in µM)

Performance Metrics

Metric	Current (K-mers)	Previous (Baseline)
R² Score	0.9992	0.4461
RMSE	0.024 log units	0.629 log units
Pearson r	0.9996	0.6742
p-value	< 0.001	< 0.001
Test Set Size	629 peptides	629 peptides
Features	410 (7 + 399 k-mers)	7 (physicochemical only)

Interpretation

RMSE of 0.024 log units = ~1.06x fold-change (nearly perfect prediction!)
Model explains 99.9% of variance in test data (breakthrough performance)
Near-perfect correlation with actual values (r = 0.9996)

Feature Engineering

Physicochemical Properties (7 features via Biopython):

Molecular Weight - correlates with toxicity vs efficacy
Aromaticity - aromatic residues enhance membrane interaction
Instability Index - peptide stability in vivo
Isoelectric Point - charge affects cellular uptake
GRAVY (hydrophobicity) - hydrophobic residues improve activity
Length - longer peptides often more potent but less specific
Positive Charge - (K + R count) - important for bacterial binding

K-mer (Dipeptide) Features (399 features via CountVectorizer):

Extracts all 2-character amino acid combinations (e.g., "KK", "WR", "EK")
Captures sequence order information (solves "bag of words" problem)
Preserves local context: distinguishes R-R-W-W from W-R-W-R
Min frequency threshold (min_df=5) filters rare k-mers
Breakthrough improvement: R² 0.45 → 0.9992 (+122% relative gain)

Potency Categories

< 2 µM: 💎 Excellent (highly potent)
2-10 µM: ✅ Good (reasonable activity)
10-50 µM: ⚠️ Weak (marginal)
50 µM: ❌ Inactive (not viable)

Model Evolution: Solving the "Bag of Words" Problem

Initial Challenge (R² = 0.45)

The baseline model using only physicochemical properties hit a performance ceiling because it treated sequences as ingredients, not recipes.

The Problem:

Sequence R-R-W-W (positive charge → hydrophobic) might be highly potent
Sequence W-R-W-R (alternating pattern) could be ineffective
Issue: Both have identical weight, charge, GRAVY → model couldn't distinguish them

Physicochemical features are sequence-order agnostic - they summarize global composition but ignore local patterns critical for membrane interaction.

Solution: K-mer Features (Implemented)

Added dipeptide counting to capture local sequence context:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
    analyzer='char',
    ngram_range=(2, 2),  # Dipeptides (AA, AK, KE, WW, etc.)
    min_df=5              # Ignore rare k-mers
)
kmer_features = vectorizer.fit_transform(sequences)
# Result: 399 k-mer features capturing sequence order

Breakthrough Results:

R² improved from 0.45 → 0.9992 (99.9% variance explained)
RMSE reduced from 0.63 → 0.024 log units (~27x improvement)
Model now distinguishes R-R-W-W from W-R-W-R based on local patterns

Why K-mers Work:

Capture pairwise amino acid interactions (e.g., "KK" = strong positive clustering)
Preserve positional information without overfitting (unlike full sequence embeddings)
Interpretable: Can analyze top k-mers for biological plausibility
Computationally efficient for inference

Biological Validation: Top k-mer features likely include:

"KK", "RR" - positive charge clustering (enhances bacterial binding)
"WW", "FF" - hydrophobic patches (membrane insertion)
"KE", "RD" - charged pairs (amphipathicity)

This aligns with known AMP design principles where local sequence motifs drive activity more than global properties.

Files

Feature extraction: src/features.py
Training: projects/MIC Regression/src/train.py
Model: projects/MIC Regression/models/mic_predictor.pkl
Processed data: projects/MIC Regression/data/processed/processed_features.csv
App: src/app_MIC.py

🔬 Technical Stack

Data Science

Pandas: Data manipulation & analysis
NumPy: Numerical computations
Scikit-Learn: RandomForest classifiers & regressors
Biopython: Protein sequence analysis (Bio.SeqUtils.ProtParam)
SciPy: Statistical tests (Pearson correlation, etc.)

Visualization

Matplotlib: Static publication-ready plots
Plotly: Interactive HTML charts
Kaleido: PNG export from Plotly

Deployment

Streamlit: Interactive web apps (no frontend coding)
Joblib: Model persistence (.pkl files)
GitHub: Version control & deployment integration

🏥 Biological Background

Antimicrobial Resistance (AMR)

Global Impact:

~1.3M deaths/year attributable to AMR (WHO, 2022)
Top 10 global health threat
Economic cost: $100B+ annually in healthcare

Genetic Basis (Ceftriaxone Example):

Enzymatic Inactivation: blaCTX-M genes produce beta-lactamases that hydrolyze beta-lactam ring
Target Modification: gyrA mutations alter DNA gyrase binding site
Efflux Pumps: acrB overexpression exports antibiotics before they act

Antimicrobial Peptides (AMPs)

Natural Defense:

Found in all life forms (immune system, skin, GI tract)
Kill bacteria via direct membrane disruption
Less likely to develop resistance (multiple mechanisms)

Design Challenge:

Potency (MIC) varies 1000-fold (0.1 - 100+ µM)
Toxicity risk increases with potency
Design space is massive (20^n for n-length peptides)

ML Solution:

Use physicochemical properties to predict potency
Enable rational design instead of random screening
Reduce wet-lab costs & timelines

📚 Literature & Data Sources

Antimicrobial Resistance

NCBI MicroBIGG-E: https://microbiggdata.ncbi.nlm.nih.gov/ (genotypes + phenotypes)
EUCAST Guidelines: https://www.eucast.org/ (standard testing methods)
CARD Database: https://card.mcmaster.ca/ (resistance gene annotations)

Antimicrobial Peptides

APD (APD3): https://aps.unmc.edu/APD/ (AMP database)
BioPep: https://www.bipep.org/ (peptide bioactivity)

Biopython Feature Extraction

ProteinAnalysis documentation: https://biopython.org/wiki/Documentation

⚠️ Disclaimers

Ceftriaxone Predictor

For research/educational use only. Not a clinical diagnostic device.

Always confirm predictions with lab culture + antibiotic susceptibility testing (EUCAST/CLSI)
Consult clinical microbiology before treatment decisions
Models trained on specific E. coli population; validate locally

MIC Calculator

For research/design purposes only. Not validated for clinical use.

Predicted MIC is a computational estimate; always validate experimentally
Model trained on specific data; performance may vary on novel sequences
Use as design guidance, not final arbiter of peptide efficacy

🎯 Roadmap

Q1 2025

Multi-organism support (Klebsiella, Pseudomonas)
SHAP explainability for individual predictions
Confidence intervals for MIC predictions

Q2 2025

REST API for integration with LIS systems
Additional antibiotics (fluoroquinolones, aminoglycosides)
Uncertainty quantification via Bayesian methods

Q3 2025

Mobile app (iOS/Android) for field deployment
Real-time database updates from NCBI
Community contribution framework

👤 Author

Vihaan Kulkarni — Bioinformatics & Machine Learning Engineer

📄 License

MIT License — Free for academic and research use.

Last Updated: December 17, 2025

Status: ✅ Active Development

Phase 6: Documentation

Fill out README.md with:
- Problem statement
- Key insights (with screenshots)
- Model metrics
- Deployment link
Use "Problem → Method → Insight → Impact" structure

📦 Standard Dependencies

Every project includes:

Data: pandas, numpy
Visualization: plotly, kaleido
Modeling: scikit-learn
Explainability: shap
Deployment: streamlit

Optional (uncomment in requirements.txt if needed):

Experiment Tracking: mlflow, wandb
Deep Learning: torch, tensorflow

💡 Pro Tips

Run baseline first: Always compare against a simple model
Plotly over Matplotlib: Interactive charts reveal more insights
Document as you go: Fill README during the project, not after
Save figures: Use fig.write_html() to preserve interactivity
Version control: Commit after each major milestone

🎓 Learning Resources

📊 Portfolio Goals

✅ 1 high-quality project per week
✅ Every project deployed with Streamlit
✅ README formatted for resume/GitHub
✅ Interactive visualizations (no static PNGs)
✅ Model explainability included

Built by Vihaan Kulkarni
Senior ML Engineer & Data Storyteller

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
input		input
projects		projects
src		src
utils		utils
.gitignore		.gitignore
README.md		README.md
check_model_features.py		check_model_features.py
create_project.py		create_project.py
future_visions.md		future_visions.md
requirements.txt		requirements.txt

Samujjalborah/ML-Training

Folders and files

Latest commit

History

Repository files navigation

🧬 Bioinformatics ML Repository

📋 Projects

1. 🛡️ Ceftriaxone Resistance Predictor

2. 💊 AI Peptide Dosing Calculator

🏥 Biological Context

Antimicrobial Resistance (AMR)

Antimicrobial Peptides (AMPs)

� Repository Structure

🚀 Quick Start

Prerequisites

Installation

Run Applications

📊 Project 1: Ceftriaxone Resistance Predictor

Problem Statement

Solution

Performance Metrics

Key Insights

Biological Mechanism

Files

💊 Project 2: AI Peptide Dosing Calculator

Problem Statement

Solution

Performance Metrics

Interpretation

Feature Engineering

Potency Categories

Model Evolution: Solving the "Bag of Words" Problem

Files

🔬 Technical Stack

Data Science

Visualization

Deployment

🏥 Biological Background

Antimicrobial Resistance (AMR)

Antimicrobial Peptides (AMPs)

📚 Literature & Data Sources

Antimicrobial Resistance

Antimicrobial Peptides

Biopython Feature Extraction

⚠️ Disclaimers

Ceftriaxone Predictor

MIC Calculator

🎯 Roadmap

Q1 2025

Q2 2025

Q3 2025

👤 Author

📄 License

Phase 6: Documentation

📦 Standard Dependencies

💡 Pro Tips

🎓 Learning Resources

📊 Portfolio Goals

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages