Machine Learning Models for Antimicrobial Resistance & Peptide Engineering
A collection of production-ready ML projects focused on antimicrobial resistance prediction and antimicrobial peptide design. Built with scikit-learn, Streamlit, and Biopython.
Classification Model for Antibiotic Resistance Detection
- Task: Binary classification (Susceptible vs Resistant)
- Model: Random Forest Classifier
- Accuracy: 94.9% | Sensitivity: 93.9% | Specificity: 95.9%
- Data: 4,383 E. coli isolates from NCBI
- App:
streamlit run src/app.py
Regression Model for Antimicrobial Peptide Potency Prediction
- Task: MIC (Minimum Inhibitory Concentration) prediction
- Model: Random Forest Regressor
- RΒ² Score: 0.45 | RMSE: 0.63 log units
- Data: 3,143 E. coli isolates with MIC values
- App:
streamlit run src/app_MIC.py
Challenge: Antibiotic-resistant bacteria cause ~1.3M deaths annually (WHO). Traditional lab testing takes 24-48 hours, delaying treatment.
Solution: Use genomic markers to instantly predict resistance from DNA sequences.
Challenge: Designing potent peptides requires expensive lab screening. Potency varies wildly (MIC: 0.1 - 1000+ Β΅M).
Solution: Use machine learning to predict peptide efficacy from physicochemical properties, enabling faster design cycles.
ML-Training/
βββ projects/
β βββ cefixime-resistance-training/ # Antibiotic resistance classifier
β β βββ data/
β β β βββ raw/ # Original NCBI isolates
β β β βββ processed/ # Cleaned genotype data
β β βββ src/
β β β βββ process.py # Data preprocessing
β β β βββ train.py # Model training (RF classifier)
β β βββ models/
β β β βββ ceftriaxone_model.pkl # Trained classifier
β β βββ results/
β β βββ confusion_matrix.html # Interactive CM
β β βββ feature_importance.csv # Top resistance genes
β β
β βββ MIC Regression/ # Peptide potency regressor
β βββ data/
β β βββ raw/ # Raw peptide sequences & MIC values
β β βββ processed/ # Computed physicochemical features
β βββ src/
β β βββ process.py # Data preprocessing
β β βββ train.py # Model training (RF regressor)
β βββ models/
β β βββ mic_predictor.pkl # Trained regressor
β βββ results/
β βββ predicted_vs_actual.png # Predictions visualization
β βββ feature_importance.png # Top peptide features
β
βββ src/
β βββ app.py # Ceftriaxone classifier Streamlit app
β βββ app_MIC.py # MIC regressor Streamlit app
β βββ features.py # Biopython feature extraction
β
βββ utils/
β βββ model_evaluation.py # Shared evaluation metrics
β
βββ requirements.txt # Python dependencies
βββ README.md # This file
- Python 3.8+
- Git
# Clone repository
git clone https://github.com/vihaankulkarni29/ML-Training
cd ML-Training
# Install dependencies
pip install -r requirements.txtCeftriaxone Resistance Predictor (Classifier):
streamlit run src/app.pyAccess at http://localhost:8501
AI Peptide Dosing Calculator (Regressor):
streamlit run src/app_MIC.pyAccess at http://localhost:8501
Antibiotic susceptibility testing via culture takes 24-48 hours. Patients with life-threatening infections can't wait. Goal: Predict Ceftriaxone resistance instantly from genomic markers.
- Model: Random Forest Classifier (100 trees, balanced class weights)
- Data: 4,383 E. coli isolates from NCBI MicroBIGG-E
- Features: 352 detected resistance genes/mutations
| Metric | Value |
|---|---|
| Accuracy | 94.9% |
| Sensitivity | 93.9% |
| Specificity | 95.9% |
| ROC-AUC | 0.978 |
| Test Set Size | 876 isolates |
The model independently discovered known resistance mechanisms:
- blaCTX-M-15 (Extended-Spectrum Beta-Lactamase) - strongest predictor
- blaCMY-2 (AmpC Cephalosporinase)
- gyrA_S83L (Gyrase mutation - fluoroquinolone resistance)
Beta-lactamase genes encode enzymes that destroy beta-lactam antibiotics (e.g., cephalosporins) before they can bind to bacterial cell walls.
- Training:
projects/cefixime-resistance-training/src/train.py - Model:
projects/cefixime-resistance-training/models/ceftriaxone_model.pkl - App:
src/app.py
Antimicrobial peptide (AMP) design is expensive and slow. Wet-lab screening for potency (MIC) takes months. Goal: Predict MIC instantly from sequence, enabling computational design cycles.
- Model: Random Forest Regressor (100 trees)
- Data: 3,143 E. coli isolates with MIC values (NCBI)
- Target:
neg_log_mic_microM(-log10 of MIC in Β΅M)
| Metric | Current (K-mers) | Previous (Baseline) |
|---|---|---|
| RΒ² Score | 0.9992 | 0.4461 |
| RMSE | 0.024 log units | 0.629 log units |
| Pearson r | 0.9996 | 0.6742 |
| p-value | < 0.001 | < 0.001 |
| Test Set Size | 629 peptides | 629 peptides |
| Features | 410 (7 + 399 k-mers) | 7 (physicochemical only) |
- RMSE of 0.024 log units = ~1.06x fold-change (nearly perfect prediction!)
- Model explains 99.9% of variance in test data (breakthrough performance)
- Near-perfect correlation with actual values (r = 0.9996)
Physicochemical Properties (7 features via Biopython):
- Molecular Weight - correlates with toxicity vs efficacy
- Aromaticity - aromatic residues enhance membrane interaction
- Instability Index - peptide stability in vivo
- Isoelectric Point - charge affects cellular uptake
- GRAVY (hydrophobicity) - hydrophobic residues improve activity
- Length - longer peptides often more potent but less specific
- Positive Charge - (K + R count) - important for bacterial binding
K-mer (Dipeptide) Features (399 features via CountVectorizer):
- Extracts all 2-character amino acid combinations (e.g., "KK", "WR", "EK")
- Captures sequence order information (solves "bag of words" problem)
- Preserves local context: distinguishes
R-R-W-WfromW-R-W-R - Min frequency threshold (min_df=5) filters rare k-mers
- Breakthrough improvement: RΒ² 0.45 β 0.9992 (+122% relative gain)
- < 2 Β΅M: π Excellent (highly potent)
- 2-10 Β΅M: β Good (reasonable activity)
- 10-50 Β΅M:
β οΈ Weak (marginal) -
50 Β΅M: β Inactive (not viable)
Initial Challenge (RΒ² = 0.45)
The baseline model using only physicochemical properties hit a performance ceiling because it treated sequences as ingredients, not recipes.
The Problem:
- Sequence
R-R-W-W(positive charge β hydrophobic) might be highly potent - Sequence
W-R-W-R(alternating pattern) could be ineffective - Issue: Both have identical weight, charge, GRAVY β model couldn't distinguish them
Physicochemical features are sequence-order agnostic - they summarize global composition but ignore local patterns critical for membrane interaction.
Solution: K-mer Features (Implemented)
Added dipeptide counting to capture local sequence context:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(
analyzer='char',
ngram_range=(2, 2), # Dipeptides (AA, AK, KE, WW, etc.)
min_df=5 # Ignore rare k-mers
)
kmer_features = vectorizer.fit_transform(sequences)
# Result: 399 k-mer features capturing sequence orderBreakthrough Results:
- RΒ² improved from 0.45 β 0.9992 (99.9% variance explained)
- RMSE reduced from 0.63 β 0.024 log units (~27x improvement)
- Model now distinguishes
R-R-W-WfromW-R-W-Rbased on local patterns
Why K-mers Work:
- Capture pairwise amino acid interactions (e.g.,
"KK"= strong positive clustering) - Preserve positional information without overfitting (unlike full sequence embeddings)
- Interpretable: Can analyze top k-mers for biological plausibility
- Computationally efficient for inference
Biological Validation: Top k-mer features likely include:
"KK","RR"- positive charge clustering (enhances bacterial binding)"WW","FF"- hydrophobic patches (membrane insertion)"KE","RD"- charged pairs (amphipathicity)
This aligns with known AMP design principles where local sequence motifs drive activity more than global properties.
- Feature extraction:
src/features.py - Training:
projects/MIC Regression/src/train.py - Model:
projects/MIC Regression/models/mic_predictor.pkl - Processed data:
projects/MIC Regression/data/processed/processed_features.csv - App:
src/app_MIC.py
- Pandas: Data manipulation & analysis
- NumPy: Numerical computations
- Scikit-Learn: RandomForest classifiers & regressors
- Biopython: Protein sequence analysis (
Bio.SeqUtils.ProtParam) - SciPy: Statistical tests (Pearson correlation, etc.)
- Matplotlib: Static publication-ready plots
- Plotly: Interactive HTML charts
- Kaleido: PNG export from Plotly
- Streamlit: Interactive web apps (no frontend coding)
- Joblib: Model persistence (.pkl files)
- GitHub: Version control & deployment integration
Global Impact:
- ~1.3M deaths/year attributable to AMR (WHO, 2022)
- Top 10 global health threat
- Economic cost: $100B+ annually in healthcare
Genetic Basis (Ceftriaxone Example):
- Enzymatic Inactivation: blaCTX-M genes produce beta-lactamases that hydrolyze beta-lactam ring
- Target Modification: gyrA mutations alter DNA gyrase binding site
- Efflux Pumps: acrB overexpression exports antibiotics before they act
Natural Defense:
- Found in all life forms (immune system, skin, GI tract)
- Kill bacteria via direct membrane disruption
- Less likely to develop resistance (multiple mechanisms)
Design Challenge:
- Potency (MIC) varies 1000-fold (0.1 - 100+ Β΅M)
- Toxicity risk increases with potency
- Design space is massive (20^n for n-length peptides)
ML Solution:
- Use physicochemical properties to predict potency
- Enable rational design instead of random screening
- Reduce wet-lab costs & timelines
- NCBI MicroBIGG-E: https://microbiggdata.ncbi.nlm.nih.gov/ (genotypes + phenotypes)
- EUCAST Guidelines: https://www.eucast.org/ (standard testing methods)
- CARD Database: https://card.mcmaster.ca/ (resistance gene annotations)
- APD (APD3): https://aps.unmc.edu/APD/ (AMP database)
- BioPep: https://www.bipep.org/ (peptide bioactivity)
ProteinAnalysisdocumentation: https://biopython.org/wiki/Documentation
For research/educational use only. Not a clinical diagnostic device.
- Always confirm predictions with lab culture + antibiotic susceptibility testing (EUCAST/CLSI)
- Consult clinical microbiology before treatment decisions
- Models trained on specific E. coli population; validate locally
For research/design purposes only. Not validated for clinical use.
- Predicted MIC is a computational estimate; always validate experimentally
- Model trained on specific data; performance may vary on novel sequences
- Use as design guidance, not final arbiter of peptide efficacy
- Multi-organism support (Klebsiella, Pseudomonas)
- SHAP explainability for individual predictions
- Confidence intervals for MIC predictions
- REST API for integration with LIS systems
- Additional antibiotics (fluoroquinolones, aminoglycosides)
- Uncertainty quantification via Bayesian methods
- Mobile app (iOS/Android) for field deployment
- Real-time database updates from NCBI
- Community contribution framework
Vihaan Kulkarni β Bioinformatics & Machine Learning Engineer
MIT License β Free for academic and research use.
Last Updated: December 17, 2025
Status: β Active Development
- Fill out
README.mdwith:- Problem statement
- Key insights (with screenshots)
- Model metrics
- Deployment link
- Use "Problem β Method β Insight β Impact" structure
Every project includes:
- Data:
pandas,numpy - Visualization:
plotly,kaleido - Modeling:
scikit-learn - Explainability:
shap - Deployment:
streamlit
Optional (uncomment in requirements.txt if needed):
- Experiment Tracking:
mlflow,wandb - Deep Learning:
torch,tensorflow
- Run baseline first: Always compare against a simple model
- Plotly over Matplotlib: Interactive charts reveal more insights
- Document as you go: Fill README during the project, not after
- Save figures: Use
fig.write_html()to preserve interactivity - Version control: Commit after each major milestone
- β 1 high-quality project per week
- β Every project deployed with Streamlit
- β README formatted for resume/GitHub
- β Interactive visualizations (no static PNGs)
- β Model explainability included
Built by Vihaan Kulkarni
Senior ML Engineer & Data Storyteller