This repository contains a machine learning pipeline for predicting antimicrobial resistance (AMR) in Acinetobacter baumannii isolates using k-mer based genomic features.
The project demonstrates data preprocessing, feature selection, model training, and ensemble learning for accurate AMR classification.
Antimicrobial resistance (AMR) poses a major threat to global health.
This study applies machine learning to predict resistance phenotypes directly from genomic k-mer features, offering a faster alternative to traditional phenotype-based testing.
The pipeline uses:
- Random Forest for baseline model training
- Stacking Ensemble (RF + XGBoost + Logistic Regression) for improved performance
- Mutual Information-based Feature Selection to reduce dimensionality and overfitting
| File | Description |
|---|---|
X_train_sel.csv |
Training feature matrix (selected k-mers) |
X_test_sel.csv |
Test feature matrix (selected k-mers) |
y_train.csv |
Training labels (resistant/susceptible) |
y_test.csv |
Test labels (resistant/susceptible) |
amr_model.py |
Python script for model training and evaluation |
README.md |
Project documentation |
- Genomic data from Acinetobacter baumannii isolates were processed to extract k-mer features.
- Features were filtered using Mutual Information to retain the most informative 750–1000 k-mers.
- Random Forest Classifier trained with balanced class weights:
RandomForestClassifier(n_estimators=200, class_weight='balanced', random_state=42)
- Stacking Ensemble combining:
- Random Forest
- XGBoost
- Logistic Regression (as meta-learner)
- Accuracy
- ROC-AUC
- Classification Report
- Confusion Matrix
| Metric | Random Forest | Stacking Ensemble |
|---|---|---|
| Accuracy | ~78–80% | Slightly higher |
| ROC-AUC | Good separation | Improved robustness |
| Class Balance | Handled using class_weight='balanced' |
Effective with stacking |
The ensemble model provided stable and improved results compared to single classifiers.
Install required libraries:
pip install pandas scikit-learn xgboost-
Clone the repository:
git clone https://github.com/tirth1305/AMR_Prediction_ML_Pipeline.git cd AMR_Prediction_ML_Pipeline -
Run the model script:
python amr_model.py
-
View the output metrics in the terminal.
This work is inspired by:
Machine learning and feature extraction for rapid antimicrobial resistance prediction of Acinetobacter baumannii from whole-genome sequencing data (Frontiers in Microbiology, 2024)
Dataset source:
BV-BRC (Bacterial and Viral Bioinformatics Resource Center)
Tirth Patel
Department of Bioinformatics, Marwadi University
💼 Passionate about Genomics, Machine Learning, and Antimicrobial Resistance Research
This project is open-source under the MIT License.