This project demonstrates how machine learning can be applied to detect fraudulent credit card transactions in a real-world banking scenario.
Fraud detection is highly imbalanced: fraudulent transactions make up less than 0.2% of all data.
The challenge is to catch fraud while minimizing false alerts that overwhelm analysts and disrupt legitimate customers.
- 284,807 anonymized transactions from Kaggle: Credit Card Fraud Detection.
- Features
V1...V28are PCA-transformed for confidentiality. - Added features
TimeandAmount.
-
Data Wrangling & Preprocessing
- Checked for nulls, validated schema.
- Scaled
AmountandTime. - Stratified train/validation split to preserve fraud ratio.
-
Modeling
- Logistic Regression (baseline).
- Random Forest with calibrated probabilities.
- Logistic Regression with SMOTE (oversampling minority class).
-
Evaluation
- Metrics: Precision, Recall, F1, PR-AUC, ROC-AUC.
- Focused on precision/recall trade-offs instead of accuracy.
- Threshold tuning to balance fraud detection with false alert reduction.
-
Interpretability
- Logistic Regression coefficients (top predictive components).
- Random Forest feature importances.
- Business-friendly explanations of model signals.
-
Deployment Awareness
- Saved trained models and scaler with
joblib. - Produced a comparison table for business decision-making.
- Saved trained models and scaler with
The Random Forest model achieves a strong balance between recall (fraud caught) and precision (alerts that are correct).
The chart below shows how precision increases as recall decreases when the decision threshold is adjusted.
| Model @ point | Threshold | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.973 | 0.517 | 0.854 | 0.644 | 0.973 | 0.704 |
| Random Forest (best F1) | 0.365 | 0.901 | 0.813 | 0.855 | 0.970 | 0.836 |
| Random Forest (rec≥80%) | 0.406 | 0.908 | 0.805 | 0.853 | 0.970 | 0.836 |
| SMOTE + LogReg | 1.000 | 0.844 | 0.789 | 0.815 | 0.972 | 0.711 |
Key Insights:
- Logistic Regression: high recall (85%) but very noisy (52% precision).
- Random Forest: best balance — ~81% recall with ~90% precision.
- SMOTE + Logistic: better than baseline LR, but not as strong as Random Forest.
Business Takeaway:
A tuned Random Forest reduces false alerts by >90% while still catching >80% of fraud cases.
This makes fraud alerts more trustworthy for analysts and reduces unnecessary customer disruptions.
- Download dataset from Kaggle.
- Place
creditcard.csvin your working directory. - Install requirements:
pip install -r requirements.txt