🏦 SBA Loan Default Prediction - Profit-Optimized Machine Learning

Overview

The Small Business Administration (SBA), established in 1953, has played a crucial role in supporting small businesses and entrepreneurship in the United States. Through its loan guarantee programs, the SBA helps reduce lending risks, encouraging banks to extend credit to small enterprises that drive job creation and economic development. With a network of 10 regional offices and a workforce exceeding 8,000, the SBA facilitated over 103,000 small-business financings in fiscal year 2024, a 22% increase from the previous year, contributing a total capital impact of $56 billion.

However, not all loans are successfully repaid. Loan defaults present significant financial risks to lenders, making effective risk assessment a crucial part of the lending process. Traditionally, loan decisions were often based on credit history and qualitative assessments. Yet, as financial markets become more complex, data-driven approaches have become increasingly important. Data analysis allows lenders to move beyond intuition, leveraging historical trends and statistical models to improve the accuracy of credit risk evaluations and make more informed, objective lending decisions.

This project aims to develop a comprehensive predictive framework for small-business loan defaults using various machine learning algorithms, including Logistic Regression, Random Forest, XGBoost, Neural Network, Multilayer Perceptron. Beyond classification accuracy, we incorporate a cost-benefit analysis through a customized cost matrix to better reflect the financial consequences of loan decisions. By aligning model outputs with profit-maximization goals, the final recommendation will not only predict default risks but also support lenders in optimizing their approval strategies for maximum financial return.

📊 Overview

Dataset: 899,000+ SBA loan records (1987–2014), enriched with macroeconomic indicators (GDP growth, interest rate, inflation, recession)
Objective: Binary classification of loan status (Paid in Full vs. Charged Off)
Unique Contribution: Custom cost matrix and profit-maximization framework to guide real-world lending decisions

🧠 Key Features

Preprocessing:
- One-Hot Encoding for categorical variables
- Standardization of numerical features
- Median imputation for missing values
- Feature engineering:
  - SBA guarantee ratio
  - Real estate collateral indicator
  - Recession flags and macroeconomic enrichment
Models Implemented:
- Logistic Regression (Lasso, Ridge)
- Random Forest
- XGBoost (best performing)
- Multilayer Perceptron (highest ROI)
Evaluation Strategy:
- ROC AUC, Accuracy, Precision, Recall, F1-Score
- Custom Cost Matrix:
  - +5% profit for fully repaid loans
  - −25% cost for defaulted loans
  - −5% opportunity cost for false rejections
- Threshold tuning to optimize net profit
- Gain and Lift charts to identify ROI-maximizing lending policy

📈 Results

Model	ROC AUC	Net Profit	ROI	Optimal Threshold
XGBoost	0.978	$3.566B	3.49%	0.27
Random Forest	0.975	$3.484B	3.41%	0.27
MLP (Neural Net)	0.945	$3.383B	3.97%	0.15
Logistic (Lasso/Ridge)	~0.892	~$2.71B	2.66%	~0.34–0.35

Key Insight: Approving only the top ~76–80% of applications based on repayment likelihood maximizes profit. This is operationalized using model-specific "P(Paid in Full)" thresholds.

📂 File Descriptions

File Name	Description
`2024_business_analytics_competition.pdf`	Competition and problem brief
`ML Final Project - Group 4.pdf`	Presentation version of the final project deliverable
`[BANA4020_Group4] Final Project Report.pdf`	Full final report with methodology, results, and references
`best_mlp_model.joblib`	Serialized best-performing Multilayer Perceptron model
`knn.py`	Script to train and evaluate a K-Nearest Neighbors model
`lasso_model.joblib`	Trained Lasso Regression model saved using `joblib`
`logistic regression running file.py`	Script to train Logistic Regression models with L1/L2 regularization
`neural network.py`	Multilayer Perceptron (Neural Network) model training and evaluation
`random forest.py`	Random Forest training script with performance evaluation

📌 Tip: Run any .py script to train the model after data preprocessing. Models are saved to .joblib files for reuse or deployment.

🚀 How to Use

# 1. Clone the repo
git clone https://github.com/yourusername/sba-loan-default-prediction.git

# 2. Install requirements
pip install -r requirements.txt

# 3. Run preprocessing
python scripts/preprocess.py

# 4. Train model (example: XGBoost)
python scripts/train_xgb.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🏦 SBA Loan Default Prediction - Profit-Optimized Machine Learning

Overview

📊 Overview

🧠 Key Features

📈 Results

📂 File Descriptions

🚀 How to Use

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
2024_business_analytics_competition.pdf		2024_business_analytics_competition.pdf
ML Final Project - Group 4.pdf		ML Final Project - Group 4.pdf
README.md		README.md
[BANA4020_Group4] Final Project Report.pdf		[BANA4020_Group4] Final Project Report.pdf
best_mlp_model.joblib		best_mlp_model.joblib
knn.py		knn.py
lasso_model.joblib		lasso_model.joblib
logistic regression running file.py		logistic regression running file.py
neural network.py		neural network.py
random forest.py		random forest.py
rf + ridge + xgb.zip		rf + ridge + xgb.zip

huyhng11/SBA-Predictive-Modeling

Folders and files

Latest commit

History

Repository files navigation

🏦 SBA Loan Default Prediction - Profit-Optimized Machine Learning

Overview

📊 Overview

🧠 Key Features

📈 Results

📂 File Descriptions

🚀 How to Use

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages