This project analyzes healthcare fraud patterns using three large-scale datasets:
- CMS Medicare Data – Public provider billing records with cost and service metrics.
- Kaggle Healthcare Fraud Dataset – Real-world data labeled with fraudulent claims.
- Synthea Synthetic Data – Comprehensive synthetic EHR data including patients, conditions, and claims.
- Explore service volume and financial metrics across providers and states.
- Detect patterns of excessive billing and service anomalies.
- Prepare datasets for machine learning models focused on fraud detection.
- Python, Pandas, Seaborn, Scikit-learn, Matplotlib
- Data preprocessing, outlier handling, feature creation
- Visualization and statistical summary
- Prepared for modeling with textbook methods from An Introduction to Statistical Learning
Course Project – Statistical Learning (Spring 2025)
Team Members: Nhan, Tan, Andre