This project aims to predict insurace charges based on a set of features, such as age, gender, BMI and other factors. The goal is to create a machine learning model that can accurately estimate the charges for a given individual based on their profile.
The dataset used for this project is insurance.csv. It includes the following featues:
age: Age of the individualsex: GenderBMI: Body Mass Indexchildren: Number of childrensmoker: Whether the individual is a smokerregion: The region where the individual residescharges: Medical insurance charges (target variable)
The preprocessing pipeline includes the following steps:
- Handling duplicated rows
- Handling missing values
- Correcting datatype
- Encoding categorical variables (sex, smoker, region)
- Feature scaling (StandardScaler for age, bmi, charges)
Linear Regression
- R-squared (R²)
- Model was validated with a new dataset called
validation_dataset.csv
The mean R-squared scores across 5 folds is 0.75. This suggests a good level of predictive power, but may be further improved with hyperparameter tuning, feature engineering and exploring different model architectures.