This project focuses on building and improving a machine learning model to predict wine quality based on physicochemical attributes. Through iterative steps including feature engineering, data balancing, and hyperparameter tuning, the model's performance was significantly enhanced.
- Source: Kaggle: White Wine Quality Dataset
- Description: Contains physicochemical properties of white wines and their quality scores (3 to 9).
- Features: 11 physicochemical properties such as alcohol, pH, and residual sugar.
- Target Variable: Wine quality (integer scores).
- Objective: Establish a benchmark for comparison.
- Results:
- R² Score: 0.546
- RMSE: 0.593
- Insights: The model struggled with imbalanced data and unexplored feature relationships.
- Objective: Enhance the dataset with new meaningful features to improve predictive accuracy.
- New Features Added:
sugar_alcohol_ratiovolatile_acidity_pH_ratio- Interaction terms and other domain-specific features.
- Impact: Improved the model's ability to capture complex relationships.
- Objective: Balance the dataset to prevent bias toward majority classes.
- Method: Applied Synthetic Minority Oversampling Technique (SMOTE).
- Result: The resampled dataset ensured equal representation of all quality levels.
- Objective: Optimize XGBoost model parameters for better performance.
- Key Parameters Tuned:
n_estimators,max_depth,learning_rate,subsample. - Outcome:
- R² Score: 0.954
- RMSE: 0.424
- Feature Importance: Identified key features using SHAP and XGBoost.
- Error Analysis: Highlighted areas for improvement, particularly for underrepresented quality levels.
| Metric | Baseline Model | Improved Model |
|---|---|---|
| R² Score | 0.546 | 0.954 |
| RMSE | 0.593 | 0.424 |
- Python Libraries: pandas, numpy, matplotlib, seaborn, xgboost, scikit-learn, imbalanced-learn, SHAP.
- Clone the repository and download the dataset.
- Install dependencies:
pip install -r requirements.txt
- Run the notebook to execute the analysis and model training steps.
- Feature Engineering: Explore external data sources or new feature combinations.
- Advanced Models: Experiment with ensemble methods or neural networks.
- Explainability: Utilize advanced interpretability tools like SHAP or LIME.
- Validation: Test the model on an independent validation dataset.
This project demonstrates how iterative improvements in data preprocessing, feature engineering, and model optimization can lead to significant gains in predictive performance. The resulting model is robust and provides valuable insights into the factors influencing wine quality.
For questions or collaborations, feel free to reach out at [[email protected]].