Amazon items
recommendation
Dor Meir, 22.2.2024
Introduction
Kaggle dataset, collected from Amazon reviews
Content
500000+ reviews
100000+ users
100000+ items
12 features about items, reviews and users
Task
build a recommender system: top 5 rating-predicted
items
Model Comparison Service
Introduction Data Insights
Selection Methodology Deployment
Data Insights
• Target (ratings) is left skewed
• 77% items < 5 reviews, 87% users < 6 reviews
• After filtering 15k users, 25k items
• 5% user ratings unverified dropped
• Assume each userName is unique
• 14% ratings by non-unique “Amazon” or “Kindle”
Customer
• 73k ratings are duplicates dropped
• Train (80% data) is 136k rows
Model Comparison Service
Introduction Data Insights
Selection Methodology Deployment
Data Insights - continued
• 60% items < 2 brands, 7% ratings mention prices
Item_id = brand + itemName + price
• Top item: KIND’s Caramel snack, 500 reviews (0.4%)
• Median review:
• 2 words summary (“X stars”), 509 chars description, no votes
• one word Text (“Good”) 5 images, 5 features, price of 13$
• Brands: 8k unique, 2% of reviews: “KONG”, 0.1% missing
• Categories: 9 unique (<1% were merged), 44% “Pet Supplies”
• Prices: 10 price groups, 14.6% missing, imputed by:
1. 0.5% - Same item_id older reviews (275 dates, no change over time)
2. 12.8% - brand & category means (most are close to mean), 1.3% Categories
Model Comparison Service
Introduction Data Insights
Selection Methodology Deployment
Data Insights - continued
• 450 features engineered
• Dummies for categories
• Dummies for rows of missing values and outliers
• User specific statistics: min, mean ,median ,max, std
• Item specific statistics: min, mean ,median ,max, std
• Collaborative filtering: weighted averages of similar users
• 80 features dropped due to high correlation (90%)
• Some features had different distribution for train, validation
and test
• Highest correlated features to target (but less than 50%)
• Item specific rating: mean, median, max, std.
Model Comparison Service
Introduction Data Insights
Selection Methodology Deployment
Model Selection
• Mean user rating (Baseline Model)
• Benchmark. Had surprisingly low error.
• Linear Regression
• The go-to regression model. Highly Interpretable
• Linear Regression L1 regularization (Lasso):
• LR needed feature selection. Only one important feature.
• Random Forest Regressor (different max_depths):
• Lasso didn’t capture complex relationships. Non-linear Bagging +
selection
• Light Gradient Boosting Machine (LightGBM):
• Non-linear Boosting, complex + selection, asymmetrical, very fast
Model Comparison Service
Introduction Data Insights
Selection Methodology Deployment
Comparison Methodology
• Metrics
• Mean Absolute Error (MAE)
• Rather easy to interpret, large errors have proportionally large impact.
• Mean Squared Error (MSE)
• Squaring gives bigger weight to bigger errors. Common LR loss function.
• Mean Absolute Percentage Error (MAPE)
• Average abs differences in %. Interpretable, but overestimation is downplayed.
• R2 (Coefficient of Determination)
• In linear models, % of variance explained. Mathematically increases with
features.
• Mean absolute error on each target value, Max error
• Distribution of error over bins (and max error). Each range:
underfitting/overfitting.
Model Comparison Service
Introduction Data Insights
Selection Methodology Deployment
Comparison Methodology
• RF max_depth=2
• had 3 important features, but MAE & MSE > baseline.
• RF max_depth=None
• MAE < baseline, MSE > baseline – more extreme errors. Smaller errors on rating <=3.
depth – overfit?
• LGBM defaults
• More equal feature importance, MAE & MSE > baseline.
• LGBM grid searched on max_depth, learning_rate, n_estimators.
• Similar errors, a little better on rating <=3.
• RF grid search on max_depth
• Higher Complexity better MAE & MSE
• Didn’t outperform max_depth=None
• Not enough time to mitigate overfitting,
With more n_estimators or feature selection
Model Comparison Service
Introduction Data Insights
Selection Methodology Deployment
Model Selection -decision
• RF max_depth=None
• MAE < baseline, MSE > basline (bigger errors), better on smaller bins filters unlikable
items better.
• Even if we wanted to, we can’t pick basline as it gives same rating to all possible user’s
items.
• For lack of time, selected this though it might overfit.
• As expected, was a little worse on Test data. Hoping final train on entire data will
Top important
mitigate a little.
features are similar
on entire data model
Model Comparison Service
Introduction Data Insights
Selection Methodology Deployment
Service Deployment
• Dropped items with < 0.1% ratings.
• Similarity features took too long to compute with not enough benefit
dropped
• Predict ratings for all pairs of users and items the users haven't purchased.
• Filtered top 5 recommendations per user, exported to csv.
• Filtered all users top 5 recommendations in different
5 categories, exported
to csv. recommende
d items
User with
history? No – 5 Most
popular items
User exists? per category
5 Most
No popular items
per category
Model Comparison Service
Introduction Data Insights
Selection Methodology Deployment
Service Deployment
Model Comparison Service
Introduction Data Insights
Selection Methodology Deployment
Service Deployment
Model Comparison Service
Introduction Data Insights
Selection Methodology Deployment
Service Deployment
Run app
Model Comparison Service
Introduction Data Insights
Selection Methodology Deployment
Questions?