Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views14 pages

Amazon Recommendations Project

The document outlines the development of a recommender system using a Kaggle dataset of over 500,000 Amazon reviews, focusing on predicting the top 5 rated items for users. Data insights reveal challenges such as a skewed rating distribution and a high number of duplicate ratings, leading to a refined dataset of 15,000 users and 25,000 items. Various models were evaluated, with a Random Forest Regressor selected for deployment despite potential overfitting, and recommendations were generated for users based on their purchase history.

Uploaded by

tram.levobao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views14 pages

Amazon Recommendations Project

The document outlines the development of a recommender system using a Kaggle dataset of over 500,000 Amazon reviews, focusing on predicting the top 5 rated items for users. Data insights reveal challenges such as a skewed rating distribution and a high number of duplicate ratings, leading to a refined dataset of 15,000 users and 25,000 items. Various models were evaluated, with a Random Forest Regressor selected for deployment despite potential overfitting, and recommendations were generated for users based on their purchase history.

Uploaded by

tram.levobao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Amazon items

recommendation
Dor Meir, 22.2.2024
Introduction
 Kaggle dataset, collected from Amazon reviews
 Content
 500000+ reviews
 100000+ users
 100000+ items
 12 features about items, reviews and users
 Task
build a recommender system: top 5 rating-predicted
items

Model Comparison Service


Introduction Data Insights
Selection Methodology Deployment
Data Insights
• Target (ratings) is left skewed
• 77% items < 5 reviews, 87% users < 6 reviews
• After filtering  15k users, 25k items
• 5% user ratings unverified  dropped
• Assume each userName is unique
• 14% ratings by non-unique “Amazon” or “Kindle”
Customer
• 73k ratings are duplicates  dropped
• Train (80% data) is 136k rows
Model Comparison Service
Introduction Data Insights
Selection Methodology Deployment
Data Insights - continued
• 60% items < 2 brands, 7% ratings mention prices
 Item_id = brand + itemName + price
• Top item: KIND’s Caramel snack, 500 reviews (0.4%)
• Median review:
• 2 words summary (“X stars”), 509 chars description, no votes
• one word Text (“Good”) 5 images, 5 features, price of 13$
• Brands: 8k unique, 2% of reviews: “KONG”, 0.1% missing
• Categories: 9 unique (<1% were merged), 44% “Pet Supplies”
• Prices: 10 price groups, 14.6% missing, imputed by:
1. 0.5% - Same item_id older reviews (275 dates, no change over time)
2. 12.8% - brand & category means (most are close to mean), 1.3% Categories
Model Comparison Service
Introduction Data Insights
Selection Methodology Deployment
Data Insights - continued
• 450 features engineered
• Dummies for categories
• Dummies for rows of missing values and outliers
• User specific statistics: min, mean ,median ,max, std
• Item specific statistics: min, mean ,median ,max, std
• Collaborative filtering: weighted averages of similar users
• 80 features dropped due to high correlation (90%)
• Some features had different distribution for train, validation
and test
• Highest correlated features to target (but less than 50%)
• Item specific rating: mean, median, max, std.
Model Comparison Service
Introduction Data Insights
Selection Methodology Deployment
Model Selection
• Mean user rating (Baseline Model)
• Benchmark. Had surprisingly low error.
• Linear Regression
• The go-to regression model. Highly Interpretable
• Linear Regression L1 regularization (Lasso):
• LR needed feature selection. Only one important feature.
• Random Forest Regressor (different max_depths):
• Lasso didn’t capture complex relationships. Non-linear Bagging +
selection
• Light Gradient Boosting Machine (LightGBM):
• Non-linear Boosting, complex + selection, asymmetrical, very fast
Model Comparison Service
Introduction Data Insights
Selection Methodology Deployment
Comparison Methodology
• Metrics
• Mean Absolute Error (MAE)
• Rather easy to interpret, large errors have proportionally large impact.
• Mean Squared Error (MSE)
• Squaring gives bigger weight to bigger errors. Common LR loss function.
• Mean Absolute Percentage Error (MAPE)
• Average abs differences in %. Interpretable, but overestimation is downplayed.
• R2 (Coefficient of Determination)
• In linear models, % of variance explained. Mathematically increases with
features.
• Mean absolute error on each target value, Max error
• Distribution of error over bins (and max error). Each range:
underfitting/overfitting.
Model Comparison Service
Introduction Data Insights
Selection Methodology Deployment
Comparison Methodology
• RF max_depth=2
• had 3 important features, but MAE & MSE > baseline.
• RF max_depth=None
• MAE < baseline, MSE > baseline – more extreme errors. Smaller errors on rating <=3.
depth – overfit?
• LGBM defaults
• More equal feature importance, MAE & MSE > baseline.
• LGBM grid searched on max_depth, learning_rate, n_estimators.
• Similar errors, a little better on rating <=3.
• RF grid search on max_depth
• Higher Complexity  better MAE & MSE
• Didn’t outperform max_depth=None
• Not enough time to mitigate overfitting,
With more n_estimators or feature selection

Model Comparison Service


Introduction Data Insights
Selection Methodology Deployment
Model Selection -decision
• RF max_depth=None
• MAE < baseline, MSE > basline (bigger errors), better on smaller bins  filters unlikable
items better.
• Even if we wanted to, we can’t pick basline as it gives same rating to all possible user’s
items.
• For lack of time, selected this though it might overfit.

• As expected, was a little worse on Test data. Hoping final train on entire data will
Top important
mitigate a little.
features are similar
on entire data model

Model Comparison Service


Introduction Data Insights
Selection Methodology Deployment
Service Deployment
• Dropped items with < 0.1% ratings.
• Similarity features took too long to compute with not enough benefit 
dropped
• Predict ratings for all pairs of users and items the users haven't purchased.
• Filtered top 5 recommendations per user, exported to csv.
• Filtered all users top 5 recommendations in different
5 categories, exported
to csv. recommende
d items
User with
history? No – 5 Most
popular items
User exists? per category

5 Most
No popular items
per category

Model Comparison Service


Introduction Data Insights
Selection Methodology Deployment
Service Deployment

Model Comparison Service


Introduction Data Insights
Selection Methodology Deployment
Service Deployment

Model Comparison Service


Introduction Data Insights
Selection Methodology Deployment
Service Deployment

Run app

Model Comparison Service


Introduction Data Insights
Selection Methodology Deployment
Questions?

You might also like