Team Members: Charlie Chen, Chuanyang Jin, Gavin Yang, Hongyi Zheng (alphabetical order)
This is our project for NYU x Peak Datathon 2022. Take a look at our demonstration slides and notebook!
In this challenge, we are asked to build a recommendation system from the dataset provided by Olist.
As we will show more details in the following sections, only 10k users out of 100k users in the dataset buy two or more items. What's more, the side information for users and items are limited. This makes the classical machine learning recommendation pipelines especially hard because they generally depends on sequencial purchase history and user profiles (which we only have geospacial information). So, it is important to formally define our own statement for the challenge and how we will resolve this problem.
For the sake of sanity, we will use
Why not state the problem as a cold-start problem?
If we don't take into the purchase history into considerations, we only have geospacial information, which is not sufficient for recommendation.
We target to solve the problem from a probablistic point of view with the help of deep learning. To be precise,
The first term is the customer-product relation. The second term is the product-product relation. The third term can be just interpreted as the intrinsic product features.
We hope to recommend products that are either:
-
Popular in general, or
-
Especially favored by customers from the same state.
Assuming whether the user like the product is positively correlated with whether the user will buy the product, we deduce that
We only have very limited customers who purchase more than one items, but we have plenty of vendors selling multiple products. This motivates us to measure category similarities based on the assumptions that if two categories of products are sold by the same seller, they tends to be more similar.
We construct an embedding for each category via Item2Vec, and train the embeddings so that the cosine similarity of two embeddings represents similarity between two categories.
Apart from the two metrics above, we argue the following regarding the recommendation score:
-
A customer is more willing to buy a product with a higher average rating.
-
As we have shown before, the more distance between the customer and the product's seller, the less likely the customer would buy the product. Therefore, the score should be negatively correlated with the distance. We analyze the correlation and choose to use a log scale.
-
A customer is more willing to buy a product of similar price with his previous purchase. We represent the price difference by comparing the ratio between
$\text{price}$ and$\text{previous price}$ in a log scale. Under this assumption, a customer will be equally likely to be recommended a product with the price of$3 \times \text{previous price}$ and another product with the price of$\frac{1}{3} \times \text{previous price}$ .
To make use of all the above metrics, we propose a score formula as follows:
We perform the grid search for the optimal hyperparameters.