Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
2 views17 pages

Towards Open-World Recommendation

Uploaded by

alexhychen1992
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views17 pages

Towards Open-World Recommendation

Uploaded by

alexhychen1992
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Towards Open-World Recommendation:

An Inductive Model-based Collaborative Filtering Approach

Qitian Wu 1 2 Hengrui Zhang 1 Xiaofeng Gao 1 2 Junchi Yan 1 2 Hongyuan Zha 3

Abstract (MC) where one has a user-item rating matrix whose entries,
which stand for interactions of users with items (ratings or
Recommendation models can effectively estimate click behaviors), are partially observed. The goal of MC is
arXiv:2007.04833v2 [cs.IR] 9 Jun 2021

underlying user interests and predict one’s future to predict missing entries (unobserved or future potential
behaviors by factorizing an observed user-item interactions) in the matrix based on the observed ones.
rating matrix into products of two sets of latent
factors. However, the user-specific embedding Modern recommender systems need to meet two important
factors can only be learned in a transductive way, requirements for practical utility. First of all, recommenda-
making it difficult to handle new users on-the-fly. tion models should have enough expressiveness to capture
In this paper, we propose an inductive collabora- diverse user interests and preferences so that the systems can
tive filtering framework that contains two repre- accomplish personalized recommendation. Existing meth-
sentation models. The first model follows con- ods based on collaborative filtering (CF)1 or, interchange-
ventional matrix factorization which factorizes a ably, matrix factorization (MF) have shown great power in
group of key users’ rating matrix to obtain meta la- this problem by factorizing the rating matrix into two classes
tents. The second model resorts to attention-based of latent factors (i.e., embeddings) for users and items re-
structure learning that estimates hidden relations spectively, and further leverage dot-product of two factors to
from query to key users and learns to leverage predict potential ratings (Hu et al., 2008; Koren et al., 2009;
meta latents to inductively compute embeddings Rendle et al., 2009; Srebro et al., 2004; Zheng et al., 2016).
for query users via neural message passing. Our Equivalently, for each user, the methods consider a one-hot
model enables inductive representation learning user encoding as input, and assume a user-specific embed-
for users and meanwhile guarantees equivalent ding function mapping user index to a latent factor. Such
representation capacity as matrix factorization. learnable latent factors can represent user’s preferences in a
Experiments demonstrate that our model achieves low-dimensional space. Recent works extend MF with com-
promising results for recommendation on few- plex architectures, like multi-layer perceptrons (Dziugaite
shot users with limited training ratings and new & Roy, 2015), recurrent units (Monti et al., 2017), autore-
unseen users which are commonly encountered in gressive models (Zheng et al., 2016), graph neural networks
open-world recommender systems. (van den Berg et al., 2017), etc., achieving state-of-the-art
results on both benchmark datasets and commercial systems
(Covington et al., 2016; Ying et al., 2018; Kang et al., 2020).
1. Introduction The second requirement stems from a key observation from
real-world scenarios: recommender systems often interact
As information explosion has become one major factor af-
with a dynamic open world where new users, who are not
fecting human life, recommender systems, which can filter
exposed to models during training, may appear in test stage.
useful information and contents of user’s potential interests,
This requires that models trained on a set of users manage
play an increasingly indispensable role. Recommendation
to adapt to unseen users. However, the above-mentioned
problems can be generally formalized as matrix completion
CF models would fail in this situation since the embedding
1
Department of Computer Science and Engineering, Shanghai factors are parameterized for specific users and need to be
Jiao Tong University. 2 MoE Key Lab of Artificial Intelligence, learned collaboratively with all other users in transductive
AI Institute, Shanghai Jiao Tong University. 3 School of Data setting (see Fig. 1 for illustration). Brute-force ways include:
Science, Shenzhen Institute of Artificial Intelligence and Robotics
for Society, The Chinese University of Hong Kong, Shenzhen. 1
In recent literature, collaborative filtering (CF) approaches
Correspondence to: Xiaofeng Gao <[email protected]>. often refer to model-based CF, i.e. matrix factorization, while
its memory-based counterpart, as an heuristic approach, adopts
Proceedings of the 38 th International Conference on Machine similarity methods like KNN for recommendation.
Learning, PMLR 139, 2021. Copyright 2021 by the author(s).
An Inductive Model-based Collaborative Filtering Approach

embedding ma i eigh ma i ke e' embedding ma i ke e'


ne-h enc ding e embedding fea e ec e embedding ne-h enc ding embedding
0 0 1 0 0 0 1 0

e -i em a ing
a gmen ed a ame e
0 2 0
embedding ma i ha ing e - e ela i n
ne e' ne e' 0 0 1 0 lea ning m del

ke
ne e' ne e'
fea e ec embedding 3 0 0 3
ne-h enc ding embedding
0 0 2 a ame e ha ing e e'
0 0 0 0 1 0 1 0 0 embedding
e - e ela i n

e
0 0 2 0
lea ning m del

(a) C llab a i e fil e i g del (b) Fea e-d i e del (c) I d c i e CF del ( )
Figure 1. (a) Conventional (model-based) collaborative filtering model assumes user-specific embeddings for user representation, which is
parameterized in a user-specific manner and limits the model’s capability to handle new unseen users. (b) Feature-driven models manage
to deal with new users and achieves inductive learning via modeling a user-sharing mapping from user features to representations, but
would lack enough expressiveness for diverse user interests. (c) Our proposed inductive CF model that absorbs the advantage of both of
the worlds, achieving inductive representation learning for users without compromising representation capacity.

1) retrain a model with an augmented rating matrix; 2) con- In this paper, we propose an InDuctive Collaborative Filter-
sider incremental learning (Hu et al., 2008) for new users’ ing model (IDCF)2 as a general CF framework that achieves
embeddings. The former requires much extra time cost, inductive learning for user representations and meanwhile
which would be unacceptable for online systems, while guarantees enough expressiveness and scalability, as shown
the latter is prone for over-fitting and disables on-the-fly in Fig. 1. Our approach involves two representation models.
inference. There are quite a few studies that propose induc- A conventional matrix factorization model that factorizes a
tive matrix completion models using user features (Jain & group of key users’ rating matrix to obtain their user-specific
Dhillon, 2013; Xu et al., 2013; Cheng et al., 2016; Ying embeddings, which we call meta latents. On top of that,
et al., 2018; Zhong et al., 2018). Their different thinking we further design a relation learning model, specified as
paradigm, as shown in Fig. 1(b), is to target a user-sharing multi-head attention mechanism, that learns hidden graphs
mapping from user features to user representations, instead between key and query users w.r.t. their historical rating
of from one-hot user indices. Since the feature space is behaviors. The uncovered relation graphs enable neural
shared among users, such methods are able to adapt a trained message passing among users in latent space and inductive
model to unseen users. Nevertheless, feature-driven models computation of user-specific representations for query users.
may suffer from limited expressiveness with low-quality
Furthermore, we develop two training strategies for practical
features that have weak correlation with labels. For exam-
cases in what we frame as open-world recommendation: in-
ple, users with the same age and occupation (commonly
ductive learning for interpolation and inductive learning for
used features) may have distinct ratings on movies. Unfor-
extrapolation, respectively. In the first case, query users are
tunately, high-quality features that can unveil user interests
disjoint from key users in training and the model is expected
for personalized recommendation are often hard to collect
to provide decent performance on few-shot query users with
due to increasingly concerned privacy issues.
limited training ratings. In the second case, query users are
A following question arises: Can we build a recommen- the same as key users in training and the trained model aims
dation model that guarantees enough expressiveness for to tackle zero-shot test users that are unseen before. We
personalized preferences and enables inductive learning? show that our inductive model guarantees equivalent capac-
Such question still remains unexplored so far. In fact, si- ity as matrix factorization and provides superior expressive-
multaneously meeting the two requirements is a non-trivial ness compared to other inductive models (feature-driven,
challenge when high-quality user features are unavailable. item-based and graph-based). Empirically, we conduct ex-
First, to achieve either of them, one often needs to com- periments on five datasets for recommendation (with both
promise on the other. The user-specific embedding vectors, explicit and implicit feedbacks). The comprehensive eval-
which assume independent parametrization for different uation demonstrates that IDCF 1) consistently outperform
users, can give sufficient capacity for learning distinct user various inductive models by a large margin on recommenda-
preferences from historical rating patterns (Mikolov et al., tion for few-shot users , and 2) can achieve superior results
2013). To make inductive learning possible, one needs to on recommendation for new unseen users (with few his-
construct a shared input feature space among users out of torical ratings not used in training). Moreover, compared
the rating matrix, as an alternative to one-hot user encodings. with transductive models, IDCF provides very close recon-
However, the new constructed features have relatively insuf- struction error and can even outperform them when training
ficient expressive power. Second, the computation based on 2
The codes are available at https://github.com/
new feature space often brings extra costs for time and space,
qitianwu/IDCF.
which limits model’s scalability on large-scale datasets.
An Inductive Model-based Collaborative Filtering Approach

Key Users
ratings becomes scarce. The contributions of this paper are

Matrix Factorization Model


summarized as follows. User Meta Latents

1) We propose a new inductive collaborative filtering frame- Item Latents

work that can inductively compute user representations for Relation Model
new users based on a set of pretrained meta latents, which
is suitable for open-world recommender systems. The new Initial State User Latents
approach can serve as a brand new learning paradigm for Query Users
inductive representation learning. (a) Inductive learning for interpolation
Key Users
2) We show that a general version of our model can mini-

Matrix Factorization Model


… User Meta Latents
mize reconstruction loss to the same level as vanilla matrix x
… Item Latents
factorization model under a mild condition. Empirically,
IDCF gives very close RMSE to transductive CF models. … Relation Model

3) IDCF achieves very competitive and superior


Initial State User Latents
RMSE/NDCG/AUC results on few-shot and new un-
seen users compared with various inductive models on Query Users
(b) Inductive learning for extrapolation
explicit feedback and implicit feedback data.
Data flow in pretraining Backward flow in pretraining x Gradient block

As a general model-agnostic framework, IDCF can flexi- Data flow in adaption Backward flow in adaption User's rated items

bly incorporate with various off-the-shelf MF models (e.g. Figure 2. Framework of inductive collaborative filtering for open-
MLP-based, GNN-based, RNN-based, attention-based, etc.) world recommendation. We consider two scenarios, inductive
as backbones as well as combine with user profile features learning for interpolation (query users are different from key users)
for hybrid recommendation. and extrapolation (query users are the same as key users), which
aims to handle few-shot query users and zero-shot new users in test
stage. In both cases, the learning procedures contain pretraining
2. Background
and adaption. The pretraining learns initial user meta latents via
Consider a general matrix completion (MC) problem which matrix factorization over key users’ ratings. The adaption opti-
deals with a user-item rating matrix R = {rui }M ×N where mizes a relation model, which estimates hidden relations from key
M and N is the number of users and items, respectively. For to query users and inductively computes embeddings for query
users. In particular, in extrapolation case, we introduce a self-
explicit feedback, rui records rating value of user u on item
supervised contrastive loss that enforces similarity between meta
i. For implicit feedback, rui is a binary entry for whether latents and inductively computed embeddings for the same users.
user u rated (or clicked on, reviewed, liked, purchased,
etc.) item i or not. The recommendation problem can be 3. Methodology
generally formalized as: given partially observed entries in
R, one needs to estimate the missing values in the matrix. We propose the InDuctive Collaborative Filtering (IDCF)
model. Our high-level methodology stems from a key ob-
Existing recommendation models are mostly based on col- servation: there exist a (or multiple) latent relational graph
laborative filtering (CF) or, interchangeably, matrix factor- among users that represents preference proximity and be-
ization (MF) where user u (resp. item i) corresponds to havioral interactions. For instance, social networks and
a d-dimensional latent factor (i.e., embedding) pu (resp. following networks in social media can be seen as realiza-
qi ). Then one has a prediction model r̂ui = f (pu , qi ) tions of such relational graphs, but in most cases, the graph
where f can be basically specified as simple dot product structures are unobserved and implicitly affect user’s be-
or some complex architectures, like neural networks, graph haviors. If we can identify the graph structures, we can
neural networks, etc. One advantage of CF models is that leverage the idea of message passing (Scarselli et al., 2009;
the user-specific embedding pu (as learnable parameters) Hamilton et al., 2017; Gilmer et al., 2017; Chen et al., 2020),
can provide enough expressive power for learning diverse propagating learned embeddings from one group of users to
personal preferences from user historical behaviors and de- others, especially, in an inductive manner.
cent generalization ability through collaborative learning
with all the users and items. However, such user-specific We formulate our model through two sets of users: 1) key
parametrization limits the model in transductive learning. In users (denoted by Uk ), for which we learn their embeddings
practical situations, one cannot have information for all the by matrix factorization and use them as meta latents; 2)
users that may appear in the future when collecting training query users (denoted by Uq ), for which we consider neural
data. When it comes to new users in test stage, the model message passing to inductively compute their embeddings.
has to be retrained and cannot deliver on-the-fly inference. Assume |Uk | = Mk and |Uq | = Mq . Correspondingly, we
have two rating matrices: Rk = {rui }Mk ×N (given by Uk )
An Inductive Model-based Collaborative Filtering Approach

and Rq = {ru0 i }Mq ×N (given by Uq ). Based on this, we model optimization becomes:


further consider two scenarios.
min DSq (R̂q , Rq ), (2)
Scenario I: Inductive learning for interpolation. U1 ∩ U2 = C,Q
∅, i.e., query users are disjoint from key users. In training,
it learns to leverage meta latents of key users to compute 1
P we define R̂q = {r̂u0 i }Mq ×N , DSq (R̂q , RTqq ) =
where
(u0 ,i)∈Sq l(ru i , r̂u i ) and Sq ∈ ([Mq ] × [N ]) is a
0 0
representations for another group of query users in a super- Tq
vised way. The model is expected to perform robustly on set with size Tq containing indices of observed entries in
few-shot query users with limited training ratings. Rq . The essence of above method is taking attentive pool-
ing as message passing from key to query users. We first
Scenario II: Inductive learning for extrapolation. U1 = U2 , justify this idea by analyzing its capacity and then propose
i.e., key and query users are the same. In training, it learns to a parameterized model enabling inductive learning.
use meta latents given by key users to represent themselves
in a self-supervised way. Then the trained model aims to Theoretical Justification If we use dot-product for fθ in
deal with zero-shot test users (with limited observed ratings the MF model, then we have r̂u0 i = p̃> u0 qi . We compare
not used in training) that are unseen before. Eq. (2) with using matrix factorization over Rq :

Notice that in the above two cases, we assume no side min DSq (R̂q , Rq ), (3)
information, such as user profile features (ages, occupation, P̃q ,Q
etc,), social networks, item content features, etc., besides
the observed user-item rating matrix. We frame the problem where P̃q = {p̃u0 }Mq ×d and have the following theorem.
as open-world recommendation which requires the model
to deal with few-shot and zero-shot users. We present our Theorem 1. Assume Eq. (3) can achieve DSq (R̂q , Rq ) <
model framework for two settings in Fig. 2 and go into the  and the optimal Pk given by Eq. (1) satisfies column-full-
details in the following. rank, then there exists at least one solution for C in Eq. (2)
such that DSq (R̂q , Rq ) < .
3.1. Matrix Factorization Model
The only condition that Pk is column-full-rank can be triv-
We first pretrain a (transductive) matrix factorization model ially guaranteed since d  N . The theorem shows that
for Uk using Rk , denoted as r̂ui = fθ (pu , qi ), where pu ∈ the proposed model can minimize the reconstruction loss
Rd denotes a user-specific embedding for user u in Uk , of MC to at least the same level as matrix factorization
qi ∈ Rd denotes an item-specific embedding for item i and which gives sufficient capacity for learning personalized
fθ can be simple dot-product or a network with parameter user preferences from historical rating patterns.
θ. Section 5.1 gives details for two specifications for fθ Parametrization We have shown that using attentive pool-
using neural network and graph convolution network, as ing does not sacrifice model capacity than MF/CF models
used in our implementation. Denote Pk = {pu }Mk ×d , under a mild condition. However, directly optimizing over
Q = {qi }N ×d and the objective becomes C is intractable due to its O(Mk Mq ) parameter space and
min DSk (R̂k , Rk ), (1) cu0 is a user-specific high-dimensional vector which disal-
Pk ,Q,θ lows inductive learning. Hence, we parametrize C with an
attention network, reducing parameters and enabling it for
1
P we define R̂k = {r̂ui }Mk ×N , DSk (R̂kT,kRk ) =
where inductive learning. Concretely, we estimate the adjacency
Tk (u,i)∈Sk l(rui , r̂ui ) and Sk ∈ ([Mk ] × [N ]) is a set score between user u0 and u as
with size Tk containing indices of observed entries in Rk .
Here l(rui , r̂ui ) can be MSE loss for explicit user feedbacks e> [Wq du0 ⊕ Wk pu ]
cu0 u = P , (4)
or cross-entropy loss for implicit user feedbacks. >
u0 ∈U1 e [Wq du ⊕ Wk pu0 ]
0

We treat the pretrained factors Pk as meta latents, to induc-


where e ∈ R2d×1 , Wq ∈ Rd×d , Wk ∈ Rd×d are Ptrainable
tively compute user latents for query via a relation model.
parameters, ⊕ denotes concatenation and du0 = i∈Iu0 qi .
Here Iu0 = {i|ru0 i > 0} includes the historically rated
3.2. Inductive Relation Model
items of user u0 . The attention network captures first-
Assume C = {cuu0 }Mk ×Mq , where cuu0 ∈ R denotes order user proximity on behavioral level and also maintains
weighted edge from user u ∈ Uk to user u0 ∈ Uq , and second-order proximity that users with similar historical
define cu0 = [c1u0 , c2u0 , · · · cMk u0 ]> the u0 -th column of C. ratings on items would have similar relations to other users.
Then we express embedding of user u0 as p̃u0 = c> u0 Pk , Besides, if Iu0 is empty (for extreme cold-start recommen-
a weighted sum of embeddings of key users. The rating dation), we can randomly select a group of items from all the
can be predicted by r̂u0 i = fθ (p̃u0 , qi ) and the problem of candidates. Also, if user’s profile features are available, we
An Inductive Model-based Collaborative Filtering Approach

can harness them as du0 . We provide details in Appendix C. configuration of Uk has an important effect on model ca-
Yet, in the main body of our paper, we focus on learning pacity and generalization ability. On one hand, we need to
from user-item rating matrix, i.e., the common setting for make key users in Uk ‘representative’ of diverse user be-
CF approaches. havior patterns on item consumption in order to guarantee
enough representation capacity. Also, we need to control
The normalization in Eq. (4) requires computation for all the
the size of Uk to maintain generalization ability.
key users, which limits scalability to large dataset. There-
fore, we use sampling strategy to control the size of key
users in relation graph for each query user and further con- 3.3. Model Optimization
sider multi-head attentions that independently sample dif- The complete training process is comprised of: pretrain-
ferent subsets of key users. The attention score given by the ing and adaption. In the first stage, we train a MF model
l-th head is in transductive setting via Eq. (1) and obtain embeddings
(l)
(e(l) )> [Wq du0 ⊕ Wk pu ]
(l) Pk , Q and network fθ . The adaption involves optimization
(l)
cu0 u = P (l) (l)
, (5) for the inductive relation model hw and finetuning the pre-
(l)
u0 ∈Uk
(e(l) )> [Wq du0 ⊕ Wk pu0 ] diction network fθ via Eq. (3). In particular, in terms of
(l)
inductive learning for extrapolation (i.e., Uk = Uq ), we fur-
where Uk denotes a subset of key users sampled from Uk . ther consider a self-supervised contrastive loss that pursuits
Each attention head independently aggregates embeddings similarity between the inductively computed user embed-
of different subsets of key users and the final inductive dings and the ones given by the MF model. Concretely, the
representation for user u0 can be given as loss is defined as
 
L min DSq (R̂q , Rq ) + λLC (Pk , P̃q ), (9)
(l) w,θ
M X
p̃u0 = Wo  cu0 u Wv(l) pu  , (6)
 
l=1 u∈U (l)
k
where λ is a trading-off hyper-parameter and
!
where Wo ∈ R d×Ld
and ∈ R
(l)
Wv d×d
. To keep the 1 X exp(p>
u p̃u )
LC (Pk , P̃q ) = log P > p̃ 0 )
.
notation clean, we denote p̃u0 = hw (du0 ) and w = Mq u 0
u ∈Uq exp(p u u
(l) (l) (l) (10)
∪L (l)
l=1 {e , Wq , Wk , Wv } ∪ {Wo }.
The summation in the denominator can be approximated by
With fixed meta latents P1 and Q, we can consider opti- in-batch negative samples with mini-batch training.
mization for our inductive relation model
min DSq (R̂q , Rq ). (7) 4. Comparison with Existing Works
w,θ

We found that using fixed Q here contributes to much better We discuss related works and highlight our differences.
performance than optimizing it in the second stage.
Feature-driven Recommendation. The collaborative fil-
Next we analyze the generalization ability of inductive tering (CF) models do not assume any side information
model on query users. Also, consider fθ as dot-product op- other than the rating matrix, but they cannot be trained in
eration and we assume cu0 u ∈ R+ to simplify the analysis. inductive ways due to the learnable user-specific embedding
Now, we show that the generalization error D(R̂q , Rq ) = pu . To address the issue, previous works leverage side in-
E(u0 ,i) [l(ru0 i , r̂u0 i )] on query users is bounded by the num- formation, e.g. user profile features, to achieve inductive
bers of key users and observed ratings of query users. learning (Jain & Dhillon, 2013; Xu et al., 2013; Cheng et al.,
2016; Ying et al., 2018; Zhong et al., 2018). Define user
Theorem 2. Assume 1) D is L-Lipschitz, 2) for ∀r̂u0 i ∈ features (like age, occupation) as au and item features (like
R̂q we have |r̂u0 i | ≤ B, and 3) the L1-norm of cu0 is movie genre, director) as bi . The feature-driven model tar-
bounded by H. Then with probability at least 1 − δ over gets a prediction model r̂ui = g(au , bi ). Since the space of
the random choice of Sq ∈ ([Mq ] × [N ])Tq , it holds that for au is shared among users, a model trained on one group of
any R̂q , the gap between D(R̂q , Rq ) and DSq (R̂q , Rq ) will users can adapt to other users without retraining. However,
be bounded by feature-driven models often provide limited performance
s s ! since the shared feature space is not expressive enough com-
2Mq ln Mk ln(1/δ) pared to user-specific embedding factors (see Appendix A
O 2LHB + . (8)
Tq Tq for more discussions). Another issue is that high-quality
features are hard to collect in practice. A key advantage of
The theorem shows that the generalization error bound de- IDCF is the capability for inductive representation learning
pends on the size of Uk . Theorem 1 and 2 show that the without using features.
An Inductive Model-based Collaborative Filtering Approach

Table 1. Statistics of five datasets used in our experiments. Amazon-Books and Amazon-Beauty datasets contain implicit user feedbacks
while Douban, ML-100K, ML-1M have explicit feedbacks (ratings range within [1, 2, 3, 4, 5]).
Dataset # Users #Items # Ratings Density # Key/Query Users # Training/Test Instances
Douban 3,000 3,000 0.13M 0.0152 2,131/869 80,000/20,000
Movielens-100K 943 1,682 0.10M 0.0630 123,202/13,689
Movielens-1M 6,040 3,706 1.0M 0.0447 5,114/926 900,199/100,021
Amazon-Books 52,643 91,599 2.1M 0.0012 49,058/3,585 2,405,036/526,430
Amazon-Beauty 2,944 57,289 0.08M 0.0004 780/2,164 53,464/29,440

Inductive Matrix Completion. There are few existing tains equivalent representation capacity as general MF/CF
works that attempt to handle inductive matrix completion models (as proved in Theorem 1). Item-based models only
using only user-item rating matrix. (Hartford et al., 2018) (F- consider learnable embeddings for items, and may suffer
EAE) puts forward an exchangeable matrix layer that takes a from limited representation capacity. Second, IDCF learns
whole rating matrix as input and inductively outputs predic- message-passing graphs among users by a relation model
tion for missing ratings. However, the scalability of F-EAE to obtain better user representations, instead of directly ag-
is limited since it requires the whole rating matrix as input gregating a given observed set of items’ embeddings as
for training and inference for users, while IDCF enables item-based models.
mini-batch training and efficient inference. Besides, (Zhang
& Chen, 2020) (IGMC) proposes to use local subgraphs of 5. Experiment
user-item pairs in a bipartite graph of rating information
as input features and further adopt graph neural networks In this section, we apply the proposed model IDCF to several
to encode subgraph structures for rating prediction. The real-world recommendation datasets to verify and dissect
model achieves inductive learning via replacing users’ one- its effectiveness. Before going into the experiment results,
hot indices by shared input features (i.e., index-free local we first introduce experiment setup including dataset infor-
subgraph structures). However, the expressiveness of IGMC mation, evaluation protocol and implementation details.
is limited since the local subgraph structures can be indistin-
guishable for users with distinct behaviors (see Appendix A 5.1. Experiment Setups
for more discussions), and the issue would become worse
for implicit feedback data. By contrast, IDCF has equivalent Datasets. We consider five common recommendation
expressiveness as original CF models. Another drawback of benchmarks: Douban, Movielens-100K (ML-100K),
F-EAE and IGMC is that their models cannot output user Movielens-1M (ML-1M), Amazon-Books and Amazon-
representations. Differently, IDCF maintains the ability to Beauty. Douban, ML-100K and ML-1M have explicit user’s
give user-specific representations, which reflect users’ pref- ratings on movies. Amazon-Books and Amazon-Beauty
erences and can be used for downstream tasks (like user contain implicit user feedbacks (the records of user’s inter-
behavior modeling (Liu et al., 2020), user-controllable rec- actions with items). For Douban and ML-100K, we use the
ommendation (Ma et al., 2019; Cen et al., 2020), target training/testing splits provided by (Monti et al., 2017). For
advertisement and influence maximization (Khalil et al., ML-1M3 , we follow previous works (van den Berg et al.,
2017; Manchanda et al., 2019), etc.). 2017; Hartford et al., 2018; Zhang & Chen, 2020) and use
9:1 training/testing spliting. For two Amazon datasets, we
Item-based CF Models. Previous works use item embed- use the last ten interactions of each user for test and the
dings as representation for users. (Cremonesi et al., 2010; remaining for training. We leave out 5% training data as val-
Kabbur et al., 2013) adopts a combination of items rated idation set for early stopping in training. Note that the raw
by users to compute user embeddings and frees the model datasets for Amazon-Books and Amazon-Beauty 4 are very
from learning parametric user-specific embeddings. Further- large and sparse ones and we filter out infrequent items and
more, there are quite a few auto-encoder architectures for users with less than five ratings. The statistics of datasets
recommendation problem, leveraging user’s rating vector used in our experiments are summarized in Table 1.
(ratings on all the items) as input, estimating user embed-
ding (as latent variables), and decoding missing values in
Evaluation. For datasets with explicit feedbacks, the goal
the rating vector (Sedhain et al., 2015; Liang et al., 2018).
is to predict user’s ratings on items, i.e. estimate the missing
With item embeddings and user’s rating history, these meth-
values in user-item rating matrix. The task can be seen as
ods enable inductive learning for user representation and
a multi-class classification or regression problem. We use
can adapt to new users on-the-fly. On methodological level,
IDCF has the following differences. First, IDCF assumes 3
https://grouplens.org/datasets/movielens/
4
learnable embeddings for both users and items, which main- http://jmcauley.ucsd.edu/data/amazon/
An Inductive Model-based Collaborative Filtering Approach

Table 2. Test RMSE and NDCG for all the users (All) and few-shot users (FS) in Douban, ML-100K and ML-1M. We highlight the best
scores among all the (resp. inductive) models with bold (resp. underline). Inductive indicates whether the method can achieve inductive
learning. Feature indicates whether the method relies on user features.
Douban ML-100K ML-1M
Method Inductive Feature RMSE NDCG RMSE NDCG RMSE NDCG
All FS All FS All FS All FS All FS All FS
PMF No No 0.737 0.718 0.939 0.954 0.932 1.003 0.858 0.843 0.851 0.946 0.919 0.940
NNMF No No 0.729 0.705 0.939 0.952 0.925 0.987 0.895 0.878 0.848 0.940 0.920 0.937
GCMC No No 0.731 0.706 0.938 0.956 0.911 0.989 0.900 0.886 0.837 0.947 0.923 0.939
NIMC Yes Yes 0.732 0.745 0.928 0.931 1.015 1.065 0.832 0.824 0.873 0.995 0.889 0.904
BOMIC Yes Yes 0.735 0.747 0.923 0.925 0.931 1.001 0.828 0.815 0.847 0.953 0.905 0.924
F-EAE Yes No 0.738 - - - 0.920 - - - 0.860 - - -
IGMC Yes No 0.721 0.728 - - 0.905 0.997 - - 0.857 0.956 - -
IDCF-NN (ours) Yes No 0.738 0.712 0.939 0.956 0.931 0.996 0.896 0.880 0.844 0.952 0.922 0.940
IDCF-GC (ours) Yes No 0.733 0.712 0.940 0.956 0.905 0.981 0.901 0.884 0.839 0.944 0.924 0.940

RMSE and NDCG to evaluate general reconstruction error specific embedding for item i, we consider embeddings for
and personalized ranking performance. RMSE counts the user u’s rated items and users who rated on item i, i.e., the
overall l2 distance from predicted ratings to the ground- one-hop neighbors of u and i in user-item bipartite graph.
truth, while NDCG is an averaged score that measures the Denote Nu,m = {i|rui = m} as user u’s rated items with
consistency between the ranking of predicted ratings and rating value m and Ni,m = {u|rui = m} as users who
that of the ground-truth for each user. For datasets with rated on item i with rating value m for m ≥ 1. Consider
implicit feedbacks, the goal is to predict whether a user graph convolution to aggregate information from neighbors,
interacts with an item. The task is essentially a one-class
classification problem. Since the dataset is very sparse and 1 X
mu,m =ReLU ( Wq,m qi ),
only has positive instances, we uniformly sample five items |Nu,m |
i∈Nu,m
as negative samples for each clicked item and adopt AUC (12)
1 X
and NDCG to measure the global and personalized ranking ni,m =ReLU ( Wp,m pu ),
|Ni,m |
accuracy, respectively. AUC is short of Area Under of the u∈Ni,m
ROC Curve which measures the global consistency between
and combination function: mu = FC({mu,m }m ), ni =
the ranking of all the predicted user-item interactions and
FC’({ni,m }m ) where FC denotes a fully-connected layer.
the ground-truth (which ranks all the 1’s before 0’). More
Then we define the output function
details for evaluation metrics are provided in Appendix D.2
f (pu , qi , {pu }u∈Ni , {qi }i∈Nu )
Implementations. We consider two specifications for fθ
=nn0 ([pu qi kpu mu kni qi kni mu ]) + bu + bi ,
in our model: IDCF-NN, which adopts multi-layer percep-
(13)
tron for f , and IDCF-GC, which uses graph convolution
network for f .
where nn0 is a three-layer neural network using ReLU
Feedforward Neural Network as Matrix Factorization Model activation function. In Appendix D.1, we provide more
(IDCF-NN). We follow the architecture in NNMF (Dziugaite details for hyper-parameter setting. Moreover, we specify
& Roy, 2015) and use neural network for fθ . Here we l(r̂ui , rui ) as MSE loss (resp. cross-entropy) for explicit
combine a three-layer neural network and a shallow dot- (resp. implicit) feedback data.
product operation. Concretely,
5.2. Comparative Results
(p> qi + nn([pu kqi kpu qi ]))
fθ (pu , qi ) = u + bu + bi , 5.2.1. I NTERPOLATION FOR F EW-S HOT U SERS .
2
(11)
Setup. We study the performance on few-shot query users
where nn is a three-layer neural network using tanh acti-
with limited training ratings. We split users in each dataset
vation, denotes element-wise product and bu , bi are bias
into two sets: users with more than δ training ratings, de-
terms for user u and item i, respectively.
noted as U1 , and those with less than δ training ratings U2 .
Graph Convolution Network as Matrix Factorization Model We basically set δ = 30 for Douban, ML-100K and ML-
(IDCF-GC). We follow the architecture in GCMC (van den 1M datasets, and δ = 20 for Amazon datasets. For IDCF,
Berg et al., 2017) and adopt graph convolution network for we adopt the training algorithm of inductive learning for
fθ (). Besides user-specific embedding for user u and item- interpolation with U1 as key users and U2 as query users.
An Inductive Model-based Collaborative Filtering Approach

Table 3. Test AUC and NDCG for few-shot users (FS) and new Table 4. Test RMSE and NDCG for new users on Douban, ML-
users (New) in Amazon-Books and Amazon-Beauty. 100K and ML-1M.
Amazon-Books Amazon-Beauty Douban ML-100K ML-1M
Method
AUC NDCG AUC NDCG
Method
RMSE NDCG RMSE NDCG RMSE NDCG
Query New Query New Query New Query New
NIMC 0.766 0.921 1.089 0.864 1.059 0.883
PMF 0.917 - 0.888 - 0.779 - 0.769 - BOMIC 0.764 0.920 1.088 0.859 1.057 0.879
NNMF 0.919 - 0.891 - 0.790 - 0.763 -
NGCF 0.916 - 0.896 - 0.793 - 0.775 - FISM 1.910 0.824 1.891 0.760 2.283 0.771
PinSAGE 0.923 - 0.901 - 0.790 - 0.775 - MultVAE 2.783 0.823 2.865 0.758 2.981 0.792
FISM - 0.752 - 0.792 - 0.613 - 0.678 IGMC 0.743 - 1.051 - 0.997 -
MultVAE - 0.738 - 0.701 - 0.644 - 0.679 IDCF-NN 0.749 0.955 1.078 0.877 0.994 0.941
IDCF-NN 0.944 0.939 0.928 0.920 0.792 0.750 0.783 0.774 IDCF-GC 0.723 0.955 1.011 0.881 0.957 0.942
IDCF-GC 0.938 0.946 0.921 0.930 0.801 0.791 0.772 0.791

5.2.2. E XTRAPOLATION FOR N EW U SERS .

We use the training ratings for U1 and U2 as input data for Setup. We then investigate model’s generalization perfor-
our pretraining and adaption, respectively. We compare mance on new users that are unseen in training. We assume
with several competitors, including 1) powerful transduc- the model is only exposed to the training ratings of U1 and
tive methods PMF (Salakhutdinov & Mnih, 2007), NNMF test its performance on test ratings of U2 . Concretely, for
(Dziugaite & Roy, 2015) and GCMC (van den Berg et al., IDCF, we leverage the training algorithm of inductive learn-
2017), 2) inductive feature-driven methods NIMC (Zhong ing for extrapolation with U1 as both key and query users.
et al., 2018), BOMIC (Ledent et al., 2020), and 3) inductive The two-stage training uses training ratings of U1 . We com-
matrix completion model IGMC (Zhang & Chen, 2020). We pare with inductive models NIMC, BOMIC, IGMC and
train each competitor with training ratings of U1 and U2 . item-based CF models FISM (Kabbur et al., 2013), Mult-
VAE (Liang et al., 2018).
Table 2 reports RMSE and NDCG for test ratings of all the
users and few-shot users in U2 on Douban, ML-100k and Table 4 reports results on test ratings for new users in U2
ML-1M. As we can see, IDCF-NN (resp. IDCF-GC) gives on Douban, ML-100K and ML-1M. Notably, IDCF-GC
very close RMSE and NDCG on all the users and query outperforms the best competitors by a large margin, with
users to NNMF (resp. GCMC), which suggests that our RMSE (resp. NDCG) improvement of 2.6% (resp. 3.6%)
inductive model can achieve similar reconstruction error and on Douban, 3.8% (resp. 2.0%) on ML-100K and 4.0%
ranking accuracy as corresponding transductive counterpart. (resp. 6.7%) on ML-1M. Also, Table 3 reports test AUC
The results validate our theoretical analysis in Section 3 and NDCG for new users on two Amazon datasets. The
that IDCF possesses the same representation capacity as results show that both IDCF-NN and IDCF-GC outperform
matrix factorization model. Even though IDCF enables other competitors with 25.7% (resp. 17.4%) and 22.8%
inductive representation learning via a parameter-sharing (resp. 16.4%) improvements of AUC (resp. NDCG) on
relation model, it does not sacrifice any representation power. Amazon-Books and Amazon-Beauty, respectively. Such
Compared with inductive methods, IDCF-GC achieves the results demonstrate superior power of IDCF for addressing
best RMSE and NDCG for query users in most cases. The new users in open-world recommendation.
results demonstrate the superiority of IDCF against other
feature-driven and inductive matrix completion models. 5.3. Further Discussion
Table 3 shows AUC and NDCG for test interactions of few- Sparse Data and Few-shot Users. A successful recom-
shot users in U2 on Amazon-Books and Amazon-Beauty. mender system is supposed to handle data sparsity and few-
Since Amazon datasets have no user feature, the feature- shot users with few historical ratings. Here we construct
driven competitors would not work. Also, IGMC would sparse datasets by using 50%, 20%, 10%, 5%, 1% and 0.1%
fail to work with implicit feedbacks. We compare with training ratings in Movielens-1M, and then compare the test
two other graph-based recommendation models, NGCF RMSEs of query users in Fig. 3(a). Also, Fig. 3(b) com-
(Wang et al., 2019) and PinSAGE (Ying et al., 2018). As pares the test RMSEs for users with different numbers of
shown in Table 3, IDCF-NN and IDCF-GC significantly historical ratings under 50% sparsity. As shown in Fig. 3(a),
outperform transductive models in implicit feedback set- as the dataset becomes sparser, the RMSEs of all the mod-
ting with 2.3%/3.0% (resp. 1.0%/1.0%) improvement of els suffer from a drop, but the drop rate of our inductive
AUC/NDCG on Amazon-Books (resps. Amazon-Beauty). models IDCF-NN and IDCF-GC is much smaller compared
The two Amazon datasets are both very sparse with rating with transductive models NNMF and GCMC. In Fig. 3(b),
density approximately 0.001. One implication here is that we find that users with more historical ratings usually have
our inductive model can provide better performance than better RMSE scores compared with few-shot users. By con-
transductive models for users with few training ratings. trast, our inductive models IDCF-NN and IDCF-GC exhibit
An Inductive Model-based Collaborative Filtering Approach

0 0.00050 0.85
IDCF-NN IDCF-NN 1400 IDCF-GC
1.5 IDCF-GC 1.00 IDCF-GC GCMC

Training time per epoch (s)


0.00045 1200
NNMF NNMF 200 0.80
IGMC

Query user index

Attention weight
1.4 GCMC 0.98 GCMC 1000
0.00040
1.3 0.96 400 0.75 800

RMSE
RMSE

0.00035
1.2 0.94 600
600 0.00030 0.70
1.1 400
0.92
800 0.00025 0.65 200
1.0
0.90 0
0.00020 0 200 400 600 800 1000 1200 1400
1 0.5 0.2 0.1 0.05 0.01 0.001 5 10 15 20 25 0 1000 2000 3000 4000 5000 0 100k 200k 300k 400k
Sparsity # Training ratings Key user index # History ratings Dataset size
(a) (b) (c) (d) (e)
Figure 3. Evaluation. (a) Overall RMSE w.r.t # sparsity ratio. (b) User-specific RMSE w.r.t # user’s training ratings. (c) Attention weights
of query users (y-axis) on key users (x-axis). (d) Key uses’ accumulated attention weights w.r.t. # historical ratings. (e) Scalability test.

a more smooth decrease and even outperform other trans- learning. Existing graph-based (recommendation) models
ductive methods NNMF and GCMC for users with very few rely on a given observed graph, on top of which a GNN
ratings. In the extreme cases with less than five historical model would play as a strong inductive bias. Such induc-
ratings, IDCF-GC achieves 2.5% improvement on RMSE tive bias may be beneficial for some cases (if the graph
compared with the best transductive method GCMC. is fully observed and possesses homophily property) and
harmful for other cases (if the graph has noisy/missing links
Attention Weight Distribution In Fig. 3(c) we visualize at-
or have heterophily structures). Differently, IDCF enables
tention weights of IDCF-NN from query users to key users
mutual reinforcement between structure learning and mes-
in ML-1M. There is an interesting phenomenon that some
sage passing with the guidance of supervised signals from
of key users appear to be very ‘important’ and most query
downstream tasks in a fully data-driven manner.
users have high attention scores on them. It indicates that
the embeddings of these key users are informative and can
provide powerful expressiveness for query users’ prefer- 6. Conclusions and Outlook
ences. In Fig. 3(d) we further plots key users’ accumulated
We have proposed an inductive collaborative filtering frame-
attention weights (sum of the attention scores over all the
work that learns hidden relational graphs among users to
query users) w.r.t. # historical ratings. We can see that key
allow effective message passing in the latent space. It ac-
users with more historical ratings are more likely to have
complishes inductive computation for user-specific repre-
large attention weights on query users, though they are also
sentations without compromising on representation capacity
more likely to have low attention scores. This observation is
and scalablity. Our model achieves state-of-the-art perfor-
consistent with intuition that the users with more historical
mance on inductive collaborative filtering for recommenda-
ratings are easier for the model to identify their interests.
tion with few-shot and zero-shot users that are commonly
Also, the results gives an important hint for selecting useful
encountered in open-world recommendation.
key users: informative key users are more likely to exsit
in users with more historical ratings. In Appendix E, we The core idea of IDCF opens a new way for next generation
compare different split ways for key and query users and of representation learning, i.e., one can consider a pretrained
will provide more discussions on this point. representation model for one set of existing entities and their
representations (through some simple transformations) can
Scalability Test We further investigate the scalability of
be generalized to efficiently compute inductive representa-
IDCF-GC compared with two GNN-based counterparts
tions for others, enabling the model to flexibly handle new
IGMC and GCMC. We statistic the training time per epoch
coming entities in the wild.
on ML-1M using a GTX 1080Ti with 11G memory. Here
we truncate the dataset and use different numbers of rat-
ings for training. The results are shown in Fig. 3(e). As Acknowledgement
we can see, when dataset size becomes large, the training
We would like to thank the anonymous reviewers for their
times per epoch of three models all exhibit linear increase.
valuable feedbacks and suggestions that help to improve
IDCF spends approximately one more time than GCMC,
this work. This work was supported by the National Key
while IGMC is approximately ten times slower than IDCF.
R&D Program of China [2020YFB1707903], the National
Nevertheless, while IDCF costs one more training time than
Natural Science Foundation of China [61872238,61972250],
GCMC, the latter cannot tackle new unseen users without
Shanghai Municipal Science and Technology Major Project
retraining a model in test stage.
(2021SHZDZX0102), the Tencent Marketing Solution
More Insights on IDCF’s Effectiveness. The relation Rhino-Bird Focused Research Program [FR202001], and
model aims at learning graph structures among users that the CCF-Tencent Open Fund [RAGR20200105].
explore more useful proximity information and maximize
the benefits of message passing for inductive representation
An Inductive Model-based Collaborative Filtering Approach

References Koren, Y., Bell, R., and Volinsky, C. Matrix factorization


techniques for recommender systems. Computer, 42(8):
Cen, Y., Zhang, J., Zou, X., Zhou, C., Yang, H., and Tang, J.
30–37, 2009.
Controllable multi-interest framework for recommenda-
tion. In Gupta, R., Liu, Y., Tang, J., and Prakash, B. A. Ledent, A., Alves, R., and Kloft, M. Orthogonal inductive
(eds.), KDD, pp. 2942–2951, 2020. matrix completion. CoRR, abs/2004.01653, 2020.
Chen, Y., Wu, L., and Zaki, M. J. Iterative deep graph Lee, H., Im, J., Jang, S., Cho, H., and Chung, S. Melu:
learning for graph neural networks: Better and robust Meta-learned user preference estimator for cold-start rec-
node embeddings. In NeurIPS, 2020. ommendation. In KDD, pp. 1073–1082, 2019.
Cheng, H., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Liang, D., Krishnan, R. G., Hoffman, M. D., and Jebara,
Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, T. Variational autoencoders for collaborative filtering. In
M., Anil, R., Haque, Z., Hong, L., Jain, V., Liu, X., and Champin, P., Gandon, F. L., Lalmas, M., and Ipeirotis,
Shah, H. Wide & deep learning for recommender systems. P. G. (eds.), WWW, pp. 689–698. ACM, 2018.
In DLRS, pp. 7–10, 2016.
Liu, H., Lu, J., Zhao, X., Xu, S., Peng, H., Liu, Y., Zhang,
Covington, P., Adams, J., and Sargin, E. Deep neural net- Z., Li, J., Jin, J., Bao, Y., and Yan, W. Kalman filtering
works for youtube recommendations. In RecSys, pp. 191– attention for user behavior modeling in CTR prediction.
198, 2016. In NeurIPS, 2020.
Cremonesi, P., Koren, Y., and Turrin, R. Performance of Ma, J., Zhou, C., Cui, P., Yang, H., and Zhu, W. Learning
recommender algorithms on top-n recommendation tasks. disentangled representations for recommendation. In
In RecSys, pp. 39–46. ACM, 2010. NeurIPS, pp. 5712–5723, 2019.

Dziugaite, G. K. and Roy, D. M. Neural network matrix Manchanda, S., Mittal, A., Dhawan, A., Medya, S., Ranu,
factorization. CoRR, abs/1511.06443, 2015. S., and Singh, A. K. Learning heuristics over large graphs
via deep reinforcement learning. CoRR, abs/1903.03332,
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and 2019.
Dahl, G. E. Neural message passing for quantum chem-
istry. In ICML, pp. 1263–1272, 2017. Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient
estimation of word representations in vector space. In
Hamilton, W. L., Ying, Z., and Leskovec, J. Inductive ICLR, Workshop, 2013.
representation learning on large graphs. In NeurIPS, pp.
1024–1034, 2017. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Founda-
tions of Machine Learning. Adaptive computation and
Hartford, J. S., Graham, D. R., Leyton-Brown, K., and machine learning. MIT Press, 2012.
Ravanbakhsh, S. Deep models of interactions across sets.
In ICML, pp. 1914–1923, 2018. Monti, F., Bronstein, M. M., and Bresson, X. Geomet-
ric matrix completion with recurrent multi-graph neural
Hu, Y., Koren, Y., and Volinsky, C. Collaborative filtering networks. In Guyon, I., von Luxburg, U., Bengio, S.,
for implicit feedback datasets. In ICDM, pp. 263–272, Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and
2008. Garnett, R. (eds.), NeurIPS, pp. 3697–3707, 2017.
Jain, P. and Dhillon, I. S. Provable inductive matrix comple- Qian, T., Liang, Y., and Li, Q. Solving cold start problem
tion. CoRR, abs/1306.0626, 2013. in recommendation with attribute graph neural networks.
CoRR, abs/1912.12398, 2019.
Kabbur, S., Ning, X., and Karypis, G. FISM: factored item
similarity models for top-n recommender systems. In Rendle, S., Freudenthaler, C., Gantner, Z., and Schmidt-
KDD, pp. 659–667. ACM, 2013. Thieme, L. BPR: bayesian personalized ranking from
implicit feedback. In UAI, pp. 452–461, 2009.
Kang, W., Cheng, D. Z., Chen, T., Yi, X., Lin, D., Hong,
L., and Chi, E. H. Learning multi-granular quantized Salakhutdinov, R. and Mnih, A. Probabilistic matrix factor-
embeddings for large-vocab categorical features in rec- ization. In Platt, J. C., Koller, D., Singer, Y., and Roweis,
ommender systems. In WWW, pp. 562–566, 2020. S. T. (eds.), NeurIPS, pp. 1257–1264, 2007.
Khalil, E. B., Dai, H., Zhang, Y., Dilkina, B., and Song, Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and
L. Learning combinatorial optimization algorithms over Monfardini, G. The graph neural network model. IEEE
graphs. In NeurIPS, pp. 6348–6358, 2017. Trans. Neural Networks, 20(1):61–80, 2009.
An Inductive Model-based Collaborative Filtering Approach

Sedhain, S., Menon, A. K., Sanner, S., and Xie, L. Au-


torec: Autoencoders meet collaborative filtering. In
WWW (Companion Volume), pp. 111–112. ACM, 2015.
Srebro, N., Alon, N., and Jaakkola, T. S. Generalization
error bounds for collaborative prediction with low-rank
matrices. In NeurIPS, pp. 1321–1328, 2004.

van den Berg, R., Kipf, T. N., and Welling, M. Graph


convolutional matrix completion. CoRR, abs/1706.02263,
2017.
Wang, X., He, X., Wang, M., Feng, F., and Chua, T. Neural
graph collaborative filtering. In SIGIR, pp. 165–174,
2019.
Xu, M., Jin, R., and Zhou, Z. Speedup matrix completion
with side information: Application to multi-label learning.
In NeurIPS, pp. 2301–2309, 2013.
Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton,
W. L., and Leskovec, J. Graph convolutional neural
networks for web-scale recommender systems. In Guo,
Y. and Farooq, F. (eds.), SIGKDD, pp. 974–983. ACM,
2018.
Zhang, M. and Chen, Y. Inductive matrix completion based
on graph neural networks. In ICLR, 2020.
Zheng, Y., Tang, B., Ding, W., and Zhou, H. A neural au-
toregressive approach to collaborative filtering. In ICML,
pp. 764–773, 2016.
Zhong, K., Song, Z., Jain, P., and Dhillon, I. S. Nonlinear
inductive matrix completion based on one-layer neural
networks. CoRR, abs/1805.10477, 2018.
An Inductive Model-based Collaborative Filtering Approach

Appendix Comparison of Expressiveness. We provide a comparison


with feature-driven and local-graph-based inductive (ma-
A. Discussion on Representation Power and trix completion) models through two cases in Fig. 4 so as
Expressiveness to highlight the superior expressiveness of IDCF. Here we
assume ratings are within {−1, 1} (positive, denoted by
We compare IDCF with other related models from two per-
red line, and negative, denoted by black line). The solid
spectives in order to shed more lights on the advantages and
lines are observed ratings for training and dash lines are test
differences of our model.
ratings. In Fig. 4(a), we consider test ratings (u1 , i2 ) and
Comparison of Representation Power. We first provide (u2 , i2 ). For local-graph-based models, the 1-hop local sub-
a comparison with related works on methodological level graphs of (u1 , i2 ) (resp. (u2 , i2 )) consists of {i1 , i2 , u1 , u3 }
as a clear elaboration for essential differences of our work (resp. {i1 , i2 , u2 , u3 }) and the subgraph structures are dif-
to others. In Fig. 5, we present an intuitive comparison ferent for two cases due to the positive (resp. negative)
with general CF model, local-graph-based inductive CF and edge for (u1 , i1 ) (resp. (u2 , i1 )). The local-graph-based
item-based CF model. As shown in Fig. 5(a), general col- models can give right prediction for two ratings relying on
laborative filtering assumes user-specific embeddings for different structures. Also, CF models and IDCF can work
users and learn them collaboratively among all the users in smoothly in this case, relying on different rating history of
one dataset. It disables inductive learning due to such learn- u1 and u2 . However, feature-driven models will fail once
able embeddings. Item-based model, as shown in Fig. 5(b), u1 and u2 have the same features though two users have
leverages embeddings of user’s historically rated items to different rating history. In Fig. 4(b), we consider test rat-
compute user’s embeddings via some pooling methods. The ings (u1 , i3 ) and (u2 , i3 ). The 1-hop local subgraphs of
learnable parameters only lie in the item space. It suffers (u1 , i3 ) (resp. (u2 , i3 )) consists of (i1 , i2 , i3 , u1 , u3 ) (resp.
from limited representation capacity compared to general (i1 , i2 , i3 , u2 , u3 )) and the subgraph structures are the same.
CF model that assumes both user and item embeddings. Be- Thus, the local-graph-based models will fail to distinguish
sides, local-graph-based inductive model (e.g. (Zhang & two inputs and give the same prediction for (u1 , i3 ) and
Chen, 2020)), shown in Fig. 5(b), extracts local subgraph (u2 , i3 ). Differently, CF models and IDCF can recognize
structures within 1-hop neighbors of each user-item pair that u3 has similar rating patterns with u1 and different from
(i.e., rated items of the user and users who rated the item) u2 , thus pushing the embedding of u1 (resp. u2 ) close to
from a bipartite graph of all the observed user-item ratings (resp. distant from) u3 , which guides the model to right
and use GNNs to encode such graph structures for rating prediction. Note that the first case becomes a common is-
prediction. Note that the model requires that the local sub- sue when the feature space is small while the second case
graphs do not contain user and item indices, so it cannot becomes general when the rating patterns of users are not
output user-specific embeddings. Its representation power distinct enough throughout a dataset, which induces similar
is limited since it cannot represent diverse user preferences local subgraph structures. Therefore, IDCF enjoys superior
with arbitrary rating history on items. Also it cannot output expressiveness for input data than feature-driven and local-
user representations. Differently, our model IDCF, as shown graph-based inductive (matrix completion) models. Also,
in Fig. 5(d) adopts item-based embedding as initial states it maintains as good expressiveness as (transductive) CF
for query users to compute attention scores on key users models.
and aggregate the embeddings (i.e., meta latents) of key
users to estimate user-specific embeddings for query users, B. Proofs in Section 3
which maintains ability to produce user representations with
enough representation power and meanwhile achieves in- B.1. Proof of Theorem 1
ductive learning.
Proof. The proof is trivial by construction. Assume the op-
timal P2 for Eq. (3) as P∗2 . Since P1 given by Eq. (1)
is column-full-rank, for any column vector p∗u0 in P∗2
(u0 ∈ U2 ), there exists c∗u0 such that c∗u0 > P1 = p∗u0 .
Hence, C∗ = [c∗u0 ]u0 ∈U2 is a solution for Eq. (2) and gives
DSq (R̂2 , R2 ) < .

B.2. Proof of Theorem 2


(a) (b) Proof. With fixed a true rating matrix R2 to be learned
Figure 4. Comparison of expressiveness. Feature-driven and local- and a probability distribution P over [Mq ] × [N ], which
graph-based models fail in (a) and (b), respectively. IDCF works is unknown to the learner, we consider the problem under
effectively in both cases with superior expressiveness. the framework of standard PAC learning. We can treat the
An Inductive Model-based Collaborative Filtering Approach

user one-hot id user embeddings user rating vector user-item pair user-item interaction graph
of 1-hop neighbors
5 0 3 user rating vector
1 0 0 0 0 0 0 2
5 0 3 0 0 2
0 1 0 0 0 item
embeddings item rating vector
… …
0 0 0 0 1 user 0 4 1 0 5
embedding
(a) General CF model (b) Item-based model (c) Local-graph-based inductive model

key user query user rating vector


key user one-hot id meta latents
embeddings
1 0 0 0 0 query user 5 0 3 0 0 2

attention model
embedding
item
0 1 0 0 0
embeddings
… … …
0 0 0 0 1 query user
initial state

(d) Inductive collaborative filtering (ours)


Figure 5. Comparison with related works on methodological level. We highlight the advantage of proposed model IDCF: it can 1)
inductively compute user representations (or embeddings) and meanwhile 2) maintain enough capacity as general CF models.

matrix R2 as a function (u0 , i) → ru0 i . Let R, a set of where,


matrices in RMq ×N , denotes the hypothesis class of this s 
problem. Then the input to the learner is a sample of R2 ln(1/δ) 
denoted as G = 2L · Rad(R) + O  .
T2
T = (u0t , it , ru0t it )|(u0t , it ) ∈ Sq ,


where Sq = {(u0t , it )} ∈ ([Mq ] × [N ])T2 is a set with Based on the lemma, we need to further estimate the
size T2 containing indices of the observed entries in R2 Rademacher complexity in our model to complete the proof.
and each (u0 , i) in Sq is independently chosen according In our model, R̂2 = C> Pk Q and the entry r̂u0 i is given by
to the distribution P. When using T as training exam- r̂u0 i = p> > 0
u0 qi = cu0 Pk qi (where cu0 is the u -th colunm
plesPfor the learner, it minimizes the error DSq (R̂2 , R2 ) = vector of C). Define C as a set of matrices,
1
(u0 ,i)∈Sq l(ru i , r̂u i ). We are interested in the general-
0 0 Mk
T2 X
ization error of the learner, which is defined as C = {A ∈ [0, 1]Mq ×Mk : kau0 k1 = |au0 u | = 1}.
u=1
D(R̂2 , R2 ) = E(u0 ,i)∈P [l(ru0 i , r̂u0 i )].
Then we have
The (empirical) Rademacher complexity of R w.r.t. the
" T2
#
X
sample T is defined as T2 · RadT (R) = Eσ sup σ t c>
u0t Pk qit
C∈C t=1
" T2
#  !
1 X Mq
RadT (R) = Eσ sup σt r̂u0t it , X X
T2 = Eσ  sup c>
u0 · σt R̂k,∗it 
R̂2 ∈R t=1 C∈C
u0 =1 t:ut =u0
where σt ∈ {−1, 1} is a random variable with probability (since R̂k,∗it = Pk qit )
P r(σt = 1) = P r(σt = −1) = 21 . Assume l(·, ·) is L-  !
Mq
Lipschitz w.r.t. the first argument and |l(·, ·)| is bounded by X X
a constant. Then a general result for generalization bound ≤ H · Eσ  sup a> u0 · σt R̂k,∗it 
A∈A
u0 =1 t:ut =u0
of R is 
Mq
!
X X
Lemma 1. (Generalization bound (Mohri et al., 2012)): = H · Eσ  max σt r̂uit  .
u∈[Mk ]
For a sample T with random choice of Sq = ([Mq ]×[N ])T2 , u0 =1 t:ut =u0

it holds that for any R̂2 ∈ R and confidence parameter (15)


0 < δ < 1, The last equation is due to the fact that au0 is a probability
distribution for choosing entries in Rk,∗it , the it -th column
P r(D(R̂2 , R2 ) ≤ DSq (R̂2 , R2 ) + G) ≥ 1 − δ, (14) of matrix R̂k . In fact, we can treat the maxu∈[Mk ] inside
An Inductive Model-based Collaborative Filtering Approach

the sum over all u0 ∈ U2 as a mapping κ from u0 ∈ [Mq ] Also, we assume user-specific index embedding pu and
to u ∈ [Mk ]. Let K = {κ : [Mq ] → [Mk ]} be the set of all item-specific index embedding qi for user u and item i,
mappings from [Mq ] to [Mk ], and then the above formula respectively, as is in Section 3. The prediction for user u’s
can be written as rating on item i can be
 !
Mq
X X r̂ui = gθ (pu , yu , qi , zi ), (21)
Eσ  max σt r̂uit  (16)
u∈[Mk ] where gθ can be a shallow neural network with parameters
u0 =1 t:ut =u0
  denoted by θ. To keep notation clean, we denote Y =
Mq
X X {y1 , y2 , · · · , ym } and Z = {z1 , z2 , · · · , zn }. Then for
=Eσ  sup σt r̂κ(u0 ),it  (17) key users in Uk with rating matrix Rk , we consider the
κ∈K
u0 =1 t:ut =u0 optimization problem,
" T2
#
min DSk (R̂k , Rk ), (22)
X
=Eσ sup σt r̂κ(ut ),it (18) Pk ,Q,Y,Z,θ
κ∈K t=1
p p based on which we get learned feature embedding functions
≤B T2 · 2Mq log Mk . (19) Y, Z as well as transductive embedding matrices Pk , Q
The last inequality is according to the Massart Lemma. which we further use to compute inductive embeddings for
Hence, we have query users.
r For query users, feature embeddings can be obtained
2Mq log Mk by the learned Y and Z in Eq. (22), i.e., yu0 =
RadT (R) ≤ HB . (20)
T2 [yu0 1 (au0 1 )|| · · · ||yu0 m (au0 m )] where au0 is raw feature
Incorporating Eq. (20) into Eq. (14), we will arrive at the vector of user u0 . Then we have a relation learning model
result in this theorem. hw that consists of a multi-head attention function and
use user feature as input du0 = yu0 . The inductive user-
specific representation can be given by pu0 = hw (du0 )
C. Extensions of IDCF (i.e., Eq. (5) and Eq. (6)), similar as the CF setting in Sec-
tion 3. The rating of user u0 on item i can be predicted
IDCF can be extended to feature-based setting and deal with
by r̂u0 i = gθ (pu0 , yu0 , qi , zi ). Also, the optimization for
extreme cold-start recommendation where test users have
inductive relation model is
no historical ratings. Here, we provide details of feature-
based IDCF (IDCF-HY) which indeed is a hybrid model min DSq (R̂q , Rq ). (23)
that considers both user features and one-hot user indices. w,θ

Furthermore, we discuss in the views of transfer-learning


and meta-learning that can be leveraged to enhance our C.2. Extreme Cold-Start Recommendation
framework as future study. For cold-start recommendation where test users have no
historical rating, we have no information about users if with-
C.1. Hybrid Model with Features (IDCF-HY) out any side information. In such case, most CF models
would fail for personalized recommendation and degrade
Assume au denotes user u’s raw feature vector, i.e., a con-
to a trivial one which outputs the same result (or the same
catenation of all the features (often including binary, cate-
distribution) to all the users using the popularity of items.
gorical and continuous variables) where categorical features
For IDCF, the set Iu0 in Eq. (4) would be empty for users
can be denoted by one-hot or multi-hot vectors. If one has
with no historical rating, in which situation we can randomly
m user features in total, then au can be
select a group of key users to construct Iu0 used for com-
au = [au1 ||au2 ||au3 || · · · ||aum ]. puting attentive scores with key users. Another method is
to directly use average embeddings of all the key users as
Then we consider user-sharing embedding function yi () estimated embeddings for query users. In such case, the
which can embed each feature vector into a d-dimensional model degrades to ItemPop (using the numbers of users who
embedding vector: rated the item for prediction).

yu = [y1 (au1 )||y2 (au2 )||y3 (au3 )|| · · · ||ym (aum )]. On the other hand, if side information (such as user pro-
file features) is available, our hybrid model IDCF-HY can
Similarly, for item feature bi = [bi1 ||bi2 ||bi3 || · · · ||bin ], leverage user features for computing inductive embeddings,
we have its embedding representation: which enables extreme cold-start recommendation. We ap-
ply this method to cold-start recommendation on Movielens-
zi = [z1 (bi1 )||z2 (bi2 )||z3 (bi3 )|| · · · ||zn (bin )]. 1M using features and the results are given in Appendix E.
An Inductive Model-based Collaborative Filtering Approach

C.3. Transfer Learning & Meta-Learning D.2. Evaluation Metrics


Another extension of IDCF is to consider transfer learning We provide details for our adopted evaluation metrics. In
on cross-domain recommendation tasks or when treating our experiments, we follow evaluation protocols commonly
recommendation for different users as different tasks like used in previous works in different settings. Three metrics
(Lee et al., 2019). Transfer learning and meta learning have used in our paper are as follows.
shown power in learning generalizable models that can adapt
to new tasks. In our framework, we can also take advantage • RMSE: Root Mean Square Error is a commonly used
of transfer learning (few-shot learning or zero-shot learning) metric for explicit feedback data and measures the
or mete-learning algorithms to train our relation learning averaged L2 distance between predicted ratings and
model hw . For example, if using model-agnostic meta- ground-truth ratings:
learning algorithm for the second-stage optimization, we
v P
can first compute one-step (or multi-step) gradient update u (r̂ui − rui )2
u
independently for each user (or a group of clustering users) t (u,i)∈I +
in a batch and then average them as one global update for RM SE = . (24)
|I + |
the model. The meta-learning can be applied over different
groups of users or cross-domain datasets. • AUC: Area Under the ROC Curve is a metric for im-
plicit feedback data. It measures general consistency
D. Details in Implementations between a ranking list of predicted scores and ground-
truth ranking with 1’s before 0’s. More specifically,
We provide implementation details that are not presented in AUC counts the average area under the curve of true-
Section 5 in order for reproducibility. positive v.s. false-positive curve:
P P
D.1. Hyper-parameter Settings (u0 ,i0 )∈I − δ(r̂u,i > r̂u ,i )
0 0
(u,i)∈I +
AU C = + −
,
We present details for hyper-parameter settings in different |I ||I |
(25)
datasets. We use L = 4 attention heads for our inductive
where I + = {(u, i)|rui > 0} and I − =
relation learning model among all the datasets. For Douban
{(u0 , i0 )|ru0 j 0 = 0} denote the sets of observed user-
and ML-100K, each attention head randomly samples 200
item interaction pairs and unobserved user-item pairs
key users for computing attention weights. For ML-1M, we
respectively. The indicator δ(r̂ui > r̂uj ) returns 1
set sample size as 500; for Amazon-Books and Amazon-
when r̂ui > r̂uj and 0 otherwise. Since we only
Beauty, we set it as 2000. We use Adam optimizer and learn-
have ground-truth positive examples (clicked items)
ing rates are searched within [0.1, 0.01, 0.001, 0.0001]. For
for users, we negatively sample five items as negative
pretraining, we consdier L2 regularization for user and item
examples (non-clicked items) for each user-item rating
embeddings. The regularization weights are searched within
in dataset, which composes the set I − .
[0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2]. The mini-
batch sizes are searched within [64, 256, 512, 1024, 2048] • NDCG: Normalized discounted cumulative gain (Nor-
to keep a proper balance between training efficiency and per- malized DCG) measures the usefulness, or gain, of a
formance. For adaption stage, regularization weight λ is set recommended item based on its position in the result
as 10 for five datasets. Besides, different hyper-parameters list. NDCG can be used to evaluate the model on both
for architectures are used in three implementations. explicit and implicit feedback data. The gain is accu-
IDCF-NN. For Douban and ML-100K, we use embedding mulated from the top of the recommendation list to
dimension d = 16 and neural size 48 − 32 − 32 − 1 for fθ . the bottom, with the gain of each result discounted at
For ML-1M, ML-10M and Amazon-Books, we use d = 32 lower ranks. The NDCG metric is computed for each
and neural size 96 − 64 − 64 − 1 for fθ . user and can measures the averaged performance of
personalized recommendation. Given the ranking list
IDCF-GC. For Douban and ML-100K, we use embedding of recommended items to a user u, denoted as K̂u , its
dimension d = 32 and neural size 128 − 32 − 32 − 1 for fθ . DCG is defined as:
For ML-1M, ML-10M and Amazon-Books, we use d = 64
and neural size 256 − 64 − 64 − 1 for fθ .
X reli
DCG = . (26)
log2 (i + 1)
IDCF-HY. We use embedding size d = 32 for each feature i∈K̂u
in ML-1M as well as user-specific and item-specific index
embeddings. The neural size of gθ is set as 320−64−64−1. where reli is the graded relevance of item i with user u.
For explicit feedbacks, reli is the ground-truth rating
of user u on item i. For implicit feedbacks, reli = 1 for
An Inductive Model-based Collaborative Filtering Approach

observed user-item interaction and reli = 0 otherwise. two-folds: 1) since key usres have more training ratings,
The normalized discounted cumulative gain, or NDCG, the transductive model can learn better representations; 2)
is computed as: since query users have more training ratings, the inductive
model would have better generalization ability. On the other
DCG@K hand, with different spliting thresholds, test RMSEs for new
N DCG@K = (27)
IDCG@K users remain in a fixed level. The results demonstrate that
where IDCG@K the performance of IDCF on new unseen users is not sensi-
Pis ideal
reli
discounted cumulative gain:
tive to different splitting thresholds. However, with random
IDCG@K = log (i+1) , and Ku represents the
i∈Ku
2 split, when γ decreases (also we have fewer key users and
ground-truth ranking list of relevant items (ordered by more query users but their average training ratings stay un-
their ground-truth ratings/interactions by user u). changed), RMSEs for new users suffer from an obvious
decrease. One possible reason is that when we use smaller
ratio of key users with random split, the ‘informative’ key
E. More Experiment Results users in the dataset are more likely to be ignored. (Recall
that, as is shown in Fig. 3(c), there exist some important
E.1. Impacts of different splits for key and query users key users that give high attention weights on query users.)
In our experiments in Section 5, we basically consider users If such key users are missing, the performance would be
with more than δ training ratings as U1 and the remaining as affected due to insufficient expressive power of the inductive
U2 , based on which we construct key users and query users representation model.
to study model’s performance on few-shot query users (for Comparing threshold split with random split, we can find
inductive interpolation) and new unseen users (for inductive that when using the same ratio of key users and query users
extrapolation). Here we provide a further discussions on two (i.e., the same column in Table 5), RMSEs on new users
spliting ways and study the impact on model performance. with threshold split are always better than those with random
split. Such observation again shows that key users with more
• Threshold: we select users with more than δ training historical ratings would be more informative for providing
ratings as U1 and users with less than δ training ratings useful information to inductive representation learning on
as U2 . query users, and again echo the results in Fig. 3(d) which
demonstrates that important key users who give large at-
• Random: we set a ratio γ ∈ (0, 1) and randomly tention weights on query users tend to exist in users with
sample γ × 100% of users in the dataset as U1 . The sufficient historical ratings.
remaining users are grouped as U2 .
E.2. Ablation Studies
We consider δ = [20, 30, 40, 50, 60, 70] and γ =
[0.97, 0.85, 0.75, 0.68, 0.62, 0.57] (which exactly gives the In Table 6 we present the results of ablation study on ML-
same ratio of |U1 | and |U2 | as corresponding δ in thresh- 1M and Amazon-Books datasets. We compare IDCF-GC
old split5 ) in Movielens-1M dataset. For each spliting, we with 1) RD-Item (using randomized item embeddings), 2)
also consider two situations for key users and query users: Trans-User (directly optimizing Pq in Eqn. (2)) and 3) Meta-
1) inductive learning for interpolation on few-shot query Path (using meta-path user-item-user in the observed user-
users, i.e., the first-starge training is on the training ratings item bipartite graph to determine users’ neighbors for mes-
of key users Uk = U1 and the second-stage training is on the sage passing). The results show that RD-Item performs
training ratings of query users Uk = U2 ; inductive learning much worse than IDCF-GC since the randomized item em-
for extrapolation on zero-shot new users, i.e., the two-stage beddings may provide wrong signals for both graph learning
trainings are both on the training ratings of a same group of and final prediction. Compared with Trans-User, IDCF-GC
users Uk = Uq = U1 . We test the model on the testing rat- significantly outperforms it over a large margin. The reason
ings of users in U2 . The results of IDCF-NN are presented is that directly optimizing Pq would lead to serious over-
in Table 5 where we report test RMSEs on all the users, fitting since query users have few training data. Furthermore,
few-shot query users and zero-shot new users. we can see that Meta-Path provides inferior performance
than IDCF-GC in Table 1. The reason is that meta-path can
As we can see from Table 5, with threshold split, as δ in- only identify limited relations from observed bipartite graph
creases (we have fewer key users and more query users that often has missing/noisy links, while IDCF learns and
and they both have more training ratings on average), test explores useful semantic relations for sufficient message
RMSEs for query users exhibit a decrease. The reason is passing.
5
For example, using γ = 0.97 in random split will result in the
same sizes of U1 and U2 as using δ = 20 in threshold split.
An Inductive Model-based Collaborative Filtering Approach

Table 5. Test RMSEs on all the users (All), few-shot query users (FS) and new users (New) of IDCF-NN in Movielens-1M using different
splits for key and query users. (Lower RMSE is better)
δ 20 30 40 50 60 70
All (RMSE) 0.8440 0.8437 0.8439 0.8440 0.8444 0.8451
Threshold
FS (RMSE) 0.9785 0.9525 0.9213 0.9166 0.9202 0.9160
New (RMSE) 0.9945 0.9942 0.9902 0.9883 0.9911 0.9929
γ 0.97 0.85 0.75 0.68 0.62 0.57
All (RMSE) 0.8446 0.8536 0.8587 0.8637 0.8669 0.8689
Random
FS (RMSE) 0.8863 0.8848 0.8760 0.8805 0.8824 0.8855
New (RMSE) 0.9901 0.9923 0.9956 0.1001 1.0198 1.0262

Table 6. Ablation studies on ML-1M and Amazon-Books datasets. (Lower RMSE and Higher AUC/NDCG are better)
ML-1M Amazon-Books
Method Query New Query New
RMSE NDCG RMSE NDCG AUC NDCG AUC NDCG
IDCF-GC 0.944 0.940 0.957 0.942 0.938 0.946 0.921 0.930
RD-Item 1.014 0.843 1.023 0.835 0.665 0.701 0.832 0.821
Trans-User 0.992 0.876 - - 0.845 0.821 - -
Meta-Path 0.959 0.912 0.981 0.892 0.910 0.916 0.882 0.901

Fig. 6(a) gives the test RMSEs for all the models. It shows
that our IDCF-HY outperforms the competitors, achieving
2.6% improvement of RMSE over the best one MeLU even
on the difficult zero-shot recommendation task. The result
indicates that IDCF is a promising approach to handle new
(a) Test RMSEs (b) Training time per epoch users with no historical behavior in real-world dynamic
Figure 6. Performance comparison for extreme cold-start recom- systems. We also compare the training time per epoch of
mendation on ML-1M with user profile features. each method in Fig. 6(b). By contrast, IDCF-HY is much
faster than MeLU and as efficient as AGNN and GCMC.
E.3. Cold-Start with User Features
We also wonder if our inductive model can handle extreme
cold-start users with no historical rating 6 . Note that cold-
start users are different and more challenging compared
to new (unseen) users. For new users, the model can still
use historical ratings as input features during inference,
though it cannot be trained on these ratings. To enable cold-
start recommendation, we leverage user attribute features in
Movielens-1M. We use the dataset provided by (Lee et al.,
2019), which contains attribute features and split warm-start
and cold-start users. For IDCF, we also adopt the training
algorithm of inductive learning for extrapolation and treat
the warm-start users as key and query users. We user the
warm-start users’ training ratings for model training and
the cold-start users’ test ratings for test. We compare with
Wide&Deep network (Cheng et al., 2016), GCMC (using
feature vectors) and two recently proposed methods for cold-
start recommendation: graph-based model AGNN (Qian
et al., 2019) and meta-learning model MeLU (Lee et al.,
2019).
6
In some literature, cold-start users also mean users with few
historical ratings (for training or/and inference). Here we consider
extreme cold-start recommendation for users with no historical
rating for both training and inference.

You might also like