Llmrs
Llmrs
Abstract
Interaction Sequence Su of user
<latexit sha1_base64="rqEjZ21bFG9mf/RQdoVMT+zI4/s=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZdFNy4r2ge0Y8mkmTY0kwxJRilD/8ONC0Xc+i/u/Bsz7Sy09UDgcM693JMTxJxp47rfTmFldW19o7hZ2tre2d0r7x+0tEwUoU0iuVSdAGvKmaBNwwynnVhRHAWctoPxdea3H6nSTIp7M4mpH+GhYCEj2FjpoRdhMyKYp3fTflLqlytu1Z0BLRMvJxXI0eiXv3oDSZKICkM41rrrubHxU6wMI5xOS71E0xiTMR7SrqUCR1T76Sz1FJ1YZYBCqewTBs3U3xspjrSeRIGdzFLqRS8T//O6iQkv/ZSJODFUkPmhMOHISJRVgAZMUWL4xBJMFLNZERlhhYmxRWUleItfXiats6p3Xq3d1ir1q7yOIhzBMZyCBxdQhxtoQBMIKHiGV3hznpwX5935mI8WnHznEP7A+fwBXNCSbg==</latexit>
u:
<latexit sha1_base64="Dj96JhijXyqQ1LF4s6VoJbVKYLw=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlZtovV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYgllittbCRtRRZmx2ZRsCN7yy6ukfVH1Lqu1Zq1Sv8njKMIJnMI5eHAFdbiDBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucP5JuNAg==</latexit>
, vu,4
<latexit sha1_base64="qoz9xneR4NsUCYetEPTZhfSqsMs=">AAAB73icbVDLSgNBEOyNrxhfUY9eBoPgQcKuBPUY9OIxgnlAsoTZSScZMju7zswGwpKf8OJBEa/+jjf/xkmyB00saCiquunuCmLBtXHdbye3tr6xuZXfLuzs7u0fFA+PGjpKFMM6i0SkWgHVKLjEuuFGYCtWSMNAYDMY3c385hiV5pF8NJMY/ZAOJO9zRo2VWuNumlyQyrRbLLlldw6ySryMlCBDrVv86vQiloQoDRNU67bnxsZPqTKcCZwWOonGmLIRHWDbUklD1H46v3dKzqzSI/1I2ZKGzNXfEykNtZ6Ege0MqRnqZW8m/ue1E9O/8VMu48SgZItF/UQQE5HZ86THFTIjJpZQpri9lbAhVZQZG1HBhuAtv7xKGpdl76pceaiUqrdZHHk4gVM4Bw+uoQr3UIM6MBDwDK/w5jw5L86787FozTnZzDH8gfP5A26nj5U=</latexit>
… , vu,t
<latexit sha1_base64="m+Saoqgv2CiduhkBPf3SlvgmUe4=">AAAB73icbVDLSgNBEOyNrxhfUY9eBoPgQcKuBPUY9OIxgnlAsoTZySQZMju7zvQGwpKf8OJBEa/+jjf/xkmyB00saCiquunuCmIpDLrut5NbW9/Y3MpvF3Z29/YPiodHDRMlmvE6i2SkWwE1XArF6yhQ8lasOQ0DyZvB6G7mN8dcGxGpR5zE3A/pQIm+YBSt1Bp30+SC4LRbLLlldw6ySryMlCBDrVv86vQiloRcIZPUmLbnxuinVKNgkk8LncTwmLIRHfC2pYqG3Pjp/N4pObNKj/QjbUshmau/J1IaGjMJA9sZUhyaZW8m/ue1E+zf+KlQcYJcscWifiIJRmT2POkJzRnKiSWUaWFvJWxINWVoIyrYELzll1dJ47LsXZUrD5VS9TaLIw8ncArn4ME1VOEealAHBhKe4RXenCfnxXl3PhatOSebOYY/cD5/AM/nj9U=</latexit>
Window 2
<latexit sha1_base64="TloVsWcH2hQkOUS1ukKJigWffn8=">AAAB+3icbVBNTwIxEO3iF+IX4tFLIzHxRHYJUY9ELx4xkY8ECOl2u9DQbTftrEI2/BUvHjTGq3/Em//GAntQ8CWTvLw3k5l5fiy4Adf9dnIbm1vbO/ndwt7+weFR8bjUMirRlDWpEkp3fGKY4JI1gYNgnVgzEvmCtf3x7dxvPzJtuJIPMI1ZPyJDyUNOCVhpUCz1gE3A0LTNZaCeZj1cHRTLbsVdAK8TLyNllKExKH71AkWTiEmgghjT9dwY+inRwKlgs0IvMSwmdEyGrGupJBEz/XRx+wyfWyXAodK2JOCF+nsiJZEx08i3nRGBkVn15uJ/XjeB8LqfchknwCRdLgoTgUHheRA44JpREFNLCNXc3orpiGhCwcZVsCF4qy+vk1a14l1Wave1cv0miyOPTtEZukAeukJ1dIcaqIkomqBn9IrenJnz4rw7H8vWnJPNnKA/cD5/ABxolH0=</latexit>
able to achieve state-of-the-art performance by just a few fine- coarse and fine-grained. We learn universal sequence represen-
tuning steps [6, 30] or even without any training [5, 35] (i.e., zero- tations using a carefully designed popularity dynamics-aware
shot transfer), there are essential differences. The representations transformer architecture. These universal item and sequence rep-
learned by the pre-trained language model seem universal since the resentations make possible pre-trained sequential recommender
training domain and the application domain (e.g., text prediction systems capable of cross-domain and cross-application transfer
and generation) share the same language and vocabulary, support- without any auxiliary information.
ing the effective reuse of the word representations. However, in Zero-shot transfer without auxiliary information: We propose a
the cross-domain recommendation, the items are distinct across new challenging setting for pre-trained sequential recommender
domains in recommendation datasets (e.g., grocery items vs movies). systems: zero-shot cross-domain and cross-application transfer
Therefore, forming such generalizable correspondence is nearly without any auxiliary information. In contrast, previous pre-
impossible if we learn representations for each item within each trained sequential recommenders requires overlapping users [61],
domain. Recent work explores pre-trained models for sequential application-dependent auxiliary information [7, 12, 18, 19, 56],
recommendation [7, 12, 19] within the same application (e.g., online and are few-shot adapted to related domains within the same
retail). However, they assume access to metadata of items (e.g., item application [18, 19, 56]. Our work establishes a performance
description), which is domain-dependent and is often not gener- baseline for cross-domain sequential recommenders that use
alizable to other domains. These models cannot learn universal compatible (i.e., same language/modality) auxiliary information
representations of items; instead, they bypass the representation across domains, as such metadata can only improve the perfor-
learning problem by using additional item-side information. mance of cross-domain transfer.
Our Insight: There exist item popularity shifts in the user’s With extensive experiments, we empirically show that PrepRec has
sequence, as indicated in Figure 1. The item popularity shifts can excellent generalizability across domains and applications. Remark-
be explained as temporal shifts in the user’s preferences. For ex- ably, had we trained a state-of-the-art model from scratch for the
ample, a user might be interested in buying some common office target domain, instead of zero-shot transfer using PrepRec, the max-
goods such as pens, papers, and notebooks, but afterward, they imum performance gain over PrepRec would have been only 4%.
might look for other less common office goods such as a white- In addition, we show that PrepRec is complementary to state-of-
board or a desk. Previous works try to learn users’ preference from the-art sequential recommenders and with a post-hoc interpola-
the past sequence but ignore the crucial aspect of item popularity tion, PrepRec can outperform the state-of-the-art sequential rec-
dynamics, which could indicate the user’s changing preferences. We ommender system on average by 11.8% in Recall@10 and 22% in
know that the marginal distribution of user and item activities are NDCG@10. We attribute the improvements to the performance
heavy-tailed across datasets, supported by prior work in network gains over long-tail items, which we show in the qualitative anal-
science [3, 4] and by experiments in recommender systems [40]. In ysis. With this work, we set a baseline for pre-trained sequential
addition, recent work in recommender systems suggests that the recommenders and show that popularity dynamics not only enable
popularity dynamics of items are also crucial for predicting users’ us to build a pre-trained sequential recommender system capable
behaviors [21]. of zero-shot transfer but also significantly boost the performance
Present Work: In this paper, we develop universal, transferable of sequential recommendation.
item representations for the zero-shot, cross-domain setting based
on the popularity dynamics of items. We explicitly model the popu- 2 RELATED WORK
larity dynamics of items and propose a novel pre-trained sequential Sequential Recommendation: Sequential recommenders model
recommendation framework: PrepRec. We learn universal item user behavior as a sequence of interactions, and aim to predict the
representations based on their popularity dynamics instead of their next item that a user will interact with. Early sequential recom-
explicit item IDs or auxiliary information. We encode the relative menders adopt Markov chains [39, 42] and basic neural network ar-
time interval between two consecutive interactions via relative- chitectures [16, 17, 47, 50, 51]. With the success of Transformer [52]
time encoding and the position of each interaction in the sequence in modeling sequential data [22, 27, 45]adopt the transformer archi-
by positional encoding. Using physical time ensures that the pre- tecture for sequential recommendation. Additionally, [27] considers
dictions are not anti-causal, i.e., using the future interactions to the timestamps of each interaction and proposes a time-aware at-
predict the present. We propose a popularity dynamics-aware trans- tention mechanism. [32, 46, 60] separate interaction sequences and
former architecture for learning universal sequence representations. categorize them to show the long-term and short-term interests
We show that it is possible to build a pre-trained sequential rec- of users. Temporal sequential recommenders [25, 58, 62] models
ommender system capable of cross-domain and cross-application the change in users’ preferences. These works, while achieving
transfer without any auxiliary information. Our key contributions state-of-the-art performance, only focus on the regular sequential
are as follows: recommendation and cannot transfer to other domains.
Universal item and sequence representations: We are the first to Cross-domain Recommendation: Cross-domain recommen-
learn universal item and sequence representations for sequential dation literature leverages the information-rich domain to improve
recommendation without any auxiliary information by exploit- the recommendation performance on the data-sparse domain [20,
ing item popularity dynamics. In contrast, prior research learns 28, 33]. However, most of these works assume user or item over-
item representations for each item ID or through item auxil- lap [20, 28, 33, 63, 64] for effective knowledge transfer. Other cross-
iary information. We learn universal item representations by domain literature focuses on the cold-start problem [8, 10, 11, 26,
modeling item popularity dynamics of two temporal resolutions: 29, 31, 53, 57, 64]. In addition, multi-domain recommenders [1, 43]
Pre-trained Sequential Recommender RecSys ’24, October 14–18, 2024, Bari, Italy
leverage multi-domain data to gain insights into user preferences 4.1 Model Architecture
and item characteristics. The first step of building a pre-trained sequential recommender is
Pre-trained Sequential Recommenders: Recently, pre-trained to learn universal item representations. Our solution is to exploit
recommenders have caught the attention of the community. ZES- the item popularity statistics to learn universal item representa-
Rec [7] is capable of zero-shot sequential recommendations. How- tions. We learn to represent items at a given timestamp through
ever, it only works for closely related domains and requires item the changes in their popularity histories over different periods,
metadata. PeterRec [61] requires overlapping users in both domains. i.e., popularity dynamics. We propose a popularity dynamics-aware
On the other hand, finetuning-based models, e.g., MISSRec [56], Transformer architecture that obtains the representation of users’
UnisRec [19], and VQ-Rec [18], are not designed for zero-shot se- behavior sequences through item popularity dynamics.
quential recommendation and works within the same application
(e-commerce), and they rely on application-dependent auxiliary 4.1.1 Item Popularity Encoder. We learn to represent items based
information. [54] investigates the joint and marginal activity dis- on their popularity dynamics, i.e., changes in their popularity his-
tribution of users and items, but are not suitable for the sequential tories. Intuitively, popularity can be calculated by two horizons:
recommendation task. long-term and short-term. Long-term horizons reflect the overall
To summarize, prior works on sequential recommendation fo- popularity of items, whereas short-term horizons should capture
cus on learning high-quality representations for each item in the the recent trends in the domain. For example, the long-term popu-
training set and are not generalizable across domains. Pre-trained larity of a winter coat measures how popular is the coat in general,
sequential recommenders are evaluated on closely related domains while its short-term popularity reflects more temporal changes,
and platforms and rely heavily on application-dependent auxiliary e.g., season, weather conditions, and fashion trends. Therefore, con-
information of items. sider an item 𝑣 𝑗 that has interaction at time 𝑡, denoted as 𝑣 𝑡𝑗 , we
define two popularity representations for 𝑣 𝑡𝑗 : popularity p𝑡𝑗 ∈ R𝑘
over a coarse period (e.g., month) and popularity h𝑡𝑗 ∈ R𝑘 over a
3 PROBLEM DEFINITION
fine period (e.g., week).
In this section, we formally define the research problems this paper To calculate p𝑡𝑗 and h𝑡𝑗 , we first calculate the popularities of 𝑣 𝑡𝑗
addresses (i.e., regular sequential recommendation and zero-shot
over the two horizons, denoted as 𝑎𝑡𝑗 ∈ R+ (coarse period number
sequential recommendation) and introduce our notations.
In sequential recommendation, denote 𝑴 as the implicit feedback of interactions) and 𝑏 𝑡𝑗 ∈ R+ (fine period number of interactions).
matrix, U = {𝑢 1, 𝑢 2, ..., 𝑢 | U | } as the set of users, V = {𝑣 1, 𝑣 2, ..., 𝑣 | V | } Specifically, we calculate them as:
as the set of items. The goal of sequential recommendation is to 𝑡
∑︁
learn a scoring function, that predicts the next item 𝑣𝑢,𝑡 given a 𝑎𝑡𝑗 = 𝛾 𝑡 −𝑚 𝑐𝑎 (𝑣 𝑚 𝑡 𝑡
𝑗 ), 𝑏 𝑗 = 𝑐𝑏 (𝑣 𝑗 ) (1)
user 𝑢 ’s interaction history S𝑢 = {𝑣𝑢,1, 𝑣𝑢,2, ..., 𝑣𝑢,𝑡 −1 }. Note that in 𝑚=1
this paper, since we model time explicitly, we assume access to the where 𝛾 ∈ R+ is a pre-defined discount factor and 𝑐𝑎 (𝑣 𝑚 𝑗 ) is the
timestamp of each interaction, including the next item interaction. number of interactions of 𝑣 𝑗 over a coarse time period 𝑚. Similarly,
We argue that this is a reasonable assumption since the timestamp 𝑐𝑏 (𝑣 𝑡𝑗 ) denotes the number of interactions of 𝑣 𝑗 over a fine period
of the next interaction is always available in practice. For example, 𝑡. We do not impose the discounting factor when computing 𝑏 𝑡𝑗
if Alice logs in to Netflix, Netflix will always know when Alice logs since we want it to capture the current popularity information,
in and can predict the next movie for Alice. Formally, we define whereas 𝑎𝑡𝑗 captures the cumulative popularity of an item over a
the scoring function as F (𝑣 𝑡 |S𝑢 , 𝑴), where 𝑡 is the time of the longer horizon.
prediction. To make item popularity comparable across domains, we cal-
Zero-shot Sequential Recommendation: Given two domains culate the percentiles of 𝑎𝑡𝑗 and 𝑏 𝑡𝑗 relative to their corresponding
𝑴 and 𝑴 ′ over U, V and U ′, V ′ respectively, we study the zero- coarser and finer popularity distributions over all items at time 𝑡,
shot recommendation problem in the scenario where the domains denoted as 𝑃 (𝑎𝑡𝑗 ) ∈ R+ and 𝑃 (𝑏 𝑡𝑗 ) ∈ R+ , respectively.
are different (𝑴 ∩ 𝑴 ′ = ∅), users are disjoint (U ∩ U ′ = ∅), and
We now encode the popularity percentiles 𝑃 (𝑎𝑡𝑗 ) and 𝑃 (𝑏 𝑡𝑗 ) into 𝑘
item sets are unique (V ∩ V ′ = ∅). The goal is to produce a scoring
function F ′ without training on 𝑴 ′ directly. In other words, the dimensional vector representations p𝑡𝑗 and h𝑡𝑗 respectively. Denote
scoring function F ′ has to be trained on a different interaction the popularity encoder as 𝐸𝑝 : R+ → R𝑘 , which takes in a per-
matrix 𝑴. Furthermore, we assume there is no metadata associated centile value. Suppose given the popularity percentile 𝑃 (𝑎𝑡𝑗 ) ∈ R+
with users or items, which makes the problem particularly challeng- over a coarse time period 𝑡, the coarse level popularity vector rep-
ing but crucial to study. We want to set a baseline for pre-trained resentation p𝑡𝑗 ∈ R𝑘 is computed as follows:
sequential recommenders, using metadata can only improve and
p𝑡𝑗 = 𝐸𝑝 (𝑃 (𝑎𝑡𝑗 ))
simply the problem.
1 − { 𝑘 𝑃−1 }, if 𝑖 = ⌊ 𝑘 𝑃−1 ⌋
(p𝑡𝑗 )𝑖 = { 𝑘 𝑃−1 }, if 𝑖 = ⌊ 𝑘 𝑃−1 ⌋ + 1
4 PREPREC FRAMEWORK
0, otherwise
We first introduce the model architecture of PrepRec (§ 4.1) then the
training procedure (§ 4.2). Finally, we formally define the zero-shot where ⌊·⌋ denotes the floor, {·} denotes the fractional part of a
inference process (§ 4.3). number, and (p𝑡𝑗 )𝑖 denotes the 𝑖-th index of p𝑡𝑗 . For example, if 𝑘 =
RecSys ’24, October 14–18, 2024, Bari, Italy Wang and Rathi, et al.
Output
Transformer Architecture
+
<latexit sha1_base64="+jGC6Nfx7KmlUkR1RCD5u5DNFzE=">AAACCXicbVDLSgNBEJyNrxhfUY9eBoMQEcKuBPUY9OIxonlAsllmJ5NkzOyDmd5AWPfqxV/x4kERr/6BN//GSbIHTSxoKKq66e5yQ8EVmOa3kVlaXlldy67nNja3tnfyu3t1FUSSshoNRCCbLlFMcJ/VgINgzVAy4rmCNdzh1cRvjJhUPPDvYBwy2yN9n/c4JaAlJ4/bAwLxOCmOOjF0ThLn/qHtERhQIuLbxImOnXzBLJlT4EVipaSAUlSd/Fe7G9DIYz5QQZRqWWYIdkwkcCpYkmtHioWEDkmftTT1iceUHU8/SfCRVrq4F0hdPuCp+nsiJp5SY8/VnZMr1bw3Ef/zWhH0LuyY+2EEzKezRb1IYAjwJBbc5ZJREGNNCJVc34rpgEhCQYeX0yFY8y8vkvppyTorlW/KhcplGkcWHaBDVEQWOkcVdI2qqIYoekTP6BW9GU/Gi/FufMxaM0Y6s4/+wPj8AUHImrI=</latexit>
Addition &
ŷ(vjt |Su )
LayerNorm <latexit sha1_base64="whFemE1dEJwZ5L8UATVcszv85oU=">AAAB8HicbVBNSwMxEM3Wr1q/qh69BIsgCGVXinosevFYwX5Iuy3ZdLaNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbWV1b38hvFra2d3b3ivsHDR0likKdRjxSrYBo4ExC3TDDoRUrICLg0AxGN1O/+QRKs0jem3EMviADyUJGibHSA/Qeu6npnk16xZJbdmfAy8TLSAllqPWKX51+RBMB0lBOtG57bmz8lCjDKIdJoZNoiAkdkQG0LZVEgPbT2cETfGKVPg4jZUsaPFN/T6REaD0Wge0UxAz1ojcV//PaiQmv/JTJODEg6XxRmHBsIjz9HveZAmr42BJCFbO3YjokilBjMyrYELzFl5dJ47zsXZQrd5VS9TqLI4+O0DE6RR66RFV0i2qojigS6Bm9ojdHOS/Ou/Mxb8052cwh+gPn8we4v5Be</latexit>
+
Prediction (§ 4.1.6) etj Dot Product qu
<latexit sha1_base64="TzVcVwr44uMpBPTMwme2zunsWiI=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8eK9gPaUDbbTbt0s4m7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWjm6nfeuLaiFg94DjhfkQHSoSCUbTS/WMv7ZUrbtWdgSwTLycVyFHvlb+6/ZilEVfIJDWm47kJ+hnVKJjkk1I3NTyhbEQHvGOpohE3fjY7dUJOrNInYaxtKSQz9fdERiNjxlFgOyOKQ7PoTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tOyYbgLb68TJpnVe+ien53Xqld53EU4QiO4RQ8uIQa3EIdGsBgAM/wCm+OdF6cd+dj3lpw8plD+APn8wdrBI3m</latexit>
Dropout Transformer
…
…
…
Pointwise Transformer Transformer Transformer
Feed-Forward Popularity Dynamics-
Aware Transformer (§
4.1.5)
+ + +
Addition &
PL
<latexit sha1_base64="p2GYZ3TszoBdjgJaQyxyEZ9IATk=">AAAB6nicbVA9SwNBEJ2LXzF+RS1tFoNgFe5E1DJoY2ER0XxAcoS9zV6yZG/v2J0TwpGfYGOhiK2/yM5/4ya5QhMfDDzem2FmXpBIYdB1v53Cyura+kZxs7S1vbO7V94/aJo41Yw3WCxj3Q6o4VIo3kCBkrcTzWkUSN4KRjdTv/XEtRGxesRxwv2IDpQIBaNopYd6765XrrhVdwayTLycVCBHvVf+6vZjlkZcIZPUmI7nJuhnVKNgkk9K3dTwhLIRHfCOpYpG3PjZ7NQJObFKn4SxtqWQzNTfExmNjBlHge2MKA7NojcV//M6KYZXfiZUkiJXbL4oTCXBmEz/Jn2hOUM5toQyLeythA2ppgxtOiUbgrf48jJpnlW9i+r5/Xmldp3HUYQjOIZT8OASanALdWgAgwE8wyu8OdJ5cd6dj3lrwclnDuEPnM8f+ouNnA==</latexit>
P2
<latexit sha1_base64="8XwCUehtEqqwEzmOyZU0Y3ncpu8=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lKUY9FLx4r2lpoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUNnGqGW+xWMa6E1DDpVC8hQIl7ySa0yiQ/DEY38z8xyeujYjVA04S7kd0qEQoGEUr3Tf7tX654lbdOcgq8XJSgRzNfvmrN4hZGnGFTFJjup6boJ9RjYJJPi31UsMTysZ0yLuWKhpx42fzU6fkzCoDEsbalkIyV39PZDQyZhIFtjOiODLL3kz8z+umGF75mVBJilyxxaIwlQRjMvubDITmDOXEEsq0sLcSNqKaMrTplGwI3vLLq6Rdq3oX1fpdvdK4zuMowgmcwjl4cAkNuIUmtIDBEJ7hFd4c6bw4787HorXg5DPH8AfO5w/TI42C</latexit>
LayerNorm 0 ⇤
etu,1
<latexit sha1_base64="qrUJnMPIyV86hY0WFNHLYU71lHQ=">AAAB9HicbVDJSgNBEO1xjXGLevTSGEQPEmZCUI9BLx4jmAWSMfR0apImPYvdNYEwzHd48aCIVz/Gm39jZzlo4oOCx3tVVNXzYik02va3tbK6tr6xmdvKb+/s7u0XDg4bOkoUhzqPZKRaHtMgRQh1FCihFStggSeh6Q1vJ35zBEqLKHzAcQxuwPqh8AVnaCQXHlM8y7ppckHLWbdQtEv2FHSZOHNSJHPUuoWvTi/iSQAhcsm0bjt2jG7KFAouIct3Eg0x40PWh7ahIQtAu+n06IyeGqVH/UiZCpFO1d8TKQu0Hgee6QwYDvSiNxH/89oJ+tduKsI4QQj5bJGfSIoRnSRAe0IBRzk2hHElzK2UD5hiHE1OeROCs/jyMmmUS85lqXJfKVZv5nHkyDE5IefEIVekSu5IjdQJJ0/kmbySN2tkvVjv1sesdcWazxyRP7A+fwAPOJGl</latexit>
<latexit sha1_base64="8BPkTvBDQPo95UcoAKS/z+qc6FM=">AAAB8XicbVDLSgNBEOyNrxhfUY9eBoPgQcKuBPUY9OIxgnlgsobZySQZMju7zPQKYclfePGgiFf/xpt/4yTZgyYWNBRV3XR3BbEUBl3328mtrK6tb+Q3C1vbO7t7xf2DhokSzXidRTLSrYAaLoXidRQoeSvWnIaB5M1gdDP1m09cGxGpexzH3A/pQIm+YBSt9MAfsZsmZ8SbdIslt+zOQJaJl5ESZKh1i1+dXsSSkCtkkhrT9twY/ZRqFEzySaGTGB5TNqID3rZU0ZAbP51dPCEnVumRfqRtKSQz9fdESkNjxmFgO0OKQ7PoTcX/vHaC/Ss/FSpOkCs2X9RPJMGITN8nPaE5Qzm2hDIt7K2EDammDG1IBRuCt/jyMmmcl72LcuWuUqpeZ3Hk4QiO4RQ8uIQq3EIN6sBAwTO8wptjnBfn3fmYt+acbOYQ/sD5/AHfAJBn</latexit>
<latexit sha1_base64="kEz2qPQmY2gBxWW9fDMmEEjqthM=">AAAB9XicbVDLSgNBEOz1GeMr6tHLYBBEJOxKUI9BLx48RDAPSDZhdtJJhsw+mJlVwrL/4cWDIl79F2/+jZNkD5pY0FBUddPd5UWCK23b39bS8srq2npuI7+5tb2zW9jbr6swlgxrLBShbHpUoeAB1jTXApuRROp7Ahve6GbiNx5RKh4GD3ocoevTQcD7nFFtpA52Et05TbtJfEbu0m6haJfsKcgicTJShAzVbuGr3QtZ7GOgmaBKtRw70m5CpeZMYJpvxwojykZ0gC1DA+qjcpPp1Sk5NkqP9ENpKtBkqv6eSKiv1Nj3TKdP9VDNexPxP68V6/6Vm/AgijUGbLaoHwuiQzKJgPS4RKbF2BDKJDe3EjakkjJtgsqbEJz5lxdJ/bzkXJTK9+Vi5TqLIweHcAQn4MAlVOAWqlADBhKe4RXerCfrxXq3PmatS1Y2cwB/YH3+APPSkio=</latexit>
etu,2 … etu,L
Tru,1
<latexit sha1_base64="9MUpMRDA3k1tpV364nhlKg6VWsY=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBbBg5REinosevFYobWFNoTNdtMu3WzCfggl5G948aCIV/+MN/+N2zYHbX0w8Hhvhpl5YcqZ0q777ZTW1jc2t8rblZ3dvf2D6uHRo0qMJLRDEp7IXogV5UzQjmaa014qKY5DTrvh5G7md5+oVCwRbT1NqR/jkWARI1hbadAOMhlk5gJ5eR5Ua27dnQOtEq8gNSjQCqpfg2FCTEyFJhwr1ffcVPsZlpoRTvPKwCiaYjLBI9q3VOCYKj+b35yjM6sMUZRIW0Kjufp7IsOxUtM4tJ0x1mO17M3E/7y+0dGNnzGRGk0FWSyKDEc6QbMA0JBJSjSfWoKJZPZWRMZYYqJtTBUbgrf88ip5vKx7V/XGQ6PWvC3iKMMJnMI5eHANTbiHFnSAQArP8ApvjnFenHfnY9FacoqZY/gD5/MHjXmRYQ==</latexit>
Tru,2 Tru,L
<latexit sha1_base64="x89keKTrHskmw+dlBcf36Z3HMT0=">AAAB83icbVDLSgNBEOyNrxhfUY9eBoPgQcJuCOox6MVjhLwgWZbZyWwyZHZ2mYcQlvyGFw+KePVnvPk3TpI9aGJBQ1HVTXdXmHKmtOt+O4WNza3tneJuaW//4PCofHzSUYmRhLZJwhPZC7GinAna1kxz2kslxXHIaTec3M/97hOViiWipacp9WM8EixiBGsrDVpBJoPMXKHabBaUK27VXQCtEy8nFcjRDMpfg2FCTEyFJhwr1ffcVPsZlpoRTmelgVE0xWSCR7RvqcAxVX62uHmGLqwyRFEibQmNFurviQzHSk3j0HbGWI/VqjcX//P6Rke3fsZEajQVZLkoMhzpBM0DQEMmKdF8agkmktlbERljiYm2MZVsCN7qy+ukU6t619X6Y73SuMvjKMIZnMMleHADDXiAJrSBQArP8ApvjnFenHfnY9lacPKZU/gD5/MHjv+RYg==</latexit>
<latexit sha1_base64="88oliapIbQlnsTYFbDpXLRNnXp4=">AAAB83icbVDLSgNBEOyNrxhfUY9eBoPgQcKuBPUY9OLBQ4S8IFmW2ckkGTI7u8xDCMv+hhcPinj1Z7z5N06SPWhiQUNR1U13V5hwprTrfjuFtfWNza3idmlnd2//oHx41FaxkYS2SMxj2Q2xopwJ2tJMc9pNJMVRyGknnNzN/M4TlYrFoqmnCfUjPBJsyAjWVuo3g1QGqblAD1kWlCtu1Z0DrRIvJxXI0QjKX/1BTExEhSYcK9Xz3ET7KZaaEU6zUt8ommAywSPas1TgiCo/nd+coTOrDNAwlraERnP190SKI6WmUWg7I6zHatmbif95PaOHN37KRGI0FWSxaGg40jGaBYAGTFKi+dQSTCSztyIyxhITbWMq2RC85ZdXSfuy6l1Va4+1Sv02j6MIJ3AK5+DBNdThHhrQAgIJPMMrvDnGeXHenY9Fa8HJZ47hD5zPH7abkXw=</latexit>
0 0 ⇤ ⇤
<latexit sha1_base64="2F9ezcX60FgnxRpQ0XvobzTAaYs=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaL6EJKUoq6LLpxWcE+oI1hMp20QycPZm6EErLxV9y4UMStn+HOv3HSZqHVAxcO59zLvfd4seAKLOvLKC0tr6yuldcrG5tb2zvm7l5HRYmkrE0jEcmeRxQTPGRt4CBYL5aMBJ5gXW9ynfvdByYVj8I7mMbMCcgo5D6nBLTkmgeDgMCYEpG2svsUTjI3Tc5wPXPNqlWzZsB/iV2QKirQcs3PwTCiScBCoIIo1betGJyUSOBUsKwySBSLCZ2QEetrGpKAKSedPZDhY60MsR9JXSHgmfpzIiWBUtPA0535uWrRy8X/vH4C/qWT8jBOgIV0vshPBIYI52ngIZeMgphqQqjk+lZMx0QSCjqzig7BXnz5L+nUa/Z5rXHbqDavijjK6BAdoVNkowvURDeohdqIogw9oRf0ajwaz8ab8T5vLRnFzD76BePjG3etllM=</latexit>
<latexit sha1_base64="reV5V2+MuYLnzEGcXgj/5qHmhjQ=">AAACAXicbVDLSsNAFJ3UV62vqBvBzWARRKQkUtRl0U0XLirYB7RpmEwn7dDJJMxMhBLixl9x40IRt/6FO//GSZuFth64cDjnXu69x4sYlcqyvo3C0vLK6lpxvbSxubW9Y+7utWQYC0yaOGSh6HhIEkY5aSqqGOlEgqDAY6TtjW8yv/1AhKQhv1eTiDgBGnLqU4yUllzzoBcgNcKIJfW0n6j+aeom8Rm8TV2zbFWsKeAisXNSBjkarvnVG4Q4DghXmCEpu7YVKSdBQlHMSFrqxZJECI/RkHQ15Sgg0kmmH6TwWCsD6IdCF1dwqv6eSFAg5STwdGd2r5z3MvE/rxsr/8pJKI9iRTieLfJjBlUIszjggAqCFZtogrCg+laIR0ggrHRoJR2CPf/yImmdV+yLSvWuWq5d53EUwSE4AifABpegBuqgAZoAg0fwDF7Bm/FkvBjvxsestWDkM/vgD4zPH1QSltA=</latexit>
<latexit sha1_base64="VDYLxOiglAUhNDMG6SrYp4cXIxQ=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaL6EJKUoq6LLrpsoJ9QBvDZDpth04ezNwIJWTjr7hxoYhbP8Odf+OkzUJbD1w4nHMv997jRYIrsKxvo7Cyura+UdwsbW3v7O6Z+wdtFcaSshYNRSi7HlFM8IC1gINg3Ugy4nuCdbzJbeZ3HplUPAzuYRoxxyejgA85JaAl1zzq+wTGlIikkT4kcJa6SXyBq6lrlq2KNQNeJnZOyihH0zW/+oOQxj4LgAqiVM+2InASIoFTwdJSP1YsInRCRqynaUB8ppxk9kCKT7UywMNQ6goAz9TfEwnxlZr6nu7MzlWLXib+5/ViGF47CQ+iGFhA54uGscAQ4iwNPOCSURBTTQiVXN+K6ZhIQkFnVtIh2IsvL5N2tWJfVmp3tXL9Jo+jiI7RCTpHNrpCddRATdRCFKXoGb2iN+PJeDHejY95a8HIZw7RHxifP2sllks=</latexit>
<latexit sha1_base64="5XC3r9upMbR9zYbCllXhrZZdP04=">AAACAXicbVDLSsNAFL2pr1pfUTeCm8EiiEhJpKjLohsXLirYB7RpmEyn7dDJg5mJUELc+CtuXCji1r9w5984abPQ6oELh3Pu5d57vIgzqSzryygsLC4trxRXS2vrG5tb5vZOU4axILRBQh6Ktocl5SygDcUUp+1IUOx7nLa88VXmt+6pkCwM7tQkoo6PhwEbMIKVllxzr+tjNSKYJ/W0l6jeceom8Qm6SV2zbFWsKdBfYuekDDnqrvnZ7Yck9mmgCMdSdmwrUk6ChWKE07TUjSWNMBnjIe1oGmCfSieZfpCiQ6300SAUugKFpurPiQT7Uk58T3dm98p5LxP/8zqxGlw4CQuiWNGAzBYNYo5UiLI4UJ8JShSfaIKJYPpWREZYYKJ0aCUdgj3/8l/SPK3YZ5XqbbVcu8zjKMI+HMAR2HAONbiGOjSAwAM8wQu8Go/Gs/FmvM9aC0Y+swu/YHx8A2Ciltg=</latexit>
t t t t t t
<latexit sha1_base64="x4rypFulifmlhvUFS9hEr6NWLUc=">AAAB/3icbVDLSsNAFJ3UV62vqODGzWARXEhJpKjLohuXFewD2hgm00k7dPJg5kYoMQt/xY0LRdz6G+78GydtFtp6YOBwzr3cM8eLBVdgWd9GaWl5ZXWtvF7Z2Nza3jF399oqSiRlLRqJSHY9opjgIWsBB8G6sWQk8ATreOPr3O88MKl4FN7BJGZOQIYh9zkloCXXPOgHBEaUiLSZ3aeQuWlyiu3MNatWzZoCLxK7IFVUoOmaX/1BRJOAhUAFUapnWzE4KZHAqWBZpZ8oFhM6JkPW0zQkAVNOOs2f4WOtDLAfSf1CwFP190ZKAqUmgacn87Rq3svF/7xeAv6lk/IwToCFdHbITwSGCOdl4AGXjIKYaEKo5DorpiMiCQVdWUWXYM9/eZG0z2r2ea1+W682roo6yugQHaETZKML1EA3qIlaiKJH9Ixe0ZvxZLwY78bHbLRkFDv76A+Mzx8O3JYh</latexit>
<latexit sha1_base64="LFOEo6Jikb+W6nqIQ79ZYAr1iY4=">AAAB/3icbVDLSsNAFL3xWesrKrhxM1gEF1ISKeqy6KbLCvYBbQyT6aQdOnkwMxFKzMJfceNCEbf+hjv/xkmbhbYeGDiccy/3zPFizqSyrG9jaXlldW29tFHe3Nre2TX39tsySgShLRLxSHQ9LClnIW0ppjjtxoLiwOO0441vcr/zQIVkUXinJjF1AjwMmc8IVlpyzcN+gNWIYJ42svtUZW6anCE7c82KVbWmQIvELkgFCjRd86s/iEgS0FARjqXs2VasnBQLxQinWbmfSBpjMsZD2tM0xAGVTjrNn6ETrQyQHwn9QoWm6u+NFAdSTgJPT+Zp5byXi/95vUT5V07KwjhRNCSzQ37CkYpQXgYaMEGJ4hNNMBFMZ0VkhAUmSldW1iXY819eJO3zqn1Rrd3WKvXroo4SHMExnIINl1CHBjShBQQe4Rle4c14Ml6Md+NjNrpkFDsH8AfG5w8CXJYZ</latexit>
Item Popularity Encoder (§ 4.1.1) Item Popularity Encoder Item Popularity Encoder
t0 ⇤
<latexit sha1_base64="sflMOXNMuyNvCMvIGEflwPVahkQ=">AAAB9XicbVBNS8NAEJ34WetX1aOXxSJ6KCUpRT0WvXisYD+gTctmu2mXbjZhd1MpIf/DiwdFvPpfvPlv3LY5aOuDgcd7M8zM8yLOlLbtb2ttfWNzazu3k9/d2z84LBwdN1UYS0IbJOShbHtYUc4EbWimOW1HkuLA47Tlje9mfmtCpWKheNTTiLoBHgrmM4K1kXqTfhKXKmkv0RdpCfULRbtsz4FWiZORImSo9wtf3UFI4oAKTThWquPYkXYTLDUjnKb5bqxohMkYD2nHUIEDqtxkfnWKzo0yQH4oTQmN5urviQQHSk0Dz3QGWI/UsjcT//M6sfZv3ISJKNZUkMUiP+ZIh2gWARowSYnmU0MwkczcisgIS0y0CSpvQnCWX14lzUrZuSpXH6rF2m0WRw5O4QwuwYFrqME91KEBBCQ8wyu8WU/Wi/VufSxa16xs5gT+wPr8AZXMkew=</latexit>
<latexit sha1_base64="w3HOi+2X8gqQDQ8bdSZtBGd63lA=">AAAB9HicbVBNS8NAEN3Ur1q/qh69LBZBREoiRT0WvXjwUMF+QJuWzXbTLt1s4u6kUEJ+hxcPinj1x3jz37htc9DWBwOP92aYmedFgmuw7W8rt7K6tr6R3yxsbe/s7hX3Dxo6jBVldRqKULU8opngktWBg2CtSDESeII1vdHt1G+OmdI8lI8wiZgbkIHkPqcEjOSOe0l8fp92E+iepb1iyS7bM+Bl4mSkhDLUesWvTj+kccAkUEG0bjt2BG5CFHAqWFroxJpFhI7IgLUNlSRg2k1mR6f4xCh97IfKlAQ8U39PJCTQehJ4pjMgMNSL3lT8z2vH4F+7CZdRDEzS+SI/FhhCPE0A97liFMTEEEIVN7diOiSKUDA5FUwIzuLLy6RxUXYuy5WHSql6k8WRR0foGJ0iB12hKrpDNVRHFD2hZ/SK3qyx9WK9Wx/z1pyVzRyiP7A+fwC0IJIR</latexit>
t t
<latexit sha1_base64="qwkLiMTNilalxLl++lRiId33T/0=">AAAB8nicbVBNSwMxEM36WetX1aOXYBE8lLIrRT0WvXisYD9gu5Zsmm1Ds8mSzBbK0p/hxYMiXv013vw3pu0etPXBwOO9GWbmhYngBlz321lb39jc2i7sFHf39g8OS0fHLaNSTVmTKqF0JySGCS5ZEzgI1kk0I3EoWDsc3c389phpw5V8hEnCgpgMJI84JWAlf9zL0oo3fYIK7pXKbtWdA68SLydllKPRK311+4qmMZNABTHG99wEgoxo4FSwabGbGpYQOiID5lsqScxMkM1PnuJzq/RxpLQtCXiu/p7ISGzMJA5tZ0xgaJa9mfif56cQ3QQZl0kKTNLFoigVGBSe/Y/7XDMKYmIJoZrbWzEdEk0o2JSKNgRv+eVV0rqselfV2kOtXL/N4yigU3SGLpCHrlEd3aMGaiKKFHpGr+jNAefFeXc+Fq1rTj5zgv7A+fwBZIOQrg==</latexit>
Input
Input sequence Su of user
<latexit sha1_base64="O8EHCNcvmuBDYwsfP9fLDx0yiJc=">AAAB9HicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZdFNy4r2ge0Q8mkmTY0k4xJplCGfocbF4q49WPc+Tdm2llo64HA4Zx7uScniDnTxnW/ncLa+sbmVnG7tLO7t39QPjxqaZkoQptEcqk6AdaUM0GbhhlOO7GiOAo4bQfj28xvT6jSTIpHM42pH+GhYCEj2FjJ70XYjAjm6cOsn/TLFbfqzoFWiZeTCuRo9MtfvYEkSUSFIRxr3fXc2PgpVoYRTmelXqJpjMkYD2nXUoEjqv10HnqGzqwyQKFU9gmD5urvjRRHWk+jwE5mIfWyl4n/ed3EhNd+ykScGCrI4lCYcGQkyhpAA6YoMXxqCSaK2ayIjLDCxNieSrYEb/nLq6R1UfUuq7X7WqV+k9dRhBM4hXPw4ArqcAcNaAKBJ3iGV3hzJs6L8+58LEYLTr5zDH/gfP4AI1eSWg==</latexit>
u:
<latexit sha1_base64="Dj96JhijXyqQ1LF4s6VoJbVKYLw=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlZtovV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYgllittbCRtRRZmx2ZRsCN7yy6ukfVH1Lqu1Zq1Sv8njKMIJnMI5eHAFdbiDBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucP5JuNAg==</latexit>
11, 𝐸𝑝 (40.1) = [0, 0, 0, 0, 0.99, 0.01, 0, 0, 0, 0, 0]. The interpretation of the item embeddings. Instead, it learns to represent items through
this would be considering the 10 deciles for 𝑖 ∈ {0, 1, . . . , 9, 10} as their popularity dynamics, which is universal across domains and
basis vectors, and this popularity encoding as a linear combination applications.
of the nearest (in percentile space) two basis vectors. The fine level
4.1.3 Relative Time Interval. We also consider the time interval be-
popularity vector is calculated identically, i.e., h𝑡𝑗 = 𝐸𝑝 (𝑃 (𝑏 𝑡𝑗 )). In
tween two consecutive interactions when modeling sequences. Dif-
this example, we’ve fixed the vector representation size to be 11,
ferences in time intervals might indicate differences in the users’ be-
but this approach is fully generalizable to other sizes and would
haviors. While previous works explore absolute time intervals [27],
just require changing the multipliers in the encoding function. We
different domains exhibit diverse time scales, thus making modeling
also experimented with sinuoisal encodings of the same size, but
absolute time intervals ungeneralizable. Therefore, we propose to
found that the linear encoding empirically performed better.
encode relative time intervals into modeling sequences. Given an
interaction sequence S𝑢 = {𝑣𝑢,1, 𝑣𝑢,2, ..., 𝑣𝑢,𝐿 } of user 𝑢, we define
4.1.2 Universal Item Representation. We now define the popularity
the time interval between 𝑣𝑢,𝑗 and 𝑣𝑢,𝑗+1 as 𝑡𝑢,𝑗 = 𝑡 (𝑣𝑢,𝑗+1 ) −𝑡 (𝑣𝑢,𝑗 ),
dynamics of 𝑣 𝑗 at time 𝑡 over the coarse period (long-term horizon)
where 𝑡 (𝑣𝑢,𝑗 ) is the time that user 𝑢 interacts with item 𝑣𝑢,𝑗 . We
to be P 𝑡𝑗 = {p1𝑗 , p2𝑗 , ..., p𝑡𝑗 −1 }, and over the fine period (short-term then rank the time intervals of user 𝑢. Define the rank of relative
horizon) as H 𝑗𝑡 = {h1𝑗 , h2𝑗 , ..., h𝑡𝑗 −1 }. We use 𝑡 − 1 to constrain access time interval of 𝑡𝑢,𝑗 as 𝑟𝑢,𝑗 = rank(𝑡𝑢,𝑗 ). The relative time inter-
to future interactions and prevent information leakage, i.e., we do val encoding of interval 𝑡𝑢,𝑗 is then defined as 𝑻𝑟𝑢,𝑗 ∈ R𝑑 , where
not have access to the popularity statistics of 𝑣 𝑗 at time 𝑡 if we are 𝑻 ∈ R𝐿×𝑑 , following the same setup in [52], is a fixed sinusoidal
at time 𝑡. For example, say an interaction happens on the second encoding matrix defined as:
Wednesday in February, we consider the coarser and finer time pe- 𝑖 𝑖
riod up until the end of January and the end of the first week in Feb- 𝑻𝑖,2𝑗 = sin( ), 𝑻𝑖,2𝑗+1 = cos( ) (3)
ruary respectively. To limit computation, we constrain window sizes 𝐿 2𝑗/𝑑 𝐿 2𝑗/𝑑
We also tried a learnable time interval encoding, but it yielded
𝑚, 𝑛 for P and H respectively. Formally, the coarse popularity dy-
worse performance. We hypothesize that the sinusoidal encoding
namics of 𝑣 𝑗 at time 𝑡 is P 𝑡𝑗 = {p𝑡𝑗 −𝑚 , p𝑡𝑗 −𝑚+1, ..., p𝑡𝑗 −1 }, and the fine
is more generalizable across domains and the learnable encoding is
popularity dynamics of 𝑣 𝑗 at time 𝑡 is H 𝑗𝑡 = {h𝑡𝑗 −𝑛 , h𝑡𝑗 −𝑛+1, ..., h𝑡𝑗 −1 }. more prone to overfitting.
Finally, we compute the embedding of item 𝑣 𝑗 at time 𝑡 via
the universal item representation encoder, defined as a function 4.1.4 Positional Encoding. As we will see in § 4.1.5, the self-attention
E (P 𝑡𝑗 , H 𝑗𝑡 ) that learns to encode the popularity dynamics P 𝑡𝑗 and mechanism does not take the positions of the items into account.
Therefore, following [52], we also inject a fixed positional encoding
H 𝑗𝑡 into a 𝑑 dimension vector representation e𝑡𝑗 . Specifically, we
for each position in a user’s sequence. Denote the positional em-
have:
𝑡 −1 𝑡 −1
bedding of a position 𝑙 as 𝑷𝑙 ∈ R𝑑 , where 𝑷 ∈ R𝐿×𝑑 . We compute
e𝑡𝑗 = E (P 𝑡𝑗 , H 𝑗𝑡 ) = 𝑾𝑝 [(∥𝑖=𝑡 𝑖 𝑖
−𝑚 p 𝑗 )∥(∥𝑖=𝑡 −𝑛 h 𝑗 )] (2) 𝑷 using the same formula in Equation (3). Again, we also tried a
where ∥ denotes the concatenation operation, and 𝑾𝑝 ∈ R𝑑 ×𝑘 (𝑚+𝑛) learnable positional encoding as presented in [22, 45], but it yielded
is a learnable weight matrix. worse results.
𝑡 −1 p𝑖 ≔ p𝑡 −𝑚 ∥p𝑡 −𝑚+1 ∥...∥p𝑡 −1 and
In addition, we define ∥𝑖=𝑡 −𝑚 𝑗 𝑗 𝑗 𝑗 4.1.5 Popularity Dynamics-Aware Transformer. We follow previous
𝑡 −1 𝑖 𝑡 −𝑛 𝑡
∥𝑖=𝑡 −𝑛 h 𝑗 ≔ h 𝑗 ∥h 𝑗 −𝑛+1 ∥...∥h𝑡𝑗 −1 . The item popularity dynamics works in sequential recommendation [22, 27, 45] and propose an
encoder can effectively capture the popularity change of items extension to the self-attention mechanism by incorporating uni-
over different time periods. Most importantly, it does not take versal item representations (§ 4.1.2), relative time intervals(§ 4.1.3),
explicit item IDs or auxiliary information as input to compute and positional encoding (§ 4.1.4).
Pre-trained Sequential Recommender RecSys ’24, October 14–18, 2024, Bari, Italy
Firstly, we transform the user sequence {𝑣𝑢,1, 𝑣𝑢,2, ..., 𝑣𝑢,| S𝑢 | } for Dataset #users #items #actions avg length density
each user 𝑢 into a fixed-length sequence S𝑢 = {𝑣𝑢,1, 𝑣𝑢,2, ..., 𝑣𝑢,𝐿 } Office 101,133 27,500 0.74M 7.3 0.03%
via truncating the oldest interactions or padding, where 𝐿 is a Tool 240,464 73,153 1.96M 8.1 0.01%
pre-defined hyper-parameter controlling the maximum length of Movie 70,404 40,210 11.55M 164.2 0.41%
the sequence. Given a user sequence S𝑢 = {𝑣𝑢,1, 𝑣𝑢,2, ..., 𝑣𝑢,𝐿 }, we Music 20,539 10,121 0.66M 32.2 0.32%
compute its input matrix 𝑬𝑢 as: Epinions 30,989 20,382 0.54M 17.5 0.09%
e𝑡 + 𝑻𝑟𝑢,1 + 𝑷1
𝑢,1 Table 1: Dataset statistics
𝑡′
e + 𝑻𝑟𝑢,2 + 𝑷2 the training. Therefore, we follow previous works [22, 27, 45] and
𝑢,2
𝑬𝑢 = .
(4) apply layer normalization [2] and residual connections to each
.. multi-head self-attention layer and point-wise feed-forward net-
𝑡∗
e + 𝑻𝑟𝑢,𝐿 + 𝑷𝐿
work. Formally, we have:
𝑢,𝐿
′
𝑡 , e𝑡 , ..., e𝑡 ∗ is computed from Equation (2), 𝑻 𝑔(x) = x + Dropout(𝑔(LayerNorm(x))) (8)
e𝑢,1 𝑢,2 𝑢,𝐿 𝑟𝑢,1 , 𝑻𝑟𝑢,2 , ..., 𝑻𝑟𝑢,𝐿
𝑔(x) is either the multi-head self-attention layer or the point-wise
and 𝑷1, 𝑷2, ..., 𝑷𝐿 are computed following the procedure in § 4.1.3
feed-forward network. Therefore, for every multi-head self-attention
and § 4.1.4 respectively.
layer and point-wise feed-forward network, we first apply layer
Multi-Head Self-Attention. We adopt a widely used multi-
normalization to the input, then apply the multi-head self-attention
head self-attention mechanism [52], i.e., Transformers. Specifically,
layer or point-wise feed-forward network, and finally apply dropout
it consists of multiple multi-head self-attention layers (denoted as
and add the input x to the layer output. The LayerNorm function is
MHAttn(·)), and point-wise feed-forward networks (FFN(·)). The
defined as:
multi-head self-attention mechanism is defined as:
z𝑢 = MHAttn(𝑬𝑢 )
x−𝜇
LayerNorm(x) = 𝛼 ⊙ √ +𝛽 (9)
MHAttn(𝑬𝑢 ) = Concat(head1, ..., headℎ )𝑾 𝑂 (5) 𝜎2 + 𝜖
𝑄
head𝑖 = Attn(𝑬𝑢 𝑾𝑖 , 𝑬𝑢 𝑾𝑖𝐾 , 𝑬𝑢 𝑾𝑖𝑉 )
where 𝑬𝑢 is the input matrix computed from Equation (4), ℎ is a where ⊙ denotes the element-wise product, 𝜇 and 𝜎 are the mean
tunable hyper-parameter indicating the number of attention heads, and standard deviation of x, 𝛼 and 𝛽 are learnable parameters, and
𝜖 is a small constant to avoid numerical instability.
𝑾𝑖 , 𝑾𝑖𝐾 , 𝑾𝑖𝑉 ∈ R𝑑 ×𝑑/ℎ are the learnable weight matrices, and
𝑄
Table 2: Zero-shot recommendation results. Results for cross-domain, cross-application zero-shot transfer. S→T means we
pre-train PrepRec using S’s data (columns) and evaluate on T’s data (rows). We follow the zero-shot inference setting in § 4.3.
Reference models are trained from scratch on the target dataset. The best-performing zero-shot transfer results of each dataset
are in bold. We empirically show PrepRec achieves remarkable zero-shot generalization performance across domains.
4.3 Zero-shot Inference Epinions [48, 49] is a dataset crawled from product review site
Suppose we are given a pre-trained model F trained on 𝑴, where Epinions. We utilize the ratings dataset for our study.
F is the scoring function learned from source domain 𝑴. Denote We present dataset statistics in Table 1. We compute the density
the interaction matrix of the target domain as 𝑴 ′ . We first compute as the ratio of the number of interactions to the number of users
the popularity dynamics of each item in 𝑴 ′ over a coarser period times the number of items. Douban datasets (i.e., movie and music)
and a finer period. Then, we apply the pre-trained model F to 𝑴 ′ are the densest and have no auxiliary information available, while
and compute the prediction score as: the Amazon review datasets (i.e., office and tool) are the sparsest.
For fair evaluation, we follow the same preprocessing procedure
as previous works [22, 45], i.e., we binarize the explicit ratings to
𝑦ˆ (𝑣 𝑡𝑗 ′ |S𝑢′ ) = F (𝑣 𝑡𝑗 ′ |S𝑢′ , 𝑴 ′ ) (12) implicit feedback. In addition, for each user, we sort interactions
Note that in this procedure, we use the pre-trained model F by their timestamp and use the second most recent action for vali-
trained on domain 𝑴 ′ to predict the next item 𝑣 𝑡𝑗 ′ that user 𝑢 ′ will dation, the most recent action for testing, and the rest for training.
interact with in domain 𝑴 ′ . We do not use any auxiliary informa- 5.2 Baselines and Experimental Setup
tion in either domain. In addition, none of the parameters in F are Baselines: Our baselines (supplementary materials contain de-
updated during the zero-shot inference process. tailed descriptions) include classic general recommendation models
To summarize, in this section, we showed how to develop a pre- (e.g., MostPop, BPR [38], NCF [15], LightGCN [14]) and state-of-the-
trained sequential recommender system based on the popularity art sequential recommendation models (e.g., Caser [50], SasRec [22],
dynamics of items. We enforce the structure of each interaction in BERT4Rec [45], TiSasRec [27], CL4SRec [59]).
the sequence by the positional encoding and introduce a relative Following previous works [15, 22, 24, 45], we adopt the leave-
time encoding for modeling time intervals between two consecu- one-out evaluation method: for each user, we pair the test item with
tive interactions. In addition, we showed the training process and 100 unobserved items according to the user’s interaction history.
formally defined the zero-shot inference procedure. In the next Then we rank the test item for the user among the 101 total items.
section, we present experiments to evaluate PrepRec . We use two standard evaluation metrics for top-𝑘 recommendation:
5 EXPERIMENTS Recall@k (R@k) and Normalized Discounted Cumulative Gain@k
(N@k). Our model explicitly utilizes popularity information. There-
We present extensive experiments on five real-world datasets to
fore, we also present results where we sample the negatives based
evaluate the performance of PrepRec, following the problem set-
on their popularities, i.e., popular items have higher probabilities
tings in § 3. We introduce the following research questions (RQ) to
of being sampled as negatives. We report the average of R@k and
guide our experiments: (RQ1) How well can PrepRec perform on
N@k over all the test interactions.
zero-shot cross-domain and cross-application transfer? (RQ2) Why
We use publicly available implementations for the baselines. For
should we model popularity dynamics in sequential recommenda-
fair evaluation, we set dimension size 𝑑 to 50, max sequence length 𝐿
tion? (RQ3) What affects the performance of PrepRec ?
to 200, and batch size to 128 in all models. We use an Adam optimizer
5.1 Datasets and Preprocessing and tune the learning rate in the range {10 −4, 10 −3, 10 −2 } and set
We evaluate our proposed method on five real-world datasets across the weight decay to 10 −5 . We use the dropout regularization rate
different applications, with varying sizes, and density levels. of 0.3 for all models. We set 𝛾 = 0.5 in Equation (1), whose reason
Amazon [34] is a series of product ratings datasets obtained from we will discuss the reason in supplement materials. We define the
Amazon.com, split by product categories. We consider the Office and coarse and fine period to be 10 and 2 days respectively, and we fix
Tool product domains in our study. Douban [44] consists of three the window size to be 𝑚 = 12 and 𝑛 = 4 for all datasets (§ 4.1.2).
datasets across different domains, collected from Douban.com, a We train PrepRec for a maximum of 80 epochs. All experiments
Chinese review website. We work with the Movie and Music datasets. are conducted on a Tesla V100 using PyTorch. We repeat each
Pre-trained Sequential Recommender RecSys ’24, October 14–18, 2024, Bari, Italy
Table 3: Regular sequential recommendation results, RQ2, (§ 5.4.1). We make bold the best results and mark the best baseline
results with ′ ∗′ . Interp represents the interpolation results between PrepRec and BERT4Rec. PrepRec Δ denotes the perfor-
mance difference between PrepRec and the best results among the selected baselines, similar for Interp Δ. PrepRec achieves
comparable performance to the state-of-the-art sequential recommenders, with only on average 0.2% worse than the best
performing sequential recommenders in R@10 while having only a fraction of the model size (Table 5). After a simple post-hoc
interpolation, we outperform the state-of-the-art sequential recommenders by 11.8% in R@10 on average.
experiment 5 times with different random seeds and report the also demonstrates that the popularity dynamics-based item and
average performance. sequence representations are generalizable across domains.
5.3 Zero-shot Transferability (RQ1) 5.3.2 Robustness to Noise. We further investigate the robustness
5.3.1 Zero-shot Transfer Results. We follow the zero-shot inference of PrepRec to possible noise in zero-shot transfer by adding Gauss-
setting introduced in § 4.3 and report the results in Table 2. We also ian noise ∼ N (0, 𝜎) to the item popularity statistics and evaluate
include the results of PrepRec and the best-performing sequential the zero-shot transfer performance on Douban-Music and Epin-
recommenders trained on the target dataset for reference. In the ions from Douban-Movie. We randomly choose some percentage
zero-shot setting, PrepRec shows minimal performance reduction of items in the sequence to add noise, as indicated in Figure 3. We
in the target datasets (i.e., 6% maximum and 2% average reduction find that PrepRec is relatively robust to noise, maintaining robust
in R@10). The best zero-shot transfer results from PrepRec only fall performance across different noise levels at 20% noised interaction.
short against the selected sequential recommendation baselines by We attribute this to the model’s ability to learn from the overall pop-
up to 4% and even outperform (by up to 6.5%) them on the Epinions ularity dynamics, which is less affected by noise in individual item
and Office. We found that PrepRec trained on douban-movie and popularity statistics. In addition, when the noise level is relatively
amazon-tools show the highest generalizability, even outperform- low, e.g., 𝜎 ≤ 5, even if 100% of the sequence is noised, PrepRec still
ing the target-trained models on Music (0.811 vs. 0.782 on R@10). holds the performance, indicating significant item popularity shifts
We conjecture that this is because Movie is the largest dataset in exist in the sequence (Figure 1).
terms of the number of interactions. Overall, these results show Dataset Music Office Epinions
PrepRec ’s effectiveness in zero-shot transfer without any training Metric R@10 N@10 R@10 N@10 R@10 N@10
on interaction data or side information. In addition, this experiment
MostPop 0.197 0.139 0.099 0.046 0.163 0.110
Zero-shot Transfer to Epinions Zero-shot Transfer to Music SasRec [22] 0.749 0.519 0.453 0.291 0.658 0.442
1 0.744 0.743 0.743 0.744 0.742 1 0.772 0.771 0.769 0.767 0.768
BERT4Rec [45] 0.747 0.519 0.461 0.299 0.655 0.456
Standard Diviation of Noise
2 0.741 0.741 0.739 0.737 0.735 2 0.770 0.767 0.766 0.763 0.760 PrepRec 0.739 0.523 0.443 0.280 0.762 0.551
PrepRec Δ -1.3% +0.7% -2.2% -6.3% +15.8% +17.2%
5 0.729 0.725 0.715 0.709 0.697 5 0.759 0.745 0.739 0.720 0.712
10 0.716 0.697 0.672 0.638 0.592 10 0.741 0.719 0.688 0.659 0.591
Table 4: Regular sequential recommendation results (§ 5.4.1)
with popularity-based negative sampling. PrepRec can learn
20 0.687 0.637 0.559 0.477 0.405 20 0.708 0.656 0.572 0.486 0.415 discriminative item and sequence representations despite
depending only on popularity statistics.
%
0%
0%
20
40
60
80
20
40
60
80
10
10
Dataset Office Tool Movie Music Epinions 5.4.2 Qualitative Analysis on Regular Sequential Recommendation.
SasRec 1.331M 3.581M 2.044M 0.542M 1.054M We analyze the performance of PrepRec in detail. We separate
BERT4Rec 2.687M 7.233M 4.126M 1.094M 2.127M test items into equally sized groups based on their popularity in
TiSasRec 1.367M 3.617M 2.127M 0.578M 1.09M the training set then compute the average R@10 and N@10 for
PrepRec 0.045M 0.045M 0.045M 0.045M 0.045M each group (Figure 4). PrepRec achieves better performance on
item group with the least interactions, i.e., long-tail items, while
Table 5: Comparison of model sizes (i.e., number of learnable the SasRec and BERT4Rec show stronger performance on popular
parameters in millions) over different datasets. PrepRec is items. Long-tail item recommendation is a particularly challenging
12 to 90x smaller. task explored by many previous works [41] and requires recom-
dynamics in sequential recommendation, and how much does it ex- menders able to learn high-quality representations with just a few
plain the performance of state-of-the-art sequential recommenders? interactions. This corresponds to our observation that PrepRec is
Therefore, we propose the following experiments to investigate the more robust to data sparsity and can learn discriminative item
importance of popularity dynamics in sequential recommendation. and sequence representations (§ 5.4.1), showing that long-tail item
5.4.1 Regular Sequential Recommendation (RQ2). We show com- recommendation can benefit from PrepRec ’s popularity dynamics-
parisons of PrepRec against state-of-the-art sequential recom- based item representations.
menders in the regular sequential recommendation tasks (Table 3), 5.5 What affects PrepRec performance? (RQ3)
i.e., all models are trained from scratch. PrepRec achieves competi-
tive performance—within 2% in R@10 and 5% in N@10, with the Dataset Music Office Epinions
state-of-the-art baselines. On Epinions, PrepRec even outperforms Metric R@10 N@10 R@10 N@10 R@10 N@10
all baselines by 7.3%, particularly impressive since PrepRec has PrepRec 0.782 0.573 0.536 0.344 0.795 0.580
significantly fewer model parameters (Table 5). w/o Relative Time 𝑻 (§ 4.1.3) 0.734 0.514 0.541 0.334 0.782 0.562
PrepRec explicitly models popularity information and the Most- w/o Positional 𝑷 (§ 4.1.4) 0.765 0.544 0.530 0.332 0.772 0.554
Pop demonstrates decent performance compared to the remaining w/o Popularity Dynamics P 0.800 0.594 0.530 0.341 0.761 0.560
baselines, thus we conduct an additional experiment (Table 4) where w/o Popularity Dynamics H 0.705 0.582 0.525 0.337 0.730 0.533
we sample the unobserved (negative) items based on their popu- Sinuoisal Popularity Encoding 0.779 0.570 0.529 0.340 0.772 0.561
larity [45]. As shown in Table 4, MostPop’s performance dropped Table 6: Ablation study of PrepRec ’s different variants.
significantly, while PrepRec shows even more competitive perfor-
5.5.1 Ablation Study. Here, we assess the importance of different
mance on some datasets (e.g., Music and Epinions). This suggests
components crucial to PrepRec , i.e., relative time encoding (§ 4.1.3),
PrepRec learns discriminative item and sequence representations.
positional encoding (§ 4.1.4), popularity encoder 𝐸𝑝 (§ 4.1.1), and
PrepRec learns item representations through popularity dy-
resolutions for popularity dynamics (§ 4.1.2). We find that removing
namics, which is conceptually different from learning represen-
relative time encoding 𝑻 results in the largest performance drop on
tations specific to each item ID. Therefore, we propose a simple
both the Music and Office datasets. This suggests that the relative
post-hoc interpolation to investigate how much can popularity
time encoding is crucial for effectively capturing the popularity
dynamics explain the performance of state-of-the-art sequential
dynamics. Removing positional encoding 𝑷 results in a maximum
recommenders. We interpolate the scores from PrepRec with the
of 2.2% drop in R@10 on the Office dataset, indicating positional
scores from BERT4Rec as follows: 𝑦ˆ𝑖𝑛𝑡𝑝 (𝑣 𝑡𝑗 |S𝑢 ) = 𝛼 ∗ 𝑦ˆ𝑂 (𝑣 𝑡𝑗 |S𝑢 ) +
encoding is important for capturing sequential information. In ad-
(1 − 𝛼) ∗ 𝑦ˆ𝑆 (𝑣 𝑗 |S𝑢 ), where 𝑦ˆ𝑂 (𝑣 𝑡𝑗 |S𝑢 ) and 𝑦ˆ𝑆 (𝑣 𝑗 |S𝑢 ) are the scores dition, changing 𝐸𝑝 to the non-linear sunusoidal encoding shows
from PrepRec (Equation (10)) and BERT4Rec, respectively. We set worse performance on all datasets, meaning that the linear encod-
𝛼 = 0.5 for all datasets. After interpolation, the performance sig- ing is more suitable for capturing the popularity dynamics. On
nificantly boosts by up to 34.9% in N@10. Gains are largest in the the music dataset, removing coarse popularity encoding P results
medium and low-density datasets (Epinions, Amazon), indicating improves the performance by 2% in R@10, while removing fine
that our model complements existing methods in sparse datasets popularity encoding H results in a 7.5% drop in R@10. This sug-
where item embeddings are less informative. Therefore, it is crucial gests that the music domain is more sensitive to recent trends in
to consider popularity dynamics to maximize performance. popularity. Coarse and fine popularity encodings complement each
PrepRec BERT4Rec SasRec
other on other datasets.
Office Regular Sequential Music Regular Sequential Epinions Regular Sequential
0.95 0.95
0.8
0.75 0.75 5.5.2 Effect of discounting factor 𝛾. We examine the effect of differ-
Recall@10
0.6
0.4
0.55 0.55 ent preprocessing weights 𝛾 used in popularity calculation (§ 4.1.1).
0.2 0.35 0.35 In particular, 𝛾 = 1 corresponds to the cumulative popularity, or
0.0
1 2 3 4 5
0.15
1 2 3 4 5
0.15
1 2 3 4 5
in other words, at a given time period 𝑡, the overall number of
Item Popularity Groups Item Popularity Groups Item Popularity Groups interactions up to period 𝑡. On the other hand, 𝛾 = 0 corresponds
Figure 4: Recommendation results for different item popularity to the current popularity, or percentiles are calculated over interac-
groups (§ 5.4.2), where Group 1 represents the least popular items, tions just in 𝑡, same as 𝑏 𝑡𝑗 in Equation (1). When 𝛾 = 0.5, it can be
and Group 5 represents the most popular items. PrepRec achieves interpreted as interactions being exponentially weighted by time,
better performance on long-tail items while having competitive per-
with a half-life of 1 time period. We find that 𝛾 = 0.5 outperforms
formance on popular items.
the other two settings, with the largest gains of around 12% R@10
Pre-trained Sequential Recommender RecSys ’24, October 14–18, 2024, Bari, Italy
Dataset Music Office Epinions datasets’ total interactions. After further processing, we follow the
Metric R@10 N@10 R@10 N@10 R@10 N@10 same experimental setup in § 5.2. We fine-tune PrepRec and re-
𝛾 = 0 (Curr-pop) 0.749 0.542 0.512 0.328 0.689 0.496 train the baselines from scratch on the target dataset and report the
𝛾 = 0.25 0.764 0.529 0.538 0.338 0.761 0.562 results in Table 9. We find that PrepRec , after fine-tuning, outper-
𝛾 = 0.5∗ (weighted -pop) 0.782 0.573 0.536 0.344 0.795 0.580 forms the selected baselines on Office and Epinions by up to 12.9%,
𝛾 = 0.75 0.755 0.520 0.543 0.336 0.747 0.519 indicating that PrepRec is capable of learning from the limited data
𝛾 = 1 (cumul-pop) 0.695 0.452 0.530 0.330 0.733 0.505 and can be further fine-tuned to achieve better performance.
Table 7: Recommendation results for varying the discounting 5.7 Discussion
factor 𝛾 in § 4.1.2. 𝛾 = 0.5 is the default setting, denoted by PrepRec demonstrates the strong ability for zero-shot transfer. We
′ ∗′ . We find that 𝛾 = 0.5 generally outperforms the other two argue that PrepRec is particularly useful in the following scenarios:
settings 1) initial sequential model when the data in the domain is sparse; 2)
and 27% N@10 over cumul-pop in the dense Music dataset, and backbone for developing more complex sequential recommenders
the largest gains over curr-pop in the sparser Office (5% N@10 (i.e., prediction interpolation) 3) online recommendation settings.
and 4% N@10) and Epinions (15% R@10 and 17%N@10) datasets. PrepRec captures the popularity shifts in the sequence and is
We suspect this is due to cumulative measures in denser datasets complementary to state-of-the-art sequential recommenders. It is
failing to capture recent trends due to the large historical presence, worth noting that item popularity dynamics might not capture ev-
while current-only measures in sparser datasets convey too little erything in users’ preferences, but we believe they are orthogonal
or noisy information and lose the information of long-term trends. components towards capturing user preferences, which could ex-
curr-pop shows decent performance on the Music dataset, suggest- plain why the interpolation results substantially outperform both
ing that Music trends might be more cyclical and thus the current PrepRec and the selected state-of-the-art baselines (Table 3).
popularity is more informative. Additionally, time granularity is also crucial for popularity dy-
namics, and sequence analysis requires careful consideration of the
time horizon. Intuitively, when the dataset time precision is less
Dataset Music Office Epinions
accurate, i.e., weeks or days, we expect the performance to decrease
Metric R@10 N@10 R@10 N@10 R@10 N@10
as the sequential information and popularity dynamics become
Fine:2 days; Coarse:10 days ∗ 0.782 0.573 0.536 0.344 0.795 0.580 muddled. If the time precision in the training data increases, we can
Fine:4 days; Coarse:15 days 0.778 0.553 0.537 0.341 0.790 0.574 expect more accurate user sequences and more accurate measures
Fine:7 days; Coarse:30 days 0.760 0.509 0.526 0.334 0.757 0.543
of popularity dynamics. In general, time precision will not signif-
Table 8: Recommendation results for varying time horizons. icantly impact the performance of PrepRec in most scenarios as
Fine and coarse time horizons are used for short-term and in practice, online platforms can record precise time data for each
long-term popularity dynamics respectively (§ 4.1.1). user-item interaction. We will include more discussion in the arXiv
5.5.3 Effect of Different Time Horizons. We study the effect of version of this paper [55].
different time horizons to PrepRec . We found that in general, long- 6 CONCLUSION
term horizons of 30 days and short-term horizons of 7 days perform In this paper, using the critical insight of popularity dynamics in the
worse than the other settings. This is likely because the long-term user’s sequence, we developed a novel pre-trained sequential rec-
horizon might lead to the lack of resolutions in popularity statistics. ommendation framework, PrepRec, for the zero-shot, cross-domain
We also find that depending on the dataset, the effect of different setting without any auxiliary information. PrepRec learned trans-
time horizons also varies. For example, both Music and Epinions ferable, universal item representations via popularity dynamics-
show larger performance decrease from short to long-term horizons aware transformers. We empirically showed that PrepRec can
than Office. This could be because Music and Epinions are more achieve excellent zero-shot transfer to a target domain, compa-
sensitive to recent trends than Office, or their data are denser in rable to state-of-the-art sequential recommenders trained on the
terms of time granularity. target domain. With extensive within-domain experiments, we
found performance gains of 11.8% when we interpolated PrepRec ’s
5.6 Fine-tune Capability
results with state-of-the-art sequential recommenders, indicating
Dataset Movie→Music Tool→Office Tool→Epinions that PrepRec is learning complementary information. We posit
Metric R@10 N@10 R@10 N@10 R@10 N@10 that popularity dynamics are crucial for developing generalizable
PrepRec 0.803 0.591 0.472 0.300 0.489 0.264
sequential recommenders.
SasRec 0.815 0.599 0.437 0.290 0.433 0.245 As part of future work, we plan to investigate: 1) developing
BERT4Rec 0.816 0.602 0.407 0.249 0.433 0.255 more complex sequential recommenders by using PrepRec as a
backbone (i.e., prediction interpolation and auxiliary information),
Table 9: Recommendation results for fine-tuning PrepRec. and 2) exploring online recommendation settings.
We fine-tune PrepRec and retrain the baselines from scratch
on the target dataset. Acknowledgments
We also investigate PrepRec’s fine-tune capability. To ensure This work was generously supported by the National Science Foun-
target datasets are smaller than the source, we further process the dation (NSF) under grant number 2312561. We also would like to
target datasets such that they are no more than 10% of the source thank the anonymous reviewers for their valuable feedback.
RecSys ’24, October 14–18, 2024, Bari, Italy Wang and Rathi, et al.
References [24] Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted
[1] Alejandro Ariza-Casabona, Bartlomiej Twardowski, and Tri Kurniawan Wijaya. collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international
2023. Exploiting graph structured cross-domain representation for multi-domain conference on Knowledge discovery and data mining. 426–434.
recommendation. In European Conference on Information Retrieval. Springer, 49– [25] Yehuda Koren. 2009. Collaborative filtering with temporal dynamics. In Proceed-
65. ings of the 15th ACM SIGKDD international conference on Knowledge discovery
[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normaliza- and data mining. 447–456.
tion. arXiv preprint arXiv:1607.06450 (2016). [26] Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. 2019.
[3] Albert-Laszlo Barabasi. 2005. The origin of bursts and heavy tails in human MeLU: Meta-Learned User Preference Estimator for Cold-Start Recommendation.
dynamics. Nature 435, 7039 (2005), 207–211. In KDD. 1073–1082.
[4] Albert-László Barabási and Réka Albert. 1999. Emergence of Scaling in Random [27] Jiacheng Li, Yujie Wang, and Julian McAuley. 2020. Time interval aware self-
Networks. Science 286, 5439 (1999), 509. http://search.ebscohost.com/login.aspx? attention for sequential recommendation. In Proceedings of the 13th international
direct=true&db=tfh&AN=2405932&site=ehost-live conference on web search and data mining. 322–330.
[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, [28] Pan Li and Alexander Tuzhilin. 2020. Ddtcdr: Deep dual transfer cross domain
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda recommendation. In Proceedings of the 13th International Conference on Web
Askell, et al. 2020. Language models are few-shot learners. Advances in neural Search and Data Mining. 331–339.
information processing systems 33 (2020), 1877–1901. [29] Weiming Liu, Jiajie Su, Chaochao Chen, and Xiaolin Zheng. 2021. Leveraging
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina N. Toutanova. 2018. distribution alignment via stein path for cross-domain cold-start recommendation.
BERT: Pre-training of Deep Bidirectional Transformers for Language Under- Advances in Neural Information Processing Systems 34 (2021), 19223–19234.
standing. https://arxiv.org/abs/1810.04805 [30] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
[7] Hao Ding, Yifei Ma, Anoop Deoras, Yuyang Wang, and Hao Wang. 2021. Zero- Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
shot recommender systems. arXiv preprint arXiv:2105.08318 (2021). robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
[8] Manqing Dong, Feng Yuan, Lina Yao, Xiwei Xu, and Liming Zhu. 2020. Mamo: (2019).
Memory-augmented meta-optimization for cold-start recommendation. In Pro- [31] Yuanfu Lu, Yuan Fang, and Chuan Shi. 2020. Meta-learning on heterogeneous
ceedings of the 26th ACM SIGKDD international conference on knowledge discovery information networks for cold-start recommendation. In Proceedings of the 26th
& data mining. 688–697. ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- 1563–1573.
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg [32] Fuyu Lv, Taiwei Jin, Changlong Yu, Fei Sun, Quan Lin, Keping Yang, and Wil-
Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers fred Ng. 2019. SDM: Sequential deep matching model for online large-scale
for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020). recommender system. In Proceedings of the 28th ACM International Conference on
[10] Xiaoyu Du, Xiang Wang, Xiangnan He, Zechao Li, Jinhui Tang, and Tat-Seng Information and Knowledge Management. 2635–2643.
Chua. 2020. How to learn item representation for cold-start multimedia recom- [33] Tong Man, Huawei Shen, Xiaolong Jin, and Xueqi Cheng. 2017. Cross-domain
mendation?. In Proceedings of the 28th ACM International Conference on Multime- recommendation: An embedding and mapping approach.. In IJCAI, Vol. 17. 2464–
dia. 3469–3477. 2470.
[11] Philip J Feng, Pingjun Pan, Tingting Zhou, Hongxiang Chen, and Chuanjiang Luo. [34] Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying recommendations
2021. Zero shot on the cold-start problem: Model-agnostic interest learning for using distantly-labeled reviews and fine-grained aspects. In Proceedings of the
recommender systems. In Proceedings of the 30th ACM International Conference 2019 conference on empirical methods in natural language processing and the 9th
on Information & Knowledge Management. 474–483. international joint conference on natural language processing (EMNLP-IJCNLP).
[12] Bowen Hao, Jing Zhang, Hongzhi Yin, Cuiping Li, and Hong Chen. 2021. Pre- 188–197.
training graph neural networks for cold-start users and items representation. In [35] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
Proceedings of the 14th ACM International Conference on Web Search and Data [36] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
Mining. 265–273. Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual et al. 2021. Learning transferable visual models from natural language supervision.
learning for image recognition. In Proceedings of the IEEE conference on computer In International conference on machine learning. PMLR, 8748–8763.
vision and pattern recognition. 770–778. [37] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
[14] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of
Wang. 2020. LightGCN: Simplifying and Powering Graph Convolution Network transfer learning with a unified text-to-text transformer. The Journal of Machine
for Recommendation. CoRR abs/2002.02126 (2020). arXiv:2002.02126 https: Learning Research 21, 1 (2020), 5485–5551.
//arxiv.org/abs/2002.02126 [38] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.
[15] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng 2009. BPR: Bayesian personalized ranking from implicit feedback. In UAI. AUAI
Chua. 2017. Neural collaborative filtering. In WWW. 173–182. Press, 452–461.
[16] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. [39] Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factor-
2015. Session-based recommendations with recurrent neural networks. arXiv izing personalized markov chains for next-basket recommendation. In Proceedings
preprint arXiv:1511.06939 (2015). of the 19th international conference on World wide web. 811–820.
[17] Balázs Hidasi, Massimo Quadrana, Alexandros Karatzoglou, and Domonkos [40] Matthew J. Salganik, Peter Sheridan Dodds, and Duncan J. Watts. 2006. Experi-
Tikk. 2016. Parallel recurrent neural network architectures for feature-rich mental Study of Inequality and Unpredictability in an Artificial Cultural Market.
session-based recommendations. In Proceedings of the 10th ACM conference on Science 311, 5762 (2006), 854–856. http://www.jstor.org/stable/3843620
recommender systems. 241–248. [41] Aravind Sankar, Junting Wang, Adit Krishnan, and Hari Sundaram. 2021. Pro-
[18] Yupeng Hou, Zhankui He, Julian McAuley, and Wayne Xin Zhao. 2023. Learning toCF: Prototypical Collaborative Filtering for Few-Shot Recommendation. In
vector-quantized item representation for transferable sequential recommenders. Proceedings of the 15th ACM Conference on Recommender Systems (Amsterdam,
In Proceedings of the ACM Web Conference 2023. 1162–1171. Netherlands) (RecSys ’21). Association for Computing Machinery, New York, NY,
[19] Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong USA, 166–175. https://doi.org/10.1145/3460231.3474268
Wen. 2022. Towards universal sequence representation learning for recommender [42] Guy Shani, David Heckerman, Ronen I Brafman, and Craig Boutilier. 2005. An
systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Dis- MDP-based recommender system. Journal of Machine Learning Research 6, 9
covery and Data Mining. 585–593. (2005).
[20] Guangneng Hu, Yu Zhang, and Qiang Yang. 2018. Conet: Collaborative cross [43] Xiang-Rong Sheng, Liqin Zhao, Guorui Zhou, Xinyao Ding, Binding Dai, Qiang
networks for cross-domain recommendation. In Proceedings of the 27th ACM Luo, Siran Yang, Jingshan Lv, Chi Zhang, Hongbo Deng, et al. 2021. One model to
international conference on information and knowledge management. 667–676. serve all: Star topology adaptive recommender for multi-domain ctr prediction. In
[21] Yitong Ji, Aixin Sun, Jie Zhang, and Chenliang Li. 2020. A re-visit of the popularity Proceedings of the 30th ACM International Conference on Information & Knowledge
baseline in recommender systems. In Proceedings of the 43rd International ACM Management. 4104–4113.
SIGIR Conference on Research and Development in Information Retrieval. 1749– [44] Weiping Song, Zhiping Xiao, Yifan Wang, Laurent Charlin, Ming Zhang, and Jian
1752. Tang. 2019. Session-based social recommendation via dynamic graph attention
[22] Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom- networks. In Proceedings of the Twelfth ACM international conference on web
mendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, search and data mining. 555–563.
197–206. [45] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang.
[23] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- 2019. BERT4Rec: Sequential recommendation with bidirectional encoder rep-
mization. arXiv preprint arXiv:1412.6980 (2014). resentations from transformer. In Proceedings of the 28th ACM international
conference on information and knowledge management. 1441–1450.
Pre-trained Sequential Recommender RecSys ’24, October 14–18, 2024, Bari, Italy
[46] Qiaoyu Tan, Jianwei Zhang, Ninghao Liu, Xiao Huang, Hongxia Yang, Jingren for Recommendation. In Proceedings of the 31st ACM International Conference
Zhou, and Xia Hu. 2021. Dynamic memory based attention network for sequential on Multimedia (, Ottawa ON, Canada,) (MM ’23). Association for Computing
recommendation. In Proceedings of the AAAI conference on artificial intelligence, Machinery, New York, NY, USA, 6548–6557. https://doi.org/10.1145/3581783.
Vol. 35. 4384–4392. 3611967
[47] Yong Kiam Tan, Xinxing Xu, and Yong Liu. 2016. Improved recurrent neural [57] Yinwei Wei, Xiang Wang, Qi Li, Liqiang Nie, Yan Li, Xuanping Li, and Tat-Seng
networks for session-based recommendations. In Proceedings of the 1st workshop Chua. 2021. Contrastive learning for cold-start recommendation. In Proceedings
on deep learning for recommender systems. 17–22. of the 29th ACM International Conference on Multimedia. 5382–5390.
[48] H. Tang, J.and Gao, H. Liu, and A. Das Sarma. 2012. eTrust: Understanding trust [58] Chao-Yuan Wu, Amr Ahmed, Alex Beutel, Alexander J Smola, and How Jing. 2017.
evolution in an online world. , 253–261 pages. Recurrent recommender networks. In Proceedings of the tenth ACM international
[49] J. Tang, H. Gao, and H. Liu. 2012. mTrust: Discerning multi-faceted trust in a conference on web search and data mining. 495–503.
connected world. In Proceedings of the fifth ACM international conference on Web [59] Xu Xie, Fei Sun, Zhaoyang Liu, Shiwen Wu, Jinyang Gao, Jiandong Zhang, Bolin
search and data mining. ACM, 93–102. Ding, and Bin Cui. 2022. Contrastive learning for sequential recommendation. In
[50] Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommenda- 2022 IEEE 38th international conference on data engineering (ICDE). IEEE, 1259–
tion via convolutional sequence embedding. In Proceedings of the eleventh ACM 1273.
international conference on web search and data mining. 565–573. [60] Haochao Ying, Fuzhen Zhuang, Fuzheng Zhang, Yanchi Liu, Guandong Xu, Xing
[51] Trinh Xuan Tuan and Tu Minh Phuong. 2017. 3D convolutional networks Xie, Hui Xiong, and Jian Wu. 2018. Sequential recommender system based
for session-based recommendation with content features. In Proceedings of the on hierarchical attention network. In IJCAI International Joint Conference on
eleventh ACM conference on recommender systems. 138–146. Artificial Intelligence.
[52] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, [61] Fajie Yuan, Xiangnan He, Alexandros Karatzoglou, and Liguang Zhang. 2020.
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all Parameter-Efficient Transfer from Sequential Behaviors for User Modeling and
you need. Advances in neural information processing systems 30 (2017). Recommendation. Proceedings of the 42nd international ACM SIGIR conference on
[53] Maksims Volkovs, Guangwei Yu, and Tomi Poutanen. 2017. Dropoutnet: Ad- Research and development in Information Retrieval (2020).
dressing cold start in recommender systems. In Advances in neural information [62] Chenyi Zhang, Ke Wang, Hongkun Yu, Jianling Sun, and Ee-Peng Lim. 2014.
processing systems. 4957–4966. Latent factor transition for dynamic collaborative filtering. In Proceedings of the
[54] Junting Wang, Adit Krishnan, Hari Sundaram, and Yunzhe Li. 2023. Pre-trained 2014 SIAM international conference on data mining. SIAM, 452–460.
Neural Recommenders: A Transferable Zero-Shot Framework for Recommenda- [63] Cheng Zhao, Chenliang Li, Rong Xiao, Hongbo Deng, and Aixin Sun. 2020. CATN:
tion Systems. arXiv:2309.01188 [cs.IR] Cross-domain recommendation for cold-start users via aspect transfer network.
[55] Junting Wang, Praneet Rathi, and Hari Sundaram. 2024. A Pre-trained Sequen- In Proceedings of the 43rd International ACM SIGIR Conference on Research and
tial Recommendation Framework: Popularity Dynamics for Zero-shot Transfer. Development in Information Retrieval. 229–238.
arXiv:2401.01497 [cs.IR] https://arxiv.org/abs/2401.01497 [64] Yongchun Zhu, Kaikai Ge, Fuzhen Zhuang, Ruobing Xie, Dongbo Xi, Xu Zhang,
[56] Jinpeng Wang, Ziyun Zeng, Yunxiao Wang, Yuting Wang, Xingyu Lu, Tianxiang Leyu Lin, and Qing He. 2021. Transfer-meta framework for cross-domain recom-
Li, Jun Yuan, Rui Zhang, Hai-Tao Zheng, and Shu-Tao Xia. 2023. MISSRec: Pre- mendation to cold-start users. In Proceedings of the 44th International ACM SIGIR
training and Transferring Multi-modal Interest-aware Sequence Representation Conference on Research and Development in Information Retrieval. 1813–1817.