Chapter 4
Semantics-Aware Content-Based
Recommender Systems
Marco de Gemmis, Pasquale Lops, Cataldo Musto, Fedelucio Narducci,
and Giovanni Semeraro
4.1 Introduction
Content-based recommender systems (CBRSs) rely on item and user descriptions
(content) to build item representations and user profiles to suggest items similar
to those a target user already liked in the past. The basic process of producing
content-based recommendations consists in matching up the attributes of the target
user profile, in which preferences and interests are stored, with the attributes of the
items. The result is a relevance score that predicts the target user’s level of interest
in those items. Usually, attributes for describing an item are features extracted from
metadata associated to that item, or textual features extracted directly from the item
description. The content extracted from metadata is often too short and not sufficient
to correctly define the user interests, while the use of textual features involves
a number of complications when learning a user profile due to natural language
ambiguity. Polysemy, synonymy, multi-word expressions, named entity recognition
and disambiguation are inherent problems of traditional keyword-based profiles,
which are not able to go beyond the usage of lexical/syntactic structures to infer the
user interest in topics.
The ever increasing interest in semantic technologies and the availability of
several open knowledge sources, such as Wikipedia, DBpedia, Freebase, and
BabelNet have fueled recent progress in the field of CBRSs. Novel research works
have introduced semantic techniques that shift from a keyword-based to a concept-
based representation of items and user profiles. These observations make very
relevant the integration of proper techniques for deep content analytics borrowed
M. de Gemmis • P. Lops () • C. Musto • F. Narducci • G. Semeraro
Department of Computer Science, University of Bari “Aldo Moro”, Bari, Italy
e-mail:
[email protected];
[email protected];
[email protected];
[email protected];
[email protected]© Springer Science+Business Media New York 2015 119
F. Ricci et al. (eds.), Recommender Systems Handbook,
DOI 10.1007/978-1-4899-7637-6_4
120 M. de Gemmis et al.
from Natural Language Processing (NLP) and Semantic Technologies, which is one
of the most innovative lines of research in semantic recommender systems [61].
We roughly classify semantic techniques into top-down and bottom-up
approaches. Top-down approaches rely on the integration of external knowledge,
such as machine readable dictionaries, taxonomies (or IS-A hierarchies), thesauri or
ontologies (with or without value restrictions and logical constraints), for annotating
items and representing user profiles in order to capture the semantics of the target
user information needs. The main motivation behind top-down approaches is the
challenge of providing recommender systems with the linguistic knowledge and
common sense knowledge, as well as the cultural background which characterize
the human ability of interpreting documents expressed in natural language and
reasoning on their meaning.
On the other side, bottom-up approaches exploit the so-called geometric
metaphor of meaning to represent complex syntagmatic and paradigmatic relations
between words in high-dimensional vector spaces. According to this metaphor, each
word (and each document as well) can be represented as a point in a vector space.
The peculiarity of these models is that the representation is learned by analyzing
the context in which the word is used, in a way that terms (or documents) similar
to each other are close in the space. For this reason bottom-up approaches are also
called distributional models. One of the great virtues of these approaches is that they
are able to induce the semantics of terms by analyzing their use in large corpora
of textual documents using unsupervised mechanisms, as evidenced by the recent
advances of machine translation techniques [52, 83].
This chapter describes a variety of semantic approaches, both top-down and
bottom-up, and shows how to leverage them to build a new generation of semantic
CBRSs that we call semantics-aware content-based recommender systems.
4.2 Overview of Content-Based Recommender Systems
This section reports an overview of the basic principles for building CBRSs,
the main techniques for representing items, learning user profiles and providing
recommendations. The most important limitations of CBRSs are also discussed,
while the semantic techniques useful to tackle those limitations are introduced in
the next sections.
The high level architecture of a content-based recommender system is depicted
in Fig. 4.1. The recommendation process is performed in three steps, each of which
is handled by a separate component:
• CONTENT ANALYZER—When information has no structure (e.g. text), some
kind of pre-processing step is needed to extract structured relevant information.
The main responsibility of the component is to represent the content of items
(e.g. documents, Web pages, news, product descriptions, etc.) coming from
information sources in a form suitable for the next processing steps. Data items
4 Semantics-Aware Content-Based Recommender Systems 121
User ua User ua
training feedback
examples PROFILE
Represented Feedback
LEARNER
Items
User ua
Structured Profile
Item User ua
Representation feedback
CONTENT New Active user ua
ANALYZER Items PROFILES
User ua
Profile
Item
Descriptions
FILTERING List of
Information
COMPONENT recommendations
Source
Fig. 4.1 High level architecture of a content-based recommender
are analyzed by feature extraction techniques in order to shift item representation
from the original information space to the target one (e.g. Web pages represented
as keyword vectors). This representation is the input to the PROFILE LEARNER
and FILTERING COMPONENT;
• PROFILE LEARNER—This module collects data representative of the user
preferences and tries to generalize this data, in order to construct the user
profile. Usually, the generalization strategy is realized through machine learning
techniques [86], which are able to infer a model of user interests starting from
items liked or disliked in the past. For instance, the PROFILE LEARNER of a Web
page recommender can implement a relevance feedback method [113] in which
the learning technique combines vectors of positive and negative examples into a
prototype vector representing the user profile. Training examples are Web pages
on which a positive or negative feedback has been provided by the user;
• FILTERING COMPONENT—This module exploits the user profile to suggest
relevant items by matching the profile representation against that of items to
be recommended. The result is a binary or continuous relevance judgment
(computed using some similarity metrics [57]), the latter case resulting in a
ranked list of potentially interesting items. In the above mentioned example, the
matching is realized by computing the cosine similarity between the prototype
vector and the item vectors.
The first step of the recommendation process is the one performed by the
CONTENT ANALYZER, that usually borrows techniques from Information Retrieval
122 M. de Gemmis et al.
systems [6, 118]. Item descriptions coming from Information Source are processed
by the CONTENT ANALYZER, that extracts features (keywords, n-grams, concepts,
. . . ) from unstructured text to produce a structured item representation, stored in the
repository Represented Items.
In order to construct and update the profile of the active user ua (user for
which recommendations must be provided) her reactions to items are collected in
some way and recorded in the repository Feedback. These reactions, called anno-
tations [51] or feedback, together with the related item descriptions, are exploited
during the process of learning a model useful to predict the actual relevance of
newly presented items. Users can also explicitly define their areas of interest as an
initial profile without providing any feedback. Typically, it is possible to distinguish
between two kinds of relevance feedback: positive information (inferring features
liked by the user) and negative information (i.e., inferring features the user is
not interested in [58]). Two different techniques can be adopted for recording
user’s feedback. When a system requires the user to explicitly evaluate items, this
technique is usually referred to as “explicit feedback”; the other technique, called
“implicit feedback”, does not require any active user involvement, in the sense
that feedback is derived from monitoring and analyzing user’s activities. Explicit
evaluations indicate how relevant or interesting an item is to the user [111]. Explicit
feedback has the advantage of simplicity, albeit the adoption of numeric/symbolic
scales increases the cognitive load on the user, and may not be adequate for catching
user’s feeling about items. Implicit feedback methods are based on assigning a
relevance score to specific user actions on an item, such as saving, discarding,
printing, bookmarking, etc. The main advantage is that they do not require a direct
user involvement, even though biasing is likely to occur, e.g. interruption of phone
calls while reading.
In order to build the profile of the active user ua , the training set TRa for ua must
be defined. TRa is a set of pairs hIk ; rk i, where rk is the rating provided by ua on the
item representation Ik . Given a set of item representation labeled with ratings, the
PROFILE LEARNER applies supervised learning algorithms to generate a predictive
model—the user profile—which is usually stored in a profile repository for later
use by the FILTERING COMPONENT. After the user profile has been learned, the
FILTERING COMPONENT predicts whether a new item is likely to be of interest
for the active user, by comparing features in the item representation to those in the
representation of user preferences (stored in the user profile).
User tastes usually change in time, therefore up-to-date information must be
maintained and provided to the PROFILE LEARNER in order to automatically
update the user profile. Further feedback is gathered on generated recommendations
by letting users state their satisfaction or dissatisfaction with items in La . After
gathering that feedback, the learning process is performed again on the new training
set, and the resulting profile is adapted to the updated user interests. The iteration
of the feedback-learning cycle over time enables the system to take into account the
dynamic nature of user preferences.
4 Semantics-Aware Content-Based Recommender Systems 123
4.2.1 Keyword-Based Vector Space Model
Most content-based recommender systems use relatively simple retrieval models,
such as keyword matching or the Vector Space Model (VSM). VSM is a spatial
representation of text documents. In that model, each document is represented by a
vector in a n-dimensional space, where each dimension corresponds to a term from
the overall vocabulary of a given document collection.
Formally, every document is represented as a vector of term weights, where
each weight indicates the degree of association between the document and the
term. Let D D fd1 ; d2 ; : : : ; dN g denote a set of documents or corpus, and T D
ft1 ; t2 ; : : : ; tn g be the dictionary, that is to say the set of words in the corpus. T
is obtained by applying some standard natural language processing operations,
such as tokenization, stopwords removal, and stemming [6]. Each document dj is
!
represented as a vector in a n-dimensional vector space, so dj D hw1j ; w2j ; : : : ; wnj i,
where wkj is the weight for term tk in document dj .
Document representation in the VSM raises two issues: weighting the terms and
measuring the feature vector similarity. The most commonly used term weighting
scheme, TF-IDF (Term Frequency-Inverse Document Frequency) weighting, is
based on empirical observations regarding text [117]:
• rare terms are not less relevant than frequent terms (IDF assumption);
• multiple occurrences of a term in a document are not less relevant than single
occurrences (TF assumption);
• long documents are not preferred to short documents (normalization
assumption).
In other words, terms that occur frequently in one document (TF=term-
frequency), but rarely in the rest of the corpus (IDF=inverse-document-frequency),
are more likely to be relevant to the topic of the document. In addition, normalizing
the resulting weight vectors prevent longer documents from having a better chance
of retrieval. These assumptions are well exemplified by the TF-IDF function:
N
TF - IDF.tk ; dj / D TF.tk ; dj / log (4.1)
„ ƒ‚ … nk
TF „ƒ‚…
IDF
where N denotes the number of documents in the corpus, and nk denotes the number
of documents in the collection in which the term tk occurs at least once.
fk;j
TF .tk ; dj / D (4.2)
maxz fz;j
where the maximum is computed over the frequencies fz;j of all terms tz that
occur in document dj . In order for the weights to fall in the Œ0; 1 interval and for
124 M. de Gemmis et al.
the documents to be represented by vectors of equal length, weights obtained by
Eq. (4.1) are usually normalized by cosine normalization:
TF - IDF.tk ; dj /
wk;j D q (4.3)
PjTj 2
sD1 TF - IDF .ts ; dj /
which enforces the normalization assumption.
As stated earlier, a similarity measure is required to determine the closeness
between two documents. Many similarity measures have been derived to describe
the proximity of two vectors; among those measures, cosine similarity is the most
widely used:
P
wki wkj
sim.di ; dj / D pP k
pP (4.4)
2 2
k wki k wkj
In content-based recommender systems relying on VSM, both user profiles and
items are represented as weighted term vectors. Predictions of a user’s interest in a
particular item can be derived by computing the cosine similarity.
4.2.2 Methods for Learning User Profiles
Machine learning techniques generally used in the task of inducing content-
based profiles, are well-suited for text categorization [119]. In a machine learning
approach to text categorization, an inductive process automatically builds a text
classifier from a set of training documents, i.e. documents labeled with the
categories they belong to.
The problem of learning user profiles can be cast as a binary text categorization
task: each document has to be classified as interesting or not with respect to the
user preferences. Therefore, the set of categories is C D fcC ; c g, where cC is
the positive class (user-likes) and c the negative one (user-dislikes). Classifiers
can be also adopted with a set of categories which is not binary. Besides the use
of classifiers, other machine learning algorithms, such as linear regression, can be
adopted to predict numerical ratings. The most used learning algorithms in content-
based recommender systems are based on probabilistic methods, relevance feedback
and k-nearest neighbors [6].
4.2.2.1 Probabilistic Methods
Naïve Bayes is a probabilistic approach to inductive learning, and belongs to the
general class of Bayesian classifiers. These approaches generate a probabilistic
model based on previously observed data. The model estimates the a posteriori
4 Semantics-Aware Content-Based Recommender Systems 125
probability, P.cjd/, of document d belonging to class c. This estimation is based
on the a priori probability, P.c/, the probability of observing a document in class
c, P.djc/, the probability of observing the document d given c, and P.d/, the
probability of observing the instance d. Using these probabilities, the Bayes theorem
is applied to calculate P.cjd/:
P.c/P.djc/
P.cjd/ D (4.5)
P.d/
To classify the document d, the class with the highest probability is chosen:
P.cj /P.djcj /
c D argmaxcj
P.d/
P.d/ is generally removed as it is equal for all cj . As we do not know the value
for P.djc/ and P.c/, we estimate them by observing the training data. However,
estimating P.djc/ in this way is problematic, as it is very unlikely to see the
same document more than once: the observed data is generally not enough to
be able to generate good probabilities. The naïve Bayes classifier overcomes this
problem by simplifying the model through the independence assumption: all the
words or tokens in the observed document d are conditionally independent of each
other given the class. Individual probabilities for the words in a document are
estimated one by one rather than the complete document as a whole. The conditional
independence assumption is clearly violated in real-world data, however, despite
these violations, empirically the naïve Bayes classifier does a good job in classifying
text documents [12, 70].
There are two commonly used working models of the naïve Bayes classifier,
the multivariate Bernoulli event model and the multinomial event model [77]. Both
models treat a document as a vector of values over the corpus vocabulary, V, where
each entry in the vector represents whether a word occurred in the document, hence
both models lose information about word order. The multivariate Bernoulli event
model encodes each word as a binary attribute, i.e., whether a word appeared or not,
while the multinomial event model counts how many times the word appeared in
the document. Empirically, the multinomial naïve Bayes formulation was shown to
outperform the multivariate Bernoulli model. This effect is particularly noticeable
for large vocabularies [77]. The way the multinomial event model uses its document
vector to calculate P.cj jdi / is as follows:
Y
P.cj jdi / D P.cj / P.tk jcj /N.di ;tk / (4.6)
tk 2Vdi
where N.di ;tk / is defined as the number of times word or token tk appeared in
document di . Notice that, rather than getting the product of all the words in the
corpus vocabulary V, only the subset of the vocabulary, Vdi , containing the words
that appear in the document di , is used. A key step in implementing naïve Bayes
126 M. de Gemmis et al.
is estimating the word probabilities P.tk jcj /. To make the probability estimates
more robust with respect to infrequently encountered words, a smoothing method
is used to modify the probabilities that would have been obtained by simple event
counting. One important effect of smoothing is that it avoids assigning probability
values equal to zero to words not occurring in the training data for a particular
class. A rather simple smoothing method relies on the common Laplace estimates
(i.e., adding one to all the word counts for a class). A more interesting method is
Witten-Bell [129].
Although naïve Bayes performances are not as good as some other statistical
learning methods such as nearest-neighbor classifiers or support vector machines,
it has been shown that it can perform surprisingly well in the classification tasks
where the computed probability is not important [40]. Another advantage of the
naïve Bayes approach is that it is very efficient and easy to implement compared to
other learning methods.
4.2.2.2 Relevance Feedback
Relevance feedback is a technique adopted in Information Retrieval that helps users
to incrementally refine queries based on previous search results. It consists of the
users feeding back into the system decisions on the relevance of retrieved documents
with respect to their information needs.
Relevance feedback and its adaptation to text categorization, the well-known
Rocchio’s formula [113], are commonly adopted by content-based recommender
systems. The general principle is to let users to rate documents suggested by the
recommender system with respect to their information need. This form of feedback
can subsequently be used to incrementally refine the user profile or to train the
learning algorithm that infers the user profile as a classifier. Some linear classifiers
consist of an explicit profile (or prototypical document) of the category [119]. The
Rocchio’s method is used for inducing linear, profile-style classifiers. This algorithm
represents documents as vectors, so that documents with similar content have similar
vectors. Each component of such a vector corresponds to a term in the document,
typically a word. The weight of each component is computed using the TF-IDF
term weighting scheme. Learning is achieved by combining document vectors (of
positive and negative examples) into a prototype vector for each class in the set
of classes C. To classify a new document d, the similarity between the prototype
vectors and the corresponding document vector representing d are calculated for
each class (for example by using the cosine similarity measure), then d is assigned
to the class whose document vector has the highest similarity value.
More formally, Rocchio’s method computes a classifier !
ci D h!1i ; : : : ; !jTji i for
the category ci (T is the vocabulary, that is the set of distinct terms in the training
set) by means of the formula:
X wkj X wkj
!ki D ˇ (4.7)
jPOSi j jNEGi j
fdj 2POSi g fdj 2NEGi g
4 Semantics-Aware Content-Based Recommender Systems 127
where wkj is the TF-IDF weight of the term tk in document dj , POSi and NEGi are
the set of positive and negative examples in the training set for the specific class ci , ˇ
and are control parameters that allow to set the relative importance of all positive
and negative examples. To assign a class cQ to a document dj , the similarity between
!
each prototype vector ! c and the document vector d is computed and cQ will be the
i j
ci with the highest value of similarity. The Rocchio-based classification approach
does not have any theoretic underpinning and there are guarantees on performance
or convergence [108].
4.2.2.3 Nearest Neighbors
Nearest neighbor algorithms, also called lazy learners, simply store training data in
memory, and classify a new unseen item by comparing it to all stored items by using
a similarity function. The “nearest neighbor” or the “k-nearest neighbors” items are
determined, and the class label for the unclassified item is derived from the class
labels of the nearest neighbors. A similarity function is needed, for example the
cosine similarity measure is adopted when items are represented using the VSM.
Nearest neighbor algorithms are quite effective, albeit the most important drawback
is their inefficiency at classification time, since they do not have a true training phase
and thus defer all the computation to classification time.
4.2.3 Advantages and Drawbacks of Content-Based Filtering
The adoption of the content-based recommendation paradigm has several advan-
tages when compared to the collaborative one:
• USER INDEPENDENCE—Content-based recommenders exploit solely ratings
provided by the active user to build her own profile. Instead, collaborative
filtering methods need ratings from other users in order to find the “nearest
neighbors” of the active user, i.e., users that have similar tastes since they
rated the same items similarly. Then, only the items that are most liked by the
neighbors of the active user will be recommended;
• TRANSPARENCY—Explanations on how the recommender system works can be
provided by explicitly listing content features or descriptions that caused an item
to occur in the list of recommendations. Those features are indicators to consult
in order to decide whether to trust a recommendation. Conversely, collaborative
systems are black boxes since the only explanation for an item recommendation
is that unknown users with similar tastes liked that item;
• NEW ITEM—Content-based recommenders are capable of recommending items
not yet rated by any user. As a consequence, they do not suffer from the first-rater
problem, which affects collaborative recommenders which rely solely on users’
preferences to make recommendations. Therefore, until the new item is rated by
a substantial number of users, the system would not be able to recommend it.
128 M. de Gemmis et al.
Nonetheless, content-based systems have several shortcomings:
• LIMITED CONTENT ANALYSIS—Content-based techniques have a natural limit
in the number and type of features that are associated, whether automatically
or manually, with the objects they recommend. Domain knowledge is often
needed, e.g., for movie recommendations the system needs to know the actors
and directors, and sometimes, domain ontologies are also needed. No content-
based recommendation system can provide suitable suggestions if the analyzed
content does not contain enough information to discriminate items the user
likes from items the user does not like. Some representations capture only
certain aspects of the content, but there are many others that would influence
a user’s experience. For instance, often there is not enough information in the
word frequency to model the user interests in jokes or poems, while techniques
for affective computing would be most appropriate. Again, for Web pages,
feature extraction techniques from text completely ignore aesthetic qualities
and additional multimedia information. Furthermore, CBRSs based on a string
matching approach suffer from problems of:
– POLYSEMY, the presence of multiple meanings for one word;
– SYNONYMY, multiple words with the same meaning;
– MULTI-WORD EXPRESSIONS, the difficulty to assign the correct properties to
a sequence of two or more words whose properties are not predictable from
the properties of the individual words;
– ENTITY IDENTIFICATION or NAMED ENTITY RECOGNITION, the difficulty
to locate and classify elements in text into pre-defined categories such as the
names of persons, organizations, locations, expressions of times, quantities,
monetary values, etc.
– ENTITY LINKING or NAMED ENTITY DISAMBIGUATION, the difficulty of
determining the identity (often called the reference) of entities mentioned in
text.
• OVER-SPECIALIZATION—Content-based recommenders have no inherent
method for finding something unexpected. The system suggests items whose
scores are high when matched against the user profile, hence the user is going
to be recommended items similar to those already rated. This drawback is also
called lack of serendipity problem to highlight the tendency of the content-based
systems to produce recommendations with a limited degree of novelty. To give
an example, when a user has only rated movies directed by Stanley Kubrick,
she will be recommended just that kind of movies. A “perfect” content-based
technique would rarely find anything novel, limiting the range of applications for
which it would be useful.
• NEW USER—Enough ratings have to be collected before a content-based rec-
ommender system can really understand user preferences and provide accurate
recommendations. Therefore, when few ratings are available, as for a new user,
the system will not be able to provide reliable recommendations.