CONTENT-BASED RECOMMENDER SYSTEMS
Systems implementing a content-based recommendation approach analyze a set of documents
and/or descriptions of items previously rated by a user, and build a model or profile of user
interests based on the features of the objects rated by that user.
The profile is a structured representation of user interests, adopted to recommend new
interesting items. The recommendation process basically consists in matching up the
attributes of the user profile against the attributes of a content object. The result is a relevance
judgment that represents the user’s level of interest in that object.
If a profile accurately reflects user preferences, it is of tremendous advantage for the
effectiveness of an information access process. For instance, it could be used to filter search
results by deciding whether a user is interested in a specific Web page or not and, in the
negative case, preventing it from being displayed.
High Level Architecture of Content-based Systems
Content-based Information Filtering (IF) systems need proper techniques for representing the
items and producing the user profile, and some strategies for comparing the user profile with
the item representation. The high-level architecture of a content-based recommender system
is depicted in the following figure. The recommendation process is performed in three steps,
each of which is handled by a separate component:
CONTENT ANALYZER – When information has no structure (e.g. text), some kind of pre-
processing step is needed to extract structured relevant information.The main responsibility
of the component is to represent the content of items (e.g. documents, Web pages, news,
product descriptions, etc.) coming from information sources in a form suitable for the next
processing steps. Data items are analyzed by feature extraction techniques in order to shift
item representation from the original information space to the target one (e.g.Web pages
represented as keyword vectors). This representation is the input to the PROFILE LEARNER
and FILTERING COMPONENT;
PROFILE LEARNER – This module collects data representative of the user preferences
and tries to generalize this data, in order to construct the user profile. Usually, the
generalization strategy is realized through machine learning techniques, which are able to
infer a model of user interests starting from items liked or disliked in the past. For instance,
the PROFILE LEARNER of a Web page recommender can implement a relevance feedback
method in which the learning technique combines vectors of positive and negative examples
into a prototype vector representing the user profile. Training examples are Web pages on
which a positive or negative feedback has been provided by the user;
FILTERING COMPONENT – This module exploits the user profile to suggest relevant
items by matching the profile representation against that of items to be recommended. The
result is a binary or continuous relevance judgment (computed using some similarity
metrics), the latter case resulting in a ranked list of potentially interesting items. In the above
mentioned example, the matching is realized by computing the cosine similarity between the
prototype vector and the item vectors.
The first step of the recommendation process is the one performed by the
CONTENTANALYZER, that usually borrows techniques from Information Retrieval
systems. Item descriptions coming from Information Source are processed by the CONTENT
ANALYZER, that extracts features (keywords, n-grams, concepts, . . .) from unstructured text
to produce a structured item representation, stored in the repository Represented Items.
In order to construct and update the profile of the active user ua (user for which
recommendations must be provided) her reactions to items are collected in some way and
recorded in the repository Feedback. These reactions, called annotations or feedback,
together with the related item descriptions, are exploited during the process of learning a
model useful to predict the actual relevance of newly presented items. Users can also
explicitly define their areas of interest as an initial profile without providing any feedback.
Advantages and Drawbacks of Content-based Filtering
Advantages :
The adoption of the content-based recommendation paradigm has several advantages when
compared to the collaborative one:
USER INDEPENDENCE - Content-based recommenders exploit solely ratings provided by
the active user to build her own profile. Instead, collaborative filtering methods need ratings
from other users in order to find the “nearest neighbours” of the active user, i.e., users that
have similar tastes since they rated the same items similarly. Then, only the items that are
most liked by the neighbours of the active user will be recommended;
TRANSPARENCY - Explanations on how the recommender system works can be provided
by explicitly listing content features or descriptions that caused an item to occur in the list of
recommendations. Those features are indicators to consult in order to decide whether to trust
a recommendation. Conversely, collaborative systems are black boxes since the only
explanation for an item recommendation is that unknown users with similar tastes liked that
item;
NEW ITEM - Content-based recommenders are capable of recommending items not yet
rated by any user. As a consequence, they do not suffer from the first-rater problem, which
affects collaborative recommenders which rely solely on users’ preferences to make
recommendations. Therefore, until the new item is rated by a substantial number of users, the
system would not be able to recommend it.
Drawbacks:
LIMITED CONTENT ANALYSIS - Content-based techniques have a natural limit in the
number and type of features that are associated. Domain knowledge is often needed, e.g., for
movie recommendations the system needs to know the actors and directors, and sometimes,
domain ontologies are also needed.
No content-based recommendation system can provide suitable suggestions if the analysed
content does not contain enough information to discriminate items the user likes from items
the user does not like.
Some representations capture only certain aspects of the content, but there are many others
that would influence a user’s experience.
For instance, often there is not enough information in the word frequency to model the user
interests in jokes or poems, while techniques for affective computing would be most
appropriate. Again, for Web pages, feature extraction techniques from text completely ignore
aesthetic qualities and additional multimedia information.
To sum up, both automatic and manually assignment of features to items could not be
sufficient to define distinguishing aspects of items that turn out to be necessary for the
elicitation of user interests.
OVER-SPECIALIZATION - Content-based recommenders have no inherent method for
finding something unexpected. The system suggests items whose scores are high when
matched against the user profile, hence the user is going to be recommended items similar to
those already rated. This drawback is also called serendipity problem to highlight the
tendency of the content-based systems to produce recommendations with a limited degree of
novelty. To give an example, when a user has only rated movies directed by Stanley Kubrick,
she will be recommended just that kind of movies. A “perfect” content-based technique
would rarely find anything novel, limiting the range of applications for which it would be
useful.
NEW USER - Enough ratings have to be collected before a content-based recommender
system can really understand user preferences and provide accurate recommendations.
Therefore, when few ratings are available, as for a new user, the system will not be able to
provide reliable recommendations.
CONTENT REPRESENTATION AND CONTENT SIMILARITY
The simplest way to describe catalogue items is to maintain an explicit list of features for
each item (also often called attributes, characteristics, or item profiles).
Book Knowledge Base
Vector-Space Model :
Term Frequency and Inverse Document Frequency:
• Simple keyword representation has its problems
• in particular when automatically extracted:
• not every word has similar importance
• longer documents have a higher chance to have an overlap with the
user profile
• Standard measure: TF-IDF
• Encodes text documents in multi-dimensional Euclidian space
• weighted term vector
• TF: Measures, how often a term appears (density in a document)
• assuming that important terms appear more often
• normalization has to be done in order to take document length into
account
• IDF: Aims to reduce the weight of terms that appear in all documents
TF-IDF calculation:
Term frequency describes how often a certain term appears in a document (assuming that
important words appear more often).
We search for the normalized term frequency value TF(i, j) of keyword i in document j. Let
freq(i, j) be the absolute number of occurrences of I in j. Given a keyword i, let
maximum frequency maxOthers(i, j) as max(freq(z, j )), z ∈ OtherKeywords(i, j). Finally,
OtherKeywords(i, j) denote the set of the other keywords appearing in j . Compute the
calculate TF (i, j).
Inverse document frequency is the second measure that is combined with term frequency. It
aims at reducing the weight of keywords that appear very often in all documents. The idea is
that those generally frequent words are not very helpful to discriminate among documents,
and more weight should therefore be given to words that appear in only a few documents.
Let N be the number of all recommendable documents and n(i) be the number of documents
from N in which keyword i appears. The inverse document frequency for i is typically
calculated as
The combined TF-IDF weight for a keyword i in document j is computed as the product of
these two measures:
In the TF-IDF model, the document is, therefore, represented not as a vector of Boolean
values for each keyword but as a vector of the computed TF-IDF weights.
Example TF-IDF representation:
Improving the vector space model:
i. Stop words and stemming:
A straightforward method is to remove so-called stop words. In the English
language these are, for instance, prepositions and articles such as “a”, “the”, or
“on”, which can be removed from the document vectors because they will appear
in nearly all documents.
Another commonly used technique is called stemming or conflation, which aims
to replace variants of the same word by their common stem (root word). The word
“stemming” would, for instance, be replaced by “stem”, “went” by “go”, and so
forth.
ii. Size cutoffs:
Another straightforward method to reduce the size of the document representation
and hopefully remove “noise” from the data is to use only the n most informative
words.
iii. Phrases.
A further possible improvement with respect to representation accuracy is to use
“phrases as terms”, which are more descriptive for a text than single words alone.
Phrases, or composed words such as “United Nations”, can be encoded as
additional dimensions in the vector space.
Limitations:
The described approach of extracting and weighting individual keywords from
the text has another important limitation: it does not take into account the context
of the keyword and, in some cases, may not capture the “meaning” of the
description correctly.
SIMILARITY-BASED RETRIEVAL
When the item selection problem in collaborative filtering can be described as “recommend
items that similar users liked”, content-based recommendation is commonly described as
“recommend items that are similar to those the user liked in the past”.
i. Nearest neighbors:
The prediction for a not-yet-seen item d is based on letting the k most similar items for
which a rating exists “vote” for n. If, for instance, four out of k = 5 of the most similar
items were liked by the current user, the system may guess that the chance that d will also
be liked is relatively high. Besides varying the neighborhood size k, several other
variations are possible, such as binarization of ratings, using a minimum similarity
threshold,or weighting of the votes based on the degree of similarity.
The kNN method was implemented as part of a multi-strategy user profile technique. The
system-maintained profiles of short-term (ephemeral) and long-term interests. The short-
term profile, as described earlier, allows the system to provide the user with information
on topics of recent interest. For the long-term model collects information over a longer
period of time (e.g., several months) and also seeks to identify the most informative words
in the documents by determining the terms that consistently receive high TF-IDF scores in
a larger document collection.
ii. Relevance feedback – Rocchio’s method:
Another method that is based on the vector-space model and was developed in the context
of the pioneering information retrieval (IR) system in the late 1960s is Rocchio’s relevance
feedback method.
The relevance feedback loop used in this method will help the system improve and
automatically extend the query as follows. The main idea is to first split the already rated
documents into two groups, D+ and D−, of liked (interesting/relevant) and disliked
documents and calculate a prototype (oraverage) vector for these categories. This
prototype can also be seen as a sort of centroid of a cluster for relevant and nonrelevant
document sets;The current query Qi , which is represented as a multidimensional term
vector just like the documents, is then repeatedly refined to Qi+1 by a weighted addition
of the prototype vector of the relevant documents and weighted substraction of the vector
representing the nonrelevant documents. As an effect,the query vector should consistently
move toward the set of relevant documents as depicted schematically in the following
figure.
The proposed formula for computing the modified query Qi+1 from Qi is defined as
follows:
The variables α, β, and γ are used to fine-tune the behavior of the “move” toward the more
relevant documents. The value of α describes how strongly the last (or original) query should
be weighted, and β and γ correspondingly capture how strongly positive and negative
feedback should be taken into account in the improvement step.
Average vectors for relevant and nonrelevant documents.
Relevance feedback. After feedback, the original query is moved toward the cluster of the
relevant documents;
Overall, the relevance feedback retrieval method and its variations are used in many
application domains. It has been shown that the method, despite its simplicity, can lead to
good retrieval improvements in real-world settings.
Other text classification methods
Another way of deciding whether or not a document will be of interest to a user is to view the
problem as a classification task, in which the possible classes are “like” and “dislike”. Once
the content-based recommendation task has been formulated as a classification problem,
various standard (supervised) machine learning techniques can, in principle, be applied such
that an intelligent system can automatically decide whether a user will be interested in a
certain document. Supervised learning means that the algorithm relies on the existence
of training data, in our case a set of (manually labelled) document-class pairs.
i. Probabilistic methods :
The most prominent classification methods developed in early text classification systems are
probabilistic ones. These approaches are based on the naive Bayes assumption of conditional
independence (with respect to term occurrences) and have also been successfully deployed in
content-based recommenders.
Classification based on Boolean feature vector
The basic formula to compute the posterior probability for document classification is:
Calculation:
To determine the correct class, we can compute the class-conditional probabilities for the
feature vector X of Document 6 again as follows:
P(X|Label=1) = P(recommender=1|Label=1) ×
P(intelligent=1|Label=1) ×
P(learning=0|Label=1) × P(school=0|Label=1)
= 3/3 × 2/3 × 1/3 × 2/3
≈ 0.149
The same can be done for the case Label = 0.
There are two advantages of this Bayesian Classifier :
i. Good Accuracy
ii. the components of the classifier can be easily updated when new data are
available
iii. the learning time complexity remains linear to the number of examples
Other linear classifiers and machine learning
When viewing the content-based recommendation problem as a classification problem,
various other machine learning techniques can be employed. At a more abstract level, most
learning methods aim to find coefficients of a linear model to discriminate between relevant
and nonrelevant documents.
The following figure sketches the basic idea in a simplified setting in which the available
documents are characterized by only two dimensions. If there are only two dimensions, the
classifier can be represented by a line. The idea can, however, also easily be generalized to
the multidimensional space in which a two-class classifier then corresponds to a hyperplane
that represents the decision boundary.
A linear classifier in two-dimensional space.
In two-dimensional space, the line that we search for has the form w1x1 + w2x2 = b where x1
and x2 correspond to the vector representation of a document (using, e.g., TF-IDF weights)
andw1,w2, and b are the parameters to be learned.
The classification of an individual document is based on checking whether for a certain
document w1x1 + w2x2 > b, which can be done very efficiently. In n-dimensional space, a
generalized equation using weight and feature vectors instead of only two values is used, so
the classification function is
COMPARE AND CONTRAST BETWEEN COLLABORATIVE AND
CONTENT-BASED RECOMMENDER SYSTEMS
Aspect Collaborative Filtering Content-Based Filtering
Basis of Based on user-user or item-item Based on item features and user
Recommendation similarities using ratings/behavior. preferences.
Requires Item
No Yes
Metadata
Implicitly learned from user behavior Explicitly created using item
User Profile
(e.g., rating history). features the user interacted with.
Cold Start Problem Yes, difficult to recommend for new Less severe if item features are
(New User) users. known.
Cold Start Problem Yes, hard to recommend new items No, can recommend if item features
(New Item) with no interaction data. are available.
Often suffers from sparse user-item Less affected, as recommendations
Sparsity Issue
matrices. are based on content.
Scalability May not scale well for large datasets. Easier to scale using item features.
High, as it considers peer user High, tailored to individual user’s
Personalization
behavior. preferences.
High–can recommend unexpected but May be lower – tends to
Serendipity
relevant items. recommend similar items.
Hard to explain ("People like you Easier to explain ("Recommended
Explainability
liked this"). because it has features you liked").
MovieLens,Netflix Amazon's item suggestions based
Example Use Case Recommendations based on other on product features you viewed or
users' ratings. liked.
KNOWLEDGE BASED RECOMMENDER SYSTEMS
Knowledge-based recommender systems help us to tackle the challenges imposed by both
Collaborative and content-based recommender systems. The advantage of these systems is
that no ramp-up problems exist, because no rating data are needed for the calculation of
recommendations. Recommendations are calculated independently of individual user ratings:
either in the form of similarities between customer requirements and items or on the basis of
explicit recommendation rules.
Two basic types of knowledge-based recommender systems are constraint based and case-
based systems. Both approaches are similar in terms of the recommendation process: the user
must specify the requirements, and the system tries to identify a solution. If no solution can
be found, the user must change the requirements. The system may also provide explanations
for the recommended items. These recommenders, however, differ in the way they use the
provided knowledge: case-based recommenders focus on the retrieval of similar items on the
basis of different types of similarity measures, whereas constraint-based recommenders rely
on an explicitly defined set of recommendation rules. In constraint-based systems, the set of
recommended items is determined by, for instance, searching for a set of items that fulfil the
recommendation rules. Case-based systems, on the other hand, use similarity metrics to
retrieve items that are similar (within a predefined threshold) to the specified customer
requirements.
Knowledge representation and reasoning
Knowledge-based systems rely on detailed knowledge about item characteristics. An example
for knowledge representation of a given product is given in the following table:
Example product assortment: digital cameras
The recommendation problem consists of selecting items from this catalog that match the
user’s needs, preferences, or hard requirements. The user’s requirements can, for instance, be
expressed in terms of desired values or value ranges for an item feature, such as “the price
should be lower than 300e” or in terms of desired functionality, such as “the camera should
be suited for sports photography”.
CONSTRAINT BASED APPROACH
A classical constraint satisfaction problem (CSP)1 can be described by a-tuple
(V,D,C) where
V is a set of variables,
D is a set of finite domains for these variables, and
C is a set of constraints that describes the combinations of values the variables
can simultaneously take.
A solution to a CSP corresponds to an assignment of a value to each variable
in V in a way that all constraints are satisfied.
Example recommendation task (VC, VPROD, CR, CF , CPROD,REQ) and the corresponding
recommendation result (RES)
recommender knowledge base that typically includes two different sets of variables (V = VC ∪
Constraint-based recommender systems can build on this formalism and exploit a
properties. Three different sets of constraints (C = CR ∪ CF ∪ CPROD) define which items
VPROD), one describing potential customer requirements and the other describing product
should be recommended to a customer in which situation. Examples for such variables and
constraints for a digital camera recommender, are shown in the above Table.
Customer properties (VC) describe the possible customer requirements. The customer
property max-price denotes the maximum price acceptable for the customer, the property
usage denotes the planned usage of photos (print versus digital organization), and
photography denotes the predominant type of photos to be taken; categories are, for example,
sports or portrait photos.
Product properties (VPROD) describe the properties of products in an assortment; for example,
mpix denotes possible resolutions of a digital camera.
Compatibility constraints (CR) define allowed instantiations of customer properties – for
example, if large-size photoprints are required, the maximal accepted price must be higher
than 200.
Filter conditions (CF) define under which conditions which products should be selected – in
other words, filter conditions define the relationships between customer properties and
product properties. An example filter condition is large-size photoprints require resolutions
greater than 5 mpix.
Product constraints (CPROD) define the currently available product assortment.
An example constraint defining such a product assortment is depicted in the above table.
Each conjunction in this constraint completely defines a product (item) – all product
properties have a defined value.
The task of identifying a set of products matching a customer’s wishes and needs is denoted
as a recommendation task. The customer requirements REQ can be encoded as unary
constraints over the variables in VC and VPROD – for example, max-price = 300.
Formally, each solution to the CSP (V = VC ∪ VPROD , D, C = CR ∪ CF ∪ CPROD ∪ REQ)
corresponds to a consistent recommendation.
Cases and similarities
In case-based recommendation approaches, items are retrieved using similarity measures that
similarity of an item p to the requirements r ∈ REQ is often defined as shown in the
describe to which extent item properties match some given user’s requirements. The distance
following formula.
customer requirement r ∈ REQ – for example, φmpix(p1) = 8.0.Furthermore, wr is the
In this context, sim(p, r) expresses for each item attribute value φr (p) its distance to the
importance weight for requirement r.3
In real-world scenarios, there are properties a customer would like to maximize – for
example, the resolution of a digital camera. There are also properties that customers want to
minimize – for example, the price of a digital camera or the risk level of a financial service.
In the first case we are talking about “more-is-better” (MIB) properties; in the second case
the corresponding properties are denoted with “less-is-better” (LIB).
To take those basic properties into account in our similarity calculations,we introduce the
following formulae for calculating local similarities.
First, in the case of MIB properties, the local similarity between p and r is calculated as
follows:
The local similarity between p and r in the case of LIB properties is calculated as follows:
4.3.1 Defaults
Proposing default values. Defaults are an important means to support customers in the
requirements specification process, especially in situations in which they are unsure about
which option to select or simply do not know technical details. Defaults can support
customers in choosing a reasonable alternative (an alternative that realistically fits the current
preferences). For example, if a customer is interested in printing large-format pictures from
digital images, the camera should support a resolution of more than 5.0 megapixels (default).
The negative side of the coin is that defaults can also be abused to manipulate consumers to
choose certain options. For example, users can be stimulated to buy a park distance control
functionality in a car by presenting the corresponding default value.
Defaults can be specified in various ways:
_ Static defaults: In this case, one default is specified per customer property – for example,
default(usage)=large-print, because typically users want to generate posters from high-
quality pictures.
_ Dependent defaults: In this case a default is defined on different combinations of potential
customer requirements – for example, default(usage=smallprint,max-price) = 300.
Derived defaults: When the first two default types are strictly based on a declarative
approach, this third type exploits existing interaction logs for the automated derivation of
default values.
Interacting with constraint-based recommenders
In our example, a given set of requirements REQ = {r1 : price <= 150, r2 :
opt-zoom = 5x, r3 : sound = yes, r4 : waterproof = yes} cannot be fulfilled by any of the
zoom=5x,sound=yes,waterproof=yes](P) = ∅.
products in P = {p1, p2, p3, p4, p5, p6, p7, p8} because σ[price<=150,opt-
In the context of our problem setting, a diagnosis is a minimal set of user requirements whose
repair (adaptation) will allow the retrieval of a recommendation.
Given P = {p1, p2, . . . , pn} and REQ = {r1, r2, . . . , rm} where σ[REQ](P) = ∅, a
dk} where σ[REQ−di ](P) =∅∀di ∈_.A diagnosis is a minimal set of elements {r1, r2, . . . ,
knowledge-based recommender system would calculate a set of diagnoses_ = {d1, d2, . . . ,
rk} = d ⊆ REQ that have to be repaired in order to restore consistency with the given product
assortment so at least one solution can be found: σ[REQ−d](P) = ∅. Following the basic
principles of MBD, the calculation of diagnoses di ∈ _ is based on the determination and
rl} ⊆ REQ, such that σ[CS](P) = ∅. A conflict set CS is minimal if and only if (iff) there does
resolution of conflict sets. A conflict set CS (Junker 2004) is defined as a subset {r1, r2, . . . ,
not exist a CS_ with CS_ ⊂ CS.
Calculating diagnoses for unsatisfiable requirements
In the context of our problem setting, a diagnosis is a minimal set of user requirements whose
repair (adaptation) will allow the retrieval of a recommendation.
Given P = {p1, p2, . . . , pn} and REQ = {r1, r2, . . . , rm} where σ[REQ](P) = ∅, a
dk} where σ[REQ−di ](P) =∅∀di ∈_.Adiagnosis is a minimal set of elements {r1, r2, . . . ,
knowledge-based recommender system would calculate a set of diagnoses_ = {d1, d2, . . . ,
rk} = d ⊆ REQ that have to be repaired in order to restore consistency with the given product
assortment so at least one solution can be found: σ[REQ−d](P) = ∅. Following the basic
principles of MBD, the calculation of diagnoses di ∈ _ is based on the determination and
rl} ⊆ REQ, such that σ[CS](P) = ∅. A conflict set CS is minimal if and only if (iff) there does
resolution of conflict sets. A conflict set CS (Junker 2004) is defined as a subset {r1, r2, . . . ,
not exist a CS_ with CS_ ⊂ CS.
Calculating conflict sets. A recent and general method for the calculation
of conflict sets is QuickXPlain (Algorithm 4.1), an algorithm that calculates
one conflict set at a time for a given set of constraints. Its divide-and-conquer
strategy helps to significantly accelerate the performance compared to other
approaches (for details see, e.g., Junker 2004).
QuickXPlain has two input parameters: first, P is the given product assortment
P = {p1, p2, . . . , pm}. Second, REQ = {r1, r2, . . . , rn} is a set of
requirements analyzed by the conflict detection algorithm.
QuickXPlain is based on a recursive divide-and-conquer strategy that divides
the set of requirements into the subsets REQ1 and REQ2. If both subsets
contain about 50 percent of the requirements (the splitting factor is n2
), all the requirements contained in REQ2 can be deleted (ignored) after a single
consistency check if σ[REQ1](P) = ∅. The splitting factor of n2 is generally
recommended; however, other factors can be defined. In the best case (e.g.,
all elements of the conflict belong to subset REQ1) the algorithm requires
log2 n u + 2u consistency checks; in the worst case, the number of consistency
checks is 2u(log2 n u
+ 1), where u is the number of elements contained in the conflict set.