Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
25 views10 pages

Don't Look Stupid: Avoiding Pitfalls When Recommending Research Papers

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

Don't Look Stupid: Avoiding Pitfalls When Recommending Research Papers

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Don’t Look Stupid:

Avoiding Pitfalls when Recommending Research Papers


Sean M. McNee, Nishikant Kapoor, Joseph A. Konstan
GroupLens Research
Department of Computer Science and Engineering
University of Minnesota
Minneapolis, Minnesota 55455 USA
{mcnee, nkapoor, konstan}@cs.umn.edu

ABSTRACT 1. INTRODUCTION
If recommenders are to help people be more productive, they need Recommender systems are supposed to help users navigate
to support a wide variety of real-world information seeking tasks, through complex information spaces by suggesting which items a
such as those found when seeking research papers in a digital user should avoid and which items a user should consume. They
library. There are many potential pitfalls, including not knowing have proven to be successful in many domains, including Usenet
what tasks to support, generating recommendations for the wrong netnews [15], movies [10], music [28], and jokes [7], among
task, or even failing to generate any meaningful recommendations others. Even more, recommenders have transitioned from a
whatsoever. We posit that different recommender algorithms are research curiosity into products and services used everyday,
better suited to certain information seeking tasks. In this work, including Amazon.com, Yahoo! Music, TiVo, and even Apple’s
we perform a detailed user study with over 130 users to iTunes Music Store.
understand these differences between recommender algorithms Yet, with this growing usage, there is a feeling that recommenders
through an online survey of paper recommendations from the are not living up to their initial promise. Recommenders have
ACM Digital Library. We found that pitfalls are hard to avoid. mostly been applied to lower-density information spaces—spaces
Two of our algorithms generated ‘atypical’ recommendations— where users are not required to make an intensive effort to
recommendations that were unrelated to their input baskets. Users understand and process recommended information (i.e. such as
reacted accordingly, providing strong negative results for these movies, music, and jokes) [13]. Moreover, recommenders have
algorithms. Results from our ‘typical’ algorithms show some supported a limited number of tasks (i.e. a movie recommender
qualitative differences, but since users were exposed to two can only help find a movie to watch). Can recommenders help
algorithms, the results may be biased. We present a wide variety people be productive, or only help people make e-commerce
of results, teasing out differences between algorithms. Finally, we purchasing decisions? Herlocker et al. stated it best when they
succinctly summarize our most striking results as “Don’t Look said, “There is an emerging understanding that good
Stupid” in front of users. recommendation accuracy alone does not give users of
recommender systems an effective and satisfying experience.
Categories and Subject Descriptors Recommender systems must provide not just accuracy, but also
H.3.3 [Information Storage and Retrieval]: Information Search
usefulness.” [9] (Emphasis in original)
and Retrieval – information filtering, relevance feedback,
retrieval models But what is usefulness? We believe a useful recommendation is
one that meets a user’s current, specific need. It is not a binary
General Terms measure, but rather a concept for determining how people use a
Algorithms, Experimentation, Human Factors recommender, what they use one for, and why they are using one.
Current systems, such as e-commerce websites, have predefined a
Keywords user’s need into their business agendas—they decide if a system is
Personalization, Recommender Systems, Human-Recommender useful for a user! Users have their own opinions about the
Interaction, Collaborative Filtering, Content-based Filtering, recommendations they receive, and we believe if recommenders
Information Seeking, Digital Libraries should make personalized recommendations, they should listen to
users’ personalized opinions.
There are many recommender pitfalls. These include not building
user confidence (trust failure), not generating any
recommendations (knowledge failure), generating incorrect
recommendations (personalization failure), and generating
Permission to make digital or hard copies of all or part of this work for recommendations to meet the wrong need (context failure),
personal or classroom use is granted without fee provided that copies are among others. Avoiding these pitfalls is difficult yet critical for
not made or distributed for profit or commercial advantage and that
the continued growth and acceptance of recommenders as
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists, knowledge management tools.
requires prior specific permission and/or a fee. This concern is of even greater importance as recommenders
CSCW'06, November 4–8, 2006, Banff, Alberta, Canada.
Copyright 2006 ACM 1-59593-249-6/06/0011...$5.00.
move into denser information spaces—spaces where users face
Figure 1-1: Aspects of Human-Recommender Interaction. The Aspects are divided into three 'Pillars'

serious challenges and the cost of failure is high. One such space seeking theory, present example tasks in a DL environment,
is that of researchers seeking peer-reviewed, published work from reflect on how recommenders change digital libraries, and finally
a digital library. In previous work, we generated explain how to apply HRI in this domain.
recommendations for computer science research papers [21]. In
that paper and in follow-up work, we found that not only could Information seeking theory provides us with a framework to
recommenders generate high quality recommendations in this understand user information needs and context. Models of
domain, but also users felt that various recommender algorithms information seeking behavior, including Taylor’s Four Stages of
generated qualitatively different, yet meaningful, recommendation Information Need, Wilson’s Mechanisms and Motivations model,
lists [21, 31]. Even though this work was in a denser information Dervin’s theory of Sense Making, and Kuhlthau’s Information
space of a digital library, it only supported one task: find more Search Process [5, 16], reveal the ways in which emotion,
citations for a given paper. uncertainty, and compromise affect the quality and nature of a
user's information search and its results. More recently,
To make recommenders more useful, we believe they must Information Foraging theory suggests how users ‘hunt’ for
support multiple information-seeking tasks. To test this, we build information by analyzing the current cues, or scents, in their
on previous results. We hypothesize that the differences between environment [24].
algorithms are quantifiable and predictable, and these differences
are meaningful and valuable for end users. That is, some These theories ground our analysis of what forms of information
algorithms are better suited for particular user information- seeking behavior appear in a digital library environment: users
seeking tasks. We believe that matching users, and their specific come to a DL looking for research information, but their
tasks, to the appropriate algorithm will increase use satisfaction, confidence, emotional state, and comfort/familiarity with the
efficiency, and the usefulness of the recommender system. system affect their information seeking behavior as much as their
intellectual desire to solve their information need. One method
We will use Human-Recommender Interaction theory as our for gathering this information in traditional libraries was the
grounding to test these hypotheses. Human-Recommender ‘reference interview’ where the librarian and user had a
Interaction (HRI) is a framework and process model for continuing discussion about the user’s information seeking task
understanding recommender algorithms, users, and information [30]. As librarians have been replaced by search boxes, users may
seeking tasks [18-20]. It takes a user-centric view of the have faster access to raw information, but without the support and
recommendation process, shifting the focus of attention from the help they may need. By leveraging the opinions of other DL users
system and associated algorithms to the (possibly repeated) as well as content authors, recommenders can bring that ‘human
interactions users have with such systems. By describing touch’ into digital libraries.
recommendation lists using descriptive keywords (called
Aspects), HRI provides a language to articulate the kinds of items In particular, HRI suggests that users and recommenders have a
that would best meet a user’s information seeking task. For similar interaction with the end goal of generating meaningful
example, an experienced user may want a high level of Boldness recommendations. In order to apply HRI to digital libraries, such
and Saliency from a recommender, where as a novice may want as the ACM Digital Library, we need an understanding of possible
more Transparent and Affirming recommendations. See Figure user tasks in this domain. We will review a few examples here.
1-1 for an overview of HRI Aspects. Find references to fit a document. This is our baseline task.
In this paper, we report on a user study of over 130 users on Given a document and list of references, what additional
research paper recommendations we generated from the ACM references might be appropriate to consider and review?
Digital Library using several different recommender algorithms. Documents of interest include paper drafts, theses, grant
We asked users about the suitability of these algorithms for proposals, and book chapters, among others.
different information seeking tasks, as well as for their opinions Maintain awareness in a research field. Dedicated researchers
on the recommendation lists across multiple dimensions, continually need a stream of information on the events in their
dimensions based on HRI. research area. It is not always clear which new items are
interesting, or what older items have become relevant for various
2. DIGITAL LIBRARIES, INFORMATION reasons. This task includes maintaining a prioritized list of papers
SEEKING TASKS, AND RECOMMENDERS to read, pushing a “paper of the week”, or triggering an alert
A main tenet of HRI is that recommenders need to tailor whenever a new paper of interest is added to the collection.
recommendation lists not just to a user, but to a user’s information
seeking task. To understand this, we will review information
Find people to send paper preprints and e-prints. Junior the citations on its reference list. Thus, papers are “users” and
academics, for example, often send copies of their papers to senior citations are “items”—papers receive citation recommendations.
colleagues for advice and support. Building these connections When an author puts a citation in a paper she writes, it is an
help research communities grow by linking new and established implicit vote for that citation. All of the collaborative algorithms
researchers together, but often new researchers do not know who use this data to train models and generate recommendations.
might be interested in their work. While Byes and PLSI could be trained on many different features,
we are only considering them as collaborative algorithms and
Inform collection management at a research library. With training them with citation data.
declining government support for universities and increasing
journal subscription costs, nearly all public university libraries in In this collaborative model, a paper is a collection of citations; the
the United States (and many private and corporate libraries across content is ignored. Moreover, any collection of citations can be
the world) must make careful decisions about which subscriptions sent to the recommender algorithm for recommendations (a
to start, renew, or discontinue. Gathering the needed information ‘pseudo-paper’, if you will). For example, we could send one
is an elaborate and expertise-intensive process, including journal paper’s worth of citations or one author’s entire bibliography of
usage analysis (both in-print and electronically), journal citations to the recommender. The interaction with the
importance in the field, journal availability through other libraries, recommender is the same in both cases, but the meaning behind
journal relevance to faculty and students at the university, and the the interaction could be quite different.
cost of the subscription. Moreover, finding information on
journals for which the library does not have a subscription can be Table 3-1: Summary of Recommender Algorithm Properties
very difficult. Run-time Pre- Expected
Speed Process Rec. Type
These tasks are only suggestive of what users might want from a User-User High
digital library. Since each task has specific needs, we believe Slow None
CF Serendipity
incorporating a tuned recommender into a DL can help better Naïve High
meet these tasks. The HRI Process Model suggests how we can Fast Slow
Bayes Ratability
tune recommenders to meet user tasks; it is done through the “Local”
language of the HRI Aspects. Aspects are selected that best PLSI Very Fast Very Slow
Serendipity
describe the kinds of results users expect to see from the High
recommender for this task [20]. The HRI Aspects can then be TF/IDF Very Slow Fast
Similarity
matched to recommender algorithms to select the most
appropriate algorithm for the given task from a family of
algorithms with known recommendation properties. 3.1 User-based Collaborative Filtering
For example, the initial task, “Find references to fit a document”, User-based Collaborative filtering (CF) has been widely used in
implies the user is trying to ‘grow’ a list. Thus, important Aspects recommender systems for over 12 years. It relies on opinions to
could include Saliency (the emotional reaction a user has to a generate recommendations: it assumes people similar to you also
recommendation), Spread (the diversity of items from across the have similar opinions on items as you do, thus their opinions are
DL), Adaptability (how a recommender changes as a user recommendations to you. The most well known algorithm is
changes), and Risk (recommending items based on confidence). User-based collaborative filtering (a.k.a. User-User CF, or
Through these Aspects we can map this task to the appropriate Resnick’s algorithm), a k-nearest neighbor recommendation
algorithms, as algorithms are mapped to aspects via a variety of algorithm [10, 26]. To generate recommendations for a target
metrics [18]. One of the strengths of HRI is that it only claims the user, a neighborhood of k similar users is calculated, usually using
mappings exist. While it provides suggestions for creating the Pearson correlation, and the highest weighted-average opinions of
mappings automatically, we believe user input is critical to get the the neighborhood are returned as a recommendation list.
mappings correct. In this paper, we add users into this feedback User-based collaborative filtering has many positives and
loop by asking them to evaluate these mappings independently. negatives. It is a well-known and studied algorithm, and it has
typically been among the most accurate predictors of user ratings.
3. RECOMMENDING PAPERS Because of this, it can be considered the “gold standard” of
We focused on four recommender algorithms to cover in this recommender algorithms. Conceptually, it is easy to understand
domain: three collaborative algorithms and one content-based. and easy to implement. It has a fast and efficient startup, but can
We selected User-Based Collaborative Filtering (CF), a Naïve be slow during run-time, especially over sparse datasets. Choice
Bayesian Classifier, a version of Probabilistic Latent Semantic of neighborhood sizes and similarity metrics can affect coverage
Indexing (PLSI), and a textual TF/IDF-based algorithm. We and recommendation quality. It is felt in the community that
chose these because they represent a spread of recommendation User-based CF can generate ‘serendipitous recommendations’
approaches and are well known in the research community. Table because of the mixing of users across the dataset, but the structure
3-1 has a summary of the four algorithms. of the dataset greatly affects the algorithm—it is possible that for
To generate paper recommendations, a content-based algorithm some users, a User-based CF algorithm will never generate high
would mine the text of each paper, and correlate an input stream quality recommendations.
of words to mined papers. The collaborative algorithms, on the There are many other variations of collaborative filtering,
other hand, ignore paper text. Further, instead of relying on user including item-based CF [27], content-boosted CF [8, 22], CF
opinion to generate recommendations, previous work has mined augmented with various machine learning optimizations [2, 4, 12],
the citations between papers to populate the ratings matrix for CF as well as hybrid approaches [3]. In this paper, we chose User-
recommendations [21, 31]. In this model, each paper “votes” for
based CF for its simplicity, high quality recommendations, and This is a relatively new algorithm in this domain. We believe the
familiarity in the recommender systems community. latent aspect of this algorithm makes it more like User-based CF:
generating interesting and serendipitous recommendations. In
Based on results of previous work, we expect User-based CF to some ways, PLSI is like a ‘soft’ clustering algorithm, creating
perform very well generating research paper recommendations. clusters around the latent classes. By performing local
Since we are guaranteed each item in a digital library rates other maximizations, the EM nature of the algorithm could reinforce
items (all papers cite other papers!), we expect good level of more “popular” connections between items at the expense of
‘cross-pollination’ for serendipitous recommendations as well as “further reaching” (and potentially more interesting) connections,
high coverage. While the size of ACM Digital Library is not thus recommendations could be interesting locally, finding
overwhelmingly large, issues of scale and sparsity are relevant if unexpected papers closely related to the input, but not interesting
User-based CF is applied to larger databases, such as the PubMed for finding a broad set of items from across the dataset.
digital library from NIH, containing over 16 million citations [23].
3.4 Content-based Filtering with TF/IDF
3.2 Naïve Bayes Classifier Content-based filtering for papers uses the full text to generate
If we assume independence of co-citation events in a research recommendations. One of the most popular and well-used content
paper, then a Naïve Bayes Classifier [1] can be used to generate filtering algorithms is Porter-stemmed Term Frequency/Inverse
citation recommendations. All co-citations pairs are positive Document Frequency (TF/IDF) [25]. By using the frequency of
training examples. Posterior probabilities are the likelihood an stemmed words in a document compared to the entire corpus, this
item is co-cited with items (citations) in the target paper class. algorithm recommends items based on how well they match on
Items that are strongly classified as belonging to the target paper important keywords.
class are returned as recommendations (usually a top-n list).
TF/IDF is well known in the information retrieval community,
A Bayes Classifier also has many pros and cons. It is a well- and is considered as the standard vector-space model (a.k.a. “bag
known and established algorithm in the machine learning of words”) approach to information processing. Since all papers
literature. The independence assumption between co-citation contain content, they can be immediately processed, and thus
pairs is a large one to make, but even in domains where this recommended. This compares to the collaborative methods where
assumption fails, this classifier still performs quite well [6]. The the strength of the citations can unduly influence
classifier requires a model to generate recommendations, and this recommendations towards older, more cited works. Yet, “bag of
model must be re-built when new data is entered—a potentially words” content analysis is limited by semantic differences (e.g.
slow process. Generating recommendations, however, is quick. “car” is not equated to “automobile”). It also may return many
Because at heart, a Naïve Bayesian Classifier is a classification irrelevant results, recommending papers who mention the search
algorithm, we have specific expectations for it. Classifiers terms only in passing. These issues are not as important in
calculate ‘most likely events’; they determine what classes items scientific literature as papers in one area have a generally agreed-
belong to. From a user’s point of view in a recommender, a upon vocabulary. Finally, TF/IDF does have an associated cost in
classifier returns the next most likely item the user will rate. This pre-processing time and storage space, like other model-based
‘ratability’ property of classifiers may not match a user’s algorithms.
expectations in a recommender. As such, we believe that the Prior results show that when Content-based Filtering works, it
recommendations would be more “mainstream”, and not as performs very well at generating highly similar results. But more
serendipitous as User-based CF. Finally, full coverage could often than not, it fails to generate relevant results. This cherry-
make a difference in low-data situations. picking behavior limits it usefulness, especially when searching
for novel or obscure work. On the other hand, this algorithm may
3.3 Probabilistic Latent Semantic Indexing excel at more conservative user tasks, especially those that start
Probabilistic Latent Semantic Indexing (PLSI) is a relatively new
with a fair amount of information (e.g. “I know many of the
algorithm to be applied to recommender systems [11]. PLSI is a
conference papers in this research area, but I don’t know about
probabilistic dimensional reduction algorithm with a strong
journal papers or workshops”). In general, we expect it to
mathematical foundation. In PLSI, the ratings space (citation
generate a narrow, yet predictable kind of recommendation list.
space) is modeled using a set of independent latent classes. A
user (i.e. paper) can have probabilistic memberships in multiple 4. THE USER STUDY
classes. PLSI uses a variant of the EM algorithm to optimize the Previous simulation experiments suggest recommender algorithms
class conditional parameters through a variational probability behave differently from each other on a variety of metrics [18]. In
distribution for each rating (citing) instance based on previous this study, we tackle the following research questions:
model parameters. Items (citations) with the highest probabilities
relative to the latent classes are recommended • How will real users describe the recommendations generated
by different recommender algorithms?
Similar in spirit to the Bayes classifier, PLSI has similar benefits
and drawbacks. It is mathematically rigorous, using latent classes • How will users rate the ability of recommender algorithms to
to find probabilistic relationships between items. Model creation meet the needs of specific user tasks?
takes a very long time, requiring exceptional computing resources,
but runtime is very efficient. It also will have 100% coverage. 4.1 Experimental Design
The latent classes are the key feature of this algorithm, thus In the experiment, we generated two recommendation lists of five
performance and quality are highly dependent on the number of items each, called ‘List A’ and ‘List B’. These recommendation
latent classes used. lists were based on a ‘basket’ of papers that the user provided to
us. Users were randomly assigned two algorithms; the display of
the lists was counterbalanced to cancel any possible order effects. kinds of papers appearing on the lists. On the second page, we
Algorithm recommendation lists were pre-screened to make sure asked task-related questions: rating the suitability of each list to
there was not strong overlap between result lists. At most, lists meet the subject’s chosen information seeking task.
had 2 out of 5 recommendations in common. Since order is
important in the returned results, if there was overlap, the list that
has the shared result ‘higher up on the list’ retained the item. In
the rare event of a tie, List A always won.
Our dataset is based on a snapshot of the ACM Digital Library,
containing over 24,000 papers. For each paper, we have text of
the paper’s abstract; each paper cites at least two other papers in
the dataset and is cited by at least two other papers in the dataset.
Users were recruited from several different pools. A subject
either had to be an author of papers appearing in the dataset or be Figure 4-1: The Author Selection Page
“very familiar” with the literature in a particular research area.
Users were mined from the DL itself: a random selection of users
who authored more than five papers in the dataset were sent an
email invitation to participate. We also sent invitations to several
computer science mailing lists asking for participants.
As a summary, the following algorithms were used in the online
experiment:
1. User-based collaborative filtering at 50 neighbors, we used
the SUGGEST Recommender as our implementation [14]
2. Porter-stemmed TF/IDF content-based filtering over paper
titles, authors, keywords, and abstracts; we used the Bow
Toolkit as our implementation [17]
3. Naïve Bayes Classifier trained using co-citation between
papers as its feature; we used a custom implementation
4. Probabilistic Latent Semantic Indexing with 1000 classes; we
used a unary co-occurrence latent semantic model and
tempered our EM algorithm to avoid overfitting

4.2 Experiment Walkthrough Figure 4-2: The Paper and Citation Selection Page
After consenting, subjects were asked as to their status as a
researcher/academic (novice, expert, or outside the field of
computer science) and as to their familiarity with the ACM
Digital Library. Next, subjects were asked to create their ‘basket’,
a list of papers to be sent to the recommender engines. This list is
seeded by the name of an author who has published in the ACM
Digital Library. There were options for selecting ‘yourself’ as an
author or selecting ‘someone else’. Figure 4-1 shows the author
selection page.
After confirming the author selection, the subject was presented
with a list of papers by that author in our dataset. The subject was
allowed to prune any papers he did not want in the basket. If the
subject stated he was an author himself, he saw a listing of papers
he had written as well as a listing of papers he had cited from our
dataset. If the subject selected another author, he was only
presented with the listing of papers that author had published in
our dataset. This decision had great implications on our results, as
we will discuss later.
After pruning was finished, we presented the user with a selection
of two possible information-seeking tasks, see Table 4-1. After
selecting a task, the subject received two recommendation lists.
See Figure 4-2 and Figure 4-3 for an overview of the paper
selection and recommendation interfaces. There were two pages
of questions associated with the recommendation lists. On the
first page, we asked comparative questions: user opinion about the Figure 4-3: The Recommendation List Screen
Table 4-1: Available Information Seeking Tasks A summary of all questions is in Table 4-2. On the first page,
there were five possible responses to each question: ‘Strongly
Group Available Tasks
Agree’, ‘Agree’, ‘Disagree’, ‘Strongly Disagree’, and ‘Not Sure’.
My bibliography is a list of papers that are important to my Question 2-1, asking which list was better for the chosen task, had
research. Locate papers that I might want to cite in the future
(or perhaps that should have cited me!). These other papers
two options: ‘List A’ and ‘List B’. We forced users to choose
would be related to my work and would help me in my current between the lists. In Question 2-2, we ask for the user’s
research efforts.
Author, with my satisfaction with that selection, and responses ranged from ‘Very
own work Given my bibliography as a description of what I find Satisfied’ to ‘Very Dissatisfied’, with a ‘Not Sure’ option.
interesting in research, locate papers I would find interesting to Question 2-3 was a hypothetical question, asking users to ponder
read. These papers would not necessarily be work in my area,
and could suggest potential collaborators or new research the better list for the alternate task, and it had ‘List A’, ‘List B’,
questions to explore in the future. and ‘Not Sure’ as its possible responses. Note that for the
You would like to build a list of papers to describe a research
hypothetical question we allowed a ‘Not Sure’ response. In
area, and you started with this input basket. You are Question 2-4, we offered users an option to receive a copy of the
concerned about covering the area and want to make sure you
no not miss any important papers. Which papers would next
citations we generated for them, asking which citations they
appear on this list? would prefer: ‘List A Only’, ‘List B Only’, ‘Both Lists’, or
Someone else’s ‘Neither List’. Finally, in question 2-5, we asked them which
publications You are looking to expand your research interests into new or
format they would prefer: ‘the ACM DL BibTex Format’ or the
related areas, hoping to find interesting papers to read. Your
input basket represents some papers in your current research ‘Standard ACM Citation Format’. While at first glance,
area and interests. What papers might expand your
knowledge, find new collaborators, or suggest new research Questions 2-4 and 2-5 appear to be a thank-you service for users
areas to explore in the future? participating in our study, we wanted to study the difference
between stating they are satisfied with a recommendation list, and
choosing to receive a copy of the list for future use—the
Table 4-2: Survey Questions for Both Pages difference between action and words. We believe action is a
much stronger statement of satisfaction.
Page 1: Comparative Questions

1-1
I would consider most of the papers on this recommendation list as 4.3 Results
authoritative works in their field.
138 subjects completed the survey over the 3-week experimental
1-2
Most of the papers on this recommendation list are papers that that I have period. There were 18 students, 117 professors/researchers, and 7
previous read, I am familiar with, or I have heard of.
non-computer scientists. All but six subjects were familiar with
1-3
This recommendation list is closely tuned-tailored-personalized to my given the ACM Digital Library, 104 used it on a regular basis. Each
input basket, and is not a generic list of papers in this research area.
subject was assigned to two algorithms, see Table 4-3. We will
1-4 This list feels like a good recommendation list. I like it. It resonates with me. first present overall survey results followed by algorithm-pair
This recommendation list contains papers that I was not expecting to see, but interactions and conclude with results by user selected task.
1-5 are good recommendations considering my input basket.

This list contains a good spread of papers from this research area and is not
4.3.1 Survey Results: Comparative Questions
1-6 overly specialized, given the input basket. Figure 4-4 and Figure 4-5 summarize the results for the six
questions on the first page. Answers have been binned into three
Page 2: Task-related Questions
categories: ‘Agree’, ‘Disagree’, and ‘Not Sure’. As the figures
2-1
In your opinion, which recommendation list generated a better set of show, there is a dramatic difference in user opinion about the four
recommendations for you and your task?
algorithms. Users tended to like (‘agree’ with) User CF and
2-2 How satisfied are you with the recommendations in the list you preferred? TF/IDF, where as users tended to dislike (‘disagree’ with) Bayes
Pretend you were going to perform the alternate task, the one you did not and PLSI. Results between these pairs are exceptionally
2-3 choose (see above). In your opinion, which recommendation list generated a significant (p << 0.01). We have found a pitfall; we discuss these
better set of recommendations for this other task?
unusual results in a separate in-depth analysis.
We have the ability to send you a text file containing the standard ACM citation
2-4 format for these recommendation lists. Would you like to keep a copy of these Focusing on User CF and TF/IDF, differences are statistically
recommendation lists?
significant for Question 1-2 (p < 0.01) and almost so for Question
2-5 If so, which format would you prefer? 1-1 (p = 0.11). That is, User CF is more authoritative and
Note: For referencing, questions may be referred by number or by the bolded text in the generates more familiar results than TF/IDF. The trends on the
question. No words were bold in the online survey. other four questions also suggest that User CF generates
recommendations that are more ‘interesting’.
Table 4-3: Number of Users per Experimental Condition 4.3.2 Survey Results: Task-Related Questions
Algorithm Pair Users There were two kinds of information-seeking tasks: “Find closely
User-User CF and TF/IDF 24
related papers”, and “find more distant paper relationships”.
Users were required to select one of these tasks before receiving
User-User CF and Naïve Bayesian 28
recommendations. Figure 4-6 shows how users judged the
User-User CF and PLSI 25 suitability of each of the four algorithms to their chosen task.
TF/IDF and Naïve Bayesian 24 Please note: users were forced to answer Question 2-1; there was
TF/IDF and PLSI 18 no ‘Not Sure’ option. Continuing the above trend, users chose
Naïve Bayesian and PLSI 23 User CF and TF/IDF over Bayes and PLSI at about a 3:1 ratio (p
< 0.01). Question 2-2 asked users about how satisfied they were
with the algorithm they chose. Users were not pleased with Bayes
or PLSI (all results for Bayes were ‘Disagree’!), and were happy
Comparative Results, First Three Questions
(‘agree’) with User CF and TF/IDF just over half of the time.
100
Question 2-3 asked which algorithm would be best for the task the
user did not select. When asked about this alternate task, 56% 80
chose the same algorithm, 16% chose the other algorithm, and

Percentage
60
28% were not sure. This trend carried across all four algorithms,
except for Bayes, where ‘Not Sure’ was selected 50% of the time. 40
The final questions asked users if they wanted to keep a copy of 20
the recommendations. While offered as a service, it provides an
insight into user satisfaction. 32% of all users elected to ‘keep a 0
copy’. 67% of satisfied users chose to while only one user who

TFIDF

TFIDF

TFIDF
PLSI

PLSI

PLSI
Bayes

Bayes

Bayes
User

User

User
was ‘dissatisfied’ chose to. Finally, users chose to keep
recommendations generated from User CF and TF/IDF over Agree Unsure Disagree
Bayes and PLSI again at a 3:1 ratio (p < 0.01).
Figure 4-4: Displayed from Left to Right, Results for the
4.3.3 Results across Algorithm Pairs Comparative Questions on ‘Authoritative Works’,
Users answered questions in the context of a pair of algorithms. ‘Familiarity’, and ‘Personalization’
The comparative questions showed minor levels of interaction.
When either User CF or TF/IDF was paired with Bayes or PLSI,
users tended to score algorithms towards the extremes, most Comparative Results, Second Three Questions
notably for questions 1-2 (“familiarity”), 1-3 (“personalized”), 100
and 1-4 (“good recommendation list”). There were no discernable
effects when User CF was paired with TF/IDF or when Bayes was 80
paired with PLSI. Percentage
60
When answering task-related questions, User CF and TF/IDF
dominated over Bayes and PLSI. When shown together, User CF 40
was selected 90% of the time over Bayes and 95% over PLSI.
20
TF/IDF was selected 88% and 94%, respectively. Compared
against each other, User CF was selected more frequently (60%) 0
than TF/IDF. Bayes and PLSI were preferred an equal number of TFIDF

TFIDF

TFIDF
PLSI

PLSI

PLSI
Bayes

Bayes

Bayes
User

User

User
times when placed together. More interestingly, when paired
against Bayes and PLSI, User CF received higher praise, earning Agree Unsure Disagree
all of its ‘Very Satisfied’ scores in these cases. TF/IDF saw no
such increase. Figure 4-5: Displayed from Left to Right, Results for the
Comparative Questions ‘Is a Good Recommendation’, ‘Not
When asked for algorithm preferences for the alternate task, users Expecting’, and ‘Good Spread’
rarely switched algorithms. Of the switches between algorithms,
users switched to PLSI 9% of the time, and switched to Bayes
20% of the time. Users who first chose PLSI or Bayes always
switched to either User CF or TF/IDF. Between User CF and Suitability of Algorithms for Chosen Task
TF/IDF, users were twice as likely to switch from User CF to
60
TF/IDF as vice-versa.
50
4.3.4 Results by User-selected Task
There was a 60%-40% split between subjects selecting the 40 Disagree
“closely related papers” task or the “distant relationships” task. Unsure
# Users

Responses showed some significant differences across tasks for 30 Agree


different algorithms. For example, User-CF was more
authoritative for the “close” task (p < 0.05), whereas TFIDF was 20
more familiar for the “distant” task (p < 0.005). Further, PLSI
10
was more personalized for “narrow” (p < 0.01), and both User-CF
and PLSI were more unexpected for “distant” (p < 0.05 and p < 0
0.075, respectively). Bayes PLSI TFIDF User

Task selection was a 2x2 grid, where subjects also chose “Self” Figure 4-6: Results for Questions 2-1 and 2-2, User Opinion
vs. “Someone else”. 85% of subjects selected “Self”. The above on Suitability of Algorithms for the Chosen User Task
results are also significant for “Self”-only subjects. Trends in the
“Someone else”-only subjects also support the results above.
Table 4-4: Overlap for Top 10 Recommendation Lists by A (on the left) to List B (on the right). Specifically, equal
Basket Size. Computed as the average of the intersection / numbers of people selected List A and List B as their preferred list
union for all possible pairs in basket range. The lower the (Question 2-1). But those who chose List B were less satisfied
score, the fewer shared items. 0 means no overlap with their selection (Question 2-2) (p < 0.10).
<5 5 - 15 15 - 30 30+ 4.4.3 Comparative and User Task Analysis
Bayes 0.22 0.36 0.20 0.04 The atypical results of the two recommender algorithms have
PLSI 0.01 0.01 0.002 0.004 skewed the results of the survey, making a detailed comparative
TF/IDF 0.002 0 0 0.03 analysis difficult. For User CF and TF/IDF, users recognized the
User 0 0 0.001 0.04 recommendations and deemed them authoritative in their research
area, but the lists did not contain unexpected results, and users
were not sure if the recommendations contained came from a wide
4.4 Analysis and Discussion spread of the dataset. The differences between User CF and all
Before discussing the results of this study, we first perform a other algorithms for authority and familiarity were significant.
closer analysis of the atypical results we found and review our These results are different from previously published work where
results for any potential order bias. User CF generated more unexpected recommendations than
TF/IDF. We do reinforce the result that User CF generates
4.4.1 Analysis of Atypical Results authoritative paper recommendations. Further, previous work
The Bayes and PLSI algorithms did not perform as expected. suggested that TF/IDF had a higher level of user satisfaction,
Prior work suggests that a Naïve Bayesian Classifier should be whereas here, both algorithms received positive scores. Of
similar to User-based CF in this domain [21], and PLSI has shown course, the interaction effects may have influenced user responses,
to be of high quality in other domains [11]. What happened? especially for User CF.
A careful review of our logs reveals that Bayes was generating TF/IDF showed to be equally as useful for all four given user
similar recommendation lists for all users. Looking at top-10 tasks with around a 54% satisfaction rating for all tasks. User CF
recommendations lists, on average 20% of the recommended showed a higher satisfaction rating (60%) for the ‘find closely
papers were identical for all users. Instead of returning related papers’ task. These results also reinforce previous
personalized recommendations, Bayes returned the most highly findings, but only in context of the possible interaction effects.
co-cited items in the dataset. Changing the input basket size was
did not have an effect: see Table 4-4. Trend data suggests as Finally, we showed user opinion of algorithms varied by task.
overlap increased, user satisfaction decreased across all questions, That is, depending on the current task, users perceived differences
most notably for ‘personalization’ and ‘good spread’, but also for among algorithms across multiple dimensions. This is the first
task usage as well. experimental evidence in support of HRI, suggesting that the
qualities of algorithms important to users vary according to task.
PLSI was doing something equally as odd. While it returned
personalized recommendations, a random sampling from logs 4.4.4 Dataset Limitations
revealed seemingly nonsensical recommendations. For example, While our dataset was of high quality, it contained several
an input basket composed of CHI papers received limitations. The two striking ones are the scope of the data and
recommendations for operating system queuing analysis. We range of the data. Only items for which the ACM holds copyright
were unable to determine any patterns in the PLSI behavior. were contained in this dataset, many relevant papers were not
In this work, the input basket size varied greatly (average of 26 included. Due to the nature of the DL, the bulk of items were
items, stddev 32.6), with many baskets containing five or fewer published in the last 15 years. Finally, items had to cite other
items. Much as search engines need to return relevant results with items and be cited by other items in the dataset. This limitation
little input [29], recommenders in this domain need to adapt to a excluded much of the newest published work. We received
variety of user models. User CF and TF/IDF we able to do so, several emails from users complaining about these limitations,
Bayes and PLSI were not. including requests to add their work to our dataset, statements that
they have changed research fields, and concerns for only receiving
Yet, it is not accurate to say the algorithms ‘failed’—some users recommendations for older papers.
did receive high quality recommendations. Rather, Bayes and
PLSI were inconsistent. While Bayes averaged an overlap score Further, these limitations may have affected PLSI’s and the Naïve
of 0.20, the standard deviation was 0.26. We believe the quality Bayesian Classifier’s performance. While all papers were
of the recommendations generated by these two algorithms connected to each other, many were only weakly so. Both of
depended on how well-connected items in the input basket were to these algorithms use strong statistical models as a basis for their
the rest of the dataset, especially for Bayes. For example, in our recommendations. When calculated over a dataset such as this
tests, adding one well-cited item to a questionable Bayesian one, the meaningful statistics could be ‘washed out’ by the other
basket radically improved the results. Adding such items to PLSI calculations. In Bayes, a paper with a strong prior probability
was not always as helpful. In PLSI, we believe many of the latent could influence posterior probabilities, or perhaps the naïve
classes were ‘junk classes’, thus to improve recommendations, assumption was one we should not make in this domain. In PLSI,
items from good classes need to be in the input basket. the local maximization calculations could reinforce stronger
global calculations in place of the weaker, yet meaningful local
4.4.2 Order Bias connections. Finally, is worth asking the question if measuring
The experiment was cross-balanced, so each algorithm appeared co-citations is the correct mapping for recommenders in this
an equal number of times as List A and List B. With this balance, domain. A complete analysis and discussion, however, is outside
we were able to detect an order bias, where people preferred List the scope of this paper.
5. IMPLICATIONS AND FUTURE WORK nonsensical results, we believe they had a strong emotional
As previously mentioned, Bayes and PLSI perform well as reaction. The results went against their expectations of being an
recommenders in offline simulation experiments. Specifically, experiment to receive personalized recommendations. Thus, the
both have scored well on accuracy metrics in using a leave-n-out users provided the strong negative feedback.
methodology [1, 11]. As we have argued elsewhere (i.e. [19, 20]) There many threads of possible future work. First, we need a
such offline measures may not translate into useful measures for deeper understanding of Bayes and PLSI in this domain. How
end users. much of the difficulties experienced in this work are related to
In particular, Bayes recommended users a combination of properties of the algorithms, properties of the dataset and input
personalized and non-personalized recommendations. In many baskets, or implications of how these algorithms were applied in
cases, the personalized recommendations were good suggestions. this domain? Second, while this study provides evidence to the
In fact, in offline leave-n-out scenarios, it did not matter that the tenet of HRI that specific recommender algorithms are better
recommendation lists contained a mixture of highly personalized suited to certain information needs, more work needs to be done.
and non-personalized items, as long as the withheld item appeared Yet it does raise one interested question from our HRI analysis,
on the list, the algorithm scored well on the metric [20]. Users, could a recommender be ‘stupid’ in front a user with whom the
however, were not satisfied with these recommendation lists. recommender has already built a relationship? This work must be
These results suggest that the research community’s dependence done with real users; offline analysis is not enough. Finally, the
on offline experiments have created a disconnect between performance of Bayes and PLSI in this domain suggest that
algorithms that score well on accuracy metrics and algorithms that dataset properties and input basket selection greatly influence
users will find useful. recommendation lists, implying the need for a study comparing
multiple datasets across multiple algorithms.
This argument is even more subtle. In our testing of HRI, we
performed an offline analysis of recommendation lists using a 6. CONCLUSIONS
series of non-accuracy-based metrics. If the Naïve Bayes Recommending research papers in a digital library environment
Classifier were generating a mixture of personalized and non- can help researchers become more productive. Human-
personalized results, a personalization metric should have Recommender Interaction argues that recommenders need to be
revealed this problem. In our results, Bayes generated very approached from a user-centric perspective in order to remain
personalized responses [20]. The difference between then and relevant, useful, and effective in both this and other domains.
now is the input baskets. The baskets then were based on the HRI suggests tailoring a recommender to a user’s information
citations from a single paper; usually such lists contain a mixture need is a way to this end. To test these ideas, we ran a study of
of well and loosely connected papers, lists for which Bayes could 138 users using four recommender algorithms over the ACM
generate personalized recommendations. The input baskets here Digital Library. Instead of validating our research questions, we
were different—they were based on papers written by a single ran into a large pitfall and discovered a more telling result: Don’t
author. This reveals not just the importance of the dataset but also Look Stupid. Recommenders that generate nonsensical results
the importance of the input basket when analyzing algorithms. were not liked by users, even when the nonsensical
The disconnect is even larger than we thought. recommendations were intermixed with meaningful results.
These results suggest that it is critically important to select the
In previous work, we argued that showing one good correct recommender algorithm for the domain and users’
recommendation in a list of five was enough to satisfy users. It is information seeking tasks. Further, the evaluation must be done
not that simple: showing one horrible recommendation in five is with real users, as current accuracy metrics cannot detect these
enough for users to lose confidence in the recommender. We call problems.
this the Don’t Look Stupid principle: only show recommendation
lists to users when you have some confidence in their usefulness. 7. ACKNOWLEDGMENTS
While this principle applies most dramatically when talking about We would like to thank the ACM for providing us with a snapshot
Bayes and PLSI in our results, we believe it is just as important of the ACM Digital Library. Thanks to GroupLens Research and
when dealing with users’ information seeking tasks. A the University of Minnesota Libraries for their help and support,
recommendation list is bad when it is not useful to the user, especially Shilad Sen for helping with PLSI. This research is
independent of why it is bad. funded by a grant from the University of Minnesota Libraries and
To understand this principle in context, we can use HRI. This by the National Science Foundation, grants DGE 95-54517, IIS
experiment was one-time online user survey. These users had no 96-13960, IIS 97-34442, IIS 99-78717, and IIS 01-02229.
previous experience with the recommender algorithms, and they
were given an information seeking task. Because of this, many 8. REFERENCES
HRI Aspects are not relevant to our discussion, but a few become [1] J.S. Breese, D. Heckerman and C. Kadie, "Empirical
very important, such as: Correctness, Saliency, Trust, and Analysis of Predictive Algorithms for Collaborative
Expectations of Usefulness. By being asked to be in an Filtering", in Proceedings of the Fourteenth Conference on
experiment, users had a heightened awareness of the Uncertainty in Artificial Intelligence, pp. 43-52, 1998.
recommendation algorithms; they expected the algorithms to be [2] J. Browning and D.J. Miller, "A Maximum Entropy
useful and they expected them to generate correct results. Indeed, Approach for Collaborative Filtering", J.VLSI Signal
we received several emails from users worried about the poor Process.Syst., vol. 37(2-3), pp. 199-209, 2004.
recommendations they received from either Bayes or PLSI. We
[3] R. Burke, "Hybrid Recommender Systems: Survey and
had no time to build trust with our users, nor did the users gain a
Experiments", User Modeling and User-Adapted Interaction,
sense of the algorithms’ personality. When users received
vol. 12(4), pp. 331-370, 2002.
[4] J. Canny, "Collaborative Filtering with Privacy Via Factor [19] S.M. McNee, J. Riedl and J.A. Konstan, "Being Accurate is
Analysis", in Proc. of the 25th Annual International ACM Not enough: How Accuracy Metrics have Hurt
SIGIR Conference on Research and Development in Recommender Systems", in Ext. Abs. of the 2006 ACM
Information Retrieval, pp. 238-245, 2002. Conference on Human Factors in Computing Systems, pp.
[5] D.O. Case, Looking for Information: A Survey of Research 997-1001, 2006.
on Information Seeking, Needs, and Behavior, San Diego: [20] S.M. McNee, J. Riedl and J.A. Konstan, "Making
Academic Press, 2002, pp. 350. Recommendations Better: An Analytic Model for Human-
[6] P. Domingos and M. Pazzani, "Beyond Independence: Recommender Interaction", in Ext. Abs. of the 2006 ACM
Conditions for the Optimality of the Simple Bayesian Conference on Human Factors in Computing Systems, pp.
Classifier", in Proc. of the 13th International Conference on 1003-1008, 2006.
Machine Learning (ICML 96), pp. 105-112, 1996. [21] S.M. McNee, I. Albert, D. Cosley, P. Gopalkrishnan, S.K.
[7] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins, Lam, A.M. Rashid, J.A. Konstan and J. Riedl, "On the
"Eigentaste: A Constant Time Collaborative Filtering Recommending of Citations for Research Papers", in Proc.
Algorithm", Inf.Retr., vol. 4(2), pp. 133-151, 2001. of the 2002 ACM Conference on Computer Supported
Cooperative Work, pp. 116-125, 2002.
[8] N. Good, J.B. Schafer, J.A. Konstan, A. Borchers, B. Sarwar,
J. Herlocker and J. Riedl, "Combining Collaborative [22] P. Melville, R.J. Mooney and R. Nagarajan, "Content-
Filtering with Personal Agents for Better Boosted Collaborative Filtering for Improved
Recommendations", in Proc. of the Sixteenth National Recommendations", in Proc. of the Eighteenth National
Conference on Artificial Intelligence, pp. 439-446, 1999. Conference on Artificial Intelligence, pp. 187-192, 2002.

[9] J.L. Herlocker, J.A. Konstan, L.G. Terveen, and J.T. Riedl, [23] National Institutes of Health (NIH), "Entrez PubMed",
"Evaluating Collaborative Filtering Recommender Systems", http://www.ncbi.nlm.nih.gov/entrez/, 2006.
ACM Trans.Inf.Syst., vol. 22(1), pp. 5-53, 2004. [24] P. Pirolli, "Computational Models of Information Scent-
[10] J.L. Herlocker, J.A. Konstan, A. Borchers and J. Riedl, "An Following in a very Large Browsable Text Collection", in
Algorithmic Framework for Performing Collaborative CHI '97: Proc. of the SIGCHI Conference on Human Factors
Filtering", in Proc. of the 22nd Annual International ACM in Computing Systems, pp. 3-10, 1997.
SIGIR Conference on Research and Development in [25] M.F. Porter, "An algorithm for suffix stripping," in Readings
Information Retrieval, pp. 230-237, 1999. in Information Retrieval, K.S. Jones and P. Willett eds., San
[11] T. Hofmann, "Latent Semantic Models for Collaborative Francisco, CA: Morgan Kaufmann Publishers Inc, 1997, ch.
Filtering", ACM Trans.Inf.Syst., vol. 22(1), pp. 89-115, 2004. 6, pp. 313-316.

[12] Z. Huang, H. Chen, and D. Zeng, "Applying Associative [26] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom and J.
Retrieval Techniques to Alleviate the Sparsity Problem in Riedl, "GroupLens: An Open Architecture for Collaborative
Collaborative Filtering", ACM Trans.Inf.Syst., vol. 22(1), pp. Filtering of Netnews", in Proc. of the 1994 ACM Conf. on
116-142, 2004. Computer Supported Cooperative Work, pp. 175-186, 1994.

[13] I. Im and A. Hars, "Finding Information just for You: [27] B. Sarwar, G. Karypis, J. Konstan and J. Riedl, "Item-Based
Knowledge Reuse using Collaborative Filtering Systems", in Collaborative Filtering Recommendation Algorithms", in
Proc. of the Twenty-Second International Conference on Proc. of the Tenth International Conference on the World
Information Systems, pp. 349-360, 2001. Wide Web, pp. 285-295, 2001.

[14] G. Karypis, "SUGGEST Top-N Recommendation Engine", [28] U. Shardanand and P. Maes, "Social Information Filtering:
http://www-users.cs.umn.edu/~karypis/suggest/, 2000. Algorithms for Automating "Word of Mouth"", in Proc. of
the SIGCHI Conference on Human Factors in Computing
[15] J.A. Konstan, B.N. Miller, D. Maltz, J.L. Herlocker, L.R. Systems, pp. 210-217, 1995.
Gordon, and J. Riedl, "GroupLens: Applying Collaborative
Filtering to Usenet News", Commun ACM, vol. 40(3), pp. [29] C. Silverstein, H. Marais, M. Henzinger, and M. Moricz,
77-87, 1997. "Analysis of a very Large Web Search Engine Query Log",
SIGIR Forum, vol. 33(1), pp. 6-12, 1999.
[16] C.C. Kuhlthau, Seeking Meaning: A Process Approach to
Library and Information Services, Westport, CT: Libraries [30] R.S. Taylor, "Question-Negotiation and Information Seeking
Unlimited, 2004, pp. 247. in Libraries", College and Research Libraries, vol. 29pp.
178-194, May, 1968.
[17] A.K. McCallum, "Bow: A Toolkit for Statistical Language
Modeling, Text Retrieval, Classification and Clustering", [31] R. Torres, S.M. McNee, M. Abel, J.A. Konstan and J. Riedl,
http://www-2.cs.cmu.edu/mccallum/bow/, 1996. "Enhancing Digital Libraries with TechLens+", in Proc. of
the 2004 Joint ACM/IEEE Conference on Digital Libraries,
[18] S.M. McNee. Meeting User Information Needs in pp. 228- 236, 2004.
Recommender Systems. Ph.D. Dissertation, University of
Minnesota. 2006.

You might also like