Discovering User Interests From Web Browsing Behav
Discovering User Interests From Web Browsing Behav
net/publication/224075956
CITATIONS READS
72 1,084
2 authors, including:
Ting-Peng Liang
National Sun Yat-sen University
226 PUBLICATIONS 14,227 CITATIONS
SEE PROFILE
All content following this page was uploaded by Ting-Peng Liang on 05 June 2014.
Rucker and Polanco [16] proposed a system that Aij(pij) shows that the likelihood of object
analyzes the structure of bookmarks to Rj having the attribute Ai is pij.
determine the interests of an individual. In fact, Definition 3: Recency adjustment
bookmarks reveal not only the person’s interest The recency adjustment is to give higher
but also the way information is organized. weights to objects accessed recently than those
accessed earlier. The recency weight of an
3. A Time-based Approach to User object can be calculated by the following
Profiling equation:
γj(Rj) = g(Dj),
The key to information filtering and Where: γi(Rj) is the recency weight of the
recommendation is user profiling. In general, object, Rj;
user profiles can be obtained from self-reporting Dj is the elapse day of reading Rj; D0 <=Dj
or analysis of browsing behavior. Although <=Du, D0 is the lower bound for a elapse day to
self-reporting may be considered more accurate be considered for adjustment, and Du is the
in some cases, it is often tedious and difficult to upper bound for an elapse day to be considered
deal with dynamic changes. Therefore, much for adjustment.
research has focused on identifying user interests g is a function that calculates the recency
from the browsing data collected on-line. In weight. It may be a linear or a sigmoid function.
this section, we present a time-based approach Definition 4: Adjusted Interest level of an
that determines user interests based on the time attribute
they spent viewing objects with known attributes. If the interest level needs to be adjusted by
The underlying assumption of the method is that the recency weight, then the interest level
the more an object contains the information of becomes:
interest to a user, the longer the user would view σi = Σj[σj(Rj)* γi(Rj)*Aij(pij)].
the object. Because errors may exist when the Definition 5: User profile
browsing time is too long or too short, we use The profile of a user is a combination of
the average reading speed and recency weight to object attributes and their associated interest
adjust the interest level. The method can be levels. It can be represented as: U([Ai(σi)],
described briefly in the following: where U is a user, Ai is a set of attribute.
The reader profile is represented as a
combination of attributes and interest levels. 4. Application to Personal News
The interest level of a particular attribute is Recommendation
determined by the previous time spending on
browsing items having the attribute and may be In this section, an application of the time-based
adjusted by recency and other factors. mechanism to personal news recommendation
Therefore, given an object Rj[Aij(pij)], the over the Internet is described. News services
approach is defined as follows: are popular because the Internet provides an
Definition 1: Interest level of an object efficient way for news distribution. It can also
The interest level of an object is an be personalized at a very low cost. Therefore,
indicator of the extent to which a user is it is an excellent domain for testing the method.
interested in the object. The interest level is Since each news report contains certain
calculated by the following equation: characters, the system needs a module to
σj(Rj) = f(Tj/Tj*), determine the attributes of the content that are of
Where: σi(Rj) is the interest level of the interest. Therefore, four modules are essential:
object, Rj; structure analysis, reader profile analysis, rating
Tj is the time spent by the user on reading for recommendation, and learning (as shown in
Rj; T0 <=Tj <=Tu, T0 is the lower bound for a Figure 1).
browsing time to be considered reasonable, and
Tu is the upper bound for a browsing time to be 4.1 Structure analysis
considered reasonable.
Tj* is the estimated reasonable reading The first step for personalized news
time based on previous average reading speed; service is to analyze the product, i.e., the
f is a function that calculates the interest structure of the news. The foundation of
level. It may be a linear or a sigmoid function. structure analysis is to identify keywords in a
Definition 2: Interest level of an attribute report and to build the keyword dictionary.
The interest level of an attribute is the Keywords in each report are identified based on
aggregation of the interest levels of objects that the property of the words. Miller [9] developed
have the attribute. It is calculated as follows: a comprehensive keyword dictionary (called
σi = Σj[σj(Rj)*Aij(pij)], where WordNet) that includes many nouns and verbs.
σi is the interest level of attribute Ai; McQuail [8] points out that keywords are those
words associated with who, what, where, when, important keywords [14,15,19]].
and why. Names mentioned in a report are also
Structure
News Analysis
Recommended
Rating News
Learning
In our mechanism, we use nouns with a Inside.com and Yahoo appear four
particular emphasis on the role and issues times and AOL appears three times.
mentioned in the report. In the following “state (3) Position adjustment: Since the title
of the web” report, for example, the keywords and first paragraph often contain more
marked in italics can be identified. important information in a report, the
State of the Web: Inside.com Says 'No keywords that appeared in the title are
Thanks' to Yahoo! multiplied by 10 and those that
Could this be the beginning of the end of the appeared in the first paragraph are
multiplied by 3. As a result, the
portal strategy as we know it?
adjusted frequency of Inside.com is 19,
By James J. Cramer while Yahoo is 17, and AOL is 3.
You might not have noticed that This significantly differentiates the
Inside.com isn't on Yahoo! (Nasdaq: YHOO relative importance of different
- news) anymore. You may never even have keywords in the report.
heard of Inside.com. But you have to (4) Noise elimination: To simplify the
understand that this lone decision, by Steve structure, minimum frequency may be
Brill, the head of Inside.com, is sending set to remove unimportant words.
For instance, we may set a rule that
shock waves throughout the portal world.
keywords whose adjusted frequency is
Here's why. lower than 20% of the most important
When the Web first started to be keyword are removed. Then, only
commercial, outfits like Yahoo! and America keywords whose adjusted frequencies
Online needed to have content to wrap are higher than 4 are considered valid
around their ads. They first tried to grow it and will be recorded to represent the
and pay for it. Then, an epiphany struck Bob structure of the report.
Pittman at AOL: Content providers needed (5) Conversion to ratios: All keyword
frequencies are then converted into a
eyeballs so badly that they would pay AOL ratio, which is the frequency of a
to be there! That shift in strategy was the keyword/sum(frequencies of all
death knell for almost all original content keywords). The structure of the
providers on the Web because if you didn't report is the collection of valid
have money, you couldn't pay, and the only keywords along with their respective
people who could pay were established frequency ratio. The structure of the
players and players that tapped the public above example is [Inside.com (.31),
Yahoo (.28), portal (.21), web (.20)].
markets. (Source: http://www.yahoo.com/)
4.2 Analysis of Reader Profile
After identifying keywords, we further Based on the algorithm specified in
analyze the position and frequency of keywords. Section 3, we can analyze the interest profile of
Major steps include: a user in the following procedures:
(1) Determine whether synonyms exist,
(1) Calculate the average reading speed of
including American Online = AOL, President
the user: The computer keeps a record
Clinton = Bill Clinton, Yahoo = YHOO, etc..
of the time a user read a report.
(2) Calculate the frequency of keywords.
These data are aggregated and
In the previous example, the word
adjusted by the length of the reports. two reports that had been read by a user:
The average reading speed is [A001: Inside.com (.31), Yahoo (.28),
calculated by dividing the total Portal (.21), Web (.20)].
number of words by the total reading [A015: Web (.60), Yahoo (.30),
time. Merger (.10)]
(2) Calculate the interest level: The
interest level of a report is calculated 4.3 Rating and Recommendation
by the time spent in reading it.
Interest level is represented by the Rating and recommendation determine the
ratio of dividing the actual reading matching between a new report and a reader. If
time by the estimated reading time, the matching level is higher than the threshold
where the estimated reading time = value, the report will be recommended to the
total words in the report/average reader. Otherwise, it is dropped. Steps for
reading speed. A mapping table is matching reports with a reader include:
built to determine the interest levels. (1) Determine the structure of the report.
In our system, we set a range between For example, we have a new report
3 and 250 seconds as the reasonable A032, whose structure is [A032:
range for reading a news report. The Portal (0.7), Merger (0.3)].
incidents outside this range are (2) Calculate the matching level: The
considered exceptional and are matching level is calculated by
assigned an interest level of 0. If the aggregating the interest levels of
time ratio of a reading is below .25 different keywords. In the example,
(i.e., the actual reading time is 25% of the matching level is 0.648
the estimated reading time), the case (=0.7*0.84+0.3*0.2*).
is considered a fast browsing and is (3) Recommend news based on matching
assigned an interest level of 1. The levels: A hurdle can be set to screen
ratio between .25 and .75 is assigned out reports with low matching levels.
the value of 2, between .75 and 1.25 is The reports whose matching levels are
assigned 3, between 1.25 and 1.75 is higher than the threshold value will be
assigned 4, and above 1.75 is assigned recommended. In this step,
5. guidelines on the number of news
(3) Conduct recency adjustment: Since recommended and distribution of
we can reasonably assume that reports news among different categories can
read recently can more accurately be used to enhance the accuracy of
reflect a reader’s interest, the system recommendation.
gives a weight of 2 to reports that
were read within D1 days, 1.75 to 4.4 Learning
reports read between D1 and D2 days,
1.5 to those read between D2 and D3, The learning module is designed to adjust
and 1 to those longer than D3. various weights. It is not the focus of the paper
(4) Calculate the adjusted interest level of and hence is omitted.
a reader on a report is the product of
the interest level from step (3) 5. Empirical Study
multiplied by the recency weight
result from step (4). In order to evaluation the news
(5) Build the profile: The interest profile recommendation mechanism, an experimental
of a user is to multiply news structure study was performed. The benchmarks were
by the interest level. For example, the regular headline approach (HLA) and the
suppose a user has read two reports self-reported interests (SRI) approach.
that have the following structure and Prototype systems that present news by HLA,
his interest levels of the reports A001 SRI, and browsing behavior analysis (BBA)
and A015 are 4 and 2 respectively, approach were developed for the experiment.
then the resulting interest profile is
[Web (2.0), Yahoo (1.72), Inside.com
(1.24), Portal (.84), Merger (.20)].
This indicates that the user is most
interested in reports related to web,
followed by Yahoo, Inside.com, Portal,
and Merge.
[Example] the following are structures of
satisfaction includes four dimensions: were recruited at the beginning. They were
information content, customized services, user divided into two groups, one of which viewed
interface, and system value. Satisfaction on HLA and SRI (Group I) and the other viewed
information content is measured by three HLA and BBA (Group II). Nine of them
questions adapted from Doll [3]: (1) whether the dropped out during the experiment,. So, we
system finds the news that the reader wants to had a total of 87 effective subjects, with 43 in
read, (2) whether the system filters out the news Group I and 44 in Group II.
that the reader does not want, and (3) whether Subjects in both groups were asked to
the system captures the right category of interest view HLA in the first three days and fill out a
to the reader. satisfaction questionnaire after the second day.
Satisfaction on customized services is On the fouth day, subjects in Group I viewed
measured by three questions adapted from the SRI and those in Group II viewed BBA. They
personalized service portion of SERVQUAL all filled out questionnaires again to indicate
[12]. They are: (1) whether the system their satisfaction with the experimental system.
provides personal attention, (2) whether the Due to the difference in the recommendation
system captures my interests, and (3) whether approach, subjects in Group I needed to indicate
the system provides customized services. their interests in the report on a 1-7 scale (7 to be
Satisfaction on user interface is measured by the most interesting) after each reading, while
four questions adapted from Doll [3]. They are the subjects in Group II did not have to do so.
(1) whether the system is easy to use, (2) In order to be close to the real world, the
whether the system is friendly, (3) whether the news adopted for the experiment was actual
interface is properly formatted, and (4) whether news provided by China Times
the presentation is clear. System value asks (www.chinatimes.com.tw). During the
about whether the system is useful and is quick experimental period (June 7 – June 10, 2000), all
to find interesting news. Table 3 summarizes news available on the website of China Times
the measured dimensions. Finally, a question is before 9:00 am were downloaded to the
designed to assess the overall satisfaction of the experimental system, organized based on
user on the system. All questions are on a different approaches, and then presented to the
7-point Likert scale with 1 being least agreed experimental subjects. The average number of
and 7 being most agreed. news per day was 255, distributed into 13
Table 3. Dimensions for Measuring System categories, with an average of 44 chosen as
Satisfaction headline news by the editors and put in the
Information Customized User System homepage in the HLA approach. Table 5
content service Interface value shows the distribution of reports in the 13
- Find the - Personal - Ease of use - Useful categories: headline news (HDL), politics,
wanted attention - Friendly - international, China, finance, stock, technology,
- Filter the - Capture - Right Quicker medical, entertainment, sports, art, and
unwanted interests format comments.
- Find right - - Clear The experimental procedures include the
category Customized presentation following:
service Day 1: the subject logged onto the website
to read the description of the experiment
The reliability test of the questionnaire (approximately 5 minutes), filled out personal
using Cronbach’s alpha shows that the data (5 minutes), learned the system (5 minutes),
instrument is generally acceptable because most and read the news (20 minutes) arranged by the
alpha values are higher than 0.6 (in Table 4). HDA approach.
Table 4. The Reliability Data Day 2: Logged onto the system, read the
Dimension Cronback α news for 20 minutes, and filled out the
Information content 0.7018 satisfaction questionnaire.
Customized services 0.7714 Day 3: Logged onto the system, read the
User interface 0.8861 news for 20 minutes.
System value 0.6792 Day 4: Logged onto the system, read the
news for 20 minutes that were arranged by the
SRI approach for Group I and by the BBA
5.4 Experimental Design and Procedures
approach for Group II, and then filled out the
satisfaction questionnaire.
Since collecting browsing data needs to
have consecutive uses of a website, the subjects
were asked to participate in the experiment for
four days. A total of 96 volunteered subjects
Table 5. Distribution of News in the Experiment
HDL Polit Social Int’l China Finance Stock Tech Med Ent. Sport Art Comm Total
6/7 41 12 23 13 13 45 12 8 10 14 15 10 14 230
6/8 45 9 24 7 16 49 36 13 8 13 21 10 14 265
6/9 45 11 22 22 16 48 33 11 10 17 24 9 11 279
6/10 44 11 21 19 16 33 22 9 9 14 21 11 14 244
Avg 44 11 23 15 15 44 26 10 9 15 20 10 13 245
Table 6 shows that the most preferred category was entertainment news with a total of 27.6% of
the subjects chosing it. The least preferred news were art and comments. Table 7 shows that the
major reason for the subjects to use web news was because it was free. The second reason was their
love in using the web. Table 8 shows that the web was the most favorable news source for 40% of the
subjects, whereas TV was the next favorable one.
Descriptive data (mean and standard deviation) of the browsing behavior are shown in Table 9.
The system recorded the number of reports read by the subjects (NRR), number of news accepted by
the subjects (NRA), number of news shown on the homepage (NNS), system processing time (SPT),
selecting time (ST), reading time (RT), and time for rating the read news (TRN, for HDA and SRI).
The statistics indicate that the HLA system presents 41 news titles to all subjects in the homepage,
whereas SRI and BBA present an average of 17.77 and 17.61 recommended news titles in the
homepage to each subject, respectively. Standard deviations among the subjects are 7.42 and 5.97 for
SRI and BBA. That is, the recommendation systems are more selective than the standard headline
news version. This allows the subject to spend more time on reading news (RT) and less time on
selecting news (ST). The objective performance indices calculated from the browsing behavior are
shown in Table 10.
Since all subjects used the HLA system before using SRI or BBA, the paired t-test is used to test
the performance difference between SRI and HLA, and SRI and BBA, but independent t-test is used to
test the difference between SRI and BBA.
Table 12. Results of paired t-test on Satisfaction for SRI and HLA (df=42)
Mean Difference t-value Significance
SRI HDL
Content 5.8488 5.2558 0.5930*** 4.379 0.000
Customization 5.7558 4.6628 1.0930*** 7.474 0.000
Interface 5.6802 5.4244 0.2558*** 2.632 0.012
Value 5.9767 5.3488 0.6279*** 3.699 0.001
Overall 5.8605 5.2558 0.6047*** 6.800 0.000
Note: *** denotes p<0.01
Table 13. Results of paired t-test on Performance for BBA and HLA (df=43)
Mean Difference t-value Significance
BBA HDL
AR 0.3786 0.0690 0.3099*** 9.244 0.000
NHR 0.4573 0.2236 0.2337*** 3.939 0.000
THR 0.4622 0.2440 0.2183*** 3.173 0.003
EUR 0.9819 0.9729 0.009*** 5.15 0.000
ERR 0.8001 0.7711 0.029* 1.92 0.061
Note: * denotes p<0.10; *** denotes p<0.01
Table 14. Results of paired t-test on Satisfaction for SRI and HLA (df=42)
Mean Difference t-value Significance
BBA HDL
Content 5.6250 5.3068 0.3182* 1.956 0.057
Customization 5.4091 4.5682 0.8409*** 4.650 0.000
Interface 5.5739 5.3125 0.2614** 2.168 0.036
Value 5.5455 5.4773 0.0678 0.380 0.706
Overall 5.7727 5.3409 0.4318*** 3.772 0.000
Note: * denotes p<0.10; *** denotes p<0.01
Table 15. Results of t-test on Performance for SRI and BBA (df=85)
Mean Difference t-value Significance
SRI BBA
AR 0.3943 0.3787 0.016 0.332 0.741
NHR 0.4539 0.4573 0.003 0.069 0.945
THR 0.4469 0.4622 0.0153 0.272 0.786
EUR 0.9332 0.9819 0.048*** 11.608 0.000
ERR 0.7749 0.8001 0.025 1.546 0.126
Note: * denotes p<0.10; *** denotes p<0.01
Table 16. Results of paired t-test on Satisfaction for SRI and HLA (df=42)
Mean Difference t-value Significance
SRI BBA
Content 5.8488 5.6250 0.2238 1.506 0.136
Customization 5.7558 5.4091 0.3467** 2.004 0.048
Interface 5.6802 5.5739 0.1064 0.633 0.529
Value 5.9767 5.5455 0.2933* 1.740 0.086
Overall 5.8605 5.7727 0.087 0.377 0.512
Note: * denotes p<0.10; ** denotes p<0.05; *** denotes p<0.01