CI-6226
Lecture 12. Link Analysis
Information Retrieval and Analysis
Vasily Sidorov
1
What Should We Learn Today?
▪ Anchor text: What exactly are links on the web and
why are they important for IR?
▪ Citation analysis: the mathematical foundation of
PageRank and link-based ranking
▪ PageRank: the original algorithm that was used for
link-based ranking on the web
▪ Hubs & Authorities: an alternative link-based ranking
algorithm
2
Today’s Lecture
▪ Anchor Text
▪ Citation Analysis
▪ PageRank
▪ HITS: Hubs & Authorities
3
The Web as a Directed Graph
▪ Assumption 1: A hyperlink is a quality signal
▪ The hyperlink 𝑑1 → 𝑑2 indicates that 𝑑1 ’s author deems 𝑑2 high-
quality and relevant
▪ Assumption 2: The anchor text describes the content of 𝑑2
▪ We use “anchor text” somewhat loosely here for “the text
surrounding the hyperlink”
▪ Example:
You can find cheap cars ˂a href="http://…"˃here˂/a˃
anchor text: “You can find cheap cars here”
Page 𝑑1
Page 𝑑2
anchor text
4
[text of 𝑑2 ] vs. [text of 𝑑2 ] + [anchor text → 𝑑2 ]
▪ Searching on [text of 𝑑2 ] + [anchor
text → 𝑑2 ] is often more effective
than searching on [text of 𝑑2 ] only
▪ Example: Query IBM
▪ Matches IBM’s copyright page
▪ Matches many spam pages
▪ Matches IBM Wikipedia article
▪ May not match IBM home page,
as it’s mainly graphics
▪ Searching on [anchor text → 𝑑2 ]
works better here
5
Anchor text with “IBM”, pointing to
www.ibm.com
6
Indexing Anchor Text
▪ Anchor text is often a better description of a page’s
content than the page itself
▪ Anchor text can be weighted more highly than document
text
▪ Based on Assumptions 1 & 2
7
Google Bombs
▪ A Google bomb is a search with “bad” results due to
maliciously manipulated anchor text, e.g.:
▪ In 1999, a search for “more evil than Satan himself” (or
“evil empire”) resulted in the Microsoft homepage as the
top result
▪ In 2000, the text “dumb ***” was linked to a site selling
George W. Bush-related merchandise
▪ Can score anchor text with weight depending on the
authority of the anchor page’s website
▪ E.g., if we were to assume that content from
cnn.com or yahoo.com is authoritative, then trust
the anchor text from them
8
Examples
9
Today’s Lecture
▪ Anchor Text
▪ Citation Analysis
▪ PageRank
▪ HITS: Hubs & Authorities
10
Origins of PageRank: Citation analysis (1)
▪ Citation analysis: analysis of citations in the scientific
literature
▪ Example citation: “Miller (2001) has shown that physical activity
alters the metabolism of estrogens”
▪ We can view “Miller (2001)” as a hyperlink linking two scientific
articles
▪ One application of these “hyperlinks” in the scientific
literature:
▪ Measure the similarity of two articles by the overlap of other
articles citing them
▪ This is called co-citation similarity
▪ Co-citation similarity on the web: Google’s “Similar” feature
11
Origins of PageRank: Citation analysis (2)
▪ Another application:
▪ Citation frequency can be used to measure the impact of
an article/author/journal
12
Origins of PageRank: Citation analysis (2)
▪ Another application
▪ Citation frequency can be used to measure the impact of an
article
▪ Simplest measure: Each article gets one vote – not very accurate
▪ On the web
▪ Citation frequency = inlink count
▪ A high inlink count does not necessarily mean high quality ...
mainly because of link spam
▪ Better measure
▪ weighted citation frequency or citation rank
▪ An article’s vote is weighted according to its citation impact
▪ This is basically PageRank
▪ PageRank-like approach was first invented in the context
of citation analysis by Pinsker and Narin in the 1960s
13
Origins of PageRank: Summary
▪ We can use the same formal representation for
▪ citations in the scientific literature
▪ hyperlinks on the web
▪ Appropriately weighted citation frequency is an
excellent measure of quality ...
▪ ... both for web pages and for scientific publications.
▪ Next: PageRank algorithm for computing weighted
citation frequency on the web
14
Today’s Lecture
▪ Anchor Text
▪ Citation Analysis
▪ PageRank
▪ HITS: Hubs & Authorities
15
Query-Independent Ordering
▪ First generation:
▪ Using link counts as simple measures of popularity
▪ Two basic suggestions:
▪ Undirected popularity:
▪ Each page gets a score = the number of in-links plus the number of
out-links (3+2=5)
▪ Directed popularity:
▪ Score of a page = number of its in-links (3)
B E
C A
D F
16
PageRank
▪ PageRank (𝑃𝑅) of page 𝐶:
𝑃𝑅(𝐴) 𝑃𝑅(𝐵)
𝑃𝑅 𝐶 = +
2 1 B
▪ More generally,
A
𝑃𝑅 𝑣
𝑃𝑅 𝑢 =
𝐿𝑣
, C
𝑣∈B𝑢
where 𝐵𝑢 is the set of pages that point to 𝑢,
𝐿𝑣 is the number of outgoing links from page 𝑣 (not
counting duplicate links)
17
𝑃𝑅(𝐴) 𝑃𝑅(𝐵)
PageRank 𝑃𝑅 𝐶 =
2
+
1
▪ Don’t know PageRank values at start
▪ Assume equal values (1Τ3 in this case), then B
iterate:
▪ First iteration: A
0.33
▪ 𝑃𝑅 𝐶 = + 0.33 = 0.5, 𝑃𝑅 𝐴 = 0.33, 𝑃𝑅 𝐵 =
2
0.17 C
▪ Second iteration:
0.33
▪ 𝑃𝑅 𝐶 = 2
+ 0.17 = 𝟎. 𝟑𝟑, 𝑃𝑅 𝐴 = 𝟎. 𝟓, 𝑃𝑅 𝐵 =
0.17
▪ Third iteration:
▪ 𝑃𝑅 𝐶 = 𝟎. 𝟒𝟐, 𝑃𝑅 𝐴 = 𝟎. 𝟑𝟑, 𝑃𝑅 𝐵 = 𝟎. 𝟐𝟓
▪ Converges to
𝑃𝑅 𝐶 = 0.4, 𝑃𝑅 𝐴 = 0.4, 𝑃𝑅 𝐵 = 0.2
18
Model Behind PageRank: Random Walk
▪ Imagine a “web surfer” doing a random walk
▪ Start at a random page
▪ At each step, go out of the current page along one of the
links on that page, equiprobably
▪ If it visits some nodes more often than others, these
are nodes with many links coming in from other
B
frequently visited nodes
▪ long-term visit rate A C
▪ PageRank = long-term visit rate D
▪ Pages visited more often in the random walk are more
important
19
Formalization of Random Walk: Markov
Chains
▪ A Markov chain consists of 𝑁 states, plus an 𝑁 × 𝑁
transition probability matrix 𝑃 (state = page)
▪ At each step, we are on exactly one of the pages
▪ For 1 ≤ 𝑖, 𝑗 ≤ 𝑁, the matrix entry 𝑃𝑖𝑗 tells us the
probability of 𝑗 being the next page, given we are currently
on page 𝑖
▪ Clearly, for all 𝑖,
▪ σ𝑁
𝑗=1 𝑃𝑖𝑗 = 1
20
Example graph and its link matrix
𝒅𝟎 𝒅𝟏 𝒅𝟐 𝒅𝟑 𝒅𝟒 𝒅𝟓 𝒅𝟔
𝒅𝟎 0 0 1 0 0 0 0
𝒅𝟏 0 1 1 0 0 0 0
𝒅𝟐 1 0 1 1 0 0 0
𝒅𝟑 0 0 0 1 1 0 0
𝒅𝟒 0 0 0 0 0 0 1
𝒅𝟓 0 0 0 0 0 1 1
𝒅𝟔 0 0 0 1 1 0 1
21
Transition Probability Matrix
𝒅𝟎 𝒅𝟏 𝒅𝟐 𝒅𝟑 𝒅𝟒 𝒅𝟓 𝒅𝟔
𝒅𝟎 0 0 1 0 0 0 0
𝒅𝟏 0 1 1 0 0 0 0
𝒅𝟎 𝒅𝟏 𝒅𝟐 𝒅𝟑 𝒅𝟒 𝒅𝟓 𝒅𝟔 𝒅𝟐 1 0 1 1 0 0 0
𝒅𝟑 0 0 0 1 1 0 0
𝒅𝟎 0 0 1 0 0 0 0
𝒅𝟒 0 0 0 0 0 0 1
𝒅𝟏 0 0.5 0.5 0 0 0 0 𝒅𝟓 0 0 0 0 0 1 1
𝒅𝟔 0 0 0 1 1 0 1
𝒅𝟐 0.33 0 0.33 0.33 0 0 0
𝒅𝟑 0 0 0 0.5 0.5 0 0
𝒅𝟒 0 0 0 0 0 0 1 𝑁
𝒅𝟓 0 0 0 0 0 0.5 0.5 𝑃𝑖𝑗 = 1
𝒅𝟔 0 0 0 0.33 0.33 0 0.33 𝑗=1
22
Long-Term Visit Rate
▪ Recall: PageRank = long-term visit rate
▪ Long-term visit rate of page 𝑑 is the probability that a web
surfer is at page 𝑑 at a given point in time.
▪ Next:
▪ what properties of the web graph must hold for the long-
term visit rate to be well defined?
▪ First, a special case: The web graph must not contain
dead ends.
23
Dead Ends
▪ The reality: the web is full of dead ends
▪ Random walk can get stuck in dead ends
▪ If there are dead ends, long-term visit rates are not
well-defined
??
24
Teleporting — to get out of dead ends
▪ Teleporting:
▪ At a dead end, jump to a random web page with probability 1Τ𝑁
▪ At a non-dead end, with probability 10%, jump to a random web
page
▪ to each with a probability of 0.1Τ𝑁
▪ 10% is a parameter, the teleportation rate
▪ With remaining probability (90%), go out on a random hyperlink
▪ Example: if the page has 4 outgoing links:
1−0.1
▪ randomly choose one with probability = 0.225
4
▪ Note: “jumping” from dead end is independent of
teleportation rate
25
Results of Teleporting
▪ With teleporting, we cannot get stuck in a dead end.
▪ But even without dead ends, a graph may not have
well-defined long-term visit rates
▪ More generally, we require that the Markov chain be
ergodic
▪ Ergodic: positive recurrent aperiodic state of stochastic
systems; tending in probability to a limiting form that is
independent of the initial conditions
26
Ergodic Markov Chains
▪ A Markov chain is ergodic if
▪ It has a path from any state to any other
▪ a positive integer 𝑇0 exists such that
▪ There is a non-zero probability to end-up in any state at any
moment of time 𝑡 > 𝑇0
▪ Two technical conditions
▪ Irreducibility
▪ Roughly: there is a path from any page to any other page
▪ Aperiodicity
▪ Roughly: The pages cannot be partitioned into sets such that all
state transitions occur cyclically from one set to another
27
Ergodic Markov Chains
▪ Theorem
▪ For any ergodic Markov chain, there is a stable long-term
visit rate for each state
▪ This is the steady-state probability distribution
▪ Over a long time period, we visit each state in
proportion to this rate. It doesn’t matter where we
start
▪ Teleporting makes the web graph ergodic
▪ WebGraph + teleporting has a steady-state probability
distribution
▪ Each page in the WebGraph + teleporting has a PageRank
28
Formalization of “visit”: Probability Vector
▪ A probability vector 𝑥Ԧ = 𝑥1 , … , 𝑥𝑁 tells us where
the random walk is at any time point. In this
example, we are now at state 𝑖
( 𝟎 𝟎 𝟎 ⋯ 𝟏 ⋯ 𝟎 𝟎 𝟎 )
1 2 3 ⋯ 𝑖 ⋯ 𝑁−2 𝑁−1 𝑁
▪ More generally, the random walk is on the page 𝑖
with probability 𝑥𝑖 , and σ𝑥𝑖 = 1
( 𝟎. 𝟎𝟓 𝟎. 𝟎𝟏 𝟎 ⋯ 𝟎. 𝟐 ⋯ 𝟎. 𝟎𝟏 𝟎. 𝟎𝟓 𝟎. 𝟎𝟑 )
1 2 3 ⋯ 𝑖 ⋯ 𝑁−2 𝑁−1 𝑁
29
Change in Probability Vector
▪ If the probability vector is 𝑥Ԧ = 𝑥1 , … , 𝑥𝑁 , at this
step, what is it at next step?
▪ Recall that row 𝑖 of the transition probability matrix 𝑃
tells us where we go next from state 𝑖
▪ So, from 𝑥,
Ԧ our next state is distributed as 𝑥𝑃Ԧ
30
Steady State in Vector Notation
▪ The steady state in vector notation is simply a vector
𝜋 = 𝜋1 , … , 𝜋𝑁 of probabilities
▪ We use 𝜋 to distinguish it from the notation for the
probability vector 𝑥Ԧ
▪ 𝜋𝑖 is the long-term visit rate (or PageRank) of page 𝑖
▪ So we can think of PageRank as a very long vector —
one component per page
31
How do we compute the steady state
vector?
▪ Or — how do we compute PageRank?
▪ Recall: 𝜋 = 𝜋1 , … , 𝜋𝑁 is the PageRank vector, the vector
of steady-state probabilities …
▪ If the distribution in this step is 𝑥,
Ԧ then the distribution in the next
step is 𝑥𝑃
Ԧ
▪ But 𝜋 is the steady state
▪ Therefore, 𝜋 = 𝜋𝑃
▪ Solving this matrix equation gives us 𝜋
▪ 𝜋 is the principal left eigenvector for 𝑃, i.e., left eigenvector
with the largest eigenvalue
▪ All transition probability matrices have largest eigenvalue 1
32
One way of computing PageRank
▪ Start with any distribution 𝑥,
Ԧ e.g., uniform
distribution
▪ After one step, we’re at 𝑥𝑃
Ԧ
Ԧ 2
▪ After two steps, we’re at 𝑥𝑃
Ԧ 𝑘
▪ After 𝑘 steps, we’re at 𝑥𝑃
▪ Algorithm: multiply 𝑥Ԧ by increasing powers of 𝑃 until
convergence
▪ This is called the power method
▪ Regardless of where we start, we eventually reach
the steady state 𝜋
33
Example web graph and transition
probability matrix
𝒅𝟎 𝒅𝟏 𝒅𝟐 𝒅𝟑 𝒅𝟒 𝒅𝟓 𝒅𝟔
𝒅𝟎 0 0 1 0 0 0 0
𝒅𝟏 0 0.5 0.5 0 0 0 0
𝒅𝟐 0.33 0 0.33 0.33 0 0 0
𝒅𝟑 0 0 0 0.5 0.5 0 0
𝒅𝟒 0 0 0 0 0 0 1
𝒅𝟓 0 0 0 0 0 0.5 0.5
𝒅𝟔 0 0 0 0.33 0.33 0 0.33
34
Transition probability matrix, and transition
matrix with teleporting
𝒅𝟎 𝒅𝟏 𝒅𝟐 𝒅𝟑 𝒅𝟒 𝒅𝟓 𝒅𝟔
𝒅𝟎 0 0 1 0 0 0 0
𝒅𝟏 0 0.5 0.5 0 0 0 0
𝒅𝟐 0.33 0 0.33 0.33 0 0 0
𝒅𝟑 0 0 0 0.5 0.5 0 0 Teleportation rate: 0.14
𝒅𝟒 0 0 0 0 0 0 1
𝒅𝟓 0 0 0 0 0 0.5 0.5
𝒅𝟔 0 0 0 0.33 0.33 0 0.33
𝒅𝟎 𝒅𝟏 𝒅𝟐 𝒅𝟑 𝒅𝟒 𝒅𝟓 𝒅𝟔
𝒅𝟎 0.02 0.02 0.88 0.02 0.02 0.02 0.02
𝒅𝟏 0.02 0.45 0.45 0.02 0.02 0.02 0.02
𝒅𝟐 0.31 0.02 0.31 0.31 0.02 0.02 0.02
𝒅𝟑 0.02 0.02 0.02 0.45 0.45 0.02 0.02
𝒅𝟒 0.02 0.02 0.02 0.02 0.02 0.02 0.88
𝒅𝟓 0.02 0.02 0.02 0.02 0.02 0.45 0.45
𝒅𝟔 0.02 0.02 0.02 0.31 0.31 0.02 0.31
35
Power Method
𝒅𝟎 𝒅𝟏 𝒅𝟐 𝒅𝟑 𝒅𝟒 𝒅𝟓 𝒅𝟔
𝒅𝟎 0.02 0.02 0.88 0.02 0.02 0.02 0.02
𝒅𝟏 0.02 0.45 0.45 0.02 0.02 0.02 0.02
𝒅𝟐 0.31 0.02 0.31 0.31 0.02 0.02 0.02
𝒅𝟑 0.02 0.02 0.02 0.45 0.45 0.02 0.02
𝒅𝟒 0.02 0.02 0.02 0.02 0.02 0.02 0.88
𝒅𝟓 0.02 0.02 0.02 0.02 0.02 0.45 0.45
𝒅𝟔 0.02 0.02 0.02 0.31 0.31 0.02 0.31
36
Example Web Graph
PageRank
𝑑0 0.05
𝑑1 0.04
𝑑2 0.11
𝑑3 0.25
𝑑4 0.21
𝑑5 0.04
𝑑6 0.31
37
PageRank Summary
▪ Preprocessing
▪ Given graph of links, build matrix 𝑃
▪ Apply teleportation
▪ From modified matrix, compute 𝜋
▪ 𝜋𝑖 is the PageRank of page 𝑖
▪ Query processing
▪ Retrieve pages satisfying the query
▪ Rank them by their PageRank
▪ Return re-ranked list to the user
38
Topic-Specific PageRank
▪ Goal
▪ PageRank values that depend on
query topic
▪ Teleporting
▪ Selects a topic (say, one of the
16 top level ODP categories)
based on a query & user-specific
distribution over the categories
▪ Teleport to a page uniformly at
random within the chosen topic
39
Topic-Specific PageRank
▪ Offline: Compute PageRank for individual topics
▪ Query independent as before
▪ Each page has multiple PageRank scores
▪ one for each ODP category, with teleportation only to that
category
▪ Online: Query context classified into (distribution of
weights over) topics
▪ Generate a dynamic PageRank score for each page
▪ Weighted sum of topic-specific PageRanks
40
PageRank Issues
▪ Real users are not random surfers
▪ Examples of non-random surfing: back button, bookmarks,
directories — and search!
▪ → Markov model is not the best model of surfing
▪ Simple PageRank ranking produces bad results for many
pages
▪ Consider the query [travel to Italy]
▪ The Wikipedia.Org page (i) has a very high PageRank, and
(ii) contains both travel and Italy
▪ If we rank all Boolean hits according to PageRank, then the
Wikipedia home page would be top-ranked
▪ In practice:
▪ Rank according to weighted combination of raw text match,
anchor text match, PageRank & other factors
41
How Important is PageRank
▪ Frequent claim:
▪ PageRank is the most important component of web
ranking
▪ The reality:
▪ There are several components that are at least as
important: e.g., anchor text, phrases, proximity, tiered
indexes ...
▪ Rumor has it that PageRank in its original form (as
presented here) now has a negligible impact on ranking!
▪ However, variants of a page’s PageRank are still an
essential part of ranking
▪ Addressing link spam is difficult and crucial
42
Today’s Lecture
▪ Anchor Text
▪ Citation Analysis
▪ PageRank
▪ HITS: Hubs & Authorities
43
Hits –Hyperlink-Induced Topic Search
▪ Premise: there are two different types of relevance on the
web
▪ Relevance type 1: Hubs
▪ A hub page has a good list of links to pages answering the information
need
▪ E.g, for query [chicago bulls]: Bob’s list of recommended resources on
the Chicago Bulls sports team
▪ Relevance type 2: Authorities
▪ An authority page is a direct answer to the information need
▪ The home page of the Chicago Bulls sports team
▪ By definition: Links to authority pages occur repeatedly on hub pages
▪ Most approaches to search (including PageRank ranking) don’t
make the distinction between these two very different types
of relevance
44
Hubs and authorities: Definition
▪ A good hub page for a topic links to many authority
pages for that topic
▪ A good authority page for a topic is linked to by many
hub pages for that topic
▪ Circular definition — we will turn this into an
iterative computation
45
Example for hubs and authorities
Hubs Authorities
pscafe.com
thesmartlocal.com
hungrygowhere.com breadtalk.com.sg
burpple.com
oldchangkee.com
danielfooddiary.com
thaiexpress.com.sg
46
How to compute hub and authority scores
▪ Do a regular web search first
▪ Call the search result the root set
▪ Find all pages that are linked by or link to pages in
the root set
▪ Call first larger set the base set
▪ Finally, compute hubs and authorities for the base
set (which we’ll view as a small web graph)
47
Root set and base set (1)
Root set
The root set 48
Root set and base set (1)
Root set
Nodes that root set nodes link to 49
Root set and base set (1)
Root set
Nodes that link to root set nodes 50
Root set and base set (1)
Root set
Base set
The base set 51
Root set and base set (2)
▪ Root set typically has 200–1000 nodes
▪ Base set may have up to 5000 nodes
▪ Computation of base set, as shown on previous slide:
▪ Follow outlinks by parsing the pages in the root set
▪ Find 𝑑’s inlinks by searching for all pages containing a link
to 𝑑
52
Hub and Authority Scores
▪ Compute for each page 𝑑 in the base set
▪ a hub score ℎ(𝑑) and an authority score 𝑎(𝑑)
▪ Initialization:
▪ for all 𝑑: ℎ(𝑑) = 1, 𝑎(𝑑) = 1
▪ Iteratively update all ℎ(𝑑), 𝑎(𝑑)
▪ After convergence:
▪ Output pages with highest ℎ scores as top hubs
▪ Output pages with highest 𝑎 scores as top authorities
▪ We output two ranked lists
53
Iterative Update
▪ For all 𝑑:
𝑦1
▪ ℎ 𝑑 = σ𝑑→𝑦 𝑎 𝑦
d 𝑦2
▪ For all 𝑑:
▪ 𝑎 𝑑 = σ𝑦→𝑑 ℎ(𝑦) 𝑦3
▪ Iterate these steps until
𝑦1
convergence is achieved
𝑦2 d
𝑦3
54
Details
▪ Scaling
▪ To prevent the 𝑎() and ℎ() values from getting too big, can
scale down after each iteration
▪ Scaling factor doesn’t really matter
▪ We care about the relative (as opposed to absolute) values
of the scores
▪ In most cases, the algorithm converges after a few
iterations
▪ See IIR Section 21.3 for details of computation
55
PageRank vs. HITS: Discussion
▪ PageRank can be pre-computed, HITS has to be
computed at query time
▪ HITS is too expensive in most application scenarios
▪ PageRank and HITS make two different design choices
concerning
▪ the eigen problem formalization
▪ the set of pages to apply the formalization to
▪ These two are orthogonal (We could also apply HITS to
the entire web and PageRank to a small base set)
▪ Claim: On the web, a good hub almost always is also a
good authority
▪ The actual difference between PageRank ranking and
HITS ranking is therefore not as large as one might expect
56
Resources
▪ IIR Chapter 21
▪ Papers
▪ http://www2004.org/proceedings/docs/1p309.pdf
▪ http://www2004.org/proceedings/docs/1p595.pdf
▪ http://www2003.org/cdrom/papers/refereed/p270/kamvar-270-
xhtml/index.html
▪ http://www2003.org/cdrom/papers/refereed/p641/xhtml/p641-
mccurley.html
57