0% found this document useful (0 votes)

19 views57 pages

Lecture 12 - Link Analysis

The lecture covers link analysis in information retrieval, focusing on anchor text, citation analysis, and the PageRank algorithm. It explains how hyperlinks serve as quality signals and discusses the mathematical foundations of PageRank and HITS (Hubs and Authorities) for ranking web pages. Additionally, it addresses issues like Google bombs and the importance of ergodic Markov chains in defining long-term visit rates for web pages.

Uploaded by

alexiesourin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views57 pages

Lecture 12 - Link Analysis

Uploaded by

alexiesourin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

CI-6226

Lecture 12. Link Analysis

Information Retrieval and Analysis

Vasily Sidorov

1
What Should We Learn Today?
▪ Anchor text: What exactly are links on the web and
why are they important for IR?

▪ Citation analysis: the mathematical foundation of

PageRank and link-based ranking

▪ PageRank: the original algorithm that was used for

link-based ranking on the web

▪ Hubs & Authorities: an alternative link-based ranking

algorithm
2
Today’s Lecture
▪ Anchor Text
▪ Citation Analysis
▪ PageRank
▪ HITS: Hubs & Authorities

3
The Web as a Directed Graph
▪ Assumption 1: A hyperlink is a quality signal
▪ The hyperlink 𝑑1 → 𝑑2 indicates that 𝑑1 ’s author deems 𝑑2 high-
quality and relevant
▪ Assumption 2: The anchor text describes the content of 𝑑2
▪ We use “anchor text” somewhat loosely here for “the text
surrounding the hyperlink”
▪ Example:
You can find cheap cars ˂a href="http://…"˃here˂/a˃
anchor text: “You can find cheap cars here”

Page 𝑑1
Page 𝑑2
anchor text

4
[text of 𝑑2 ] vs. [text of 𝑑2 ] + [anchor text → 𝑑2 ]
▪ Searching on [text of 𝑑2 ] + [anchor
text → 𝑑2 ] is often more effective
than searching on [text of 𝑑2 ] only
▪ Example: Query IBM
▪ Matches IBM’s copyright page
▪ Matches many spam pages
▪ Matches IBM Wikipedia article
▪ May not match IBM home page,
as it’s mainly graphics
▪ Searching on [anchor text → 𝑑2 ]
works better here
5
Anchor text with “IBM”, pointing to
www.ibm.com

6
Indexing Anchor Text
▪ Anchor text is often a better description of a page’s
content than the page itself
▪ Anchor text can be weighted more highly than document
text
▪ Based on Assumptions 1 & 2

7
Google Bombs
▪ A Google bomb is a search with “bad” results due to
maliciously manipulated anchor text, e.g.:
▪ In 1999, a search for “more evil than Satan himself” (or
“evil empire”) resulted in the Microsoft homepage as the
top result
▪ In 2000, the text “dumb ***” was linked to a site selling
George W. Bush-related merchandise
▪ Can score anchor text with weight depending on the
authority of the anchor page’s website
▪ E.g., if we were to assume that content from
cnn.com or yahoo.com is authoritative, then trust
the anchor text from them
8
Examples

9
Today’s Lecture
▪ Anchor Text
▪ Citation Analysis
▪ PageRank
▪ HITS: Hubs & Authorities

10
Origins of PageRank: Citation analysis (1)
▪ Citation analysis: analysis of citations in the scientific
literature
▪ Example citation: “Miller (2001) has shown that physical activity
alters the metabolism of estrogens”
▪ We can view “Miller (2001)” as a hyperlink linking two scientific
articles
▪ One application of these “hyperlinks” in the scientific
literature:
▪ Measure the similarity of two articles by the overlap of other
articles citing them
▪ This is called co-citation similarity
▪ Co-citation similarity on the web: Google’s “Similar” feature
11
Origins of PageRank: Citation analysis (2)
▪ Another application:
▪ Citation frequency can be used to measure the impact of
an article/author/journal

12
Origins of PageRank: Citation analysis (2)
▪ Another application
▪ Citation frequency can be used to measure the impact of an
article
▪ Simplest measure: Each article gets one vote – not very accurate
▪ On the web
▪ Citation frequency = inlink count
▪ A high inlink count does not necessarily mean high quality ...
mainly because of link spam
▪ Better measure
▪ weighted citation frequency or citation rank
▪ An article’s vote is weighted according to its citation impact
▪ This is basically PageRank
▪ PageRank-like approach was first invented in the context
of citation analysis by Pinsker and Narin in the 1960s
13
Origins of PageRank: Summary
▪ We can use the same formal representation for
▪ citations in the scientific literature
▪ hyperlinks on the web
▪ Appropriately weighted citation frequency is an
excellent measure of quality ...
▪ ... both for web pages and for scientific publications.
▪ Next: PageRank algorithm for computing weighted
citation frequency on the web

14
Today’s Lecture
▪ Anchor Text
▪ Citation Analysis
▪ PageRank
▪ HITS: Hubs & Authorities

15
Query-Independent Ordering
▪ First generation:
▪ Using link counts as simple measures of popularity
▪ Two basic suggestions:
▪ Undirected popularity:
▪ Each page gets a score = the number of in-links plus the number of
out-links (3+2=5)
▪ Directed popularity:
▪ Score of a page = number of its in-links (3)
B E

C A

D F
16
PageRank
▪ PageRank (𝑃𝑅) of page 𝐶:
𝑃𝑅(𝐴) 𝑃𝑅(𝐵)
𝑃𝑅 𝐶 = +
2 1 B
▪ More generally,
A
𝑃𝑅 𝑣
𝑃𝑅 𝑢 = ෍
𝐿𝑣
, C
𝑣∈B𝑢

where 𝐵𝑢 is the set of pages that point to 𝑢,

𝐿𝑣 is the number of outgoing links from page 𝑣 (not
counting duplicate links)
17
𝑃𝑅(𝐴) 𝑃𝑅(𝐵)
PageRank 𝑃𝑅 𝐶 =
2
+
1

▪ Don’t know PageRank values at start

▪ Assume equal values (1Τ3 in this case), then B
iterate:
▪ First iteration: A
0.33
▪ 𝑃𝑅 𝐶 = + 0.33 = 0.5, 𝑃𝑅 𝐴 = 0.33, 𝑃𝑅 𝐵 =
2
0.17 C
▪ Second iteration:
0.33
▪ 𝑃𝑅 𝐶 = 2
+ 0.17 = 𝟎. 𝟑𝟑, 𝑃𝑅 𝐴 = 𝟎. 𝟓, 𝑃𝑅 𝐵 =
0.17
▪ Third iteration:
▪ 𝑃𝑅 𝐶 = 𝟎. 𝟒𝟐, 𝑃𝑅 𝐴 = 𝟎. 𝟑𝟑, 𝑃𝑅 𝐵 = 𝟎. 𝟐𝟓
▪ Converges to
𝑃𝑅 𝐶 = 0.4, 𝑃𝑅 𝐴 = 0.4, 𝑃𝑅 𝐵 = 0.2
18
Model Behind PageRank: Random Walk
▪ Imagine a “web surfer” doing a random walk
▪ Start at a random page
▪ At each step, go out of the current page along one of the
links on that page, equiprobably
▪ If it visits some nodes more often than others, these
are nodes with many links coming in from other
B
frequently visited nodes
▪ long-term visit rate A C

▪ PageRank = long-term visit rate D

▪ Pages visited more often in the random walk are more
important
19
Formalization of Random Walk: Markov
Chains
▪ A Markov chain consists of 𝑁 states, plus an 𝑁 × 𝑁
transition probability matrix 𝑃 (state = page)
▪ At each step, we are on exactly one of the pages
▪ For 1 ≤ 𝑖, 𝑗 ≤ 𝑁, the matrix entry 𝑃𝑖𝑗 tells us the
probability of 𝑗 being the next page, given we are currently
on page 𝑖
▪ Clearly, for all 𝑖,
▪ σ𝑁
𝑗=1 𝑃𝑖𝑗 = 1

20
Example graph and its link matrix

𝒅𝟎 𝒅𝟏 𝒅𝟐 𝒅𝟑 𝒅𝟒 𝒅𝟓 𝒅𝟔
𝒅𝟎 0 0 1 0 0 0 0
𝒅𝟏 0 1 1 0 0 0 0
𝒅𝟐 1 0 1 1 0 0 0
𝒅𝟑 0 0 0 1 1 0 0
𝒅𝟒 0 0 0 0 0 0 1
𝒅𝟓 0 0 0 0 0 1 1
𝒅𝟔 0 0 0 1 1 0 1

21
Transition Probability Matrix
𝒅𝟎 𝒅𝟏 𝒅𝟐 𝒅𝟑 𝒅𝟒 𝒅𝟓 𝒅𝟔
𝒅𝟎 0 0 1 0 0 0 0
𝒅𝟏 0 1 1 0 0 0 0
𝒅𝟎 𝒅𝟏 𝒅𝟐 𝒅𝟑 𝒅𝟒 𝒅𝟓 𝒅𝟔 𝒅𝟐 1 0 1 1 0 0 0
𝒅𝟑 0 0 0 1 1 0 0
𝒅𝟎 0 0 1 0 0 0 0
𝒅𝟒 0 0 0 0 0 0 1
𝒅𝟏 0 0.5 0.5 0 0 0 0 𝒅𝟓 0 0 0 0 0 1 1
𝒅𝟔 0 0 0 1 1 0 1
𝒅𝟐 0.33 0 0.33 0.33 0 0 0
𝒅𝟑 0 0 0 0.5 0.5 0 0
𝒅𝟒 0 0 0 0 0 0 1 𝑁

𝒅𝟓 0 0 0 0 0 0.5 0.5 ෍ 𝑃𝑖𝑗 = 1

𝒅𝟔 0 0 0 0.33 0.33 0 0.33 𝑗=1

22
Long-Term Visit Rate
▪ Recall: PageRank = long-term visit rate
▪ Long-term visit rate of page 𝑑 is the probability that a web
surfer is at page 𝑑 at a given point in time.
▪ Next:
▪ what properties of the web graph must hold for the long-
term visit rate to be well defined?
▪ First, a special case: The web graph must not contain
dead ends.

23
Dead Ends
▪ The reality: the web is full of dead ends
▪ Random walk can get stuck in dead ends
▪ If there are dead ends, long-term visit rates are not
well-defined

24
Teleporting — to get out of dead ends
▪ Teleporting:
▪ At a dead end, jump to a random web page with probability 1Τ𝑁
▪ At a non-dead end, with probability 10%, jump to a random web
page
▪ to each with a probability of 0.1Τ𝑁
▪ 10% is a parameter, the teleportation rate
▪ With remaining probability (90%), go out on a random hyperlink
▪ Example: if the page has 4 outgoing links:
1−0.1
▪ randomly choose one with probability = 0.225
4
▪ Note: “jumping” from dead end is independent of
teleportation rate
25
Results of Teleporting
▪ With teleporting, we cannot get stuck in a dead end.
▪ But even without dead ends, a graph may not have
well-defined long-term visit rates
▪ More generally, we require that the Markov chain be
ergodic
▪ Ergodic: positive recurrent aperiodic state of stochastic
systems; tending in probability to a limiting form that is
independent of the initial conditions

26
Ergodic Markov Chains
▪ A Markov chain is ergodic if
▪ It has a path from any state to any other
▪ a positive integer 𝑇0 exists such that
▪ There is a non-zero probability to end-up in any state at any
moment of time 𝑡 > 𝑇0
▪ Two technical conditions
▪ Irreducibility
▪ Roughly: there is a path from any page to any other page
▪ Aperiodicity
▪ Roughly: The pages cannot be partitioned into sets such that all
state transitions occur cyclically from one set to another

27
Ergodic Markov Chains
▪ Theorem
▪ For any ergodic Markov chain, there is a stable long-term
visit rate for each state
▪ This is the steady-state probability distribution
▪ Over a long time period, we visit each state in
proportion to this rate. It doesn’t matter where we
start
▪ Teleporting makes the web graph ergodic
▪ WebGraph + teleporting has a steady-state probability
distribution
▪ Each page in the WebGraph + teleporting has a PageRank

28
Formalization of “visit”: Probability Vector
▪ A probability vector 𝑥Ԧ = 𝑥1 , … , 𝑥𝑁 tells us where
the random walk is at any time point. In this
example, we are now at state 𝑖
( 𝟎 𝟎 𝟎 ⋯ 𝟏 ⋯ 𝟎 𝟎 𝟎 )
1 2 3 ⋯ 𝑖 ⋯ 𝑁−2 𝑁−1 𝑁

▪ More generally, the random walk is on the page 𝑖

with probability 𝑥𝑖 , and σ𝑥𝑖 = 1
( 𝟎. 𝟎𝟓 𝟎. 𝟎𝟏 𝟎 ⋯ 𝟎. 𝟐 ⋯ 𝟎. 𝟎𝟏 𝟎. 𝟎𝟓 𝟎. 𝟎𝟑 )
1 2 3 ⋯ 𝑖 ⋯ 𝑁−2 𝑁−1 𝑁

29
Change in Probability Vector
▪ If the probability vector is 𝑥Ԧ = 𝑥1 , … , 𝑥𝑁 , at this
step, what is it at next step?
▪ Recall that row 𝑖 of the transition probability matrix 𝑃
tells us where we go next from state 𝑖
▪ So, from 𝑥,
Ԧ our next state is distributed as 𝑥𝑃Ԧ

30
Steady State in Vector Notation
▪ The steady state in vector notation is simply a vector
𝜋 = 𝜋1 , … , 𝜋𝑁 of probabilities
▪ We use 𝜋 to distinguish it from the notation for the
probability vector 𝑥Ԧ
▪ 𝜋𝑖 is the long-term visit rate (or PageRank) of page 𝑖
▪ So we can think of PageRank as a very long vector —
one component per page

31
How do we compute the steady state
vector?
▪ Or — how do we compute PageRank?
▪ Recall: 𝜋 = 𝜋1 , … , 𝜋𝑁 is the PageRank vector, the vector
of steady-state probabilities …
▪ If the distribution in this step is 𝑥,
Ԧ then the distribution in the next
step is 𝑥𝑃
Ԧ
▪ But 𝜋 is the steady state
▪ Therefore, 𝜋 = 𝜋𝑃
▪ Solving this matrix equation gives us 𝜋
▪ 𝜋 is the principal left eigenvector for 𝑃, i.e., left eigenvector
with the largest eigenvalue
▪ All transition probability matrices have largest eigenvalue 1
32
One way of computing PageRank
▪ Start with any distribution 𝑥,
Ԧ e.g., uniform
distribution
▪ After one step, we’re at 𝑥𝑃
Ԧ
Ԧ 2
▪ After two steps, we’re at 𝑥𝑃
Ԧ 𝑘
▪ After 𝑘 steps, we’re at 𝑥𝑃
▪ Algorithm: multiply 𝑥Ԧ by increasing powers of 𝑃 until
convergence
▪ This is called the power method
▪ Regardless of where we start, we eventually reach
the steady state 𝜋
33
Example web graph and transition
probability matrix

𝒅𝟎 𝒅𝟏 𝒅𝟐 𝒅𝟑 𝒅𝟒 𝒅𝟓 𝒅𝟔
𝒅𝟎 0 0 1 0 0 0 0
𝒅𝟏 0 0.5 0.5 0 0 0 0
𝒅𝟐 0.33 0 0.33 0.33 0 0 0
𝒅𝟑 0 0 0 0.5 0.5 0 0
𝒅𝟒 0 0 0 0 0 0 1
𝒅𝟓 0 0 0 0 0 0.5 0.5
𝒅𝟔 0 0 0 0.33 0.33 0 0.33

34
Transition probability matrix, and transition
matrix with teleporting
𝒅𝟎 𝒅𝟏 𝒅𝟐 𝒅𝟑 𝒅𝟒 𝒅𝟓 𝒅𝟔
𝒅𝟎 0 0 1 0 0 0 0
𝒅𝟏 0 0.5 0.5 0 0 0 0
𝒅𝟐 0.33 0 0.33 0.33 0 0 0
𝒅𝟑 0 0 0 0.5 0.5 0 0 Teleportation rate: 0.14
𝒅𝟒 0 0 0 0 0 0 1
𝒅𝟓 0 0 0 0 0 0.5 0.5
𝒅𝟔 0 0 0 0.33 0.33 0 0.33
𝒅𝟎 𝒅𝟏 𝒅𝟐 𝒅𝟑 𝒅𝟒 𝒅𝟓 𝒅𝟔
𝒅𝟎 0.02 0.02 0.88 0.02 0.02 0.02 0.02
𝒅𝟏 0.02 0.45 0.45 0.02 0.02 0.02 0.02
𝒅𝟐 0.31 0.02 0.31 0.31 0.02 0.02 0.02
𝒅𝟑 0.02 0.02 0.02 0.45 0.45 0.02 0.02
𝒅𝟒 0.02 0.02 0.02 0.02 0.02 0.02 0.88
𝒅𝟓 0.02 0.02 0.02 0.02 0.02 0.45 0.45
𝒅𝟔 0.02 0.02 0.02 0.31 0.31 0.02 0.31
35
Power Method
𝒅𝟎 𝒅𝟏 𝒅𝟐 𝒅𝟑 𝒅𝟒 𝒅𝟓 𝒅𝟔
𝒅𝟎 0.02 0.02 0.88 0.02 0.02 0.02 0.02
𝒅𝟏 0.02 0.45 0.45 0.02 0.02 0.02 0.02
𝒅𝟐 0.31 0.02 0.31 0.31 0.02 0.02 0.02
𝒅𝟑 0.02 0.02 0.02 0.45 0.45 0.02 0.02
𝒅𝟒 0.02 0.02 0.02 0.02 0.02 0.02 0.88
𝒅𝟓 0.02 0.02 0.02 0.02 0.02 0.45 0.45
𝒅𝟔 0.02 0.02 0.02 0.31 0.31 0.02 0.31

36
Example Web Graph
PageRank
𝑑0 0.05
𝑑1 0.04
𝑑2 0.11
𝑑3 0.25
𝑑4 0.21
𝑑5 0.04
𝑑6 0.31

37
PageRank Summary
▪ Preprocessing
▪ Given graph of links, build matrix 𝑃
▪ Apply teleportation
▪ From modified matrix, compute 𝜋
▪ 𝜋𝑖 is the PageRank of page 𝑖
▪ Query processing
▪ Retrieve pages satisfying the query
▪ Rank them by their PageRank
▪ Return re-ranked list to the user

38
Topic-Specific PageRank
▪ Goal
▪ PageRank values that depend on
query topic
▪ Teleporting
▪ Selects a topic (say, one of the
16 top level ODP categories)
based on a query & user-specific
distribution over the categories
▪ Teleport to a page uniformly at
random within the chosen topic

39
Topic-Specific PageRank
▪ Offline: Compute PageRank for individual topics
▪ Query independent as before
▪ Each page has multiple PageRank scores
▪ one for each ODP category, with teleportation only to that
category
▪ Online: Query context classified into (distribution of
weights over) topics
▪ Generate a dynamic PageRank score for each page
▪ Weighted sum of topic-specific PageRanks

40
PageRank Issues
▪ Real users are not random surfers
▪ Examples of non-random surfing: back button, bookmarks,
directories — and search!
▪ → Markov model is not the best model of surfing
▪ Simple PageRank ranking produces bad results for many
pages
▪ Consider the query [travel to Italy]
▪ The Wikipedia.Org page (i) has a very high PageRank, and
(ii) contains both travel and Italy
▪ If we rank all Boolean hits according to PageRank, then the
Wikipedia home page would be top-ranked
▪ In practice:
▪ Rank according to weighted combination of raw text match,
anchor text match, PageRank & other factors

41
How Important is PageRank
▪ Frequent claim:
▪ PageRank is the most important component of web
ranking
▪ The reality:
▪ There are several components that are at least as
important: e.g., anchor text, phrases, proximity, tiered
indexes ...
▪ Rumor has it that PageRank in its original form (as
presented here) now has a negligible impact on ranking!
▪ However, variants of a page’s PageRank are still an
essential part of ranking
▪ Addressing link spam is difficult and crucial
42
Today’s Lecture
▪ Anchor Text
▪ Citation Analysis
▪ PageRank
▪ HITS: Hubs & Authorities

43
Hits –Hyperlink-Induced Topic Search
▪ Premise: there are two different types of relevance on the
web
▪ Relevance type 1: Hubs
▪ A hub page has a good list of links to pages answering the information
need
▪ E.g, for query [chicago bulls]: Bob’s list of recommended resources on
the Chicago Bulls sports team
▪ Relevance type 2: Authorities
▪ An authority page is a direct answer to the information need
▪ The home page of the Chicago Bulls sports team
▪ By definition: Links to authority pages occur repeatedly on hub pages
▪ Most approaches to search (including PageRank ranking) don’t
make the distinction between these two very different types
of relevance
44
Hubs and authorities: Definition
▪ A good hub page for a topic links to many authority
pages for that topic
▪ A good authority page for a topic is linked to by many
hub pages for that topic
▪ Circular definition — we will turn this into an
iterative computation

45
Example for hubs and authorities

Hubs Authorities
pscafe.com
thesmartlocal.com

hungrygowhere.com breadtalk.com.sg

burpple.com
oldchangkee.com
danielfooddiary.com

thaiexpress.com.sg

46
How to compute hub and authority scores
▪ Do a regular web search first
▪ Call the search result the root set
▪ Find all pages that are linked by or link to pages in
the root set
▪ Call first larger set the base set
▪ Finally, compute hubs and authorities for the base
set (which we’ll view as a small web graph)

47
Root set and base set (1)

Root set

The root set 48

Root set and base set (1)

Root set

Nodes that root set nodes link to 49

Root set and base set (1)

Root set

Nodes that link to root set nodes 50

Root set and base set (1)

Root set

Base set
The base set 51
Root set and base set (2)
▪ Root set typically has 200–1000 nodes
▪ Base set may have up to 5000 nodes
▪ Computation of base set, as shown on previous slide:
▪ Follow outlinks by parsing the pages in the root set
▪ Find 𝑑’s inlinks by searching for all pages containing a link
to 𝑑

52
Hub and Authority Scores
▪ Compute for each page 𝑑 in the base set
▪ a hub score ℎ(𝑑) and an authority score 𝑎(𝑑)
▪ Initialization:
▪ for all 𝑑: ℎ(𝑑) = 1, 𝑎(𝑑) = 1
▪ Iteratively update all ℎ(𝑑), 𝑎(𝑑)
▪ After convergence:
▪ Output pages with highest ℎ scores as top hubs
▪ Output pages with highest 𝑎 scores as top authorities
▪ We output two ranked lists

53
Iterative Update
▪ For all 𝑑:
𝑦1
▪ ℎ 𝑑 = σ𝑑→𝑦 𝑎 𝑦
d 𝑦2
▪ For all 𝑑:
▪ 𝑎 𝑑 = σ𝑦→𝑑 ℎ(𝑦) 𝑦3

▪ Iterate these steps until

𝑦1
convergence is achieved
𝑦2 d

𝑦3

54
Details
▪ Scaling
▪ To prevent the 𝑎() and ℎ() values from getting too big, can
scale down after each iteration
▪ Scaling factor doesn’t really matter
▪ We care about the relative (as opposed to absolute) values
of the scores
▪ In most cases, the algorithm converges after a few
iterations
▪ See IIR Section 21.3 for details of computation

55
PageRank vs. HITS: Discussion
▪ PageRank can be pre-computed, HITS has to be
computed at query time
▪ HITS is too expensive in most application scenarios
▪ PageRank and HITS make two different design choices
concerning
▪ the eigen problem formalization
▪ the set of pages to apply the formalization to
▪ These two are orthogonal (We could also apply HITS to
the entire web and PageRank to a small base set)
▪ Claim: On the web, a good hub almost always is also a
good authority
▪ The actual difference between PageRank ranking and
HITS ranking is therefore not as large as one might expect
56
Resources
▪ IIR Chapter 21
▪ Papers
▪ http://www2004.org/proceedings/docs/1p309.pdf
▪ http://www2004.org/proceedings/docs/1p595.pdf
▪ http://www2003.org/cdrom/papers/refereed/p270/kamvar-270-
xhtml/index.html
▪ http://www2003.org/cdrom/papers/refereed/p641/xhtml/p641-
mccurley.html

Accenture Complete Preparation Sheet
100% (1)
Accenture Complete Preparation Sheet
11 pages
Module 4 Structural Theory 2
No ratings yet
Module 4 Structural Theory 2
19 pages
IR-UNIT 11 (Link Analysis) - 2019
No ratings yet
IR-UNIT 11 (Link Analysis) - 2019
58 pages
Link Analysis: (Follow The Links To Learn More!)
No ratings yet
Link Analysis: (Follow The Links To Learn More!)
28 pages
Module 6-: Real Time Big Data Models
No ratings yet
Module 6-: Real Time Big Data Models
58 pages
Markov Chains PDF
No ratings yet
Markov Chains PDF
66 pages
Advanced PageRank Analysis
No ratings yet
Advanced PageRank Analysis
33 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
No ratings yet
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
51 pages
PageRank Algorithm Explained
No ratings yet
PageRank Algorithm Explained
9 pages
Brin and Page 1998 Page Et Al. 1999
No ratings yet
Brin and Page 1998 Page Et Al. 1999
37 pages
PMBD-07-Link Analysis
No ratings yet
PMBD-07-Link Analysis
42 pages
A Numerical Study of Special Truss Moment Frames
100% (1)
A Numerical Study of Special Truss Moment Frames
106 pages
Markov Chains
No ratings yet
Markov Chains
37 pages
DS
No ratings yet
DS
39 pages
Liuty
No ratings yet
Liuty
50 pages
Google Pagerank and Reduced-Order Modelling
No ratings yet
Google Pagerank and Reduced-Order Modelling
56 pages
Report PDF
No ratings yet
Report PDF
35 pages
Evolution of Search Engine Ranking
No ratings yet
Evolution of Search Engine Ranking
19 pages
Power Point
No ratings yet
Power Point
77 pages
Page Rank PDF
0% (1)
Page Rank PDF
20 pages
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
No ratings yet
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
46 pages
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
No ratings yet
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
33 pages
1.1 Pagerank Description
No ratings yet
1.1 Pagerank Description
19 pages
Datamining-Lect7 - Link Analysis Ranking PageRank - Random Walks HITS Absorbing Random Walks and Label Propagation
No ratings yet
Datamining-Lect7 - Link Analysis Ranking PageRank - Random Walks HITS Absorbing Random Walks and Label Propagation
99 pages
Google Pagerank: Maths Delivers!
No ratings yet
Google Pagerank: Maths Delivers!
24 pages
Understanding PageRank and HITS
No ratings yet
Understanding PageRank and HITS
55 pages
Technical University of Ilmenau Institute For Theoretical and Technical Computer Science Automata and Formal Languages
No ratings yet
Technical University of Ilmenau Institute For Theoretical and Technical Computer Science Automata and Formal Languages
19 pages
Jeffrey D. Ullman Stanford University
No ratings yet
Jeffrey D. Ullman Stanford University
44 pages
Information Networks and World Wide Web
No ratings yet
Information Networks and World Wide Web
37 pages
Applications of Stochastic Models in Web Page Ranking
No ratings yet
Applications of Stochastic Models in Web Page Ranking
8 pages
The Linear Algebra Behind Google'S Pagerank Algorithm: Sujit Dunga 11110102
No ratings yet
The Linear Algebra Behind Google'S Pagerank Algorithm: Sujit Dunga 11110102
6 pages
CSF-469-L11-13 (Link Analysis Page Rank)
No ratings yet
CSF-469-L11-13 (Link Analysis Page Rank)
47 pages
PageRank Explained for Math Students
No ratings yet
PageRank Explained for Math Students
23 pages
Module 4 MapReduce and Link Analysis
No ratings yet
Module 4 MapReduce and Link Analysis
103 pages
Social Network Analysis
No ratings yet
Social Network Analysis
28 pages
Link Analysis
No ratings yet
Link Analysis
43 pages
Pavement Condition Assessment Using Soft Computing Techniques
No ratings yet
Pavement Condition Assessment Using Soft Computing Techniques
18 pages
Web Search & PageRank Insights
No ratings yet
Web Search & PageRank Insights
52 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Page Rank With 13 Cases
No ratings yet
Page Rank With 13 Cases
72 pages
Note You Must Follow A Sequential Method and Show All Your Working For Arriving at A Particular Solution
No ratings yet
Note You Must Follow A Sequential Method and Show All Your Working For Arriving at A Particular Solution
9 pages
SNA Unit2 LearningMaterial
No ratings yet
SNA Unit2 LearningMaterial
16 pages
Lecture 9 - Performance Evaluation
No ratings yet
Lecture 9 - Performance Evaluation
2 pages
Efficient Barcode Decoding Algorithm
No ratings yet
Efficient Barcode Decoding Algorithm
6 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
18 pages
Lecture16 Linkanalysis
No ratings yet
Lecture16 Linkanalysis
58 pages
Network Analysis For Wikipedia: F. Bellomi and R. Bonato
No ratings yet
Network Analysis For Wikipedia: F. Bellomi and R. Bonato
12 pages
Google PageRank Algorithm Overview
No ratings yet
Google PageRank Algorithm Overview
6 pages
Logic Students' Guide
No ratings yet
Logic Students' Guide
5 pages
Link-Based Ranking and PageRank
No ratings yet
Link-Based Ranking and PageRank
30 pages
Linear Algebra in Web Search
No ratings yet
Linear Algebra in Web Search
5 pages
What Number Is Five More Than Forty? - What Number Is Five More Than Seventy-Five?
No ratings yet
What Number Is Five More Than Forty? - What Number Is Five More Than Seventy-Five?
6 pages
Rec Sys Network
No ratings yet
Rec Sys Network
45 pages
BCS2213 - Async Interface
No ratings yet
BCS2213 - Async Interface
21 pages
Errata Sheet for Module Corrections
No ratings yet
Errata Sheet for Module Corrections
6 pages
Order of Magnitude & Vector Basics
No ratings yet
Order of Magnitude & Vector Basics
24 pages
Lecture 9
No ratings yet
Lecture 9
64 pages
Daa Lab-2
No ratings yet
Daa Lab-2
6 pages
Page Rank and HITS
No ratings yet
Page Rank and HITS
39 pages
SNA-UNIT-2 Full
No ratings yet
SNA-UNIT-2 Full
33 pages
CHP 6.1-Chp 6.7.4
No ratings yet
CHP 6.1-Chp 6.7.4
112 pages
Pagerank Basics for Students
No ratings yet
Pagerank Basics for Students
7 pages
Building 261
No ratings yet
Building 261
2 pages
Lab 4-2
No ratings yet
Lab 4-2
4 pages
Post Lab #2
No ratings yet
Post Lab #2
7 pages
Course 5-6
No ratings yet
Course 5-6
54 pages
2021 Article
No ratings yet
2021 Article
17 pages
Pagerank
No ratings yet
Pagerank
3 pages
Link Analysis
No ratings yet
Link Analysis
37 pages
Unit 2
No ratings yet
Unit 2
14 pages
804YB Kendriya Vidyalaya Sangathan Hyderabad Region Common Summative Assessment - Ii
No ratings yet
804YB Kendriya Vidyalaya Sangathan Hyderabad Region Common Summative Assessment - Ii
8 pages
Bda Final
No ratings yet
Bda Final
42 pages
Module VI Link Analysis Final
No ratings yet
Module VI Link Analysis Final
104 pages
G.T.N. Arts College (Autonomous)
No ratings yet
G.T.N. Arts College (Autonomous)
20 pages
Link Analysis
No ratings yet
Link Analysis
47 pages
Dimensional Analysis of QJM
No ratings yet
Dimensional Analysis of QJM
97 pages
DC-1 Assignment-8
No ratings yet
DC-1 Assignment-8
5 pages
Centrality Measures
No ratings yet
Centrality Measures
69 pages
10th Maths - Monday Test-2
No ratings yet
10th Maths - Monday Test-2
8 pages
Ss 2 Economics 1st Term E-Note
No ratings yet
Ss 2 Economics 1st Term E-Note
77 pages
Comparative - Superlatives
No ratings yet
Comparative - Superlatives
3 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
57 pages
Lecture 5p1 - Index Construction & Compressing
No ratings yet
Lecture 5p1 - Index Construction & Compressing
42 pages
Lecture 5p2 - Index Construction & Compressing
No ratings yet
Lecture 5p2 - Index Construction & Compressing
40 pages
Lecture 9 - Probabilistic Information Retrieval, Language Models
No ratings yet
Lecture 9 - Probabilistic Information Retrieval, Language Models
40 pages
DWM - 2
No ratings yet
DWM - 2
4 pages
3.5 WebMining ImportantPages
No ratings yet
3.5 WebMining ImportantPages
11 pages
Lecture 8-2 - Text Classification, Naïve Bayes, Vector Space Classification
No ratings yet
Lecture 8-2 - Text Classification, Naïve Bayes, Vector Space Classification
30 pages
1.bais Varience Trade-Off
No ratings yet
1.bais Varience Trade-Off
5 pages
Current, Resistance, Emf - Summative Test
No ratings yet
Current, Resistance, Emf - Summative Test
3 pages
(Ebook) Elementary Algebra by Charles P. McKeague ISBN 9780840064219, 0840064217 Instant Download
100% (1)
(Ebook) Elementary Algebra by Charles P. McKeague ISBN 9780840064219, 0840064217 Instant Download
115 pages
Maths MT-2 Set-B (23-24)
No ratings yet
Maths MT-2 Set-B (23-24)
5 pages
Big Data Analytics Module Wise Important Questions and Answers Mumbai University
No ratings yet
Big Data Analytics Module Wise Important Questions and Answers Mumbai University
12 pages

Lecture 12 - Link Analysis

Uploaded by

Lecture 12 - Link Analysis

Uploaded by

CI-6226

Lecture 12. Link Analysis

▪ Citation analysis: the mathematical foundation of

▪ PageRank: the original algorithm that was used for

▪ Hubs & Authorities: an alternative link-based ranking

where 𝐵𝑢 is the set of pages that point to 𝑢,

▪ Don’t know PageRank values at start

▪ PageRank = long-term visit rate D

𝒅𝟓 0 0 0 0 0 0.5 0.5 ෍ 𝑃𝑖𝑗 = 1

▪ More generally, the random walk is on the page 𝑖

The root set 48

Nodes that root set nodes link to 49

Nodes that link to root set nodes 50

▪ Iterate these steps until

You might also like