IR on Web
Search Engines
Reference of slides taken from Dr Haddawy's material
IR on the Web
● Search engines use well-known techniques from IR.
● But IR algorithms were developed for relatively small and coherent
collections of documents, e.g. newspaper articles.
● The Web is massive, much less coherent, changes rapidly, and is
spread over geographically distributed computers.
● Selectivity Problem: Traditional techniques measure the similarity of
the query text with document texts. But the tiny queries over vast
collections, typical for Web search engines prevent similarity-based
approaches from filtering sufficient numbers of irrelevant pages out
of the search results.
Challenges for Web Searching
● Distributed data
● Volatile data: 40% of the web changes every month
● Exponential growth
● Unstructured and redundant data: 30% of web pages
are near duplicates
● Unedited data
● Multiple formats
● Many different kinds of users
Challenges for Web Searching
● Web search queries are SHORT
● ~2 - 3 words on average
● User Expectations are quite high
● Many say “the first item shown should be what
I want to see”!
Web is a complex graph
Page 1
Site 1 Page 1 Site 2
Page 3 Page 2
Page 3
Page 2
Page 5 Page 1
Page 4
Site 5 Page 1
Page 6 Page 1 Page 2 Site 6
Site 3
Search Engine Architecture
●
Spider
– Crawls the web to find pages. Follows
hyperlinks
●
Indexer
– Produces data structures for fast searching of
all words in the pages
●
Retriever
– Query interface
– Database lookup to find hits
●
2 billion documents
●
4 TB RAM, many terabytes of disk
– Ranking
Typical Search Engine
Architecture
User
Queries
Page Repository Results
Query Ranking
Crawlers Engine
Indexer
Indexes
Structure
Text
Web
Manpower and Hardware:
Google
85 people
50% technical, 14 Ph.D. in Computer Science
Equipment
2,500 Linux machines
80 terabytes of spinning disks
30 new machines installed daily
Reported by Larry Page, Google, March 2000
At that time, Google was handling 5.5 million searches per day
Increase rate was 20% per month
By fall 2002, Google had grown to over 400 people and 10,000 Linux
Servers (World’s largest Linux cluster).
Crawlers (Spiders, Bots)
●
Main idea:
– Start with known sites
– Record information for these sites
– Follow the links from each site
– Record information found at new sites
– Repeat
Web Crawlers
● Start with an initial page P0. Find URLs on P0 and add
them to a queue.
● When done with P0, pass it to an indexing program,
get a page P1 from the queue and repeat.
●
Issues
– Which page to look at next?
– Avoid overloading a site
– How deep within a site to go (drill-down)?
– How frequently to visit pages?
Page Visit Order
●
Animated examples of breadth-first vs depth-first search on trees:
– http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Structure to be traversed
Indexing
• Arrangement of data to permit fast searching
• Which list is easier to search?
sow fox pig eel yak hen ant cat dog hog
ant cat dog eel fox hen hog pig sow yak
• Sorting helps in searching
– You probably use this when looking something up in the
telephone book or dictionary. For instance, "cold fusion" is
probably near the front, so you open maybe 1/4 of the way in.
Inverted Files
FILE
POS
1 A file is a list of words by position
10
– First entry is the word in position 1 (first word)
20
– Entry 4562 is the word in position 4562 (4562nd word)
30
36 – Last entry is the last word
An inverted file is a list of positions by word!
a (1, 4, 40)
entry (11, 20, 31)
file (2, 38) INVERTED FILE
list (5, 41)
position (9, 16, 26)
positions (44)
word (14, 19, 24, 29, 35, 45)
words (7)
4562 (21, 27)
Inverted Files for Multiple Documents
“jezebel” occurs
DOCID OCCUR POS 1 POS 2 ... 6 times in document 34,
LEXICON 3 times in document 44,
4 times in document 56 . . .
WORD NDOCS PTR
jezebel 20 34 6 1 118 2087 3922 3981 5002
44 3 215 2291 3010
jezer 3 56 4 5 22 134 992
jezerit 1
jeziah 1 566 3 203 245 287
jeziel 1
jezliah 1 67 1 132 WORD
INDEX
jezoar 1 ...
jezrahliah 1
jezreel 39 107 4 322 354 381 405
232 6 15 195 248 1897 1951 2192
jezo ar
677 1 481
713 3 42 312 802
Ranking (Scoring) Hits
●
Hits must be presented in some order
●
What order?
– Relevance, recentness, popularity, reliability?
●
Some ranking methods
– Presence of keywords in title of document
– Closeness of keywords to start of document
– Frequency of keyword in document
– Link popularity (how many pages point to this
one)
Ranking: Google
1. Vector space ranking with corrections for document
length
2. Extra weighting for specific fields, e.g., title, urls, etc.
3. PageRank
The balance between 1, 2, and 3 is not made public.
Google’s PageRank Algorithm
●
Assumption: A link in page A to page B is a
recommendation of page B by the author of A
(we say B is successor of A)
The “quality” of a page is related to the number of links that
point to it (its in-degree)
●
Apply recursively: Quality of a page is related to
– its in-degree, and to
– the quality of pages linking to it
PageRank Algorithm (Brinn & Page, 1998)
SOURCE: GOOGLE
PageRank
●
Consider the following infinite random walk (surfing):
– Initially the surfer is at a random page
– At each step, the surfer proceeds
●
to a randomly chosen web page with probability d
●
to a randomly chosen successor of the current page
with probability 1-d
SOURCE: GOOGLE
PageRank Formula
d
PageRank ( p ) = + (1 − d ) ∑ PageRank (q ) / outdegree(q )
n ( q , p )∈E
where n is the total number of nodes in the graph
●
Google uses d ≈ 0.85
●
PageRank is a probability distribution over web pages SOURCE: GOOGLE
PageRank Example
A B
d d
PageRank of P is
(1-d)∗[(PageRank of A)/4 + (PageRank of B)/3)] + d/n
PAGERANK CALCULATOR
SOURCE: GOOGLE
Robot Exclusion
●
You may not want certain pages indexed but still viewable
by browsers. Can’t protect directory.
●
Some crawlers conform to the Robot Exclusion Protocol.
Compliance is voluntary. One way to enforce: firewall
●
They look for file robots.txt at highest directory level in
domain. If domain is www.ecom.cmu.edu, robots.txt goes
in www.ecom.cmu.edu/robots.txt
●
A specific document can be shielded from a crawler by
adding the line: <META NAME="ROBOTS”
CONTENT="NOINDEX">