0% found this document useful (0 votes)

55 views21 pages

Web Search Engine Challenges & Architecture

The document discusses some of the key challenges in web search including the massive scale and dynamic nature of the web compared to traditional document collections. It also provides an overview of the major components of a typical search engine architecture, including crawlers to discover pages, indexers to create searchable indexes, and ranking algorithms like PageRank to order search results. The document also notes some of the technical resources like hardware and personnel required at large scale search engines like Google.

Uploaded by

Ali Hasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views21 pages

Web Search Engine Challenges & Architecture

Uploaded by

Ali Hasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

IR on Web

Search Engines

Reference of slides taken from Dr Haddawy's material

IR on the Web
● Search engines use well-known techniques from IR.

● But IR algorithms were developed for relatively small and coherent

collections of documents, e.g. newspaper articles.

● The Web is massive, much less coherent, changes rapidly, and is

spread over geographically distributed computers.

● Selectivity Problem: Traditional techniques measure the similarity of

the query text with document texts. But the tiny queries over vast
collections, typical for Web search engines prevent similarity-based
approaches from filtering sufficient numbers of irrelevant pages out
of the search results.
Challenges for Web Searching
● Distributed data
● Volatile data: 40% of the web changes every month
● Exponential growth
● Unstructured and redundant data: 30% of web pages
are near duplicates
● Unedited data
● Multiple formats
● Many different kinds of users
Challenges for Web Searching
● Web search queries are SHORT
● ~2 - 3 words on average
● User Expectations are quite high
● Many say “the first item shown should be what
I want to see”!
Web is a complex graph

Page 1
Site 1 Page 1 Site 2

Page 3 Page 2
Page 3
Page 2

Page 5 Page 1
Page 4
Site 5 Page 1

Queries
Page Repository Results

Query Ranking
Crawlers Engine

Indexer

Indexes

Structure
Text
Web
Manpower and Hardware:
Google
85 people
50% technical, 14 Ph.D. in Computer Science

Equipment
2,500 Linux machines
80 terabytes of spinning disks
30 new machines installed daily

Reported by Larry Page, Google, March 2000

At that time, Google was handling 5.5 million searches per day
Increase rate was 20% per month
By fall 2002, Google had grown to over 400 people and 10,000 Linux
Servers (World’s largest Linux cluster).
Crawlers (Spiders, Bots)

●
Main idea:
– Start with known sites
– Record information for these sites
– Follow the links from each site
– Record information found at new sites
– Repeat
Web Crawlers
● Start with an initial page P0. Find URLs on P0 and add
them to a queue.
● When done with P0, pass it to an indexing program,
get a page P1 from the queue and repeat.
●
Issues
– Which page to look at next?
– Avoid overloading a site
– How deep within a site to go (drill-down)?
– How frequently to visit pages?
Page Visit Order
●
Animated examples of breadth-first vs depth-first search on trees:
– http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html

Structure to be traversed
Indexing
• Arrangement of data to permit fast searching
• Which list is easier to search?
sow fox pig eel yak hen ant cat dog hog
ant cat dog eel fox hen hog pig sow yak
• Sorting helps in searching
– You probably use this when looking something up in the
telephone book or dictionary. For instance, "cold fusion" is
probably near the front, so you open maybe 1/4 of the way in.
Inverted Files
FILE
POS
1 A file is a list of words by position
10
– First entry is the word in position 1 (first word)
20
– Entry 4562 is the word in position 4562 (4562nd word)
30

36 – Last entry is the last word

An inverted file is a list of positions by word!

a (1, 4, 40)
entry (11, 20, 31)
file (2, 38) INVERTED FILE
list (5, 41)
position (9, 16, 26)
positions (44)
word (14, 19, 24, 29, 35, 45)
words (7)
4562 (21, 27)
Inverted Files for Multiple Documents
“jezebel” occurs
DOCID OCCUR POS 1 POS 2 ... 6 times in document 34,
LEXICON 3 times in document 44,
4 times in document 56 . . .
WORD NDOCS PTR
jezebel 20 34 6 1 118 2087 3922 3981 5002
44 3 215 2291 3010
jezer 3 56 4 5 22 134 992
jezerit 1
jeziah 1 566 3 203 245 287
jeziel 1
jezliah 1 67 1 132 WORD
INDEX
jezoar 1 ...
jezrahliah 1
jezreel 39 107 4 322 354 381 405
232 6 15 195 248 1897 1951 2192
jezo ar

677 1 481
713 3 42 312 802
Ranking (Scoring) Hits
●
Hits must be presented in some order
●
What order?

– Relevance, recentness, popularity, reliability?

●
Some ranking methods

– Presence of keywords in title of document

– Closeness of keywords to start of document
– Frequency of keyword in document
– Link popularity (how many pages point to this
one)
Ranking: Google

1. Vector space ranking with corrections for document

length
2. Extra weighting for specific fields, e.g., title, urls, etc.
3. PageRank
The balance between 1, 2, and 3 is not made public.
Google’s PageRank Algorithm
●
Assumption: A link in page A to page B is a
recommendation of page B by the author of A
(we say B is successor of A)
 The “quality” of a page is related to the number of links that
point to it (its in-degree)

●
Apply recursively: Quality of a page is related to
– its in-degree, and to
– the quality of pages linking to it
 PageRank Algorithm (Brinn & Page, 1998)

SOURCE: GOOGLE
PageRank
●
Consider the following infinite random walk (surfing):
– Initially the surfer is at a random page
– At each step, the surfer proceeds
●
to a randomly chosen web page with probability d
●
to a randomly chosen successor of the current page
with probability 1-d

SOURCE: GOOGLE
PageRank Formula
d
PageRank ( p ) = + (1 − d ) ∑ PageRank (q ) / outdegree(q )
n ( q , p )∈E

where n is the total number of nodes in the graph

●
Google uses d ≈ 0.85

●
PageRank is a probability distribution over web pages SOURCE: GOOGLE
PageRank Example

A B

d d

PageRank of P is
(1-d)∗[(PageRank of A)/4 + (PageRank of B)/3)] + d/n

PAGERANK CALCULATOR
SOURCE: GOOGLE
Robot Exclusion
●
You may not want certain pages indexed but still viewable
by browsers. Can’t protect directory.
●
Some crawlers conform to the Robot Exclusion Protocol.
Compliance is voluntary. One way to enforce: firewall
●
They look for file robots.txt at highest directory level in
domain. If domain is www.ecom.cmu.edu, robots.txt goes
in www.ecom.cmu.edu/robots.txt
●
A specific document can be shielded from a crawler by
adding the line: <META NAME="ROBOTS”
CONTENT="NOINDEX">

Hospital Management System Abstract
74% (19)
Hospital Management System Abstract
18 pages
11.7.4 1. WinCC v7.0 - Virusscanner Administration
No ratings yet
11.7.4 1. WinCC v7.0 - Virusscanner Administration
13 pages
StartUpCtl User Manual
No ratings yet
StartUpCtl User Manual
29 pages
SCM Quiz for IT Professionals
29% (14)
SCM Quiz for IT Professionals
260 pages
ActiveRobot User Guide PDF
No ratings yet
ActiveRobot User Guide PDF
288 pages
Forwarder-7 2 1-Forwarder
No ratings yet
Forwarder-7 2 1-Forwarder
144 pages
Xerox Workcentre 7845/7855 Software Installation Instructions
No ratings yet
Xerox Workcentre 7845/7855 Software Installation Instructions
8 pages
Web Search
No ratings yet
Web Search
49 pages
The Anatomy of A Large-Scale Hypertextual Web Search Engine: Google
No ratings yet
The Anatomy of A Large-Scale Hypertextual Web Search Engine: Google
24 pages
Google Search Engine Origins
No ratings yet
Google Search Engine Origins
27 pages
Chapter 2
No ratings yet
Chapter 2
45 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
26 pages
Mini Google
No ratings yet
Mini Google
34 pages
Search Engine Functionality Guide
No ratings yet
Search Engine Functionality Guide
40 pages
Web Search. Web Spidering
No ratings yet
Web Search. Web Spidering
44 pages
Challenges in Running A Commercial Web Search Engine: Amit Singhal
No ratings yet
Challenges in Running A Commercial Web Search Engine: Amit Singhal
50 pages
Search Engine Basics for Beginners
No ratings yet
Search Engine Basics for Beginners
29 pages
Chapter - 6 Part 1
No ratings yet
Chapter - 6 Part 1
21 pages
Google SearchEngine
No ratings yet
Google SearchEngine
13 pages
Search Engine
No ratings yet
Search Engine
35 pages
Backlinks - Pagerank
No ratings yet
Backlinks - Pagerank
12 pages
Search Engine
100% (2)
Search Engine
42 pages
Unit5 Irt
No ratings yet
Unit5 Irt
10 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
Web Search Engine
No ratings yet
Web Search Engine
26 pages
Lect 1 IRIntroduction
No ratings yet
Lect 1 IRIntroduction
59 pages
IR Lec1
No ratings yet
IR Lec1
26 pages
Websearch
No ratings yet
Websearch
21 pages
Search Engine
No ratings yet
Search Engine
42 pages
IRWM: Assignment 1: How Does Google Search Engine Works?
No ratings yet
IRWM: Assignment 1: How Does Google Search Engine Works?
7 pages
How Google Search Engine Works
No ratings yet
How Google Search Engine Works
61 pages
CS571 Note
No ratings yet
CS571 Note
2 pages
Webmininglec
100% (1)
Webmininglec
75 pages
Darknet Report
No ratings yet
Darknet Report
27 pages
1preprocessing Crawling Laws PDF
No ratings yet
1preprocessing Crawling Laws PDF
53 pages
Learning To Rank
No ratings yet
Learning To Rank
777 pages
The Anatomy of A Large-Scale Hypertextual
No ratings yet
The Anatomy of A Large-Scale Hypertextual
41 pages
Working of Webb Search Engines
No ratings yet
Working of Webb Search Engines
29 pages
Ir 5
No ratings yet
Ir 5
18 pages
Page Rank of Google Search: The Algorithm That Organizes The Web
No ratings yet
Page Rank of Google Search: The Algorithm That Organizes The Web
8 pages
Bulu
No ratings yet
Bulu
47 pages
12 Handout PDF
No ratings yet
12 Handout PDF
82 pages
VV - IR - UNIT-I - Part2
No ratings yet
VV - IR - UNIT-I - Part2
35 pages
Lab Manual: Web Technology
No ratings yet
Lab Manual: Web Technology
39 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
Lecture Crawling
No ratings yet
Lecture Crawling
38 pages
Summary of A Search Engine
No ratings yet
Summary of A Search Engine
4 pages
Building Fast Search Engines
No ratings yet
Building Fast Search Engines
21 pages
Unit 5
No ratings yet
Unit 5
36 pages
The Wisdom of Crowds: Web Mining or
No ratings yet
The Wisdom of Crowds: Web Mining or
50 pages
SEARCH ENGINES and PAGERANK
No ratings yet
SEARCH ENGINES and PAGERANK
29 pages
Search Engine Student Documents
No ratings yet
Search Engine Student Documents
6 pages
Anatomy of A Large-Scale Hypertextual Web Search Engine
No ratings yet
Anatomy of A Large-Scale Hypertextual Web Search Engine
33 pages
Tutorial: Web Information Retrieval: Monika Henzinger
No ratings yet
Tutorial: Web Information Retrieval: Monika Henzinger
154 pages
Chap 2
No ratings yet
Chap 2
29 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
36 pages
Week7 1
No ratings yet
Week7 1
48 pages
Cmpsci 446 Search Engines
No ratings yet
Cmpsci 446 Search Engines
32 pages
Search Engine Architecture
No ratings yet
Search Engine Architecture
15 pages
Chapter 1 Search Engine 1. Objective
No ratings yet
Chapter 1 Search Engine 1. Objective
63 pages
Lecture15 Crawling
No ratings yet
Lecture15 Crawling
17 pages
Comsats Institute of Information TECHNOLOGY Islamabad
No ratings yet
Comsats Institute of Information TECHNOLOGY Islamabad
11 pages
Data Warehouse Presentation
No ratings yet
Data Warehouse Presentation
13 pages
Understanding Ontologies for Experts
No ratings yet
Understanding Ontologies for Experts
13 pages
Boolean Logic for IR Professionals
No ratings yet
Boolean Logic for IR Professionals
26 pages
Elective II: Selected Topics in Information Retrieval (IR) and Natural Language Processing (NLP)
No ratings yet
Elective II: Selected Topics in Information Retrieval (IR) and Natural Language Processing (NLP)
16 pages
Relevance of The Results: Documents Are Retrieved Relevant Irrelevant Measure
No ratings yet
Relevance of The Results: Documents Are Retrieved Relevant Irrelevant Measure
42 pages
Human Computer Interaction
No ratings yet
Human Computer Interaction
26 pages
Stop Losing Hope by Anees Ur Rahman
No ratings yet
Stop Losing Hope by Anees Ur Rahman
3 pages
Movie Description (Audio Description) .
No ratings yet
Movie Description (Audio Description) .
11 pages
Professional Practices
No ratings yet
Professional Practices
8 pages
AKS Complete by Umera Ahmed
0% (2)
AKS Complete by Umera Ahmed
269 pages
Airways Management System 1
No ratings yet
Airways Management System 1
21 pages
CICS Quick Ref
No ratings yet
CICS Quick Ref
60 pages
Avneet Singh: Expert Software Developer
No ratings yet
Avneet Singh: Expert Software Developer
3 pages
Lab Manual CC105
100% (1)
Lab Manual CC105
92 pages
E-R Diagram Exercises Guide
No ratings yet
E-R Diagram Exercises Guide
6 pages
00-2 Contents PDF
No ratings yet
00-2 Contents PDF
10 pages
C DS 42 Sample Questions PDF
No ratings yet
C DS 42 Sample Questions PDF
5 pages
Moshi Co-Operative University (Mocu) : Open Source Software Development
No ratings yet
Moshi Co-Operative University (Mocu) : Open Source Software Development
16 pages
Arcgis Enterprise: Functionality Matrix
No ratings yet
Arcgis Enterprise: Functionality Matrix
11 pages
Serial Communication Protocol For Embedded Applica PDF
No ratings yet
Serial Communication Protocol For Embedded Applica PDF
4 pages
Installing and Using Snort To Monitor and Control A Network
No ratings yet
Installing and Using Snort To Monitor and Control A Network
5 pages
Informatica Powercenter and Data Quality On Oracle Exadata
No ratings yet
Informatica Powercenter and Data Quality On Oracle Exadata
8 pages
Tech Quiz for IT Professionals
No ratings yet
Tech Quiz for IT Professionals
3 pages
Histrory of Antivirus
No ratings yet
Histrory of Antivirus
2 pages
Dydacomp's Multichannel Order Manager 6.0 Workstation Install Guide
No ratings yet
Dydacomp's Multichannel Order Manager 6.0 Workstation Install Guide
12 pages
Walkthrough - Setting Up Solr - 9.2 PDF
No ratings yet
Walkthrough - Setting Up Solr - 9.2 PDF
8 pages
Operating System Concepts-PG-DAC: Suggested Teaching Guidelines For
No ratings yet
Operating System Concepts-PG-DAC: Suggested Teaching Guidelines For
4 pages
How To Patch Android Apps Depending On Google Maps and Google Play Services - BlackBerry Forums at CrackBerry
No ratings yet
How To Patch Android Apps Depending On Google Maps and Google Play Services - BlackBerry Forums at CrackBerry
12 pages
HP Support Center User Guide
No ratings yet
HP Support Center User Guide
65 pages
2017-Asec-Thomas Darimont-Open Source Identity Management Mit Keycloak-Praesentation
No ratings yet
2017-Asec-Thomas Darimont-Open Source Identity Management Mit Keycloak-Praesentation
39 pages
RSL-D-RS-7.0-SEG-EN-1.0-2017-12-08 RayStation 7 System Environment Guidelines PDF
No ratings yet
RSL-D-RS-7.0-SEG-EN-1.0-2017-12-08 RayStation 7 System Environment Guidelines PDF
42 pages
PL SQL K Online Material
80% (5)
PL SQL K Online Material
98 pages