0% found this document useful (0 votes)

6 views10 pages

Unit5 Irt

unit5-irt

Uploaded by

sec22ad063

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views10 pages

Unit5 Irt

unit5-irt

Uploaded by

sec22ad063

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

SEARCHING THE WEB

Introduction

Information Retrieval (IR) is the process of obtaining relevant information from large collections
of unstructured data, typically text. With the growth of the World Wide Web, web search has
become one of the most prominent applications of IR, helping users find relevant documents
among billions of web pages.

Imagine you're looking for the best pizza place near you. Instead of visiting every restaurant,
you type "best pizza near me" in Google. Within seconds, you get a list of recommendations.
How does this happen?

This is the power of Web Search—a process where search engines like Google, Bing, or Yahoo
scan the internet, find relevant web pages, and rank them based on how useful they are to your
query.

The process works in three main steps:

1. Web Crawling

Crawling is the process where automated programs called web crawlers or spiders
systematically browse the internet to discover and collect data from web pages.

● These crawlers visit websites, read their content, and follow links to other pages.

● This helps search engines stay updated with new and changed content on the web.

2. Indexing

Indexing is the process of organizing and storing the collected web data in a
structured format (like a digital library) so that it can be quickly retrieved during a
search.

● The system breaks down each page into keywords and maps them to the pages they
appear on.

● This structure is called an inverted index and is essential for fast searches.

3.Ranking and searching

Ranking and Searching refers to the process where the search engine, after
receiving a user’s query, retrieves relevant pages from the index and sorts (ranks)
them based on various factors like relevance, popularity, and freshness.

● The goal is to show the most useful results at the top of the search results page.

For example:

●Searching "how to make a pizza" will show recipes, cooking videos, and articles.

●Searching "buy pizza online" will display pizza delivery websites.

This is because search engines use algorithms to understand what you are looking for and show
the most useful results.

STEPS IN SEARCHING THE WEB:

1.Storing Web Pages

● Imagine we have a few web pages, each with some text content.
● We store these pages like entries in a dictionary, where each page has a name and some
text.

2️.Creating an Inverted Index

● We take each word from the web pages and create an index that tells us:
○ Which words appear in which pages.
● For example, if the word "pizza" appears in Page 1 and Page 3, the index will show:
○ "pizza": [Page1, Page3]
● This helps us quickly find all the pages where a certain word is used.

3️.Searching for a Word or Phrase

● When a user types a search query, like "best pizza", we:

○ Look into the inverted index.
○ Find all pages that contain the word "best", "pizza", or both.
4️.Visualizing the Search Results

● To make results easier to understand, we draw a bar chart.

● Each bar represents a web page.
● The height of the bar shows how many words are in that page.
○ More words = more detailed content (in basic search systems).

5️.Displaying Relevant Pages

● Finally, we show the user which pages matched their search.

● The user sees:
○ Names of the matching pages.
○ Possibly a short snippet from each page (like in Google).
○ Optional: The bar graph to help compare content richness.

Sample code:

import matplotlib.pyplot as plt

# Web pages

pages = {

"Page1": "Best pizza in town with cheese and toppings",

"Page2": "Delicious burgers and hot dogs are available here",

"Page3": "Pizza, pasta, and garlic bread for food lovers",

"Page4": "Healthy salads and smoothies for a perfect diet"

}
# Build inverted index

index = {}

for name, text in pages.items():

for word in text.lower().split():

index.setdefault(word, set()).add(name)

# Search function

def search(query):

words = query.lower().split()

matches = [index.get(w, set()) for w in words]

return set.intersection(*matches) if matches else set()

# Plotting function

def show_results(results):

lengths = {p: len(pages[p].split()) for p in results}

plt.bar(lengths.keys(), lengths.values(), color='skyblue')

plt.title("Matching Pages")

plt.show()

# Run

q = input("Search: ")

found = search(q)

if found:
print("Results:")

for p in found: print(f"{p}: {pages[p]}")

show_results(found)

else:

print("No matches found.")

OUTPUT:

User Input:

Search: pizza

Results:

Page1: Best pizza in town with cheese and toppings

Page3: Pizza, pasta, and garlic bread for food lovers

IR AND WEB SEARCH

Web Search:
Web search refers to the process of using search engines (like Google, Bing, etc.) to find and
retrieve information from the World Wide Web. It involves searching through a vast array of web
pages, documents, and media across the internet using queries that typically consist of a few
keywords. The results are ranked based on relevance, authority, and other factors. Web search is
typically user-friendly and designed for quick, informal searches.
1. Languages: Indexes documents in many languages without additional subject analysis.
2. File Types: Indexes several file types, including some without text.
3. Document Length: Documents vary widely in length, with longer documents often split
into parts.
4. Document Structure: Web documents are semi-structured (HTML).
5. Spam: Web search engines decide which documents are suitable for indexing.
6. Amount of Data, Size of Databases: The actual size of the Web is unknown, and
complete indexing is impossible.
7. Type of Queries: Users typically enter short queries (2-3 words) with little search
knowledge.
8. User Interface: Easy-to-use interfaces for general users.
9. Ranking: Relevance ranking is necessary due to the large number of hits.
10.Search Functions: Limited query options.

Information Retrieval (IR):

Information Retrieval (IR) is the process of retrieving information from a large collection of data
(like a database, digital library, or document repository) based on a user's query. Unlike web
search, IR is often more structured and involves querying a specific, well-defined collection of
documents or data. It uses techniques such as indexing, ranking, and query processing to find
relevant documents. IR systems are typically more complex and are used in research, legal
databases, and other specialized areas where the data is structured or specific.
1. Languages: Focuses on a single language or uses a consistent vocabulary across different
languages.
2. File Types: Primarily indexes consistent formats (e.g., PDF) or bibliographic info.
3. Document Length: Documents vary, but to a smaller degree than in Web Search.
4. Document Structure: Allows indexing of structured documents.
5. Spam: Suitable documents are predefined in database design.
6. Amount of Data, Size of Databases: The exact amount of data is known and defined by
formal criteria.
7. Type of Queries: Users are familiar with search syntax and use longer, more specific
queries.
8. User Interface: More complex, requiring user expertise.
9. Ranking: Relevance ranking is not always necessary since users constrain results.
10.Search Functions: Uses complex query languages to narrow down results.

Ranking in Search Engines

When you type something in a search engine (like "best phones in 2025"), the search engine
finds thousands or even millions of pages that match your keywords. But not all of them are
equally useful or trustworthy.
So, the search engine needs a ranking system to decide which results appear at the top and
which go lower down.

There are two major types of ranking used:

1. Static Ranking
2. Dynamic Ranking

1. Static Ranking (Permanent/Fixed Importance)

What is it?

Static ranking is like giving a page a permanent reputation score. It doesn’t change every time
someone searches—it’s calculated before the user searches.

What does it depend on?

● Popularity – If many people visit the site regularly.

● Backlinks – If other websites link to it (like references in a research paper).
● Website Age – Older websites are often considered more trustworthy.
● Domain Authority – Government sites (.gov), university sites (.edu), and well-known
companies often have higher authority.

Real-Life Example:

● The official website of Harvard University will always have a high rank when you
search for "top universities" because it’s trusted, has been around a long time, and is
linked from many places.

Advantages:

● Stable & Reliable: Since it's based on long-term trust (like backlinks and domain
authority), results don’t change often, ensuring users get high-quality, verified content.
● Faster Computation: Ranking scores are pre-computed, so search engines can retrieve
results more quickly.
● Less Prone to Manipulation: Difficult to artificially boost since it's based on long-term
reputation and quality.
Disadvantages:

● Not Always Up-to-Date: It might rank old content higher even if there’s newer, more
relevant information.
● Ignores User Behavior: Doesn’t adapt based on what users are actually clicking or
engaging with.
● Less Responsive to Trends: Can’t capture current events, viral news, or changing user
interests quickly.

2. Dynamic Ranking (Real-Time Popularity)

What is it?

Dynamic ranking changes depending on current trends or user behavior. It’s calculated at the
time of the search.

What does it depend on?

● Click-Through Rate (CTR) – If many users are clicking on a link when it shows up in
search results.
● User Engagement – If people stay longer on a page, scroll through, and interact with it.
● Freshness – Newer content might rank higher temporarily, especially for news or
trending topics.
● Location & Time – A local event or recent news may appear at the top based on where
you are and when you search.

Real-Life Example:

● A news article about a breaking earthquake might not be from a famous site, but
because it’s new and everyone is clicking on it, it appears at the top of the search results.

Advantages:

● Up-to-Date Results: Ranks fresh, trending, or viral content higher when it matters most.
● User-Centered: Adapts to what people are actually searching and clicking, improving
relevance.
● Responsive to Trends: Great for time-sensitive content like news, sports, or social
media.

Disadvantages:

● Can Be Manipulated: High click-through rates can sometimes be artificially inflated to

push results up.
● Unstable Results: Rankings can change frequently, making the experience unpredictable.
● Higher Computational Cost: Requires real-time data analysis, which is
resource-intensive.

Example Scenario to Understand Better

Static Ranking:

Let’s say you search "admission requirements for MIT"

● The official MIT website shows up at the top — not because everyone is clicking on it
now, but because it is trustworthy and authoritative.
● Its rank is static and doesn't change much.

Dynamic Ranking:

Now you search "latest cricket match winner"

● A new article from a sports website that just covered the match might show up at the top,
even if the website isn’t world-famous.
● This is because many people are clicking on it right now, and it has fresh content —
that’s dynamic ranking in action.

Web Technology Search Engines
No ratings yet
Web Technology Search Engines
17 pages
IR Lec1
No ratings yet
IR Lec1
26 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Unit 5
No ratings yet
Unit 5
20 pages
Lecture 11 - Web Search, Crawling, and Indexes
No ratings yet
Lecture 11 - Web Search, Crawling, and Indexes
62 pages
IR Workbook Answers
No ratings yet
IR Workbook Answers
36 pages
Information
No ratings yet
Information
61 pages
Google Deep Dive
No ratings yet
Google Deep Dive
9 pages
Wad Module3
No ratings yet
Wad Module3
38 pages
Assignment 3 DM
No ratings yet
Assignment 3 DM
12 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
Basics of Cybersecurity
No ratings yet
Basics of Cybersecurity
16 pages
Unit3 (Search Engine)
No ratings yet
Unit3 (Search Engine)
7 pages
UNIT I - Introduction and Motivation
No ratings yet
UNIT I - Introduction and Motivation
57 pages
English: Electronic Search Engine
50% (2)
English: Electronic Search Engine
10 pages
Search ENgine
No ratings yet
Search ENgine
28 pages
IR Module 3
No ratings yet
IR Module 3
45 pages
Searching The Web
No ratings yet
Searching The Web
24 pages
93512information Retrieval LecturesNotes2024
No ratings yet
93512information Retrieval LecturesNotes2024
153 pages
Bulu
No ratings yet
Bulu
47 pages
Module 1print
No ratings yet
Module 1print
5 pages
Assignment 3 of DM
No ratings yet
Assignment 3 of DM
7 pages
Lect 1 IRIntroduction
No ratings yet
Lect 1 IRIntroduction
59 pages
Unit 1 Irt
No ratings yet
Unit 1 Irt
21 pages
1 Mod-1 - Lec-1
No ratings yet
1 Mod-1 - Lec-1
21 pages
VV - IR - UNIT-I - Part2
No ratings yet
VV - IR - UNIT-I - Part2
35 pages
Unit - 1
No ratings yet
Unit - 1
51 pages
UNIT 4 Cte Note
No ratings yet
UNIT 4 Cte Note
12 pages
Search Engine Student Documents
No ratings yet
Search Engine Student Documents
6 pages
L01
No ratings yet
L01
33 pages
Search Tools and Their Components
No ratings yet
Search Tools and Their Components
7 pages
Crawler, Index, Ranking
No ratings yet
Crawler, Index, Ranking
20 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Ir Mod1 Notes
No ratings yet
Ir Mod1 Notes
20 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
4
No ratings yet
4
35 pages
Refining Serp - Search Engine Result Page For Enhanced Information Retrieval
No ratings yet
Refining Serp - Search Engine Result Page For Enhanced Information Retrieval
7 pages
R - Piracy Megathread Guide - Resources & Tools
No ratings yet
R - Piracy Megathread Guide - Resources & Tools
7 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
Web Search Engine Challenges & Architecture
No ratings yet
Web Search Engine Challenges & Architecture
21 pages
Free ASP Upload - Full Source Available
0% (1)
Free ASP Upload - Full Source Available
3 pages
Cyber Security Law & Legal Aspects
No ratings yet
Cyber Security Law & Legal Aspects
14 pages
2 Mod-1 - Lec-2
No ratings yet
2 Mod-1 - Lec-2
58 pages
Ethical Hacking - A Beginner's Guide To Cybersecurity Fundamentals
No ratings yet
Ethical Hacking - A Beginner's Guide To Cybersecurity Fundamentals
15 pages
12 Handout PDF
No ratings yet
12 Handout PDF
82 pages
Chap 1
No ratings yet
Chap 1
22 pages
Acupuncture Social Media Planner
No ratings yet
Acupuncture Social Media Planner
5 pages
The Wisdom of Crowds: Web Mining or
No ratings yet
The Wisdom of Crowds: Web Mining or
50 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Intro to Info Retrieval Course
No ratings yet
Intro to Info Retrieval Course
31 pages
UNIT 3 Notes
No ratings yet
UNIT 3 Notes
32 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
Online Enrollment Process
No ratings yet
Online Enrollment Process
62 pages
Module 1 - Search Engine Basics
No ratings yet
Module 1 - Search Engine Basics
79 pages
Media and Information Literacy Q2 W1
No ratings yet
Media and Information Literacy Q2 W1
12 pages
Chap 1
No ratings yet
Chap 1
23 pages
Incident Management Investigation
No ratings yet
Incident Management Investigation
9 pages
Web Query Mining
No ratings yet
Web Query Mining
16 pages
What Is Search Engine Optimization: SEO
No ratings yet
What Is Search Engine Optimization: SEO
7 pages
Ethical Hacking Guide for Beginners
No ratings yet
Ethical Hacking Guide for Beginners
64 pages
Students File Management System
No ratings yet
Students File Management System
25 pages
Test Scenarios - ERPPMC and APMS
No ratings yet
Test Scenarios - ERPPMC and APMS
8 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Learning To Rank
No ratings yet
Learning To Rank
777 pages
Sprint Backlog Overview
No ratings yet
Sprint Backlog Overview
5 pages
What Is Cyberbullying
No ratings yet
What Is Cyberbullying
7 pages
ICT Module 4
No ratings yet
ICT Module 4
13 pages
Manual Testcase
No ratings yet
Manual Testcase
69 pages
Using Yahoo'S Apis To Build A Simple Weather Web Services: The Javascript
No ratings yet
Using Yahoo'S Apis To Build A Simple Weather Web Services: The Javascript
8 pages
Bott, Elizabeth. Urban Families - Conjugal Roles and Social Networks
No ratings yet
Bott, Elizabeth. Urban Families - Conjugal Roles and Social Networks
41 pages
Unit 5 - Data Science & Big Data - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Science & Big Data - WWW - Rgpvnotes.in
17 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
Nist SP 1800-16
No ratings yet
Nist SP 1800-16
432 pages
Search Engines Information Retrieval in Practice PDF
No ratings yet
Search Engines Information Retrieval in Practice PDF
542 pages
Mrunmayee Kulkarni Cyber PPT 2
No ratings yet
Mrunmayee Kulkarni Cyber PPT 2
14 pages
Modern Information Retrieval: Computer Engineering Department Fall 2005
No ratings yet
Modern Information Retrieval: Computer Engineering Department Fall 2005
19 pages
Vishnu A10331291S219
No ratings yet
Vishnu A10331291S219
3 pages
Google Cybersecurity Lab Guide
No ratings yet
Google Cybersecurity Lab Guide
3 pages
Assignment 15
No ratings yet
Assignment 15
2 pages
SEO Book
No ratings yet
SEO Book
32 pages
The Lazy Networker's Link Building Method How To Get Links Without Personalizationmore Resources For Resource Pages - RankXL PDF
No ratings yet
The Lazy Networker's Link Building Method How To Get Links Without Personalizationmore Resources For Resource Pages - RankXL PDF
3 pages
Assignment Itnsa2-B11 (2025) 2
No ratings yet
Assignment Itnsa2-B11 (2025) 2
9 pages
Parse Server Guide
No ratings yet
Parse Server Guide
11 pages
Facebook Demographics 2009
No ratings yet
Facebook Demographics 2009
17 pages
Nordstrom Case Analysis: Dissension
No ratings yet
Nordstrom Case Analysis: Dissension
2 pages
o Institutie Nu Patrunde Intro Alta Institutie
No ratings yet
o Institutie Nu Patrunde Intro Alta Institutie
29 pages