Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views10 pages

Unit5 Irt

unit5-irt

Uploaded by

sec22ad063
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views10 pages

Unit5 Irt

unit5-irt

Uploaded by

sec22ad063
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

SEARCHING THE WEB

Introduction

Information Retrieval (IR) is the process of obtaining relevant information from large collections
of unstructured data, typically text. With the growth of the World Wide Web, web search has
become one of the most prominent applications of IR, helping users find relevant documents
among billions of web pages.

Imagine you're looking for the best pizza place near you. Instead of visiting every restaurant,
you type "best pizza near me" in Google. Within seconds, you get a list of recommendations.
How does this happen?

This is the power of Web Search—a process where search engines like Google, Bing, or Yahoo
scan the internet, find relevant web pages, and rank them based on how useful they are to your
query.

The process works in three main steps:

1. Web Crawling

Crawling is the process where automated programs called web crawlers or spiders
systematically browse the internet to discover and collect data from web pages.

●​ These crawlers visit websites, read their content, and follow links to other pages.​

●​ This helps search engines stay updated with new and changed content on the web.

2. Indexing

Indexing is the process of organizing and storing the collected web data in a
structured format (like a digital library) so that it can be quickly retrieved during a
search.

●​ The system breaks down each page into keywords and maps them to the pages they
appear on.​

●​ This structure is called an inverted index and is essential for fast searches.

3.Ranking and searching


Ranking and Searching refers to the process where the search engine, after
receiving a user’s query, retrieves relevant pages from the index and sorts (ranks)
them based on various factors like relevance, popularity, and freshness.

●​ The goal is to show the most useful results at the top of the search results page.

For example:

●Searching "how to make a pizza" will show recipes, cooking videos, and articles.

●Searching "buy pizza online" will display pizza delivery websites.

This is because search engines use algorithms to understand what you are looking for and show
the most useful results.

STEPS IN SEARCHING THE WEB:

1.Storing Web Pages

●​ Imagine we have a few web pages, each with some text content.
●​ We store these pages like entries in a dictionary, where each page has a name and some
text.

2️.Creating an Inverted Index

●​ We take each word from the web pages and create an index that tells us:
○​ Which words appear in which pages.
●​ For example, if the word "pizza" appears in Page 1 and Page 3, the index will show:
○​ "pizza": [Page1, Page3]
●​ This helps us quickly find all the pages where a certain word is used.

3️.Searching for a Word or Phrase

●​ When a user types a search query, like "best pizza", we:


○​ Look into the inverted index.
○​ Find all pages that contain the word "best", "pizza", or both.
4️.Visualizing the Search Results

●​ To make results easier to understand, we draw a bar chart.


●​ Each bar represents a web page.
●​ The height of the bar shows how many words are in that page.
○​ More words = more detailed content (in basic search systems).

5️.Displaying Relevant Pages

●​ Finally, we show the user which pages matched their search.


●​ The user sees:
○​ Names of the matching pages.
○​ Possibly a short snippet from each page (like in Google).
○​ Optional: The bar graph to help compare content richness.

Sample code:

import matplotlib.pyplot as plt

# Web pages

pages = {

"Page1": "Best pizza in town with cheese and toppings",

"Page2": "Delicious burgers and hot dogs are available here",

"Page3": "Pizza, pasta, and garlic bread for food lovers",

"Page4": "Healthy salads and smoothies for a perfect diet"

}
# Build inverted index

index = {}

for name, text in pages.items():

for word in text.lower().split():

index.setdefault(word, set()).add(name)

# Search function

def search(query):

words = query.lower().split()

matches = [index.get(w, set()) for w in words]

return set.intersection(*matches) if matches else set()

# Plotting function

def show_results(results):

lengths = {p: len(pages[p].split()) for p in results}

plt.bar(lengths.keys(), lengths.values(), color='skyblue')

plt.title("Matching Pages")

plt.show()

# Run

q = input("Search: ")

found = search(q)

if found:
print("Results:")

for p in found: print(f"{p}: {pages[p]}")

show_results(found)

else:

print("No matches found.")

OUTPUT:

User Input:

Search: pizza

Results:

Page1: Best pizza in town with cheese and toppings

Page3: Pizza, pasta, and garlic bread for food lovers

IR AND WEB SEARCH

Web Search:
Web search refers to the process of using search engines (like Google, Bing, etc.) to find and
retrieve information from the World Wide Web. It involves searching through a vast array of web
pages, documents, and media across the internet using queries that typically consist of a few
keywords. The results are ranked based on relevance, authority, and other factors. Web search is
typically user-friendly and designed for quick, informal searches.
1.​ Languages: Indexes documents in many languages without additional subject analysis.
2.​ File Types: Indexes several file types, including some without text.
3.​ Document Length: Documents vary widely in length, with longer documents often split
into parts.
4.​ Document Structure: Web documents are semi-structured (HTML).
5.​ Spam: Web search engines decide which documents are suitable for indexing.
6.​ Amount of Data, Size of Databases: The actual size of the Web is unknown, and
complete indexing is impossible.
7.​ Type of Queries: Users typically enter short queries (2-3 words) with little search
knowledge.
8.​ User Interface: Easy-to-use interfaces for general users.
9.​ Ranking: Relevance ranking is necessary due to the large number of hits.
10.​Search Functions: Limited query options.

Information Retrieval (IR):

Information Retrieval (IR) is the process of retrieving information from a large collection of data
(like a database, digital library, or document repository) based on a user's query. Unlike web
search, IR is often more structured and involves querying a specific, well-defined collection of
documents or data. It uses techniques such as indexing, ranking, and query processing to find
relevant documents. IR systems are typically more complex and are used in research, legal
databases, and other specialized areas where the data is structured or specific.
1.​ Languages: Focuses on a single language or uses a consistent vocabulary across different
languages.
2.​ File Types: Primarily indexes consistent formats (e.g., PDF) or bibliographic info.
3.​ Document Length: Documents vary, but to a smaller degree than in Web Search.
4.​ Document Structure: Allows indexing of structured documents.
5.​ Spam: Suitable documents are predefined in database design.
6.​ Amount of Data, Size of Databases: The exact amount of data is known and defined by
formal criteria.
7.​ Type of Queries: Users are familiar with search syntax and use longer, more specific
queries.
8.​ User Interface: More complex, requiring user expertise.
9.​ Ranking: Relevance ranking is not always necessary since users constrain results.
10.​Search Functions: Uses complex query languages to narrow down results.

Ranking in Search Engines

When you type something in a search engine (like "best phones in 2025"), the search engine
finds thousands or even millions of pages that match your keywords. But not all of them are
equally useful or trustworthy.​
So, the search engine needs a ranking system to decide which results appear at the top and
which go lower down.

There are two major types of ranking used:


1.​ Static Ranking
2.​ Dynamic Ranking

1. Static Ranking (Permanent/Fixed Importance)

What is it?

Static ranking is like giving a page a permanent reputation score. It doesn’t change every time
someone searches—it’s calculated before the user searches.

What does it depend on?

●​ Popularity – If many people visit the site regularly.


●​ Backlinks – If other websites link to it (like references in a research paper).
●​ Website Age – Older websites are often considered more trustworthy.
●​ Domain Authority – Government sites (.gov), university sites (.edu), and well-known
companies often have higher authority.

Real-Life Example:

●​ The official website of Harvard University will always have a high rank when you
search for "top universities" because it’s trusted, has been around a long time, and is
linked from many places.

Advantages:

●​ Stable & Reliable: Since it's based on long-term trust (like backlinks and domain
authority), results don’t change often, ensuring users get high-quality, verified content.
●​ Faster Computation: Ranking scores are pre-computed, so search engines can retrieve
results more quickly.
●​ Less Prone to Manipulation: Difficult to artificially boost since it's based on long-term
reputation and quality.​
Disadvantages:

●​ Not Always Up-to-Date: It might rank old content higher even if there’s newer, more
relevant information.
●​ Ignores User Behavior: Doesn’t adapt based on what users are actually clicking or
engaging with.
●​ Less Responsive to Trends: Can’t capture current events, viral news, or changing user
interests quickly.

2. Dynamic Ranking (Real-Time Popularity)

What is it?

Dynamic ranking changes depending on current trends or user behavior. It’s calculated at the
time of the search.

What does it depend on?

●​ Click-Through Rate (CTR) – If many users are clicking on a link when it shows up in
search results.
●​ User Engagement – If people stay longer on a page, scroll through, and interact with it.
●​ Freshness – Newer content might rank higher temporarily, especially for news or
trending topics.
●​ Location & Time – A local event or recent news may appear at the top based on where
you are and when you search.

Real-Life Example:

●​ A news article about a breaking earthquake might not be from a famous site, but
because it’s new and everyone is clicking on it, it appears at the top of the search results.

Advantages:

●​ Up-to-Date Results: Ranks fresh, trending, or viral content higher when it matters most.
●​ User-Centered: Adapts to what people are actually searching and clicking, improving
relevance.
●​ Responsive to Trends: Great for time-sensitive content like news, sports, or social
media.

Disadvantages:

●​ Can Be Manipulated: High click-through rates can sometimes be artificially inflated to


push results up.
●​ Unstable Results: Rankings can change frequently, making the experience unpredictable.
●​ Higher Computational Cost: Requires real-time data analysis, which is
resource-intensive.

Example Scenario to Understand Better

Static Ranking:

Let’s say you search "admission requirements for MIT"

●​ The official MIT website shows up at the top — not because everyone is clicking on it
now, but because it is trustworthy and authoritative.
●​ Its rank is static and doesn't change much.

Dynamic Ranking:

Now you search "latest cricket match winner"

●​ A new article from a sports website that just covered the match might show up at the top,
even if the website isn’t world-famous.
●​ This is because many people are clicking on it right now, and it has fresh content —
that’s dynamic ranking in action.

You might also like