0% found this document useful (0 votes)

10 views7 pages

Unit V Notes Adbt Adbt

The document discusses Advanced Database Technology, focusing on Information Retrieval (IR) and web search. It outlines various IR models, types of queries, text preprocessing techniques, evaluation measures, and current trends in web search, emphasizing the importance of user intent and personalization. Additionally, it highlights the role of analytics in improving business performance through data analysis.

Uploaded by

Vedhapriya BCA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views7 pages

Unit V Notes Adbt Adbt

Uploaded by

Vedhapriya BCA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

lOMoARcPSD|50547602

Unit v notes adbt - adbt

Advanced Database Technology (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Downloaded by Vedhapriya BCA ([email protected])
lOMoARcPSD|50547602

MC4202 ADVANCED DATABASE TEHNOLOGY

UNIT V INFORMATION RETRIEVAL AND WEB SEARCH

Information Retrieval (IR) can be defined as a software program that deals with the
organization, storage, retrieval, and evaluation of information from document repositories,
particularly textual information. Information Retrieval is the activity of obtaining material
that can usually be documented on an unstructured nature i.e. usually text which satisfies
an information need from within large collections which is stored on computers. For
example, Information Retrieval can be when a user enters a query into the system.
What is an IR Model?
An Information Retrieval (IR) model selects and ranks the document that is required by the
user or the user has asked for in the form of a query. The documents and the queries are
represented in a similar manner, so that document selection and ranking can be formalized
by a matching function that returns a retrieval status value (RSV) for each document in
the collection. Many of the Information Retrieval systems represent document contents by a
set of descriptors, called terms, belonging to a vocabulary V. An IR model determines the
query-document matching function according to four main approaches:

Retrieval Models
It is the simplest and easy to implement IR model. This model is based on
mathematical knowledge that was easily recognized and understood as well. Boolean,
Vector and Probabilistic are the three classical IR models. These are the three main statistical
models—Boolean, vector space, and probabilistic—and the semantic model.

1|Page

Downloaded by Vedhapriya BCA ([email protected])

lOMoARcPSD|50547602

Types of retrieval model:

 Classical IR Model. It is the simplest and easy to implement IR model. ...

 Non-Classical IR Model. It is completely opposite to classical IR model. ...
 Alternative IR Model. ...
 Inverted Index. ...
 Stop Word Elimination. ...
 Stemming. ...
 Term Weighting. ...
 Term Frequency (tfij)

TYPES OF QUERIES IN IR SYSTEMS:

During the process of indexing, many keywords are associated with document set
which contains words, phrases, date created, author names, and type of document. They are
used by an IR system to build an inverted index which is then consulted during the search.
The queries formulated by users are compared to the set of index keywords. Most IR systems
also allow the use of Boolean and other operators to build a complex query. The query
language with these operators enriches the expressiveness of a user’s information need.
1. Keyword Queries:
 Simplest and most common queries.
 The user enters just keyword combinations to retrieve documents.
 These keywords are connected by logical AND operator.
 All retrieval models provide support for keyword queries.
2. Boolean Queries:
 Some IR systems allow using +, -, AND, OR, NOT, ( ), Boolean operators in combination
of keyword formulations.
 No ranking is involved because a document either satisfies such a query or does not
satisfy it.
 A document is retrieved for Boolean query if it is logically true as exact match in
document.
3. Phase Queries:
 When documents are represented using an inverted keyword index for searching, the
relative order of items in document is lost.
 To perform exact phase retrieval, these phases are encoded in inverted index or
implemented differently.
 This query consists of a sequence of words that make up a phase.
 It is generally enclosed within double quotes.
4. Proximity Queries:
 Proximity refers ti search that accounts for how close within a record multiple items
should be to each other.
 Most commonly used proximity search option is a phase search that requires terms to
be in exact order.

2|Page

Downloaded by Vedhapriya BCA ([email protected])

lOMoARcPSD|50547602

 Other proximity operators can specify how close terms should be to each other. Some
will specify the order of search terms.
 Search engines use various operators’ names such as NEAR, ADJ (adjacent), or
AFTER.
 However, providing support for complex proximity operators becomes expensive as it
requires time-consuming pre-processing of documents and so it is suitable for smaller
document collections rather than for web.
5. Wildcard Queries:
 It supports regular expressions and pattern matching-based searching in text.
 Retrieval models do not directly support for this query type.
 In IR systems, certain kinds of wildcard search support may be implemented.
Example: usually words ending with trailing characters.
6. Natural Language Queries:
 There are only a few natural language search engines that aim to understand the
structure and meaning of queries written in natural language text, generally as question
or narrative.
 The system tries to formulate answers for these queries from retrieved results.
 Semantic models can provide support for this query type.

TEXT PREPROCESSING: Text preprocessing is an initial phase in text mining. There are
various preprocessing techniques to categorize text documents. These are filtering, splitting
of sentences, stemming, stop words removal and token frequency count. Filtering has
a set of rules for removing duplicate strings and irrelevant text
The various text preprocessing steps are:
1. Tokenization.
2. Lower casing.
3. Stop words removal.
4. Stemming.
5. Lemmatization.

The purpose of tokenization is to protect sensitive data while preserving its business
utility. This differs from encryption, where sensitive data is modified and stored with methods
that do not allow its continued use for business purposes. If tokenization is like a poker chip,
encryption is like a lockbox.

3|Page

Downloaded by Vedhapriya BCA ([email protected])

lOMoARcPSD|50547602

Stemming and Lemmatization are Text Normalization (or sometimes called Word
Normalization) techniques in the field of Natural Language Processing that are used to
prepare text, words, and documents for further processing.

Stop words removal: Stop word removal is one of the most commonly used
preprocessing steps across different NLP applications. The idea is simply removing the
words that occur commonly across all the documents in the corpus. Typically, articles and
pronouns are generally classified as stop words.

The preprocessing of the text data is an essential step as there we prepare the text data
ready for the mining. If we do not apply then data would be very inconsistent and could not
generate good analytics results.

Text Pre-processing is used to clean up text data: Convert words to their roots (in other
words, lemmatize). Filter out unwanted digits, punctuation, and stop words.

Some of the common text preprocessing / cleaning steps are:

 Lower casing.
 Removal of Punctuations.
 Removal of Stop words.
 Removal of Frequent words.
 Removal of Rare words.
 Stemming.
 Lemmatization.
 Removal of emojis.

Evaluation measure

4|Page

Downloaded by Vedhapriya BCA ([email protected])

lOMoARcPSD|50547602

Evaluation measures for an information retrieval system are used to assess how well the
search results satisfied the user's query intent. The field of information retrieval has used
various types of quantitative metrics for this purpose, based on either observed user behavior
or on scores from prepared benchmark test sets. Besides benchmarking by using this type of
measure, an evaluation for an information retrieval system should also include a validation of
the measures used, i.e. an assessment of how well the measures what they are intended to
measure and how well the system fits its intended use case. [1] Metrics are often split into two
types: online metrics look at users' interactions with the search system, while offline metrics
measure theoretical relevance, in other words how likely each result, or search engine results
page (SERP) page as a whole, is to meet the information needs of the user.

Online metrics
Online metrics are generally created from search logs. The metrics are often used to determine
the success of an A/B test.
Session abandonment rate
Session abandonment rate is a ratio of search sessions which do not result in a click.
Click-through rate
Click-through rate (CTR) is the ratio of users who click on a specific link to the number of total
users who view a page, email, or advertisement. It is commonly used to measure the success of
an online advertising campaign for a particular website as well as the effectiveness of email
campaigns.[2]
Session success rate
Session success rate measures the ratio of user sessions that lead to a success. Defining
"success" is often dependent on context, but for search a successful result is often measured
using dwell time as a primary factor along with secondary user interaction, for instance, the
user copying the result URL is considered a successful result, as is copy/pasting from the
snippet.
Zero result rate
Zero result rate (ZRR) is the ratio of Search Engine Results Pages (SERPs) which returned with
zero results. The metric either indicates a recall issue, or that the information being searched
for is not in the index.

Offline metrics

Offline metrics are generally created from relevance judgment sessions where the judges score
the quality of the search results. Both binary (relevant/non-relevant) and multi-level (e.g.,
relevance from 0 to 5) scales can be used to score each document returned in response to a
query. In practice, queries may be ill-posed, and there may be different shades of relevance.
WEB SEARCH
A web search engine is a specialized computer server that searches for data on the
Web. The search results of a user query are restored as a list (known as hits). The hits
can include web pages, images, and different types of files.
There are various search engines also search and return data available in public
databases or open directories. Search engines differ from web directories in that web
directories are supported by human editors whereas search engines works
algorithmically or by a combination of algorithmic and human input.

5|Page

Downloaded by Vedhapriya BCA ([email protected])

lOMoARcPSD|50547602

Web search engines are large data mining applications. There are several data mining
techniques are used in all elements of search engines, ranging from crawling (e.g.,
deciding which pages must be crawled and the crawling frequencies), indexing (e.g.,
selecting pages to be indexed and determining to which extent the index must be
constructed), and searching (e.g., determining how pages must be ranked, which
advertisements must be added, and how the search results can be customized or create
“context aware”).
ANALYTICS
Analytics is the systematic computational analysis of data or statistics. [1] It is used for the
discovery, interpretation, and communication of meaningful patterns in data. It also entails
applying data patterns toward effective decision-making. It can be valuable in areas rich with
recorded information; analytics relies on the simultaneous application of statistics, computer
programming, and operations research to quantify performance.
Organizations may apply analytics to business data to describe, predict, and improve business
performance. Specifically, areas within analytics include descriptive analytics, diagnostic
analytics, predictive analytics, prescriptive analytics, and cognitive analytics.[2] Analytics may
apply to a variety of fields such as marketing, management, finance, online systems,
information security, and software services. Since analytics can require extensive computation
(see big data), the algorithms and software used for analytics harness the most current
methods in computer science, statistics, and mathematics

CURRENT TRENDS IN WEB SEARCH

1. Voice search will become even more relevant

Voice search is already an integral part of our daily lives: we ask Siri where the closest gas
station is or say “Hey Google, which Thai restaurant is the highest rated in my town?“ At the
moment, optimizing for these kinds of voice searches is recommended especially for
ecommerce or websites whose users are likely to have their hands full. For example, if you
run a recipe blog, you want your users to find the answer on how long to let the dough rest
without having to type with their potentially dirty hands on the phone.
2. Your site search can no longer offer zero results pages
A zero result page for your user means a lost client for you. But what seems like a problem
can be a great opportunity to increase your revenue. Let’s go back to our example. In this case,
you cannot offer your user Ralph Lauren winter shoes. But you can show them results for
other relevant products such as summer shoes by Ralph Lauren or winter shoes by other
brands.
3. Search will become more personalized than ever
With personalization, you can offer relevant results for each user based on their preferences
and prior search behavior. Going back to our example, an HR person might have already
downloaded a pdf targeted towards HR managers on the website. Based on their behavior,
they would get assessed as a B2B user and can get more B2B oriented results in their search.
4. Site search will feel less like search and more intuitive
A good site search is the one you do not even think about as a user. You use it so intuitively
that you don’t need to assess what you are doing – you just do it. In 2022, site search will
look even less like classical search.

6|Page

Downloaded by Vedhapriya BCA ([email protected])

Unit 5 6 Pages Notes
No ratings yet
Unit 5 6 Pages Notes
3 pages
Lecture17 IR
No ratings yet
Lecture17 IR
28 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
47 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
Lang Models: 04 December 2024 23:03
No ratings yet
Lang Models: 04 December 2024 23:03
4 pages
Advanced Database Tech: IR & Web Search
No ratings yet
Advanced Database Tech: IR & Web Search
21 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Unit II
No ratings yet
Unit II
73 pages
DDB Ch27
No ratings yet
DDB Ch27
60 pages
Information Retrieval Systems
No ratings yet
Information Retrieval Systems
46 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Pert23 - NLP
No ratings yet
Pert23 - NLP
30 pages
NLP M5 Part-1 SPP
No ratings yet
NLP M5 Part-1 SPP
55 pages
Information Retrieval - September 2024 Question Pa
No ratings yet
Information Retrieval - September 2024 Question Pa
16 pages
Introduction to IR Models
No ratings yet
Introduction to IR Models
46 pages
Information Retrieval
No ratings yet
Information Retrieval
9 pages
1 IR Introductionn
No ratings yet
1 IR Introductionn
30 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
CS8080 Irt Unit Ii Qbank Main
No ratings yet
CS8080 Irt Unit Ii Qbank Main
8 pages
Irt Q&A
No ratings yet
Irt Q&A
14 pages
Module 1 Inforetrival
No ratings yet
Module 1 Inforetrival
11 pages
Application NLP
No ratings yet
Application NLP
23 pages
Introduction to Information Retrieval Course
No ratings yet
Introduction to Information Retrieval Course
39 pages
IR (1-7) - Heet
No ratings yet
IR (1-7) - Heet
19 pages
Module 4
No ratings yet
Module 4
16 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
NLP See
No ratings yet
NLP See
9 pages
Ap May 23 QP Ans
No ratings yet
Ap May 23 QP Ans
9 pages
Unit Ii Part B 1. Write About Basic IR Model
No ratings yet
Unit Ii Part B 1. Write About Basic IR Model
17 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Module 3-2
No ratings yet
Module 3-2
17 pages
Bai601 NLP Module 4 Lecture Notes
No ratings yet
Bai601 NLP Module 4 Lecture Notes
24 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Search Engine Evaluation Guide
No ratings yet
Search Engine Evaluation Guide
48 pages
1) Explain User Interaction With IR With The Help of A Diagram
No ratings yet
1) Explain User Interaction With IR With The Help of A Diagram
12 pages
Chapter 1 Ir
No ratings yet
Chapter 1 Ir
37 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
Ch1 IR
No ratings yet
Ch1 IR
39 pages
Aspect Information Retrieval (IR) Web Search
No ratings yet
Aspect Information Retrieval (IR) Web Search
19 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
AI Module 7
No ratings yet
AI Module 7
76 pages
IR Introduction
100% (1)
IR Introduction
6 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
Irs Ia 1
No ratings yet
Irs Ia 1
12 pages
NLP See
No ratings yet
NLP See
27 pages
IR Ans
No ratings yet
IR Ans
13 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
Web Search
No ratings yet
Web Search
30 pages
PPT08-Natural Language Processing
100% (1)
PPT08-Natural Language Processing
44 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Ir Ass1
No ratings yet
Ir Ass1
12 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Chinese Wall Security Policy
No ratings yet
Chinese Wall Security Policy
44 pages
Mc4001 SPM 1 5 Units Notes
No ratings yet
Mc4001 SPM 1 5 Units Notes
115 pages
OOSE Question Paper Nov-Dec-2023
100% (1)
OOSE Question Paper Nov-Dec-2023
3 pages
Unit-4 Django
No ratings yet
Unit-4 Django
42 pages
Digital Marketing for Architecture Students
No ratings yet
Digital Marketing for Architecture Students
18 pages
Session 3,4,5 Niche Research
100% (1)
Session 3,4,5 Niche Research
15 pages
Growth Hacking Handbook
100% (8)
Growth Hacking Handbook
131 pages
Introduction To Paid Search Advertising - Slides
No ratings yet
Introduction To Paid Search Advertising - Slides
51 pages
Review of Literature
No ratings yet
Review of Literature
19 pages
UNIT 3 Notes
No ratings yet
UNIT 3 Notes
32 pages
Archer 6.11 & Later Release Notes
No ratings yet
Archer 6.11 & Later Release Notes
10 pages
5th Module Assessment - AMIGO Opeartions 24
No ratings yet
5th Module Assessment - AMIGO Opeartions 24
5 pages
Personal Branding - How To Go From Zero To Hero in No Time
No ratings yet
Personal Branding - How To Go From Zero To Hero in No Time
25 pages
Mobile App Testing Qa Checklist by ITS Hub
No ratings yet
Mobile App Testing Qa Checklist by ITS Hub
10 pages
YouTube Algorithm, Virality, Finance Strategies
No ratings yet
YouTube Algorithm, Virality, Finance Strategies
26 pages
On Page SEO Strategy Guide PDF
100% (1)
On Page SEO Strategy Guide PDF
23 pages
Free Website Audit Report
No ratings yet
Free Website Audit Report
15 pages
SEO For Beginners Module 1 1 Google PDF
No ratings yet
SEO For Beginners Module 1 1 Google PDF
13 pages
GA4 Setup & Insights Guide
100% (2)
GA4 Setup & Insights Guide
81 pages
Filly Studios: Video Production & Marketing
No ratings yet
Filly Studios: Video Production & Marketing
23 pages
CCW332 Digital Marketing Full Notes
100% (1)
CCW332 Digital Marketing Full Notes
243 pages
Document
No ratings yet
Document
3 pages
Cellebrite Reader v7.60 Jan 2022 Eng PDF
No ratings yet
Cellebrite Reader v7.60 Jan 2022 Eng PDF
129 pages
Meltwater Full Userguide2021 Updated
No ratings yet
Meltwater Full Userguide2021 Updated
16 pages
QRC WOWPeoplePortal
No ratings yet
QRC WOWPeoplePortal
2 pages
Video Seo
No ratings yet
Video Seo
56 pages
Case Study - Finding The Perfect Rooms and Roommates - by Yash Panwar - Bootcamp
No ratings yet
Case Study - Finding The Perfect Rooms and Roommates - by Yash Panwar - Bootcamp
31 pages
SEO Basics for WordPress Users
No ratings yet
SEO Basics for WordPress Users
16 pages
Marketing and Outreach Team Presentation
No ratings yet
Marketing and Outreach Team Presentation
10 pages
Selling Backlinks by Charles Floate - 1100lik Egitim
No ratings yet
Selling Backlinks by Charles Floate - 1100lik Egitim
70 pages
Study of The Effectiveness of Online Marketing On Integrated Marketing Communication Amruta Pawar
No ratings yet
Study of The Effectiveness of Online Marketing On Integrated Marketing Communication Amruta Pawar
198 pages
Lexis Plus Legal Research Advanced Certification Guide
No ratings yet
Lexis Plus Legal Research Advanced Certification Guide
9 pages
SEO Basics for Business Owners
No ratings yet
SEO Basics for Business Owners
9 pages
Network Marketing Sales Process
No ratings yet
Network Marketing Sales Process
65 pages