lOMoARcPSD|50547602
Unit v notes adbt - adbt
Advanced Database Technology (Anna University)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university
Downloaded by Vedhapriya BCA (
[email protected])
lOMoARcPSD|50547602
MC4202 ADVANCED DATABASE TEHNOLOGY
UNIT V INFORMATION RETRIEVAL AND WEB SEARCH
Information Retrieval (IR) can be defined as a software program that deals with the
organization, storage, retrieval, and evaluation of information from document repositories,
particularly textual information. Information Retrieval is the activity of obtaining material
that can usually be documented on an unstructured nature i.e. usually text which satisfies
an information need from within large collections which is stored on computers. For
example, Information Retrieval can be when a user enters a query into the system.
What is an IR Model?
An Information Retrieval (IR) model selects and ranks the document that is required by the
user or the user has asked for in the form of a query. The documents and the queries are
represented in a similar manner, so that document selection and ranking can be formalized
by a matching function that returns a retrieval status value (RSV) for each document in
the collection. Many of the Information Retrieval systems represent document contents by a
set of descriptors, called terms, belonging to a vocabulary V. An IR model determines the
query-document matching function according to four main approaches:
Retrieval Models
It is the simplest and easy to implement IR model. This model is based on
mathematical knowledge that was easily recognized and understood as well. Boolean,
Vector and Probabilistic are the three classical IR models. These are the three main statistical
models—Boolean, vector space, and probabilistic—and the semantic model.
1|Page
lOMoARcPSD|50547602
Types of retrieval model:
Classical IR Model. It is the simplest and easy to implement IR model. ...
Non-Classical IR Model. It is completely opposite to classical IR model. ...
Alternative IR Model. ...
Inverted Index. ...
Stop Word Elimination. ...
Stemming. ...
Term Weighting. ...
Term Frequency (tfij)
TYPES OF QUERIES IN IR SYSTEMS:
During the process of indexing, many keywords are associated with document set
which contains words, phrases, date created, author names, and type of document. They are
used by an IR system to build an inverted index which is then consulted during the search.
The queries formulated by users are compared to the set of index keywords. Most IR systems
also allow the use of Boolean and other operators to build a complex query. The query
language with these operators enriches the expressiveness of a user’s information need.
1. Keyword Queries:
Simplest and most common queries.
The user enters just keyword combinations to retrieve documents.
These keywords are connected by logical AND operator.
All retrieval models provide support for keyword queries.
2. Boolean Queries:
Some IR systems allow using +, -, AND, OR, NOT, ( ), Boolean operators in combination
of keyword formulations.
No ranking is involved because a document either satisfies such a query or does not
satisfy it.
A document is retrieved for Boolean query if it is logically true as exact match in
document.
3. Phase Queries:
When documents are represented using an inverted keyword index for searching, the
relative order of items in document is lost.
To perform exact phase retrieval, these phases are encoded in inverted index or
implemented differently.
This query consists of a sequence of words that make up a phase.
It is generally enclosed within double quotes.
4. Proximity Queries:
Proximity refers ti search that accounts for how close within a record multiple items
should be to each other.
Most commonly used proximity search option is a phase search that requires terms to
be in exact order.
2|Page
lOMoARcPSD|50547602
Other proximity operators can specify how close terms should be to each other. Some
will specify the order of search terms.
Search engines use various operators’ names such as NEAR, ADJ (adjacent), or
AFTER.
However, providing support for complex proximity operators becomes expensive as it
requires time-consuming pre-processing of documents and so it is suitable for smaller
document collections rather than for web.
5. Wildcard Queries:
It supports regular expressions and pattern matching-based searching in text.
Retrieval models do not directly support for this query type.
In IR systems, certain kinds of wildcard search support may be implemented.
Example: usually words ending with trailing characters.
6. Natural Language Queries:
There are only a few natural language search engines that aim to understand the
structure and meaning of queries written in natural language text, generally as question
or narrative.
The system tries to formulate answers for these queries from retrieved results.
Semantic models can provide support for this query type.
TEXT PREPROCESSING: Text preprocessing is an initial phase in text mining. There are
various preprocessing techniques to categorize text documents. These are filtering, splitting
of sentences, stemming, stop words removal and token frequency count. Filtering has
a set of rules for removing duplicate strings and irrelevant text
The various text preprocessing steps are:
1. Tokenization.
2. Lower casing.
3. Stop words removal.
4. Stemming.
5. Lemmatization.
The purpose of tokenization is to protect sensitive data while preserving its business
utility. This differs from encryption, where sensitive data is modified and stored with methods
that do not allow its continued use for business purposes. If tokenization is like a poker chip,
encryption is like a lockbox.
3|Page
lOMoARcPSD|50547602
Stemming and Lemmatization are Text Normalization (or sometimes called Word
Normalization) techniques in the field of Natural Language Processing that are used to
prepare text, words, and documents for further processing.
Stop words removal: Stop word removal is one of the most commonly used
preprocessing steps across different NLP applications. The idea is simply removing the
words that occur commonly across all the documents in the corpus. Typically, articles and
pronouns are generally classified as stop words.
The preprocessing of the text data is an essential step as there we prepare the text data
ready for the mining. If we do not apply then data would be very inconsistent and could not
generate good analytics results.
Text Pre-processing is used to clean up text data: Convert words to their roots (in other
words, lemmatize). Filter out unwanted digits, punctuation, and stop words.
Some of the common text preprocessing / cleaning steps are:
Lower casing.
Removal of Punctuations.
Removal of Stop words.
Removal of Frequent words.
Removal of Rare words.
Stemming.
Lemmatization.
Removal of emojis.
Evaluation measure
4|Page
lOMoARcPSD|50547602
Evaluation measures for an information retrieval system are used to assess how well the
search results satisfied the user's query intent. The field of information retrieval has used
various types of quantitative metrics for this purpose, based on either observed user behavior
or on scores from prepared benchmark test sets. Besides benchmarking by using this type of
measure, an evaluation for an information retrieval system should also include a validation of
the measures used, i.e. an assessment of how well the measures what they are intended to
measure and how well the system fits its intended use case. [1] Metrics are often split into two
types: online metrics look at users' interactions with the search system, while offline metrics
measure theoretical relevance, in other words how likely each result, or search engine results
page (SERP) page as a whole, is to meet the information needs of the user.
Online metrics
Online metrics are generally created from search logs. The metrics are often used to determine
the success of an A/B test.
Session abandonment rate
Session abandonment rate is a ratio of search sessions which do not result in a click.
Click-through rate
Click-through rate (CTR) is the ratio of users who click on a specific link to the number of total
users who view a page, email, or advertisement. It is commonly used to measure the success of
an online advertising campaign for a particular website as well as the effectiveness of email
campaigns.[2]
Session success rate
Session success rate measures the ratio of user sessions that lead to a success. Defining
"success" is often dependent on context, but for search a successful result is often measured
using dwell time as a primary factor along with secondary user interaction, for instance, the
user copying the result URL is considered a successful result, as is copy/pasting from the
snippet.
Zero result rate
Zero result rate (ZRR) is the ratio of Search Engine Results Pages (SERPs) which returned with
zero results. The metric either indicates a recall issue, or that the information being searched
for is not in the index.
Offline metrics
Offline metrics are generally created from relevance judgment sessions where the judges score
the quality of the search results. Both binary (relevant/non-relevant) and multi-level (e.g.,
relevance from 0 to 5) scales can be used to score each document returned in response to a
query. In practice, queries may be ill-posed, and there may be different shades of relevance.
WEB SEARCH
A web search engine is a specialized computer server that searches for data on the
Web. The search results of a user query are restored as a list (known as hits). The hits
can include web pages, images, and different types of files.
There are various search engines also search and return data available in public
databases or open directories. Search engines differ from web directories in that web
directories are supported by human editors whereas search engines works
algorithmically or by a combination of algorithmic and human input.
5|Page
lOMoARcPSD|50547602
Web search engines are large data mining applications. There are several data mining
techniques are used in all elements of search engines, ranging from crawling (e.g.,
deciding which pages must be crawled and the crawling frequencies), indexing (e.g.,
selecting pages to be indexed and determining to which extent the index must be
constructed), and searching (e.g., determining how pages must be ranked, which
advertisements must be added, and how the search results can be customized or create
“context aware”).
ANALYTICS
Analytics is the systematic computational analysis of data or statistics. [1] It is used for the
discovery, interpretation, and communication of meaningful patterns in data. It also entails
applying data patterns toward effective decision-making. It can be valuable in areas rich with
recorded information; analytics relies on the simultaneous application of statistics, computer
programming, and operations research to quantify performance.
Organizations may apply analytics to business data to describe, predict, and improve business
performance. Specifically, areas within analytics include descriptive analytics, diagnostic
analytics, predictive analytics, prescriptive analytics, and cognitive analytics.[2] Analytics may
apply to a variety of fields such as marketing, management, finance, online systems,
information security, and software services. Since analytics can require extensive computation
(see big data), the algorithms and software used for analytics harness the most current
methods in computer science, statistics, and mathematics
CURRENT TRENDS IN WEB SEARCH
1. Voice search will become even more relevant
Voice search is already an integral part of our daily lives: we ask Siri where the closest gas
station is or say “Hey Google, which Thai restaurant is the highest rated in my town?“ At the
moment, optimizing for these kinds of voice searches is recommended especially for
ecommerce or websites whose users are likely to have their hands full. For example, if you
run a recipe blog, you want your users to find the answer on how long to let the dough rest
without having to type with their potentially dirty hands on the phone.
2. Your site search can no longer offer zero results pages
A zero result page for your user means a lost client for you. But what seems like a problem
can be a great opportunity to increase your revenue. Let’s go back to our example. In this case,
you cannot offer your user Ralph Lauren winter shoes. But you can show them results for
other relevant products such as summer shoes by Ralph Lauren or winter shoes by other
brands.
3. Search will become more personalized than ever
With personalization, you can offer relevant results for each user based on their preferences
and prior search behavior. Going back to our example, an HR person might have already
downloaded a pdf targeted towards HR managers on the website. Based on their behavior,
they would get assessed as a B2B user and can get more B2B oriented results in their search.
4. Site search will feel less like search and more intuitive
A good site search is the one you do not even think about as a user. You use it so intuitively
that you don’t need to assess what you are doing – you just do it. In 2022, site search will
look even less like classical search.
6|Page