Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views204 pages

Motivation, Basic Concepts, The Retrieval Process, Information System

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views204 pages

Motivation, Basic Concepts, The Retrieval Process, Information System

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 204

Module I

Introduction
CHAPTER 1

Syllabus
Motivation, Basic Concepts, The Retrieval Process, Information System:
Components, parts and types on information system; Definition and
objectives on information retrieval system, Information versus Data
Retrieval. Search Engines and browsers.
Self-learning Topics:Search Engines , Search API

1.1 NTRODUCTION

GO, Define information retrieval.


Information is data that has been structured for easier understanding or
interpretation. Information retrieval is the process of getting data that has
been processed, is not in its raw form, and satisfies your specific
requirements.
Aa an example of Information retrieval system, web search engines, are
used to find the relevant documents or web pages.
Information Retrieval (IR) can be defined as a system that deals with the
organization, storage, retrieval, evaluation and access of information.
An Information retrieval (IR) system is an Information system, a system
used to store items of information that need to be processed, searched,
retrieved, and distributed to various users
should be
0) The organization and representation of the information
helpful for the users to provide easy access to information and
satisfy his information need.
using
(2) User summarizes his information need in the form of a query
Infomation Retrieval System (MU-Sem.7-T) (ntroduction) Pg. no. (1-2)
set of keywords or index terms.
relevant
(3) IR system processes this query and returns useful or
information to the user.
access to
(4) The main purpose of IR system is to provide auser easy
documents containing the desired information
(5) Information Retrieval System is a system it is a capable of storing,
maintaining from a system. and retrieving of information. This
information may be in any of the form i.e. audio, video, text.
(6) Information Retrieval System is mainly focus electronic searching
and retrieving of documents.
(7) The first Information retrieval systems originated with the need to
organize information in central repositories e.g. libraries.

M112 OBJECTIVES OF INFORMATION RETRIEVAL


SYSTEM

GQ. Discuss the objectives ofinformation retrieval systems?


GQ List and explain components of IR block diagram.
(1) The objective of an information retrieval system is to enable users to
find relevant information from an organized collection of documents in
response to the user query.
(2) To provide information to the user in least time with least efforts.
(3) To act as facilitator between information and user.
(4) To provide non-ambiguous search results through proper
(5) User friendliness.
indexing.
Fig. 1.2.1 Shows basics of Information Retrieval
Document
Collection

Query IR System

Set of relevant
documents
Fig. 1.2.1l : Basies of
Information Retrieval
(Nevw Sy. w.e.f academic year
Infomation Retrieval System
(MUJ-Sem.7-|T) (Introduction)Pg. no. (1-3)

) Growth of Information Retrieval

IGQ. Discuss growth of information retrieval.


GQ. List out reasons behind success of web.
GQ. Explain retrieval from web with the help of diagram.
indexing and
) Initially primary goal of information retrieval was
searchingfor useful documents in collection.
years
(2) Information retrieval has grown at a very large scale in past 20
because of rapid growth of world wide web (Www)
repository of
(3) Web isgrowing at very fast pace and becoming a universal
human knowledge and culture.
are :
(4) The reasons behind success of web
details from user
(i) Standard user interface hiding all implementation
documents and make them point
(ii) Any user can create his own Web
restrictions, which in turn
to any other Web documents without
everyone.
making web a new publishing medium accessible to
banking making
(ii) Applications like online shopping and internet
users'life easy and also generating revenues.
the Web has introduced new
(5) Due to its growing size and success, difficult and time
web is
problem. Searching useful information on the
consuming task.
data model for the Web makes
(6) Absence of a well-defined underlying
navigation task difficult.
new interest in IR and have created a
(7) These difficulties have attracted
1.2.2 shows information retrieval
place for IR at the center of stage. Fig.
from Web.

Web pages

Querý IX System

Set of relevant
web pages

from Web
Fig. 1.2.2: Information Retrieval

(New Syll, w.e.f academic year 22-23) (M7-87) Tech-Neo Publications


Features of an information retrieval system
(D
IRs.
GQ. List out features of
:
information retrieval system must have provisions for
An effective
dissemination of information
(1) Prompt
(2) Filtering of infomation
at the right time
(3) The right amount of information
(4) Active switching of information
(5) Receiving information in an economical way
(6) Browsing
(7) Getting information in an economical way
(8) Current literature
(9) Access to other information systems
(10) Interpersonal communications, and
(11)Personalized help.
( ) Basic Concepts
The user task and the logical view of the documents are two important
factors to have a direct impact on the effective retrieval of relevant
information.

The User Task

GQ Explain userinteractionwith IR wththe help ofa diagram.


A retrieval system's user must convert his information requirement into a
query in the system's given language.
User provides a set of keywords that accurately express the semantics of
the information requirement when using an information
retrieval system
Nowimagine a user who has a poorly defined or naturally
broad area of
interest and looking for information by using an
Simply look around in the .collection for interactive interface to
requirement. documents related to his
While doing this, user might find many
topics one after another and he documents related to different
collection rather than searching continue to browse the documents in the
information on specific topic.
Information Retrieval System (MU-Sem.7-1T) (Introduction) Pg. no. (1-5)

Retrieval

User Database

(Browsing

Fig. 1.2.3: User interactionwith Retrieval System Using different tasks


Logical View of Document

GQExplain logical view of a document.


A set of index terms or keywords are typically used to describe the
documents in a collection. Such keywords may be chosen by a human
subject or may be directly extracted from the document's text
These representative keywords offer a logical picture of the document
whether they are generated manually or automatically.
It is now possible to represent a document using its entire word-set on
modern computers. We refer to this situation as a full text logical view
(or representation) of the documents by the retrieval system.
Even modern computers, nevertheless, might need to narrow the range of
representative terms in very bigcollections.
Eliminating stopwords,such as articles and connectives, using stemming
to reduce different words to their grammatical roots, and identifying
noun groups are all ways to achieve this (which eliminates adjectives,
adverbs, and verbs). Additionally,compression might be used.
These operations are referred as textoperations (or transformations).
Text operations make document representation less complicated and
enable changing the logical view from one ofa full text to one of aset of
index words.
Automatte
Accents, Noun Stemmingor Manual
pacing Stopworda groups Indexing
dooument
etc

Text +
Rtructure Structure text
recognition

index terms
structure Full text

set of index terms


Fig. 1.2.4 : Logical view of a document : from full text to a

(IV)The Retrieval Process


GQ. Mustrate the concepts of IRS with architecture view ?

The following Fig. 1.2.5 shows Information Retrieval Process.


Index Searching

Text
Indexing Ranking Query
model operations

Text Visual
Text
operations interface
Query
User

Fig. 1.2.5: Information Retrieval Process

1.3 INFORMATION RETRIEVAL PROCESS


GQ. Ilustrate the concepts of IRS with architecture view?
(1) First of all, the text database
must be defined before the retrieval process
can even be started. The database
manager typically handles this, and
they state the following:
(i) The documents to be
used,
(ii) The text
operations that will be carried out, and
(New Syl. w.e.f academic year
Infomation Retrieval System
(MU-Sem.7-T) (ntroduction) Pg. no. (1-7)

(i) The text model (i.e., the text structure and what elements can
be retrieved).
produce a logical
() The text operations alter the source documents and
defining
view of them. The database manager creates a text index after
the logical view of the documents using the DB Manager Module.
document database is
(3) The retrieval process can start now that the
indexed. Once the user specifies information requirement, the same text
operations are used to parse and alter the text.
(4) Then. before to generating the actual query that gives a system
representation for the user requirement, query operations are applied.
retrieved documents. The
(5) The query is then processed to obtain the
relevance
retrieved documents are ranked according to a likelihood of
before sending them to user.
documents in an effort
(6) The user then looks over the collection of ranked
to find relevant information.
considered to
(7) At this stage, he might identify a subset of the documents
be unquestionably interesting and start a feedback cycle from users.
the
(8) In such a cycle, the system modifies the query formulation based on
documents the user has chosen. Hopefully, this query has been changed
to better reflect the actual user requirement.

a 1.3.1 Requirements for Information Retrieval

iGQ List out major requirements of IR.


(1) An automated or manually-operated indexing system used to index and
search techniques and procedures.
(2) A collection of documents in any one of the following formats: text,
image or multimedia.
(3) A set of queries that serve as the input to a system, via a human or
machine.
(4) Anevaluation metric to measure or evaluate a system'seffectiveness.

(New Syl. w.e.facademic year 22-23) (M7-87) Tech-Neo Publications


Infomation Retrieval System (MU-Sem 7-4T) (Introduction) Pg no. (1-8

a 1.3.2 Three Major Components of Traditlonal IRS


GQ. Discuss major components of IRS.
l) Docume nt subsystem
(a) Acquisition
(b) Representation
(c) File organization
(2) User sub system
(a) Problem
(b) Representation
(c) Query
(3) Searching /Retrieval subsystem
(a) Matching
(b) Retrieved objects
Components of Information Retrieval/ IR Model
Acquisition
In this step, theselection of documents and other objects from various
web resources that consist of text-based
documents takes place.
The required data is collected by web
crawlers and stored in the
database.

Representation
It consists of indexing that
contains free-text terms, controlled
vocabulary, manual & automatic techniques as well.
Example: Abstracting contains summarizing and
description that contains author, title, Bibliographic
sources, data, and metadata.
Flle Organization
There are two types of file
organization methods. i.e.
(1) Sequential : It contains
(2) documents by document data.
Invered :lt contains lerm by term, list of
Combination of both. records under each term.

(aew Syl. wes


acadernic year 22-23) (M7-87) dTech-Neo Publications
Intomation Retrieval System (MU-Sem.7-4T) (Introduction) Pg. no. (1-9)

Query

An IR process starts when a user enters a query into the system.


Queries are formal statements of information needs,for example. search
strings in web search engines.
In information retrieval, a query does not uniquely identify a single
object in the collection.
Instead, several objects may match the query. perhaps with different
degrees of relevancy.
(1) Information System :Components and Types
GQ what is information system. Discuss its components.
Information systems are collections of multiple information
resources (e.g., software, hardware, computer system connections,
the system housing, system users, and computer system
information) to gather, process, store, and disseminate information.
IT is a group of data sets that ensures that business operates
smoothly, embracing change, and helping companies achieve their
goal.
Tools such as laptops, databases, networks, and smartphones are
examples of information systems.
Information systems consist of members that gather, store, and
process data, with the data being utilized to give information, add
to knowledge and create digital products that aid decision-making.
IT has sets of technological methods and techniques used to store,
organize, manage, and retrieve information digitally.
(2) Components of Informatlon Systems
(6) Resources of people : (end users and IS specialists, system analyst,
programmers, data administrators etc.).
(1) Hardware : (Physical computer equipment and associate device,
machines and media).
(ii) Software : (programs and procedures).
(iv) Data :(data and knowledge bases).

(New Syl1. w.e.f academic year 22-23) (M7-87) rech-Neo Publications


(v) Networks : (communications media and network support).
People Resources:End Users and IS Specialists
Software
p Control of System Performance
pus Resouces
yoesNOB
Input of Input Processing Output Output of :
Data Data InformationPrograms
Resources into Products
Informatlon

and
URMPeH
Procedures
Storage of Data Resources

Data Resources : Data Knowiedge Bases

Fig. 1.3.1:Components of Information System


S Types of Information System

GQ. Explain various types of Information Systems.


(1), Operations support system
(2) Transaction Processing System (TPS)
(3) Management Information System (MIS)
(4) Decision Support System (DSS)
(5) Experts System
(6) Office Automation System (0AS)
(1) Operations support system
In an organization, data input is done by the end user which is processed
to generate information products i.e. reports, which are utilized by
internal and or external users. Such a system is called operation support
system.
The purpose of the operation support system is to facilitate business
transaction, control production, support internal as well as external
communication and update organization central database.

(New Syll. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications


Infomation Retrieval System (MU-Sem,7-1T) (Introduction) Pg. no. (1-11)

The operationsupportsystem is further divided into:


() Transaction-processing system,
(ii) Processing control system and
(ii) Enterprise collaboration system.
(2) Transactlon Processing System (TPS)
In manufacturing organization, there are several types of transaction
across department. Typical organizational departments are Sales,
Account, Finance, Plant, Engineering, Human Resource and Marketing.
Across which following transaction may occur sales order, sales return,
cash receipts, credit sales; credit slips, material accounting, inventory
management, depreciation accounting, etc.
These transactions can be categorized into
(i) Batch transaction processing,
(i) Single transaction processing and
(ii) Seal time transaction processing.
(3) Management Information System (MIS)
Management Information System is designed to take relatively raw data
available through a Transaction Processing System and convert them
into a summarized and aggregated form for the manager, usually in a
report format.
I reports tending to be used by middle management and operational
supervisors.
Many different types of report are produced in MIS. Some of the reports
are a summary report, on-demand report, ad-hoc reports and an
exception report.
Example : Sales management systems, Human resource
system.
management
(4) Decision Support System (DSS)
Decision Support System is an interactive information system that
provides information, models and data manipulation tools to
making the decision in a semi-structured and unstructured help in
situation.
(New Syli. w.e.f academic year
22-23)(M7-87) a Tech-Neo Publications
-12)
Decision Support System comprises tools and techniques to help in
gathering relevant information and analyze the options and alternative
the end user is more involved in creating DSS than an MIS.
Example:Financial planning systems, Bank loan management systems.
(5) Experts System
Experts systems include expertise in order to aid managers in diagnosing
problems or in problem-solving. These systems are based on the
principles of artificial intelligence research.
Experts Systems is a knowledge-based information system. It uses its
knowBedge about a specify are to act as an expert consultant to users.
Knowledgebase and software modules are the components of an expert
system. These modules perform inference on the knowledge and offer
answers to a user's question
(6) Ofice Automation System (OAS)
OAS consists of computers, communication-related technology, and the
personnel assignedto perform the official taskS.
The OAS covers office transactions and supports official
activity at
every level in the organization.
The official activities are subdivided into managerial and
clerical
activities.

1.3.3 Difference between Information Retrieval


and Data Retrieval
GO Explain Information versus Data Retrieval in detail.
Sr. Information Retrieval Data Retrieval
No.
(1) The software program that Data retrieval deals with obtaining
deals with the organization, data fromn a database management
storage, retrieval, and system such as ODBMS. It is A
evaluation of information from process of identifying and retrieving
document repositories the data from the database, based on
particularly textual information. the query provided by user or
application.
(New Syll. wef academic year
22-23) (M7-87) Tech-Neo Publications
Information Retrieval System (MU-Sem.7-|T) (Introduction) Pg. no. (1-13)

Sr. Information Retrieval Data Retrieval


No.

(2) Retrieves infornmation about a Determines the keywords in the


subject. user query and retrieves the data.
(3) Small errors are likely to go A single error object means total
unnoticed. failure.

(4) Not always well structured and Has a well-defined structure and
is semantically ambiguous. semantics.

(5) Does not provide a solution to Provides solutions to the user of the
the user of the database system. database system.
(6) The results obtained are The results obtained are exact
approximate matches. matches.
(7) | Results are ordered by Results are unordered by relevance.
relevance.

(8) It is a probabilistic model. It is a deterministic model.

1.4 SEARCH ENGINES AND WEB BROWSERS

Differentiate between Search Engines and browsers.


IGQ. What issearch engine?
(1) Searching for information on the Web is, for most people, a daily
activity. Search and communication are by far the most popular uses of:
the computer.
(2) A search engine is the practical application of information retrieval
techniques to large-scale text collections.
3) Aweb search engine is used to find information on the World Wide
Web and returns web pages that are accessible online, displaying the
results at one place. To retrieve and see information from web pages
stored on web servers, web browsers use search engines.
(4) Asearch engine's primary purpose is to collect and maintain information
about several URLS and Web browsers are designed to display the
website at the server's current URL.
() A web browser uses a graphical user interface to help users have an
interactive online session on the Internet.
(New Syll. w.e.f academic year 22-23)
(M7-87) Tech-Neo Publications
Infomation Retrieval System (MU-Sem.7-1T) (introduction) Pg. no. (1-14

a 1.4.1 Search Engtne Working


GQ. Explain working of searchengine.
The Search Indexer. Crawler. and Database are the three essential
components of a search engine.
Boolean operators AND, OR, and NOT are used by search engines to
limit and widen the results of a search.

The steps that the search engine takes are as follows :


(1) Instead of searching for the phrase directly on the web, the search
engine first looks for it in the index for predefined databases.
(2) The information is then found using software to
search the
database. This software component is known as web crawler.
(3) Once web crawler finds the pages, the search
engine then shows the
relevant web pages as a result. These retrieved web
pages often
include the page title, the amount of material on
the page, the first
few phrases, etc.
(4) User can click on any of
the search results to open it.
a 1.4.2 Building Blocks of Search
Engine
Major Functions of search engines
and the query
prOcess.
components are the indexing process
GQ Discuss indexing process
with the help of diagram
(1) Indexing Process
The indexing process builds the
and the query process uses structures that enable searching,
those structures and a
produce a ranked list of person's query to
process are shown in Fig.documents.
1.4.1.
Major components of indexing
Text acquisition
component is used to identify and make
the documents that
will be searched. Text available
(New Syll. w.ef
academic year 22-23) (M7-87)
acquisition will require
Information Retrieval System (MU-Sem.7-|T) (Introduction) Pg. no. (1-15)

building a collection by crawling or scanning the Web, a corporate


intranet, adesktop, or other sources of information.

Document data store

Text Acquisition Index Creation

Index
Email, web pages,
news articles, memos, letters Text Transformation

Fig. 1.4.1: Components of Indexing Process


The text acquisition component also creates a document data store,
which contains the text and metadata for all the documents.
The text transformation component transforms document into index
terms or features.
The index creation component takes the output of the text
transformation component and creates the indexes or data structures
that enable fast searching.
(2) Query Process
GQ. Explain query process with the help of a diagram
a
Query process uses structures created by indexing process and
ranked list of documents.
person's query to produce a
Fig. 1.4.2 shows the building blocks of the query process.

Document data store

User Interaction Ranking

Index

Evaluation

Log data

Fig. 14.2:Components of Indexing Process

(NewSyl. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications


interactlon, ranking, and evaluath,
The major components are user

nteraction component provides the interface between the


The user
engine. It accepts thie
.eNon dong the searching and the seareh
terns, It also takes the
user' query and transtorms query into index
organizes it
ranked list of documents from the search engine and
into the results shown to the user.
used in
The document data store is one of the sources of information
generating the results.
The ranking component takes the transformed query from the user
interaction component and generates a ranked list of documents
using scores based on a retrieval model.
The efficiency of ranking depends on the indexes, and the
effectiveness depends on the retrieval model.
Evaluation component measures and monitors effectiveness and
efficiency. It also records and analyzes user behavior using log data.
The results of evaluation are used to tune and improve the ranking
component.
Short Questions and Answers
Q. 1 Define information retrieval.
Ans. :

Information Retrieval is finding material of an unstructured nature that


satisfies an information need from within large collections
Q. 2 List and explain components of IR block
Ans. : diagram.
(1) Input - Store Only a
representation of the document
(2) Adocument representati ve
to be significant.
-Could be list of extracted words considered
(3) Processor - Involve in performance of actual
(4) Feedback - Improve retrieval function
(5) Output- Aset document numbers.

tiew Syi.wef
intomabon Retrieval System (MU-Sem.7-1T) (Introduction) Pg. no. (1-17)

0.3 Expiain the type of natural language technology used in information


retrieval.
Ans. :

Two types
(1) Natural language interface makes the task of communicating with the
information source easier, allowing a system to respond to a range of
inputs.
(2) Natural Language text processing allows a system to scan the source
texts, either to retrieve particular information or to derive knowledge
structures that may be used in accessing information from the texts.

a.4 What is search engine?


Ane. :
A search engine is a document retrieval system design to help find
information stored in a computer system, such as on the WWw. The search
engine allows one to ask for content meeting specific criteria and retrieves a
list of items that match those criteria

Q.5 What are the applications of IR?


Ans.:
(1) Indexing (2) Ranked retrieval
(3) Web search (4) Query processing
Q.6 How to Al applied in IR systems?
Ans. :
Four main roles investigated
(1) Information characterization
(2) Search formulation in information seeking
(3) System Integration
(4) Support functions
a.7 Give the tunctions of information retrieval system.
Ans. :

(l) To identify the information(sources) relevant to the areas of interest of


the target users community
(lntroduction) Pg no (1-14
Infomation Retrieval System (MU-Sem.7-4T)
sources(documents)
(2) Toanalvze the contents of the
represent the contents of the analyzed sources in a way that will he
(31 To
suitable for matching user's queries
analyze user's queries and to represent them in a form that will he
(4) To
suitable for matching with the database
with the stored database
(5) To match the search statement
relevant
(6) Toretrieve the information that is
make nece ssary adjustments in the system based on feedback form
(7) To
the userS.

Q.8 List the issues in information retrievalsystem.


Ans. :
and
(1) Assisting the user in clarifying and analyzing the problem
determining information needs.
(2) Knowing how people use and process information.
(3) Assembling a package of information that enables group the user to
come closer to a solution of his problem.
(4) Knowledge representation.
(5) Procedures for processing knowledge/information.
(6) The human-computer interface.
(7) Designing integrated workbench systems
(8) Designing user-enhanced information systems.
(9) System evaluation.

Q.9 Define relevance.


Ans. :
Relevance appears to be a subjective quality, unique between the
individual and a given document supporting the assumption that
relevance can only be judged by the information user.
Subjectivity and fluidity make it difficult to use as measuring tool for
system performance.
a. 10 Define indexing &document indexing.
Ans. :
ASsociation of descriptors (keywords, conçepts, metadata) to documents
Intomation Retrieval System (MU-Sem.7-4T) (Introduction) Pg. no. (1-19)

in view of future retrieval. Document indexing is the process of


associating or tagging documents with different "search" terms.
Assign to each document (respectively query) a descriptor represented
with a set of features, usually weighted keywords, derived from the
document (respectively query) content.
Q. 11 Discuss the impact of IR on the web.
Ans. :

The impacts of information retrieval on the web are influenced in the


following areas.
(1) Web Document Collection
(2) Search Engine Optimization
(3) Variants of Keyword Stuffing
(4) DNS cloaking: Switch IP address
(S) Size of the Web
(6) Sampling URLs
(7) Random Queries and Searches
Q. 12 Define web search and web search engine.
Ans.:
Web search is often not informational -- it might be navigational (give
me the url of the site I want to reach) or transactional (show me sites
where Ican perform a certain transaction, e.g. shop, download a file, or
find a map).
Web search engines crawl the Web, downloading and indexing pages in
order to allow full-text search.
There are many general purpose search engines; unfortunately, none of
them come close to indexing the entire Web.
There are also thousands of specialized search services that index
specific content or specific sites.
a. 13 What are the components of search engine?
Ane. : Generally, there are three basic components oe asearch engine as
listed below:
(0) Web Crawler
Infomation Retrieval System (MU-Sem.7-1T) (Introduction) Pg. no. (1-20)

(2) Database
(3) Search Interfaces

a. 14 What are search engine processes?


Ans. :

Indexing Process
() Text acquisition
(2) Texttransformation
(3) Index creation
Query Process
(1) User interaction
(2) Ranking
(3) Evaluation
Q. 15 What are the challenges of web?
Ans. :
(1) Distributed data
(2) Volatile data
(3) Large volume
(4) Unstructured and redundant data
(5) Data quality
(6) Heterogeneous data

Chapter Ends...
MODULE I
IR Models
CHAPTER 2

Syllabus
Modeling: Taxonomy of Information Retrieval Models, Retrieval: Formal
Characteristics of IR models, Classic Information Retrieval, Alternative
Set Theoretic models, Probabilistic Models, Structured text retrieval
Models, models for Browsing;
Self-learning Topics :Terrier

M 2.1 INTRODucTION

GQ. What do you mean information retrieval models?


Information retrieval's (IR) objective is to give users the documents they
need to satisfy their informational needs.
We use the term "document" in a broad sense to refer to both textual and
non-textual information, including multimedia items.
Index terms are typically used by traditional information retrieval
systems to index and retrieve documents. A keyword (or combination of
related terms) with a distinct meaning is known as an index term (usually
a noun)
The- semantics of the documents and of the user information need can be
naturally expressed through sets of index terms.
This method is simple to implement but retrieved documents are often
ifrelevant because a lot of semantics are lost when we replace its text
with a set word.
The main problem in information retrieval is judging.relevant and non
relevant documents.
(IR Models) Pg. no. (2-2)
Information Retrieval System (MU-Sem.7-IT)

Information retrieval systems use ranking algorithms to determine which


documents are relevant and which are not.
based on the
The predictions of what is relevant and what is not are
accepted IR mode

2.2 A TAXONOMY OF INFORMATION RETRIEVAL


MODELS

GQ. What are the three classic models in information retrieval system?
G Explain the taxonomy of information retrieval with a classification
diagram.
The three classic models in information retrieval :
(1) Boolean : Documents and queries are represented as set s of index terms
in the Boolean model. As a result, we describe the model as set theoretic
(2) Vector : Documents and queries are represented as vectors in the vector
model in a t-dimensional space. As a result, we define the model as
algebraic.
(3) Probabilistic : The framework for modeling document and query
representations in the probabilistic model is based on probability theory.
As a result, we refer to the model as probabilistic, as its name suggests.
For each sort of traditional model (i.e., set-theoretic, algebraic, and
probabilistic), alternative modeling paradigms have been put out over the
years.

We make a distinction between the fuzzy and extended Boolean models


when it comes to alternative set-theoretic mnodels.

We differentiate the generalized vector, latent semantic indexing, and


neural network models as alternative algebraic models.
We distinguish between the inference network and belief network
models when referring to alternative probabilistic models. A taxonomy
of these information retrieval models is shown in Fig. 2.2.1.
We distinguish between the non-overlapping lists model and the
proximal nodes model for structured text retrieval.

(New Syll. w.e.f academic year iblications


St Theoretlo
Classic Models
Boolean Vector Fuzzy
Retrieval: Probabilietc
Extended Boolean
Ad hoc
Filering
Algebralo
Structured Modele Generalized Vector
Lat. Semantc Index
Non-overlapping ists Neural Networks
Proximal Nodes
Browsing
Probabllitles
Browsing
Flat Structurs
Inference Network
Belief Network
Guided Hypertext
(1B1)Fig. 2.2.1 : A taxonomy of Information Retrieval Models
As discussed in chapter 1, the logical view of the
documents (whole text,
collection of index words, etc.), the IR model (Boolean, vector,
probabilistic, etc.), and the user tasks (retrieval, browsing) are orthogonal
features of a retrieval system.
Thus, even though some models are better suited for one user task than
another, the same IR model can be utilized with various document
logical views to carry out various user tasks as shown in Fig. 2.2.2.
Logical view of documents
U
S
Index terms Full text Full Text +
E Structure
R Retrieval Classical set Classical set Structured
theoretic algebraic theoretic
probabilistic algebraic
A
probabilistic
K Browsing Flat Flat hypertext Structure
guided
hypertext
Fig. 2.2.2: Retrieval models most frequently ussoclated with distinct
combinations of a document logicalview and a user task
Infomation Retrieval System (MU-Sem.7-IT) (IR Models) Pg. no. (2-4)

2,3 RETRIEVAL : AD HOC AND FILTERING

GQ Define Ad hoc retrieval and Filtering.


Ad hoc retrieval : When new queries are entered into a traditional
information retrieval system, the collection of documents remains largely
static.

Filtering : Queries are relatively static as new documents are added to


thesystem (and leave). In filtering user profile is created according to the
user's preferences.
The incoming documents are theD compared to this profile in an effort to
identify any that might be of interest to this specific user.
This method can be used, for instance, to choose a news article from
among the many that are broadcast each day.
Ranking of the filtered documents is not provided.
A set of keywords are used to create user profile.

M 2.4 A FORMAL CHARACTERIZATION OF IR MODELS


GQ Ilustrate formal characterization of IR Model

The formal characterization of IR Model is as follows :

Definition : An information retrieval model is a quadruple (D, Q, F, R


{qi, d)] where
D is a set composed of logical views (or representations) for the
documents in the collection.

o is a set composed of logical views (or representations) for the user


information needs. Such representations are called queries,
F is aframework for modeling, document representations, queries, and
their relationships.
R {41. d) is a ranking function which associates a real number with a
query qi ¬ Q and a document representation d; E D. Such ranking
defines an ordering among the documentswith regard to the query i
Information Retrieval System (MU-Sem 7-1T) (IR Models) Pg no. (2-5)

M2.5 CLASSIC INFORMATION RETRIEVAL


IGO. Explain Classic Information Retrieval.
In this section, we briefly present the three classic models in information
retrieval namely. the Boolean, the vector, and the probabilistic modeis.
Basic Concepts
Each document is described by a group of representative keywords
called index terms.
An index term is only a word from the document whose semantics
makes it easier to recall itscore ideas.
Index terms are used to index and summarise the contents of the
document. Nouns are preferred as index terns.
When used to describe the contents of a document, various index terms
have differing degrees of importance.
Each index term in adocument is given a numerical weight in order to
represent this effect.
Let k; be an index term, d; be a document, and Wij > 0 be a weight
associated with the pair (k, d). This weight quantifies the importance of
the index term for describing the document semantic contents.
Definition : Let t be the number of index terms in the system and kË be a
generic index term. K=/k), . .. k} is the set of all index terms. A weight
W;.¡> 0is associated with each index term k; of a document d; : For an
|index term which does not appear in the document text, Wi.i=0. With the
document d; is associated an index term vector dË represented by
|=( W1j W2j . Wj) Further, let g; be afunction that returns the weight
|associated with the index term k; in any t-dimensional vector
|(ie., g(dË) = Wi,).
2.5.1 Boolean Model
GQ What is the basis for the Boolean model?
GQ What are the advantages and disadvantages of the Boolean model?I
The Boolean model is a simple retrieval model based on set theory and

(Néw Syl, w.ef aCademic vear


Intomation Retrieval System (MU-Sem.7-4T) (IR Models) Pg no. (2-6)

Boolean aigebra. Retrieval is based on whether or not the documents


contain the query terms.
The Boolean model is interested only in the presence or absence of a
term in the document.
In the exact match, a query specifies precise criteria. Each document
either matches or fails to match the query. The results retrieved in the
exact match are a set of documents (without ranking).
In the best match, aquery describes good or best matching documnents.
In this case, the result is a ranked list of documents. The Boolean model
here I'm going to deal with is the most comnon exact match model.

Basic Assumption of Boolean Model


An index term is either present(1 )or absent(0) in the document
All index terms provide equal evidence with respect to information
needs.

Queries are Boolean combinations of index terms.


Each query term specifies a set of documents containing the term
AND (A): the intersection of two sets
OR (V): the union of two sets

NOT (-): set inverse, or really set difference


X AND Y: represents doc that contains both X and Y
X OR Y: represents doc that contains either X or Y
NOT X: represents the doc that does not contain X
GQ. Explain Boolean model with example.

The Boolean Model Example


Consider the terms: K1, K2, K3, ...,K8.
6 documents containing different terms:
DI = {K1, K2, K3, K4, K5}
D2 = (K1, K2, K3, K4)
D3 = (K2, K4, K6, K8}
D4 = (K1, K3, K5, K7}

(Ne
Intomation Retrieval System (MU-Sem.7-|T) (IR Models) Pg. no. (2-7)

DS= (K4, KS, K6, K7, K8)


Dó = (K1, K2, K3, K4)
Query : KI A(K2 v K3) e.g documents containing Kl and (K2 or
(not K3)
Answer:

{D1. D2. D4, D6} n ({D1, D2, D3, D6} U (D3, D5) = (D1, D2, D6)
Definition : For the Boolean model, the index term weight variables are all
binary i.e., Wi,j¬ {0, 1} A query q is a conventional Boolean expression.
Let q anf be the disjunctive normal form for the query q. Further, let ce be

any of the conjunctive components of q dnf. The similarity of a document d;


|tothe query q is defined as

|sim(d,. q) =1"
(0 otherwise

(Ge)
If sim(dj, ) = 1then the Boolean model predicts that the document dË is
|relevant to the query q(it might not be). Otherwise, the prediction is that the
document is not relevant.
S Advantages of the Boolean Mode
(1) The simplest model is based on sets.
(2) Easy to understand and implement.
(3) It only retrieves exact matches.
(4) It gives the user, a sense of control over the system.
(5) Boolean retrieval was adopted by many commercial bibliographic
systems.
(6) Boolean queries are akin to database queries.
Dlsadvantages of the Boolean Model
() The model's similarity function is Boolean. Hence, there would be no
partial matches. This can be annoying for the users.
(2) Information need has to be translated into Boolean expressions which
most users find awkward.
(3) In this model, the Boolean operator usage has much more influence than
a critical word.
irtorrmation Fstrnevs Systerr (MS S4rn 747

(4) The Booiean queries forsalated try the uer #e mEAt Aen .
sifnplistic
(S) As aresult, the Boolean mdel freqeraly returra eiter n tew n tr
many doxuments in response to the ue qery
(6) The query language is expreaive, tt it is mgiLARÁUA
(7) No ranking for retrieved drcumenta (ahercef yadiz waiey.
(8) It is not possible to assigr adeye f resevatKe.
A 2.5.2 Vector Model

Define the Vector Model vith relevart rrathernetiaequins


What are the assurptions ot vector space mode?
What are the Pararmeters in caiouleting a wegnfo a docmert
term or query term?
GQ How can you calculate tt and i in the vec oei?
The vector model suggests a frarnework that albws for partiaí rnatching
because it acknowledges that using binary weighas is too reaiive.
It assigns non-binary weights to index terns in queñes ani documents.
The degyee of sinilarity between each document stred in the sysern
and the user query is calculated using these terrn weights.
The vector model considers documents that match the query terms only
partially by ordering the retrieved documents in decreasing order of thás
degree of sirmilarity.
In comparison to the Boolean rmodel, the ranked document answer set is
significantly more precise (in the sense that it better satisfies the user's
information need).
Deinitton : For the vector model, the weight W, associated with a pair
(k, d,) is positive and non-binary. Further, the index errns in the query are
also weighted. Let be the weight asOciated with the pair [k .
q).
where w;u>0. Then, the query vectorqis definedas
q=(wi,4, W2,4, W,q) where t is the total nunber of index terms in
the system. As before, the vector for a
document d, is represented by
Wi)
Intormation Retrieval System (MU-Sem.7-T) (R Models) Pg. no. (2-9)

GQ What is cosine similarity?


GQ. Define term frequency.
IGO. Define inverse tem fraquency.

The vector model proposes to evaluate the degree of similarity of the


document d; with regard to the query q as the correlation between the
vectors d; and q.
For instance, this correlation can be quantified by the cosine of the angle
between these twO vectors as shown in Fig. 2.5. 1. That is,

2 Wij XWi,q
j=1
sim(d, ) =
2 2

i=1 j=1

where ld;l and lql are the norms of the document and query vectors. The
factor lql does not affect the ranking (i.e., the ordering of the
documents) because it is the same for all documents. The factor ld;|
provides a normalization in the space of the documents.

(182)Fig. 2.5.1: The cosine of is adopted as sim (d,, q).


By calculating the raw frequency of a phrase (k) within a document (d),
the vector model measures the intra-clustering similarity.
Such term frequency is usually referred to as the tf factor and provides
one measure of how well that term describes the document contents (i.e.
intra-document characterization).
The inverse of the frequency of a phrase kË among the documents in the
collection is used to calculate the inter-cluster dissimilarity.This façtor is
known as the inverse document frequency or the idf factor.
Infomation Retrieval System (MU-Sem.7-4T) (IA Models) Pg. no, (2-10)

Defnitlon : Let Nbe the total number of documents in the system and n, be
|the number of docunnents in which the index term k, appears. Let freqi,j be
|the raw frequency of term kË in the document d, (i.e., the number of times
|the term k, is mentioned in the text of the document d). Then, the
normalized frequency f of term kj in document dË is given by
freqi
max, freqiJ
where the maximum is computed over all terms which|
are mentioned in the text of the document d;. If the term kË does not appear
|in the document d, then f, 0. Further, let idf inverse document
frequency for kj be given by idf; =log

|Weights are given by Wij =ij xlog Nn


Such term weighting schemes are called tf-idf schemes.
The Vector Model Example
Let's consider that a collection includes 10,000 documents
The term A appears 20 times in a particular document
The maximum appearance of any term in this
document is 50
The term A appears in 2,000 of the collection
documents.
fij) = freq(i,j) / max(freq(i,j)) = 20/5 = 0.4
idf(i) = log(N/n) = log (10,000/2,000) = log(5) = 2.32
Wij = fij)* log(N/n;) = 0.4 * 2.32 = 0.93
GQ. What are the advantages and disadvantages of the
Vector Model?
S Advantages of Vector Space Model
(1) Its term-weighting scheme improvesthe quality of answer set and
retrieval performance.
(2) Its partial matching strategy allows
retrieval of documents that
approximate the query cónditions.
(3) Its cosine ranking formula sorts the
of similarity tothe documents according to their degree
query.

(New Syll. w.e.f


academic year 22-23) (M7-87) Tech-Neo Publications
Infomation Retrieval System (MU- Sem.7-1T) (IR Models)Pg. no. (2-11)

D0sadvantage of Vector Space Model


(1) The assumption of mutual independence between index terms
a 2.5.3 Probablistic Model

GQ What are the Fundamental assumptions for probabilistic principle?


1GQ. Write the advantages and disadvantages of probabistic model,
The probabilistic model is an effort to frame the information retrieval
problem within a probabilistic framework.
The probabilistic model tries to estimate the probability that the user will
find the document dË relevant with ratio
P (d; relevant to q)/P (d; non relevant to q)
It is useful to derive ranking functions used by search engines and web
search engines in order to rank matching documents according to their
relevance to a given search query
This model is used to calculate the probability that a document, d; , will
be relevant to a given query, 9.
The model makes the assumption that the query and document
representations influence this probability of relevance.
Given a query q, there exists asubset of the docunents R which are
relevant to q But membership of R is uncertain
Users give with information needs, which they translate into query
representations. Similarly, there are documents, which are converted into
document representations. Given only a query, an IR system has an
uncertain understanding of the information needed.
So IR is an uncertain process, because,
Information need to query
Documents to index terms
Query terms and index terms mismatch
Probability theory provides a principled foundation for such reasoning
under uncertainty. This model provides how likely a
relevant to an information need. document is
Documents can be relevant and non-relevant, we can
probability of a term t appearing in a relevant documentestimate the
P(t | R=l),
Infomation Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no. (2-12)

Probabilistic methods are one of the oldest but also one of the currently
hottest topics in lnformation Retrieval.
For Probablistic model

GQ How can you find the similarity between doc and query in
probabilistic principle Using Bayes rule?
All index term weights are all binary i.., wij¬ {0,1)}, Wi, qe (0,1}
Let R be the set of documents known to be relevant to query q
Let R' be the complement of R.
Let (RId) be the probability that the document dË is relevant to the
query q
Let P(R'ldi) be the probability that the document d, is non-relevant to the
query q
The similarity sim(d,q) of the document d, to the query q is defined as
the ratio

sim(dj, ) =
PRId)
P(RId)
using Bayes' rule,

sim(dj, q) =
P(dR) xP(R)

P(d1R) stands for the probability of randomly selecting the document dj


from the set R of relevant documents.
P(R) stands for the probability that a document randomly selected from
the entire collection is relevant

Advantage of Probabillstic Model


(1) Documents are ranked in decreasing order of probability of
relevance.
Disadvantages of Probabillstic Model
(1) Need to guess initial estimates for P( Ki IR)

(New Syl., w.ef academic year


22-23) (M7-87) Tech-Neo Publication[
Intomation Retrievai System (MU-Sem.7-T) (IRModels) Pg. no. (2-13)

2.6 ALTERNATIVE SET THEORETIC MODELS


GQ. Discuss alternative set theoretic models.
In this section, we discuss two alternative set theoretic models, namely
the fuzzy set modeland the extended Boolean model.
a 2.6.1 Fuzry Set Model
GQExplain fuzzy set model
GQ Write basics of fuzzy set theory.
When documents and queries are represented by sets of keywords,
descriptions that are only loosely related to the actual semantic contents
of the corresponding documents and queries are produced.
As a result, there is only a rough match between a document and the
search terms (or vague).
This can be represented mathematically by assuming that each query
phrase defines a fuzzy set and that each page has a degree of
membership (often smaller than 1) in this set.
This interpretation provides the foundation for many models of IR based
on fuzzy theory.
Basics of Fuzzy Set Theory
Fuzzy sets theory is an extension of classical set theory.
Elements have a varying degree of membership. A logic based on two
truth values,
True and False are sometimes insufficient when describing human
reasoning.
Fuzzy Logic uses the whole interval between 0 (false) and 1(true) to
describe human reasoning.
A Fuzzy Set is any set that allows its members to have different degree
of membership, called membership function, having interval [0, 1).
Fuzzy Logic is derived from fuzzy set theory
Many degrees of membership (bet ween 0 to 1) are allowed.
Thus a membership function uA (x) is associated with a fuzzy sets
Á
(New Syl. wefacadermic year 22-23)
(M7-87) JTech-Neo Publications
Infomation Retrieval System (MU-Sem.7-IT) (IR Models) Pg. no. (2-14)

such that the function maps every element of the universe of discourse X
to the interval (0, 1).
The mapping is written as: uÅ (x): X-> [0, 1].
Fuzzy Logic is capable of handling inherently imprecise (vague or
inexact or rough or inaccurate) concepts
A fuzzy set is defined as follows: If X is a universe of discourse and x is
a particular element of X, then a fuzzy set A defined on X and can be
written as a collection of ordered pairs A = { (x, u¢(x)), x E X }

GQ. Define membership function.


!GO. Explain fuzzy information retrieval

Example
Let X = {gl, g2, g3, g4, g5) be the reference set of students.
Let Ä be the fuzzy set of "smart" students,.where "smart" is a fuzzy
term.

={(gl,0.4) (g2 ,0.5) (g3,1) (g4,0.9) (g5 ,0.8)}


Here A indicates that the smartness of g1 is 0.4 and so on
Membership Function : The membership function fully defines the
fuzzy set. A membership function provides a measure of the degree of
similarity of an element to a fuzzy set
Fuzzy Information Retrieval
The main idea is to supplement the query's index terms with related
terms (obtained from a thesaurus) so that the user query can acquire
more relevant pages
By creating a term-term correlation matrix (referred to as a keyword
connection matrix in whose rows and columns are connected to the index
terms in the document collection, a thesaurus can be created. In this
matrix C, a normalized correlation factor Ciibetween two terms k; and k
can be defined by

Ci,| =
nË+ nË- nË!
Where n; is the number of documents which contain the term k, nË is
the number of documents which contain the term ki, and ni is the

(New Syll. w.e.f academic year 22-23) (M7-87)


Tech-Neo Publications
Intomation Retrieval System (MU-Sem.7-T) (IR Models) Pg. no. (2-15)

number of documents which contain both terms.


In this fuzzy set, a document d has a degree of membership u
computed as

which computes algebraic sum over all terms in document d;


a 2.6.2 Extended Boolean Model

GQ. Discuss extended Boolean model.

In the Boolean model, no provision for term weighting and no ranking of


the answer set is generated.
As a result, the size of the output might be too large or too small
However, an alternative strategy is to add the capabilities of term
weighting and partial matching to the Boolean model. With this method,
it's possible to integrate vector model properties with Boolean query
constructions.

The extended Boolean model, was introduced in 1983 by Salton, Fox,


and Wu.

2.7 STRUCTURED TEXT RETRIEVAL MODELS


GQ. Explain Structured text retrieval models
Think about a user who has a strong visual memory. A user of this type
would then remember that the particular document in which he is
interested has a page where the phrase "Nuclear Blast" occurs in italics
in the text around a Figure whose label contains the word "earth."
This query may be phrased as ['Nuclear Blast and 'earth] in a traditional
information retrieval approach, which would return all pages containing
both strings. But it's clear that this customer didn't want as many
docunents as this answer provides.
In this scenario, the user wants to make his inquiry clearer by using a
richer expression,like
Same-page (near Nuclear Blast,' Figure (label ('earth')

(New Syll. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications


(MU-Sem 7-!T) (IR Models) Pg. no. (2-17)
Infomation Retrievai System
chapter

section
L2
subsections
L3
subsubsections
L4
through different indexing list
O83Fig. 2.7.1 : Structure of text documents
Implementation
single inverted file is built, in which each structural component stands
A
as an entry inthe index
list of occurrences
Each entry has a list of text regions as a
traditional inverted file
Such a listcould be casily merged with the
Example types of queries
Select aregion that contains a given word
any other region B
Select a region A which does not contain
other region
Select a region not containced within any

a 2.7.2 Model Based on Proximnal Nodes

GQ. Discuss model based on proximal nodes.


Baeza-Yates
This model was proposed by Navarro and
over the text. This
Basic idea is to define a strict hierarchical index
enriches the previous model that uses flat.
indexing
It allows the definition of independent hierarchical (non-flat)
structures over the same text of the document.
are chapters,
Every indexing system is made up of nodes, which
sections, paragraphs, pages, and lines.
Each node is associated with a text region.
formed
If user query refers to different hierarchies, compiles answer is
by nodes which allcome from onlyone of them.
This type of models allow us to formulate more complex queries than the
nodel based on non-overlapping lists.
Only nearby (proximal) nodes are looked for faster query processing.
Fig, 2.7.2 shows the hierarchicalindexing structure of four levels and an
inverted list for the word 'Everest'.

(hiew Syl. w.efacadernic year 22-23) (M7-87) aTech-Neo Publications


Infomation Retrieval System (MU-Sem.7-4T) (UR Models) Pg. no. (2-18)

chapter
L1
section
L2
subsections
L3
subsubsections
L4
48.304
erverest 256

(184Fig. 2.7.2 : Hierarchical indexing structure

Features

One node might be contained within another node.


But two nodes of the same hierarchy cannot overlap.
The inverted list for words complements the hierarchical index.

Query language inregular expression


(1) Searches for string
(2) Reference to structural components by name
(3) Combination of these
An example query [(*section) with ("Everest")]
Searches for the sections, the subsections, and the sub-subsections that
contain the word Everest"

Model is a compronmise between expressiveness and efficiency

M 2.8 MODELS FOR BRoWsING

Sometimes the user is interested to spend some time in exploring the


document, looking for interesting references instead of searching for a
specific query.
Users have goals to pursue in both cases
But the searching task's goal is more clear
than a browsing task's goal in
the user' smind.

(New Syl, w.e.f academic year


22-23) (M7-87)
Tech-Neo Publications
Intomation Retrieval System (MU-Sem.7-4T) (IR Models) Pg. no. (2-19)

Types of Browsing
GQ. What are different types of browsing.

(1) Flat Browsing


Documents are represented as dots in a (two- dimensional) plan or as
elements in a (single dimension) list.
The user then glances here and there looking for information within the
documents visited
The user looks for correlations among neighbor documents or
for
keywords
These keywords could be added to the original query for
query
expansion and this process is called relevance feedback. this helps in the
retrieval of more relevant documents.
Users can also explore a single document in a flat mannet (like a web
page)

Drawback
On agiven page user may not have an indication about the
context
where the user is. For example, if a user opens a book on a
random page,
he might not know in which chapter that page is.
(2) Structure Guided Browsing
Documents are organized in a structure as a directory to help users in
browsing.
Directories are hierarchies of classes that group documents
covering
related topics
These hierarchies of classes have been used to
classify document
collections. E.g.: "Yahoo!" provides a hierarchical directory
The user performs a structured guided type of
browsing.
The same idea applied toa single document
Chapter level, section level, etc.
The last level is the text itself
(flat!)
A good UI is needed for keeping track of
the context in a focused
manner.

ewSyl. w.e.f academic year


e.g. the "adobe acrobat pdf" files
searching such as
Additional facilities are provided when
history map to identify classes recently visited
A
showing the structures in a
Display occurrences (of terms) by
positions
global context, in addition to the text
(3) The Hypertext Model
task of writing is the notion of
The fundamental concept related to the
sequencing
underneath the most written
A sequenced organizational structure lies
text

The reader should not expect to fully understand the message conveyed
by the writer by randomlyreading pieces of text here and there
Sometimes, we even can't capture the information through sequential
reading of the whole text
For example, a book about the history of the wars'" is organized
chronologically, but the user might in interested in wars fought by any
particular army or country, in such case user will have a tough time
finding the information he is interested in.
Because contents are organized sequentially
in these situations, one of the possible solutions is to rewrite the book but
there is no point in rewriting the book
Another solution is to define a new structure to organize the contents
which can be achieved through the design of hypertext.
Hypertext
A high-level interactive navigational structure allows users to browse
text non-sequentially
Consist of nodes (text regions) correlated by directed links in a
graph
structure

A node could be a chapter in a book, a


section in an article, or a
web page
Links are attached to specific strings inside the
nodes

(New SyI. wef


acadenic year 22-23) (M7-87) ITech-Neo Publications
Infomation Retrieval System (MU-Sem.7-1T) (IAModels) Pg. no. (2-21)

A B

The process of navigating the hypertext can be understood as a traversal


of a directed graph.
Hypertexts provide the basis for HTML(Hyper Text Markup Language)
and HTTP(Hypertext Transfer Protocol)
sS Drawbacks of Hypertext
(1) Loose in hyperspace: the user will lose track of the organizational
structure of the hypertext when it is large
A hypertext map shows where the user is at all times (graphical user
interface design)
(2) But, the user is restricted to the intended flow of information previously
convinced by the hypertext designer
Should take into account the needs of potential users
Analyzing the requirements before starting implementation of hypertext
is required
difficult to orient
(3) During the hypertext navigation, the user might find it
himself Guiding tools can help in navigation (hypertext map)
Short Questions and Answers

models?
Q. 1 What do you mean information retrieval
Ans. :

A retrieval model can be a description of either the computational


process of choosing
process or the human process of retrieval: The
information needs are first
documents for retrieval; the process by which
articulated and then refined.

a. 2 What is cosine similarity?


Ans. :
similarity
This metric is frequently used when trying to determine
common
between two documents. Since there are more words that are in
between two documents, it is useless to use the other methods of calculating
similarities
((R Models) Pg. no. (2-22
InfomationRetrieval System (MU-Sem.7-IT)
feedback?
of relevance
a.3 What are the characteristics
Ans. : process.
the details of the query reformulation
from
(1) Itshields the user a sequence of small steps
whole searching task into
(2) It breaks down the
which are easier to grasp.
designed to emphasize some terms and de.
process
(3) Provide a controlled
emphasize others.
vector space model?
Q. 4 What are the assumptions of
Ans.:
model:
(1) Assumption of vector space
degree of matching can be used to rank-order documents
(2) The
how well a document satisfying a
(3) This rank-ordering corresponds to
user's informationneeds

Q.5 What are the disadvantages of Boolean model?


Ans. :
into a Boolean
(1) It is not simple to translate an information need
expression.
(2) Exact matching may lead to retrieval of too many documents.
(3) The retrieved documents are not ranked
(4) The model does not use term weights

Q. 6 Define term frequency.


Ans. :
Term frequency : Frequency of occurrence of query keyword in
document

Q.7 What are the three classic models in information retrieval system?
Ans. :
(1) Boolean model
(2) Vector Space model
(3) Probabilistic model

(New Syil., w.e.f academic year 22-23)


(M7-87) Tech-Neo Publicatians
Information Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no. (2-23)

Q. 8 What is the basis for Boolean model?


Ans. :

Simple model based on set theory and Boolean algebra


(1) Documents are sets of terms

(2) Queries are specified as Boolean expressions on terms.


Q.9 What are the disadvantages of Boolean model?
Ans. :
Exact matching may retrieve too few or too many documents
(1) Difficult to rank output, some documents are more important than
others.
(2) Hard to translate a query into a Boolean expression
(3) Allterms are equally weighted
(4) More like data retrieval than information retrieval
(5) No notion for partial matching
a. 10 What are the Fundamental assumptions for probabilistic principle?
Ans. :

q- user query, dj - docin the collections


Model assumes, relevance depends on the query and the doc
representation only
R-ideal answer set, relevant to the query
R - ideal answer set, non-relevant to the query
Similarity to the query ratio is, i.e. probabilistic ranking computed as
Ratio = P(dj relevant-to q) / P(dj non-relevant-toq)
The rank minimizes the probability of the erroneous judgment

Q. 11 Write the advantages and disadvantages of probabilistic model:


Ans. :

Advantages
(0) Doc's are ranked in decreasing order of their probability of relevant

NewSyll. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications


Information Retrieval System (MU-Sem.7-IT) (IA Models) Pg. no. (2-24)

Disadvantages
( ) Need to guess the initial separation of doc's into relevant and non
relevant sets.
(2) All weights are binary
(3) The adoption of the independence assumption for index terms
(4) Need to guess initial estimates for P(ki lR)
(5) Method does not take into account tf and idf factors

Q. 12 Why Classic IR might lead to poor retrieval?


Ans. :

(1) The user information need is more related toconcepts and ideas than to
index terms but in classic IR.
(2) Unrelated documents might be included in the answer set.
(3) Relevant documents that do not contain at least
one index term are not
retrieved,
(4) Reasoning : retrieval based on index terms
is vague and noisy.

Chapter Ends...
Module IV
Text Processing
CHAPTER 4

Syllabus
Text and Multimedia languages and properties: Metadata, Markup
Languages, Multimedia; Text Operations: Document Preprocessing,
Document Clustering.
Self-learning Topics: Digital Library : Greenstone

4.1TEXT AND MULTIMEDIA LANGUAGES AND


PROPERTIES

IGQ. Define document and list all the characteristics of a document with4
the help of diagram.
GQ Define and explain a term document with an example.

Text is the primary form utilized for knowledge exchange.


Text has been developed -everywhere, in a wide variety of forms and
languages.
The document designates a single unit of information.
Adocument is a piece of text in digital or other form.
Adocument can be any physical item (a file, an email) or a fully formed
logical unit (a book or a research article), an entry in a dictionary or a
judge's opinion on a case.
1ne syntax and structure of a document are determined by the
application or the person who generated it.
Ihe author specifies the semantics of the document.
intomation Retrieval System (MU-Sem.7-T) (Text
Processing) Pg. no (42
The presentation style of a document might dictate how it shoul4
presented or printed.
Its syntax and structure, which are tied to a
particular application
determine the presentation style.
Documênt
Presentation Style
Syntax Text + Structure +
Other media
Semantics
Fig. 4.1.1 : Characteristics of a document
The document syntax
expresses structure, presentation style, semantics or external actions
one or more of elements may be given together or
implicit in the
document's content
structural element (such as a section) can have fixed formatting
style
can be implicit in its content or expressed in a simple
declarative
language expressed in a programming language
or
Fig. 4.1.1 gives all the characteristics of a document.
Syntax anguages may be proprietary and specific but open and generic
languages are more flexible.
Text can be written in natural language, which is difficult for
computers
toprocess.
The current trend is to use document languages that provide information
on structure, format, and semantics that are readable by humans and
computers.
Document style
can be embedded in the document: TeX and RTF
can be complemented by macros : LaTeX
defines how a document is visualized or printed
Fig. 4.1.2 represent a document with few styles.

(New Syli. w.e.f academic year 22-23)(M7-87) Tech-Neo Publications


infomation Retrieval System (MU-Sem.7-T) (Text Processing) Pg. no. (4-3)

Understanding search engine queries is crucial since they are short


chunks of text that are different from normal text and whose semantics
are sometimes confusing due to polysemy.
They are also difficult to infer the user intent behind a query.

Large font

|Chapter 6
Document and Query Properties and
|Languages Boldfont
with Gonzalo Navaro
6.1 Introduction
Text is the main torm of corTTurcang krowiedge Steng wt
hiercgphs, he first wrten ufaes siore, wood arimal skn,
papyus and ri% pepe and paper. e has beert created
everyuhare, in many foms and languags. We use the ter
docunent to dende a singe unl af irformaton, ypicalytext in dgtai
fom, but k an dso incude cther media. in practiae, a documert is
ocsely detined.it can be a compiete logicat unit, ike a research
aricie. a tbock or a manua. t an atso te part of a larger text such
as a paagrah r a equeNca of paragraçhs{aso calsd pasage
of lext), anentry a ddoray, a judges pinion on a case, the
descipion o an atanobiie part, de. Furhemore wth respect ta
the physical represerntation, a dociert nbe any physiauriË for
exarpie a fie, an emai, cr a WO Wie Web(or just Wet) page

Fig. 4.1.2: An example of document style

A 4.1.1 Metadata
GQ, Define briefly the term 'metadata'.
GQ. Explain the use of metadata in Web documents.
GQ Write a short note on;RDF.
Metadata is information about how the data is organized, the different
data domains, and how they relate to one another.

(New Syl. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications


Metadata is data about the data".
The names of relations and attributes in a database that correspond to
their domain are known as metadata.
Most documents and text collections also have metadata.
Descriptive Metadata is a metadata that is external to the meaning of
the document and relates more to how it was created.
Common forms of metadata for documents include the author, the date
of publication, the source of the publication, the document length and the
document genre.
The Dublin Core Metadata Element Set suggests 15 fields for document
descriptions.
Marchionini refers to this type of information as Descriptive Metadata.
Semantic Metadata
characterizes the subject matter within the document's contents
is associated with a wide number of documents
its availability is increasing
Specific ontologies can be used to standardize semantic terms.
An important metadata format is Machine Readable Cataloging Record
(MARC).
it is the most used format for library records
includes fields for distinct attributes of a bibliographic entry such
as title, author, publication venue
in the U.S.A., a particular version of MARC is used:
USMARC
Metadata is also used in Web Documents.
The increase in Web data has led to many initiatives to add
metadata
information to Web pages for various purposes such as
cataloging and content rating
intellectual property rights and digital signatures
privacy levels for access to a document
applications to electronic commerce
Resource Description Framework (RDF) is a new standard for Web
metadata.

(New Syl. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications


Infomation Retrieval System (MU-Sem.
7-1T) (Text Processing) Pg. no (4-5)

RDF enables interoperability across applications and allows


characterizing Web resources tofacilitate automated processing.
semantic domain.
RDF does not assume any specificapplication or
attached attribute/value pairs.
It comprises of a description of nodes and
(URIs), which
Nodes are made up of Uniform Resource Identifiers
include Uniform Resource Locators (URLS).
form of text
Attributes are properties of nodes with their values in the
strings or other nodes (Web resources or metadata instances).
Metadata can be used to describe non-textual things, such as a collection
of keywords for an image.

4.1.2 Markup Languages


GQ. What is Markup? Discuss various markup Languages in detail.
GQ. Define about xml language?
GQ. Differentiate hypertext and xml data structure?
GQ. Express standard languages for multimedia applications with proper
example.
The term "markup" refers to the additional syntax required to express
formatting actions, structure details, text semantics, attributes, etc.
The marks are called tagS.
Marked text is enclosed by an initial and an ending tag to prevent
ambiguity.
Examples of Markup Languages
SGML: Standard Generalized Markup Language
HTML: HyperText Markup Language
XML:eXtensible Markup Language
SGML

SGML (ISO 8879) stands for Standard Generalized Markup Language,


ie., ameta-language for tagging text.
SGML includes rules for constructing a markup language bused on tags
and a description of the document structure called a docunment ype
definition (DTD).

ew Sy. w.efacadernic year 22-23) (M7-87) Tech-Neo Publications


tntomation Retrieval System (MU-Sem.7-T) (Text Processing) Pg. no. (4-6)

An SGML document is defined by a document type definition and the


text itself ismarked with tags that specify the structure.
The docunment type definition is used to
describe and name the pieces that a document is composed of
define how those pieces relate to each other
Part of the DTD can be defined by an SGML document type declaration.
Other components of the DTD, such as the semantics of elements and
attributes or application norms, cannot be described explicitly in SGML,
Comments can be used, however, to express them informally
More complete information is usualy present in separate
documentation.

Tags are denoted by angle brackets


Tags are used to identify the beginning and ending of an element
such as (<tagname> element </tagname>)
Ending tags include a slash before the tag name
Attributes are specified inside the beginning tag
Fig. 4.1.3 shows an example of a SGML DTD for electronic messages.
|<-SGMLDTD for electronic messages -->
ELEMENT email -- (proiog, contents)>
<ELEMENI prolog -- (sender, address+, subject?, Cc*)>
<ELEMENT (sender address |subject|Cc)-0 (#PCDATA) >
<ELEMENT Contents - (par | image | audio+>
<ELEMENT par -0 (ref| #PCDATA+>
<ELEMENT ref -0EMPTY >
<ELEMENT (image audio) - (#NDATA) >
<ATTLIST email
id ID
date sent DATE #REQUIRED
status (secret | public) #REQUIRED
<ATTLIST ref public >
id IDREF
<ATTLIST (mage | audio ) #REQUIRED >
id ID
#REQUIRED
Fig. 4.1.3:DTD for
structuring electronic malls
Fig. 4.1.4 shows an example of use of previous DTD.
<l-Example of use of previous DTD.>
<!DOCTYPE email SYSTEM "email.dtd">
<email id-12345jm date sent-02102022>
prolog
<sender> Rugved More <isender>
<address> Albert Gonsalves </address>
<address> Mumbai <address>
<subject> Pictures of my house in city town
<Cc> Saumil More </Cc>
<prolog>
<Contents>
<par>
Kindiy check the attached images of my house and the
splendid sea view from my bedroom
(photo <ref idref-F2">).
<<par>
<image id-FI> "photol.gif" </image>
<0mage id-F2> "photo2.jpg" <image>
<par>
Regards from the South, Rugved.
<Jcontemts>
<lemail>

Fig. 4.1.4 :An example of use of DTD for structuring electronicmails

Document description does not specify how a document is printed


Output specifications gives the directions on how to format a
document which are often added to SGML documents, such as
(1) DSSSL: Document Style Semantic Specification Language
(2) FOSI :Formatted Output Specification Instance
These standards define mechanisms for associating style
information with SGML document instances
should be
They allow defining that the data identified by a tag
typeset in some particular font
Initiative
One important use of SGML is in the Text Encoding
(TEI-started in 1987)
humanities and
Includes number of US associations related to the
linguistics.
iew Syll. w.e.facademic year 22-23) (M7-87) rech-Neo Publications
(Text Processing) Pg. no. (4-8)
Infomation Retrieval System (MU-Sem.7-T)

SGML DTDsprovide severaldocunment formats


preparation and
primary objective is to create guidelines for the
well as the
interchange of electronic texts for scholarly research, as
industry
the most popular format is TEI Lite

instance of
HTML stands for HyperText Markup Language which is an
SGML.

Since its creation in 1992, HTML has undergone several revisions, with
version 4.0being the most recent.
HTML5 is under development and continually be updated with new
features, called as "living standard".
Most documents on the Web are stored and transmitted in HTML.
HTML is simple language well suited for

Hypertext
Multimedia

The display of small and simple document


not
Although there is an HTML DTD, nost HTML instances do
explicitly refer to the DTD.
The HTML tags follow all the SGML conventions and also include
formatting directives.
HTML pages can contain other media embedded in them, such as images
or audios.

HTML also provides fields for metadata, which can be utilized for
various applications and purposes.
Dynamic HTML (DHTML): A page that uses HTML and another
program (for example, using JavaScript)
Fig.4.1.5 gives an example of an HTML document.

(New Syll. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications


cbtnl>
<head>
Ktitle>HTML Practice Exaple</title>
<meta name-JM content"basic example">
<bend>
cbody>
<hl>HTML Practice Example</hl>
<hr><h

HIML isvery <b>simple</b> language:


ul>
<li> link to <b><a href-htps://www.xavier.ac.in>XIE</a><b>
(a from anchor),
<li> paragraphs (p), headings (h1, h2, etc), font types (b, i),
<i> horizontal rules (hr), unordered lists and items (ul, i),
<i> images (img), tables, forms, etc.
<hub

<hr<hr

<img align=left src-"lower.jpg" height-10 width= 10>


Look at beautifui <b>flower<b>.
<body>
</html>

Fig. 4.1.5:Example of an HTML document


Fg. 4.1.6 gives an output of above HTML document on a browser.

HTML Practice Example


HTML is very simple language:
" link to XIE (a from anchor),
paragraphs (p), headings (hl, h2, etc). fout types (b, i).
" horizontal rules (hr), unordered lists and items (ul, i),
" images (img), tables, forms, etc.

Look at beautiful nower.

Fig. 4.1.6 : Output of an HTML document on a browser

New Syl. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications


7-IT) (Text Processing) Pg. no. (4-10)
Infomation Retrieval System (MU-Sem.
\997 since HTML does
Cascade Stvle Sheets (CSS)were established in
not fix presentation style.
CSS offer
authors to enhance the
a powerful and manageable mechanism for
aesthetics of HTML pages
a way to distinguish information about presentation from document
content

asupport in current browsers which is still modest


The evolution of HTML implies support for both backward and forward
compatibility.
HTML 4.0 has been specifiedin three flavours: strict, transitional, and
frameset
Strict HTML onlyworries about non-presentational syntax, leaving
all the displaying information to CSS
Transitional HTML makes advantage of all presentational features
to create pages that can be read by older browsers without
understanding CSS
Frameset HTML is utilized when you want to divide the browser
window into two or more frames
Style sheets, internationalization, frames, richer tables and forms, and
accessibility features for people with impairments are all supported by
HTML 4.0.
Typical HTML applications employ a fixed, limited set of tags
which makes the language specification much easier to implement
an applications
which significantly restricts HTML in a number of crucial areas
HTML does not
(1) allow users to declare their own tags
(2) support the specification of nested structures
needed to
represent database schemas
(3) provide language specifications that enable consuming
applications to validate imported data for structural accuracy
Information Retrieval System (MU-Sem.7-|T) (Text Processing) Pg. no. (4-11)

XML

XML, the eXtensible Markup Language is a simplified version of


SGML.
It is not a markup language. like HTML, but a meta-language, like
SGML.
It allows for machine-readable semantic markup that is alsO readable by
humans.
It makes it easier to create and use new specific markup
languages.
XML does not have many of the restrictions of HTML.
However, XML imposes a more rigid syntax on the markup
In XML, ending tags must present
XML is case sensitive
All attribute values must be added between
quotes
Parsing XML without a DTD is easier
1. The tags can be obtained while the parsing is done
XML tags are not predefined, user can define their own
tags.
Fig. 4.1.7 shows an XML document without a
DTD analogous to the
previous electronic mail DTD given for SGML.
<XML VERSION= "1.O" RMD= NONE" ?>
<enail id=12345jm date
<prolog> sent-02102022>
<sender> Rugved More </sender
<address> Albert Gonsaives </address>
<address> Mumbai </address>
<subject> Pictures of my house in city town
<Cc> Saumii More </Cc>
</prolog>
<contents>
<par>
Kindly check the attached images of my
splendid sea view from my bedroom house and the
. (photo <ref idref- *F2">).
<par>
<image id=F|> "photo I-gif" <fimage>
<image id=F2> "photo2.jpg" <image>
<par>
Regards from the South, Rugved.
/email<lcontents>

Fig. 4.1,7 : An XML


document without a DTD
(New Syl. w.e.f
acadermic year 22-23) (M7-87) Tech
Infomation Retrieval System (MU-Sem.7-T) (Text Processing) Pg. no. (4-12)

Extensible Style sheet Language (XSL)


the XML counterpart of Cascading Style Sheets (CSS)
syntax defined based on XML
created to modify and style highly-structured, data-rich XMI
documents
using XSL, for instance, it would be possible to automatically
extract a document's table of contents
Extensible Linking Language (XLL)
Another extension to XML, defined using XML
defines different types of links (external and internal)
Recent uses of XML include

Mathematical Markup Language (MathML)


Synchronized Multimedia Integration Language (SMIL)
Resource Description Format
Next generation of HTML should be based in asuite of XML tag sets.

4.1.3 Multimedia
GO. Express various formats for multimedia applications.
Recent advances in computer technology have precipitated a new era in
the way people create and store data.
Multimedia usually stands for applications that handle different types of
digital data.
Millions of multimedia documents including images, videos, audio,
graphics, and texts can now be digitized.
Different types of formats are necessary for storing each media.
Most formats for multimedia can only be processed by a computer.
Text

With the advent of the computer, it became necessary to represent code


characters in binary digits through coding schemes
EBCDIC(7 bits), ASCII (8 bits) and UNICODE(16 bits)
All these coding schemes are based on characters.

(New Syl. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications


infomation Retrieval System (MU-Sem.7-4T) (Text Processing) Pg. no. (4-13)
A IR system should be able to retrieve information from many text
formats (doc, pdf, html, txt).
IRsystems uses filters to handle most popular ocuments.
. But good filters might not be possible with proprietary formats.
Other text formats
Rich Text Format (RTF):for document interchange
Portable Document Format (PDF): for printing and displaying
Postscript : for printing, displaying and drawing
Multipurpose Internet Mail Exchange (MIME): for encoding email
Most common compression software and associated formats
Compress (Unix), ARJ (PCs) :for compressing text
ZIP (Unix) (gzip in Unix and Winzip in Windows) : for
compressing text.
Image Formats

Direct representations of a bit-mapped display, such as XBM, BMP, or


PCX; are the most basic image formats.
Images of these formats have a lot of redundancy and can be compressed
efficiently
Example of format that incorporates compression: Compuserve's
Graphic Interchange Format (GIF)
developed.
oimprove compression ratios, lossy compression was
original
Uncompressing a compressed image does not yield exactly the
Image.
format
This is done by the Joint Photographic Experts Group (JPEG)
JPEG attempts to remove portions of
the image that are less
noticeable to the human eye
be adjusted
This format is parametric, meaning that the loss may
File Format (TIFF)
Another common image format is the Tagged Image
applications and
exchange of documents between different
computers
number of colors.
AIFF provides for metadata, compression,and varying

tNew Syl. We.f acadenic year 22-23) (M7-87) Tech-Neo Publications


information Retrievai System (MU-Sem.7-4T) (Text Processing)Pg no. (4-14)

Truevision Targa image file (TGA Truevision Graphics Adapter


another format related to video game boards.
Portable Network Graphics (PNG): bit-mapped image format for use in
the Internet.
Various other image forms, such as fax and fingerprints, are associated
to specific applications.
Audio

For proper storage, audio must be converted to digital format.


The most popular audio file formats are AU, MIDI, and WAVE.
MIDI: a common format for exchanging music between computers and
electronic devices.

Other formats, such RealAudio or CD formats, are emnployed for audio


libraries.

Movies

Main format for animations is Moving Pictures Expert Group (MPEG)


operates by coding the changes in successive frames
utilizes the temporal image redundancy that any video contains
incorporates the audio signal linked with the video
specific cases for audio (MP3), video (MP4), etc.
Other video formats are AVI, FLI and QuickTime
AVI may include compression (CinePac)
QuickTime, developed by Apple, also includes compression
Graphicsand Virtual Reality
Three-dimensional graphics can be displayed in a variety of formats.
Computer Graphics Metafile (CGM) allows for the open exchange of
structured graphicalobjects and the attributes associated with them.
The Virtual Reality Modeling Language (VRML)is a 3D graphics and
multimedia interchange format that can be utilized in a wide range o
application fields, including
engincering and scientific visualization

(New Syli. w.e.f acadernic year 22-23) (M7-87) Tech-Neo Publications


tnfomation Retrieval System (MU-Sem.7-T) (Text Procesaing) Pg no. (4-16)
multimedia presentations
entertainment and educationaltitles
weh pages and shared virtual worlds
VRML has emerged as the de facto Web modelling
language.
HyTime

HypermediaTime-based Structuring Language


Multimedia document markup standard
an SGML architecture that specifies a
document's generic
hypermedia structure
HyTime hypermedia concepts include
complex locating of document objects
relationships (hyperlinks) between document objects
numeric, measured associations between document objects
The HyTime architecture has three parts
the base linking and addressing architecture
the scheduling architecture (derived from the base architecture)
the rendition architecture (which is an application of the scheduling
architecture)
HyTime does not directly provide graphical interfaces, user navigation
or user interaction.
These components of document processing are derived from the HyTime
structures in a similar way to how style sheets are used in SGML
documents.
M 4.2 TEXTOPERATIONS

A 4.2.1 Document Preprocessing


Describe five text operations of document preprocessing.
Describ lexical analysis of the text.
Explain about stemming process.
GQ Describe the process of thesaurus generation.
Information Retrieval System (MU-Sem. 7-|T) (Text Processing) Pg. no. (4-181
GQ. Explain the various phases of text preprocessing with the help of the i
logical view of the document.
(1) Lexical analysis of thetext
(2) Elimination of stopwords
(3) Stemming
(4) Selection of index terms
(5) Construction of termn categorization structures

(1) Lexical analysis of the text


It is the process of turninga stream of characters into a stream of words.
The basic goal is to identify the words in the text.
It treats spaces (reduced to one space as word separator), digits (remove
allwords containing sequences of digits), hyphens (remove), punctuation
marks (remove) and the case of the letters (converts all the text to either
lower or upper case).
(2) Elimination of stopwords
The main objective is filtering out words with very low discrimination
values for retrieval purposes.
Stopwords are the words
that are too common among the documents
which occurs in 80% of the documents
For example, articles, prepositions, conjunctions,etc.
Stopword elimination significantly shrinks the size of the indexing
structure but may decrease recall.
Problem:Search for "to be or not to be"
Elimination process might leave only the term "be' which makes it
difficult to recognize the documents of that phrase
(3) Stemming
The major goal is to get rid of affixes (prefixes and suffixes) and make it
possible to retrieve documents with query terms that have syntactic
variants.

(NewSyll. w.e.f academic year 22-23) (M7-87) Tech-NeoPublications


Syntactic variations are plurals, gerund forms and past tense suffixes.
A stem is generated by removing the affixes of a word.
Examples of generated stems are
connected, connecting, connection, connections--> connect
effectiveness --> effective --> effect
picnicking --> picnic
stresses --> stress

king -->k
Stems help to enhance retrieval efficiency.
Stemming help to reduce the size of the indexing structure.
Stemming strategies by Frakes
affix removal: intuitive, simple and effective implementation
table lookup: simple but dependent on data on stems
successor variety: more complex than affix removal
n-gram: more clustering procedure than stemming
(4) Selection of index terms
A sentence is usually composed of nouns, pronouns, articles, verbs,
adjectives, adverbs, and connectives.
Since the majority of semantics are carried by noun words, the
motivation for selecting index ternms is to use nouns in the text.
Identification of noun groups is a good approach for selecting index
terms.
A noun group is a set of nouns whose syntactic distance in the text does
not exceed a predefined threshold (for example, information technologv).
(5) Construction of term categorization structures (such as
thesaurus)
Thesaurus is a word that has Greek and Latin roots and refers to the
treasury of words.
Treasury consists of
A precompiled list of important words in a given domain of
knowledge
(New Syil. w.e.f academic year 22-23) (M7-87)
Tech-Neo Publications
Infomation Retrieval System (MU-Sem.7-|T) (Text Processing)P9. no. (4-18,
Aset of words for each word in the above list
Normalization of the vocabulary
An example from Peter Roget thesaurus is given below
cowardly adj
Ignobly lacking in courage: cowardly turncoats
Syns: chicken (slang), chicken-hearted, craven, dastardly, faint-hearted.
gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow.
bellied (slang).
The idea of adopting a regulated vocabulary for indexing and searching
is what inspired the creation of a thesaurus.
The Purpose of a Thesaurus by Foskett
toprovide astandard vocabulary for indexing and searching
to assist users with locating terms for proper query formulation
to provide classified hierarchies that allow the broadening and
narrowing of the current query request
Thesaurus index terms, term relationships, and layout designs for these
term relationships are its key constituents.
Thesaurus Term Relationships
BT: broader
NT:narrower
RT: non-hierarchical, but related
Fig. 4.2.1 represent a logical view of the document after each of the
above phases is completed.

Aocents, Noun Automatic of


document spacing Stopwords
groups Stemming Manual
Indexing
Text +
structure
Structure text
recognition

6tructure Full text index terms

Fig. 4.2.1 :Logical view of the document throughout the various phases
of text preprocessing

(New SylI. w.e.f academic year 22-23) (M7-87) Tech-NeoPublications


Information Retrieval System (MU-Sem.7-4T) (Text Processing) Pg. no. (4-19)

2.2 Document Clustering


GQ. What are the two types of clustering?
GQ. Differentiate between document clustering and term clustering?
GQ. Describe any one technique for term clustering with help of an
example?
GQ. Define term clustering? Explain item clustering with suitable
example?
GQ. Describe document clustering?
GQ. Explain about document ctustering? Explain it with the help of
example?
Describe the technique for term clustering?
GQ. Define custering? What are the general guidelines for clustering?
GQ. List all steps of query expansion through local context analysis.
Clustering : the grouping of documents which satisfy a set of common
properties.
Document clustering is an operation on the collction of documents and
not on the text.

Two types of operation of clustering documents: global and local.


Global clustering strategy: the documents are organized into groups
based on how frequently they appear throughout the whole collection.
Local clustering strategy: the grouping of document is affected by the
Context defined by the current query and its local set of retrieved
documents.

Atempting to obtain a description for a larger cluster of relevant


documents automatically is based on identification of terms related to the
a
query terms such as synonyms, stemming, variations, terms with
distance of at most k words from a query term.
In a global strategy, the entire collection of documents is used to create a
global thesaurus that chooses terms for query extension.
The local strategy involves looking through the documents that were
returned for a specific query q at query time to identify terms for query
expansion.

(New Syll. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications


OmatOn

Two basictypes of ocal stratcgy


() Local clustering (2) Local context analysis
Local strategies suit for environment of intranets, not for weh
documents.

Query Expanslon Through Local Clustering


Local feedback strate gies are that expands the query with terms
corelated to the query terms.
Such correlated terms are those present in local clusters built from the
local document set.
Definition : A V(s)be a non-empty subset of words which are grammatical
variants of each other. A canonical form s of V(s) is called a stem.
Example
If V(s)= {connect, connecting, connected} then s=connect
For a given query q:
D, : the local document set, the set of documents retrieved for a
given query q
V,: local vocabulary, the set of all distinct words in the local
document set
S,: the set of all distinct stems derived from the set V;
Strategies for building local clusters
(1) Association clusters
(2) Metric clusters
(3) Scalar clusters
(1) Association chusters
"An association cluster is based on the co-occurrence of stems or words
inside the documents by using the idea of synonymity association.
Generation of Association clusters
fsp; : the frequency of a stem s; in a document d, d; E DÊ
Let m= (m,) be an association matrix with | S,Irow and D
columns, where myj =Jsj

(New Syll. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications


infomation Retrieval System (MU-Senm.7-|T)
(Text Processing) Pg. no. (4-21)

The matrix s = m m is a local


stem-stem association matrix.
Each element s, y in s
expresses a correlation Cu,y between the
stems s, and s,

Cu,v d; e D; ...(4.2.1)
The correlation factor C, , quantifies the absolute
frequencies of
co-0CCurrence

The association matrix s is unnormalized.


if we adopt,
Su,v ...(4.2.2)
Normalize the correlation factor to normalize the association matrix
using
Cu,v
Suv Cu,u + Cv,v -Cu,v
..4.2.3)

Build local associationclusters


Consider the u-th row in the association matrix
Let S,(n) be a function which takes the u-th roW and returns the set of n
largest values s,, , where v varies over the set of local stems and v#u
Then s,(n) defines a local association cluster around the stem S,.
(2) Metric Clusters
Metric cluster is based on the idea that two terms which occur in the
same sentence seem more correlated than two terms which occur far
apart in a document.
It might be worthwhile to factor in the distance between two terms in the
Computation of their correlation factor.
Let the distance r(k, ki) between two keywords k, and k; in a same
document
If k, and k, are in distinct documents we take r(k, ki) = o
V(s,) is the set of keywords with s, as their stems
V(s)is the set of keywords with s, as their stems
Alocal stem-stenm metric correlation matrix s is definedI as each
Sy of correlation matrix expresses a metric correlation c,. element
between the
setms sy. and s,
1
K;E Vs) kje Vs,) r(ki, kj) ...(4.2.4)
The correlation factor c,, y quantifies the absolute inverse distances
The association matrix s is unnormalized.
If we adopt,
...(4.2.5)
Normalize the correlation factor to make normalized association
matrix

Cuy
Suy ...(4.2.6)
S Build local metric clusters
Given a local metric matrix s
Consider the u-th rowin the association matrix
Let S,(n) be a function which takes the u-th row and returns the set
of n largest values s, y where v varies over the set of local
stems
and v# u
Then s,(n)definies a local association cluster around the
stem s,
(3) Scalar Clusters
Two stems with similar
neighbourhoods have some synonymity
relationship.
The way to quantify such neighbourhood relationships is to arrange all
correlation values Sui in a vector s,u , to
arrange all correlation values Sy,i
in another vector sy , and to
compare these vectors through a scalar
measure.

Let s, = (Su,), Su,2. Su,n and s, =(Sy,1: ...Sy,2 Sy.n) be two vectors
of correlation values for the stems s,, and s,
Intomation Retrieval System (MU-Sem.7-|T) (Text Processing) Pg. no. (423)

Let s = (Suy) be a scalar association matrix.


Each sy can be defined as

Så Sy
Su,v ...(4.2.7)
Is, Ixls, I
Let S(n) be a function which returns the set of nlargest values suy
vtu. Then S,(n) defines a scalar cluster around the stem s,
Interactive Search Formulation
Stems that belong to clusters associated to the query stems(or terms) can
be used to expand the original query.
A stem s, which belongs to a cluster (of size n) associated to another
stem s, is said to be a neighbour of s, .
Fig. 4.2.2 represents a stem s, as a neighbour of the stem s, within a
neighbourhood S,(n).
X
X

X S,(n) X
S,
,

X
X
X X

Fig. 4.2.2: Stems,, as a neighbour of the stem s,

For each stem , select mneighbour stems from the cluster S,fn) (which
might be of type association, metric, or scalar) and add them to the
query.
Hopefully, the additional neighbour stems will retrieve new relevant
documents.
Sn) may composed of stems obtained using correlation factors
normalized and unnormalized.

normalized cluster tends togroup stems which are more rare


unnormalized cluster tends to group stems due to their large
frequencies
(New Syll. w.e.f academic year 22-23) (M7-87)
Tech-Neo Publications
Infomation Retrieval System (MU-Sem.7-4T) (Text Processing) Pg. no. (4-241

Using information about correlated stems to improve the search


Let two stems s,, and s, be correlated with a correlation factor c
If c, is larger than a predefined threshold then a neighbour stem of
S, can also be interpreted as a neighbour stem of s, and vice versa
This provides greater flexibility, particularly with Boolean queries
Consider the expression (s, + s) where the + symbol stands for
disjunction
Let s,,'be a neighbour stem of s,
Then one can try both (S, +s) and (s,ts,) as synonym search
expressions, because of the correlation given by cuy
Query Expansion Through Local Context Analysis

The local context analysis procedure operates in three steps :


(1) Retrieve the top n ranked passages using the. original query This is
accomplished by breaking up the documents initially retrieved by the
query in fixed length passages (for instance, of size 300 words) and
ranking these passages as if they were documents.
(2) For each concept c in the top ranked passages, the similarity sim(q, c)
between the whole query q (not individual query terms) and the concept
cis computed using a variant of tf-idf ranking.
(3) The top m ranked concepts (according to sim(q, c) ) are added to the
original query q. To each added concept is assigned a weight given by
1-0.9 x im where i is the position of the concept in the final concept
ranking. The terms in the original query q might be stressed by assigning
a weightequal to 2 to each of them.

4.3 GENERAL QUESTIONS


a. 1 Explain XML retrieval?
Ans. :
Document-oriented XML retrieval
(1) Document vs. data- centric XML retrieval (recall)
(2) Focused retrieval

(New SyI. w.e.f academic year 22-23) (M7-87)


Tech-Neo Publications
Information Retrieval System (MU-Sem.7-T) (Text Processing) Pg. no. (4-25)

(3) Structured documents


(4) Structured document (text) retrieval
(5) XML query languages
(6) XML element retrieval
(7) (A bit about) user aspects Explain the above in details.
a. 2 Define clustering.
Ans. :
Clustering is a process of partitioning a set of data (or objects) into a set
the natural
of meaningful sub-classes, called clusters. Help users understand
tool to get
grouping or structure in adata set. Used either as a stand-alone
insight into data distribution or as a preprocessing step for other algorithms.
Chapter Ends...

You might also like