0% found this document useful (0 votes)

108 views10 pages

Information Retrieval (IR) Is The Science of

Information retrieval (IR) is the science of searching for documents and information within documents. IR systems are used to reduce "information overload" and are applied in university and public libraries to provide access to books and journals, as well as in web search engines. IR is interdisciplinary and relies on fields like computer science, mathematics, psychology and linguistics. It has evolved from early experiments in the 1950s to large-scale systems used today to search the world wide web.

Uploaded by

Satti Pandu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

108 views10 pages

Information Retrieval (IR) Is The Science of

Uploaded by

Satti Pandu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

INFORMATION RETRIEVAL

Information retrieval (IR) is the science of on computer science, mathematics, library

searching for documents, for information within science, information science, information
documents, and for metadata about documents, architecture, cognitive psychology, linguistics,
as well as that of searching relational and statistics.
databases and the World Wide Web. There is Automated information retrieval systems are
overlap in the usage of the terms data retrieval used to reduce what has been called "information
document retrieval, information retrieval, overload". Many universities and public
and text retrieval, but each also has its own body libraries use IR systems to provide access to
of literature, theory, praxis, and technologies. IR books, journals and other documents.
is interdisciplinary, based Web search engines are the most visible IR
on computerscience, mathematics, library applications.
science, information science,information
Contents
architecture, cognitive psychology, linguistics,
and statistics. 1 History
Automated information retrieval systems are o 1.1 Timeline
used to reduce what has been called "information 2 Overview
overload". Many universities and public 3 Performance measures
libraries use IR systems to provide access to o 3.1 Precision
books, journals and other documents. o 3.2 Recall
Web search engines are the most visible IR o 3.3 Fall-Out
applications.
o 3.4 F-measure
Information retrieval o 3.5 Mean Average precision
From Wikipedia, the free encyclopedia 4 Model types
Information retrieval (IR) is the science of o 4.1 First dimension: mathematical
searching for documents, for information within basis
documents, and for metadata about documents, o 4.2 Second dimension: properties
as well as that of searching relational of the model
databases and the World Wide Web. There is
overlap in the usage of the
History
terms data retrieval,document retrieval,
information retrieval, and text retrieval, but each
“ But do you know that, although I ”
also has its own body of literature, theory, praxis, have kept the diary [on a
and technologies. IR isinterdisciplinary, based phonograph] for months past, it
INFORMATION RETRIEVAL

never once struck me how I was hardware, or the software that runs on it, is no
going to find any particular part of longer available. The information is initially
it in case I wanted to look it up? easier to retrieve than if it were on paper, but is
then effectively lost.
—Dr Seward, Bram Stoker's Dracula,
1897
HISTORY
The idea of using computers to search for
 Before the 1900s
relevant pieces of information was popularized in
the article As We May Think by Vannevar 1880s: Herman Hollerith invents the
Bush in 1945. The first automated information recording of data on a machine readable
retrieval systems were introduced in the 1950s medium.
and 1960s. By 1970 several different techniques 1890 Hollerith cards, key
had been shown to perform well on small text punches and tabulators used to process
corpora such as the Cranfield collection (several the 1890 US Census data.
thousand documents). Large-scale retrieval  1940s–1950s
systems, such as the Lockheed Dialog system,
late 1940s: The US military confronted
came into use early in the 1970s.
problems of indexing and retrieval of
In 1992, the US Department of Defense along wartime scientific research documents
with the National Institute of Standards and captured from Germans.
Technology (NIST), cosponsored the Text
1945: Vannevar Bush's As We May
Retrieval Conference (TREC) as part of the
Think appeared in Atlantic Monthly.
TIPSTER text program. The aim of this was to
1947: Hans Peter Luhn (research engineer
look into the information retrieval community by
at IBM since 1941) began work on a
supplying the infrastructure that was needed for
mechanized punch card-based system for
evaluation of text retrieval methodologies on a
searching chemical compounds.
very large text collection. This catalyzed research
1950s: Growing concern in the US for a
on methods that scale to huge corpora. The
"science gap" with the USSR motivated,
introduction of web search engines has boosted
encouraged funding and provided a
the need for very large scale retrieval systems
backdrop for mechanized literature
even further.
searching systems (.) and the invention of
The use of digital methods for storing and citation indexing (Eugene Garfield).
retrieving information has led to the phenomenon
1950: The term "information retrieval"
of digital obsolescence, where a digital resource
appears to have been coined by Calvin
ceases to be readable because the physical media,
Mooers.
the reader required to read the media, the
INFORMATION RETRIEVAL
1951: Philip Bagley conducted the retrieval" in the Journal of the ACM
earliest experiment in computerized 7(3):216–244, July 1960.
document retrieval in a master thesis
at MIT.
1955: Allen Kent joined Case Western
Reserve University, and eventually
became associate director of the Center 1962:
for Documentation and Communications
Research. That same year, Kent and  Cyril W. Cleverdon published
colleagues published a paper in American early findings of the Cranfield
Documentation describing the precision studies, developing a model for IR
and recall measures as well as detailing a system evaluation. See: Cyril W.
proposed "framework" for evaluating an Cleverdon, "Report on the Testing
IR system which included statistical and Analysis of an Investigation into
sampling methods for determining the the Comparative Efficiency of
number of relevant documents not Indexing Systems". Cranfield
retrieved. Collection of Aeronautics, Cranfield,
1958: International Conference on England, 1962.
Scientific Information Washington DC  Kent published Information
included consideration of IR systems as a Analysis and Retrieval.
solution to problems identified. 1963:
See: Proceedings of the International
Conference on Scientific Information,  Weinberg report "Science,
1958 (National Academy of Sciences, Government and Information" gave a
Washington, DC, 1959) full articulation of the idea of a "crisis
1959: Hans Peter Luhn published "Auto- of scientific information." The report
encoding of documents for information was named after Dr. Alvin Weinberg.
retrieval."  Joseph Becker and Robert M.
 1960s: Hayes published text on information
retrieval. Becker, Joseph; Hayes,
early 1960s: Gerard Salton began work
Robert Mayo. Information storage
on IR at Harvard, later moved to Cornell.
and retrieval: tools, elements,
1960: Melvin Earl (Bill) Maron and John
theories. New York, Wiley (1963).
Lary Kuhns published "On relevance,
probabilistic indexing, and information
INFORMATION RETRIEVAL
1964:

 Karen Spärck Jones finished her

1968:
thesis at Cambridge, Synonymy and
Semantic Classification, and  Gerard Salton
continued work oncomputational published Automatic Information
linguistics as it applies to IR. Organization and Retrieval.
 The National Bureau of  John W. Sammon, Jr.'s RADC
Standards sponsored a symposium Tech report "Some Mathematics
titled "Statistical Association of Information Storage and
Methods for Mechanized Retrieval..." outlined the vector
Documentation." Several highly model.
significant papers, including G. 1969: Sammon's "A nonlinear mapping
Salton's first published reference (we for data structure analysis" (IEEE
believe) to the SMARTsystem. Transactions on Computers) was the first
mid-1960s: proposal for visualization interface to an
IR system.
 National Library of Medicine
 1970s
developed MEDLARS Medical
Literature Analysis and Retrieval Early 1970s:
System, the first major machine-
 First online systems—NLM's
readable database and batch-retrieval
AIM-TWX, MEDLINE; Lockheed's
system.
Dialog; SDC's ORBIT.
 Project Intrex at MIT.
 Theodor Nelson promoting
1965: J. C. R.
concept of hypertext,
Licklider published Libraries of the
published Computer Lib/Dream
Future.
Machines.
1966: Don Swanson was involved in
1971: Nicholas Jardine and Cornelis J.
studies at University of Chicago on
van Rijsbergen published "The use of
Requirements for Future Catalogs.
hierarchic clustering in information
late 1960s: F. Wilfrid retrieval", which articulated the "cluster
Lancaster completed evaluation studies of hypothesis." (Information Storage and
the MEDLARS system and published the Retrieval, 7(5), pp. 217–240, December
first edition of his text on information 1971)
retrieval.
INFORMATION RETRIEVAL
1975: Three highly influential
publications by Salton fully articulated 1985–1993: Key papers on and
his vector processing framework and term experimental systems for visualization
discrimination model: interfaces.
Work by Donald B. Crouch, Robert R.
 A Theory of Indexing (Society for
Korfhage, Matthew Chalmers, Anselm
Industrial and Applied Mathematics)
Spoerri and others.
 A Theory of Term Importance in
1989: First World Wide Web proposals
Automatic Text Analysis (JASIS v.
by Tim Berners-Lee at CERN.
26)
 1990s
 A Vector Space Model for
Automatic Indexing (CACM 18:11) 1992: First TREC conference.
1978: The First ACM SIGIR conference. 1997: Publication
1979: C. J. van Rijsbergen of Korfhage's Information Storage and
published Information Retrieval[4] with emphasis on
Retrieval (Butterworths). Heavy visualization and multi-reference point
emphasis on probabilistic models. systems.
late 1990s: Web search
 1980s
engines implementation of many features
1980: First international ACM SIGIR formerly found only in experimental IR
conference, joint with British Computer systems. Search engines become the most
Society IR group in Cambridge. common and maybe best instantiation of
1982: Nicholas J. Belkin, Robert N. IR models, research, and implementation.
Oddy, and Helen M. Brooks proposed the
ASK (Anomalous State of Knowledge)
viewpoint for information retrieval. This
was an important concept, though their
automated analysis tool proved ultimately Overview
disappointing.
 An information retrieval process begins
1983: Salton (and Michael J. McGill)
when a user enters a query into the
published Introduction to Modern
system. Queries are formal statements
Information Retrieval (McGraw-Hill),
of information needs, for example search
with heavy emphasis on vector models.
strings in web search engines. In
mid-1980s: Efforts to develop end-user
information retrieval a query does not
versions of commercial IR systems.
uniquely identify a single object in the
INFORMATION RETRIEVAL
collection. Instead, several objects may relevant to a particular query. In practice queries
match the query, perhaps with different may be ill-posed and there may be different
degrees of relevancy. shades of relevancy.
 An object is an entity that is represented
Precision
by information in a database. User
Precision is the fraction of the documents
queries are matched against the database
retrieved that are relevant to the user's
information. Depending on
information need.
the application the data objects may be,
for example, text documents, images,
audio, mind or videos. Often the
documents themselves are not kept or In binary classification, precision is
stored directly in the IR system, but are analogous to positive predictive value.
instead represented in the system by Precision takes all retrieved documents into
document surrogates or metadata. account. It can also be evaluated at a given
 Most IR systems compute a numeric cut-off rank, considering only the topmost
score on how well each objects in the results returned by the system. This measure
database match the query, and rank the is called precision at n or P@n.
objects according to this value. The top
Note that the meaning and usage of
ranking objects are then shown to the
"precision" in the field of Information
user. The process may then be iterated if
Retrieval differs from the definition
the user wishes to refine the query.
of accuracy and precision within other
branches of science and technology.

Recall
Recall is the fraction of the documents that
are relevant to the query that are
successfully retrieved.
Performance measures

Many different measures for evaluating the

performance of information retrieval systems In binary classification, recall is
have been proposed. The measures require a called sensitivity. So it can be looked
collection of documents and a query. All at as the probability that a relevant
common measures described here assume a document is retrieved by the query.It is
ground truth notion of relevancy: every trivial to achieve recall of 100% by
document is known to be either relevant or non- returning all documents in response to
INFORMATION RETRIEVAL
any query. Therefore recall alone is not traditional F-measure or balanced
enough but one needs to measure the F-score is:
number of non-relevant documents
also, for example by computing the
precision.
This is also known as
the F1 measure, because
recall and precision are
evenly weightedThe general
formula for non-negative real
β is:

.
Two other commonly
Fall-Out used F measures are
The proportion of non-relevant the F2 measure, which
documents that are retrieved, out of all weights recall twice as
non-relevant documents available: much as precision, and
the F0.5 measure, which
weights precision twice
as much as recall.
In binary classification, fall-out is
closely related to specificity (1 − The F-measure was
specificity). It can be looked at derived by van
as the probability that a non- Rijsbergen (1979) so
relevant document is retrieved by that Fβ "measures the
the query. effectiveness of
retrieval with respect to
It is trivial to achieve fall-out of
a user who attaches β
0% by returning zero documents
times as much
in response to any query.
importance to recall as
precision". It is based
F-measure on van Rijsbergen's
effectiveness
The weighted harmonic mean of
measureE = 1 − (1 /
precision and recall, the
INFORMATION RETRIEVAL
(α / P + (1 − α) / R)). illustrates the relationship of some common
Their relationship models. In the picture, the models are
is Fβ = 1 − E where α = categorized according to two dimensions: the
1 / (β2 + 1). mathematical basis and the properties of the
model.

First dimension: mathematical basis

Set-theoretic models represent documents as sets

of words or phrases. Similarities are usually
derived from set-theoretic operations on those
sets. Common models are:

 Standard Boolean model

 Fuzzy retrieval

Model types

Algebraic models represent documents and

queries usually as vectors, matrices, or tuples.
The similarity of the query vector and document
vector is represented as a scalar value.

 Vector space model

 Generalized vector space model
 (Enhanced) Topic-based Vector Space
Model
 Extended Boolean model
 Latent semantic indexing aka latent
Categorization of IR-models (translated
from German entry, original source Dominik semantic analysis
Kuropka).
Probabilistic models treat the process of
For the information retrieval to be efficient, the document retrieval as a probabilistic inference.
documents are typically transformed into a Similarities are computed as probabilities that a
suitable representation. There are several document is relevant for a given query.
representations. The picture on the right
INFORMATION RETRIEVAL
Probabilistic theorems like the Bayes' occurrence of those terms in the whole set
theorem are often used in these models. of documents.
 Models with transcendent term
 Binary Independence Model interdependencies allow a representation
 Probabilistic relevance model on which is of interdependencies between terms, but
based the okapi (BM25) relevance they do not allege how the
function interdependency between two terms is
 Uncertain inference defined. They relay an external source for
 Language models the degree of interdependency between
 Divergence-from-randomness model two terms. (For example a human or
 Latent Dirichlet allocation sophisticated algorithms.)
 Machine-learned ranking models view
documents as vectors of ranking features
(some of which often incorporate other DIGITAL LIBRARY
ranking models mentioned above) and try  Something in between the very structured
to find the best way to combine these database and the unstructured Web.
features into a single relevance score  Content is controlled. Someone makes
by machine learning methods. the entries. (Maybe a lot of people make
Second dimension: properties of the model the entries, but there are rules for
admission.)
 Models without term-
 Searching and browsing are somewhat
interdependencies treat different
open, not controlled by fixed keys and
terms/words as independent. This fact is
anticipated queries.
usually represented in vector space
 Nature of the collection regulates
models by the orthogonality assumption
indexing somewhat.
of term vectors or in probabilistic models
by an independency assumption for term
variables.
 Models with immanent term
interdependencies allow a representation
of interdependencies between terms.
However the degree of the
interdependency between two terms is
defined by the model itself. It is usually
directly or indirectly derived (e.g.
by dimensional reduction) from the co-
INFORMATION RETRIEVAL
 American Memory
http://memory.loc.gov/ammem/in
dex.html.

This is the main process involved

 Between the chaos of the Web and the

strict structure of a database, the digital
library contains an organized collection.
 We saw the digital collection at the
Falvey library session.

 See also:
 NSDL www.nsdl.org
 And the computing
component, CITIDEL:
citidel.villanova.edu

Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
Unit1 Introduction
No ratings yet
Unit1 Introduction
31 pages
Information Retrieval Course Overview
100% (2)
Information Retrieval Course Overview
12 pages
5 Unit Notes
100% (1)
5 Unit Notes
166 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Cisco Email Security Appliances PDF
No ratings yet
Cisco Email Security Appliances PDF
7 pages
IR Introduction
100% (1)
IR Introduction
6 pages
Transportation Engineering
50% (2)
Transportation Engineering
2 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
5 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
45 pages
TQ Agile and Devops
83% (6)
TQ Agile and Devops
15 pages
Introduction Information Retrieval
No ratings yet
Introduction Information Retrieval
73 pages
Modern Information Retrieval Amit Singhal
No ratings yet
Modern Information Retrieval Amit Singhal
9 pages
Service Letter-Lubricating Oil Level Sensor
No ratings yet
Service Letter-Lubricating Oil Level Sensor
2 pages
Introduction To Information Retrieval: LBSC 796/INFM 718R: Week 1
No ratings yet
Introduction To Information Retrieval: LBSC 796/INFM 718R: Week 1
49 pages
ITR Notes
No ratings yet
ITR Notes
166 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
49 pages
Brunvoll Bow Thruster FU 63 LTC 1550
100% (4)
Brunvoll Bow Thruster FU 63 LTC 1550
149 pages
Section a-UNIT 1
No ratings yet
Section a-UNIT 1
25 pages
IRS Unit - 1 & 2
No ratings yet
IRS Unit - 1 & 2
33 pages
DAY 6 - PPT - Supraja Technologies - MGIT & CBIT
No ratings yet
DAY 6 - PPT - Supraja Technologies - MGIT & CBIT
19 pages
Tracing Down User and Computer Account Deletion in Active Directory - TechNet Blogs
No ratings yet
Tracing Down User and Computer Account Deletion in Active Directory - TechNet Blogs
4 pages
Introduction To IIR
No ratings yet
Introduction To IIR
53 pages
Informationa Retrival
No ratings yet
Informationa Retrival
22 pages
Information Retrieval: Introduction To
No ratings yet
Information Retrieval: Introduction To
21 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
88 pages
Intro to Information Retrieval Systems
No ratings yet
Intro to Information Retrieval Systems
10 pages
Modern Information Retrieval: A Brief Overview
No ratings yet
Modern Information Retrieval: A Brief Overview
9 pages
TYBSC CS Information Retrieval Munotes
No ratings yet
TYBSC CS Information Retrieval Munotes
85 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Cp5293 Big Data Analytics Question Bank
No ratings yet
Cp5293 Big Data Analytics Question Bank
26 pages
Lec 1 - Intro - Unit 1 Information Technology
No ratings yet
Lec 1 - Intro - Unit 1 Information Technology
102 pages
GLS - Sug - GSS 613 Gis Data Acquisition
No ratings yet
GLS - Sug - GSS 613 Gis Data Acquisition
69 pages
IR Cs Sem 6
No ratings yet
IR Cs Sem 6
16 pages
Unit I
No ratings yet
Unit I
33 pages
Unit I
No ratings yet
Unit I
65 pages
IR UNIT I - Notes
No ratings yet
IR UNIT I - Notes
23 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
DSTL Annual - Report 2002-03
No ratings yet
DSTL Annual - Report 2002-03
26 pages
Artificial Intelligence in Information Retrieval
No ratings yet
Artificial Intelligence in Information Retrieval
5 pages
Tk730-User Guide-Gotop-Gps Tracker
No ratings yet
Tk730-User Guide-Gotop-Gps Tracker
4 pages
IR Notes
No ratings yet
IR Notes
14 pages
Lecture 1
No ratings yet
Lecture 1
42 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
15 pages
Of 280fbpkmhy
No ratings yet
Of 280fbpkmhy
9 pages
M.Tech IR Course Overview
No ratings yet
M.Tech IR Course Overview
72 pages
Acu Rite 00782 Manual
No ratings yet
Acu Rite 00782 Manual
2 pages
Revised Energy Account Notice
No ratings yet
Revised Energy Account Notice
17 pages
Cs8080 Irt Unit 1 PDF
No ratings yet
Cs8080 Irt Unit 1 PDF
28 pages
Artificial Intelligence For Information Retrieval: January 2008
No ratings yet
Artificial Intelligence For Information Retrieval: January 2008
9 pages
Information Retrieval in Business
No ratings yet
Information Retrieval in Business
9 pages
Info Retrieval for Researchers
No ratings yet
Info Retrieval for Researchers
10 pages
Information Retrieval Course Guide
No ratings yet
Information Retrieval Course Guide
16 pages
IR Chapter 1 & 2
No ratings yet
IR Chapter 1 & 2
114 pages
A Theoretical Paradigm of Information Retrieval in
No ratings yet
A Theoretical Paradigm of Information Retrieval in
9 pages
Introduction
No ratings yet
Introduction
32 pages
All Units Notes TYBSC-CS-Information-Retrieval
No ratings yet
All Units Notes TYBSC-CS-Information-Retrieval
89 pages
Information Retrieval Course Guide
No ratings yet
Information Retrieval Course Guide
28 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
Adt Unit 5
No ratings yet
Adt Unit 5
31 pages
Introduction To IR Chapter 01
No ratings yet
Introduction To IR Chapter 01
29 pages
Information Retrieval Data Structures & Algorithms - William B. Frakes
No ratings yet
Information Retrieval Data Structures & Algorithms - William B. Frakes
630 pages
What Is Information Retrieval (IR) ?
No ratings yet
What Is Information Retrieval (IR) ?
21 pages
Chapter 7 - Fault Code
No ratings yet
Chapter 7 - Fault Code
34 pages
Free Essential Software for Windows Users
No ratings yet
Free Essential Software for Windows Users
2 pages
PEAK User Manual 08212022 1
No ratings yet
PEAK User Manual 08212022 1
13 pages
4G Camera User Manual
No ratings yet
4G Camera User Manual
1 page
Kodak-Axpert King II Twin 20220531
No ratings yet
Kodak-Axpert King II Twin 20220531
2 pages
WI Install 24.2.0
No ratings yet
WI Install 24.2.0
40 pages
Web Analytics for Business Growth
No ratings yet
Web Analytics for Business Growth
5 pages
03 FSG PM Draw-Wire-Encoder
No ratings yet
03 FSG PM Draw-Wire-Encoder
3 pages
Master - Guide - SAP ECTR 5.2
No ratings yet
Master - Guide - SAP ECTR 5.2
9 pages
EOG-Based HMI for Eye-Controlled Devices
No ratings yet
EOG-Based HMI for Eye-Controlled Devices
19 pages
K08 QMH and Other Line Units
No ratings yet
K08 QMH and Other Line Units
18 pages
P-PRD-06 Assembly of Muffler
No ratings yet
P-PRD-06 Assembly of Muffler
1 page
BPI-Company Profile
No ratings yet
BPI-Company Profile
19 pages
Lec 04 Peripheral Devices
No ratings yet
Lec 04 Peripheral Devices
17 pages
Importance of Software Testing in Software Development Life Cycle
No ratings yet
Importance of Software Testing in Software Development Life Cycle
4 pages
Quiz, Application Letter - Resume
No ratings yet
Quiz, Application Letter - Resume
4 pages
CAD CAM+MD Lab2025 Assignments
No ratings yet
CAD CAM+MD Lab2025 Assignments
2 pages
Engineering Parts Specification
No ratings yet
Engineering Parts Specification
1 page

Information Retrieval (IR) Is The Science of

Uploaded by

Information Retrieval (IR) Is The Science of

Uploaded by

INFORMATION RETRIEVAL

Information retrieval (IR) is the science of on computer science, mathematics, library

 Karen Spärck Jones finished her

Many different measures for evaluating the

First dimension: mathematical basis

Set-theoretic models represent documents as sets

 Standard Boolean model

Algebraic models represent documents and

 Vector space model

This is the main process involved

 Between the chaos of the Web and the

You might also like