INFORMATION RETRIEVAL
Information retrieval (IR) is the science of on computer science, mathematics, library
searching for documents, for information within science, information science, information
documents, and for metadata about documents, architecture, cognitive psychology, linguistics,
as well as that of searching relational and statistics.
databases and the World Wide Web. There is Automated information retrieval systems are
overlap in the usage of the terms data retrieval used to reduce what has been called "information
document retrieval, information retrieval, overload". Many universities and public
and text retrieval, but each also has its own body libraries use IR systems to provide access to
of literature, theory, praxis, and technologies. IR books, journals and other documents.
is interdisciplinary, based Web search engines are the most visible IR
on computerscience, mathematics, library applications.
science, information science,information
Contents
architecture, cognitive psychology, linguistics,
and statistics. 1 History
Automated information retrieval systems are o 1.1 Timeline
used to reduce what has been called "information 2 Overview
overload". Many universities and public 3 Performance measures
libraries use IR systems to provide access to o 3.1 Precision
books, journals and other documents. o 3.2 Recall
Web search engines are the most visible IR o 3.3 Fall-Out
applications.
o 3.4 F-measure
Information retrieval o 3.5 Mean Average precision
From Wikipedia, the free encyclopedia 4 Model types
Information retrieval (IR) is the science of o 4.1 First dimension: mathematical
searching for documents, for information within basis
documents, and for metadata about documents, o 4.2 Second dimension: properties
as well as that of searching relational of the model
databases and the World Wide Web. There is
overlap in the usage of the
History
terms data retrieval,document retrieval,
information retrieval, and text retrieval, but each
“ But do you know that, although I ”
also has its own body of literature, theory, praxis, have kept the diary [on a
and technologies. IR isinterdisciplinary, based phonograph] for months past, it
INFORMATION RETRIEVAL
never once struck me how I was hardware, or the software that runs on it, is no
going to find any particular part of longer available. The information is initially
it in case I wanted to look it up? easier to retrieve than if it were on paper, but is
then effectively lost.
—Dr Seward, Bram Stoker's Dracula,
1897
HISTORY
The idea of using computers to search for
Before the 1900s
relevant pieces of information was popularized in
the article As We May Think by Vannevar 1880s: Herman Hollerith invents the
Bush in 1945. The first automated information recording of data on a machine readable
retrieval systems were introduced in the 1950s medium.
and 1960s. By 1970 several different techniques 1890 Hollerith cards, key
had been shown to perform well on small text punches and tabulators used to process
corpora such as the Cranfield collection (several the 1890 US Census data.
thousand documents). Large-scale retrieval 1940s–1950s
systems, such as the Lockheed Dialog system,
late 1940s: The US military confronted
came into use early in the 1970s.
problems of indexing and retrieval of
In 1992, the US Department of Defense along wartime scientific research documents
with the National Institute of Standards and captured from Germans.
Technology (NIST), cosponsored the Text
1945: Vannevar Bush's As We May
Retrieval Conference (TREC) as part of the
Think appeared in Atlantic Monthly.
TIPSTER text program. The aim of this was to
1947: Hans Peter Luhn (research engineer
look into the information retrieval community by
at IBM since 1941) began work on a
supplying the infrastructure that was needed for
mechanized punch card-based system for
evaluation of text retrieval methodologies on a
searching chemical compounds.
very large text collection. This catalyzed research
1950s: Growing concern in the US for a
on methods that scale to huge corpora. The
"science gap" with the USSR motivated,
introduction of web search engines has boosted
encouraged funding and provided a
the need for very large scale retrieval systems
backdrop for mechanized literature
even further.
searching systems (.) and the invention of
The use of digital methods for storing and citation indexing (Eugene Garfield).
retrieving information has led to the phenomenon
1950: The term "information retrieval"
of digital obsolescence, where a digital resource
appears to have been coined by Calvin
ceases to be readable because the physical media,
Mooers.
the reader required to read the media, the
INFORMATION RETRIEVAL
1951: Philip Bagley conducted the retrieval" in the Journal of the ACM
earliest experiment in computerized 7(3):216–244, July 1960.
document retrieval in a master thesis
at MIT.
1955: Allen Kent joined Case Western
Reserve University, and eventually
became associate director of the Center 1962:
for Documentation and Communications
Research. That same year, Kent and Cyril W. Cleverdon published
colleagues published a paper in American early findings of the Cranfield
Documentation describing the precision studies, developing a model for IR
and recall measures as well as detailing a system evaluation. See: Cyril W.
proposed "framework" for evaluating an Cleverdon, "Report on the Testing
IR system which included statistical and Analysis of an Investigation into
sampling methods for determining the the Comparative Efficiency of
number of relevant documents not Indexing Systems". Cranfield
retrieved. Collection of Aeronautics, Cranfield,
1958: International Conference on England, 1962.
Scientific Information Washington DC Kent published Information
included consideration of IR systems as a Analysis and Retrieval.
solution to problems identified. 1963:
See: Proceedings of the International
Conference on Scientific Information, Weinberg report "Science,
1958 (National Academy of Sciences, Government and Information" gave a
Washington, DC, 1959) full articulation of the idea of a "crisis
1959: Hans Peter Luhn published "Auto- of scientific information." The report
encoding of documents for information was named after Dr. Alvin Weinberg.
retrieval." Joseph Becker and Robert M.
1960s: Hayes published text on information
retrieval. Becker, Joseph; Hayes,
early 1960s: Gerard Salton began work
Robert Mayo. Information storage
on IR at Harvard, later moved to Cornell.
and retrieval: tools, elements,
1960: Melvin Earl (Bill) Maron and John
theories. New York, Wiley (1963).
Lary Kuhns published "On relevance,
probabilistic indexing, and information
INFORMATION RETRIEVAL
1964:
Karen Spärck Jones finished her
1968:
thesis at Cambridge, Synonymy and
Semantic Classification, and Gerard Salton
continued work oncomputational published Automatic Information
linguistics as it applies to IR. Organization and Retrieval.
The National Bureau of John W. Sammon, Jr.'s RADC
Standards sponsored a symposium Tech report "Some Mathematics
titled "Statistical Association of Information Storage and
Methods for Mechanized Retrieval..." outlined the vector
Documentation." Several highly model.
significant papers, including G. 1969: Sammon's "A nonlinear mapping
Salton's first published reference (we for data structure analysis" (IEEE
believe) to the SMARTsystem. Transactions on Computers) was the first
mid-1960s: proposal for visualization interface to an
IR system.
National Library of Medicine
1970s
developed MEDLARS Medical
Literature Analysis and Retrieval Early 1970s:
System, the first major machine-
First online systems—NLM's
readable database and batch-retrieval
AIM-TWX, MEDLINE; Lockheed's
system.
Dialog; SDC's ORBIT.
Project Intrex at MIT.
Theodor Nelson promoting
1965: J. C. R.
concept of hypertext,
Licklider published Libraries of the
published Computer Lib/Dream
Future.
Machines.
1966: Don Swanson was involved in
1971: Nicholas Jardine and Cornelis J.
studies at University of Chicago on
van Rijsbergen published "The use of
Requirements for Future Catalogs.
hierarchic clustering in information
late 1960s: F. Wilfrid retrieval", which articulated the "cluster
Lancaster completed evaluation studies of hypothesis." (Information Storage and
the MEDLARS system and published the Retrieval, 7(5), pp. 217–240, December
first edition of his text on information 1971)
retrieval.
INFORMATION RETRIEVAL
1975: Three highly influential
publications by Salton fully articulated 1985–1993: Key papers on and
his vector processing framework and term experimental systems for visualization
discrimination model: interfaces.
Work by Donald B. Crouch, Robert R.
A Theory of Indexing (Society for
Korfhage, Matthew Chalmers, Anselm
Industrial and Applied Mathematics)
Spoerri and others.
A Theory of Term Importance in
1989: First World Wide Web proposals
Automatic Text Analysis (JASIS v.
by Tim Berners-Lee at CERN.
26)
1990s
A Vector Space Model for
Automatic Indexing (CACM 18:11) 1992: First TREC conference.
1978: The First ACM SIGIR conference. 1997: Publication
1979: C. J. van Rijsbergen of Korfhage's Information Storage and
published Information Retrieval[4] with emphasis on
Retrieval (Butterworths). Heavy visualization and multi-reference point
emphasis on probabilistic models. systems.
late 1990s: Web search
1980s
engines implementation of many features
1980: First international ACM SIGIR formerly found only in experimental IR
conference, joint with British Computer systems. Search engines become the most
Society IR group in Cambridge. common and maybe best instantiation of
1982: Nicholas J. Belkin, Robert N. IR models, research, and implementation.
Oddy, and Helen M. Brooks proposed the
ASK (Anomalous State of Knowledge)
viewpoint for information retrieval. This
was an important concept, though their
automated analysis tool proved ultimately Overview
disappointing.
An information retrieval process begins
1983: Salton (and Michael J. McGill)
when a user enters a query into the
published Introduction to Modern
system. Queries are formal statements
Information Retrieval (McGraw-Hill),
of information needs, for example search
with heavy emphasis on vector models.
strings in web search engines. In
mid-1980s: Efforts to develop end-user
information retrieval a query does not
versions of commercial IR systems.
uniquely identify a single object in the
INFORMATION RETRIEVAL
collection. Instead, several objects may relevant to a particular query. In practice queries
match the query, perhaps with different may be ill-posed and there may be different
degrees of relevancy. shades of relevancy.
An object is an entity that is represented
Precision
by information in a database. User
Precision is the fraction of the documents
queries are matched against the database
retrieved that are relevant to the user's
information. Depending on
information need.
the application the data objects may be,
for example, text documents, images,
audio, mind or videos. Often the
documents themselves are not kept or In binary classification, precision is
stored directly in the IR system, but are analogous to positive predictive value.
instead represented in the system by Precision takes all retrieved documents into
document surrogates or metadata. account. It can also be evaluated at a given
Most IR systems compute a numeric cut-off rank, considering only the topmost
score on how well each objects in the results returned by the system. This measure
database match the query, and rank the is called precision at n or P@n.
objects according to this value. The top
Note that the meaning and usage of
ranking objects are then shown to the
"precision" in the field of Information
user. The process may then be iterated if
Retrieval differs from the definition
the user wishes to refine the query.
of accuracy and precision within other
branches of science and technology.
Recall
Recall is the fraction of the documents that
are relevant to the query that are
successfully retrieved.
Performance measures
Many different measures for evaluating the
performance of information retrieval systems In binary classification, recall is
have been proposed. The measures require a called sensitivity. So it can be looked
collection of documents and a query. All at as the probability that a relevant
common measures described here assume a document is retrieved by the query.It is
ground truth notion of relevancy: every trivial to achieve recall of 100% by
document is known to be either relevant or non- returning all documents in response to
INFORMATION RETRIEVAL
any query. Therefore recall alone is not traditional F-measure or balanced
enough but one needs to measure the F-score is:
number of non-relevant documents
also, for example by computing the
precision.
This is also known as
the F1 measure, because
recall and precision are
evenly weightedThe general
formula for non-negative real
β is:
.
Two other commonly
Fall-Out used F measures are
The proportion of non-relevant the F2 measure, which
documents that are retrieved, out of all weights recall twice as
non-relevant documents available: much as precision, and
the F0.5 measure, which
weights precision twice
as much as recall.
In binary classification, fall-out is
closely related to specificity (1 − The F-measure was
specificity). It can be looked at derived by van
as the probability that a non- Rijsbergen (1979) so
relevant document is retrieved by that Fβ "measures the
the query. effectiveness of
retrieval with respect to
It is trivial to achieve fall-out of
a user who attaches β
0% by returning zero documents
times as much
in response to any query.
importance to recall as
precision". It is based
F-measure on van Rijsbergen's
effectiveness
The weighted harmonic mean of
measureE = 1 − (1 /
precision and recall, the
INFORMATION RETRIEVAL
(α / P + (1 − α) / R)). illustrates the relationship of some common
Their relationship models. In the picture, the models are
is Fβ = 1 − E where α = categorized according to two dimensions: the
1 / (β2 + 1). mathematical basis and the properties of the
model.
First dimension: mathematical basis
Set-theoretic models represent documents as sets
of words or phrases. Similarities are usually
derived from set-theoretic operations on those
sets. Common models are:
Standard Boolean model
Fuzzy retrieval
Model types
Algebraic models represent documents and
queries usually as vectors, matrices, or tuples.
The similarity of the query vector and document
vector is represented as a scalar value.
Vector space model
Generalized vector space model
(Enhanced) Topic-based Vector Space
Model
Extended Boolean model
Latent semantic indexing aka latent
Categorization of IR-models (translated
from German entry, original source Dominik semantic analysis
Kuropka).
Probabilistic models treat the process of
For the information retrieval to be efficient, the document retrieval as a probabilistic inference.
documents are typically transformed into a Similarities are computed as probabilities that a
suitable representation. There are several document is relevant for a given query.
representations. The picture on the right
INFORMATION RETRIEVAL
Probabilistic theorems like the Bayes' occurrence of those terms in the whole set
theorem are often used in these models. of documents.
Models with transcendent term
Binary Independence Model interdependencies allow a representation
Probabilistic relevance model on which is of interdependencies between terms, but
based the okapi (BM25) relevance they do not allege how the
function interdependency between two terms is
Uncertain inference defined. They relay an external source for
Language models the degree of interdependency between
Divergence-from-randomness model two terms. (For example a human or
Latent Dirichlet allocation sophisticated algorithms.)
Machine-learned ranking models view
documents as vectors of ranking features
(some of which often incorporate other DIGITAL LIBRARY
ranking models mentioned above) and try Something in between the very structured
to find the best way to combine these database and the unstructured Web.
features into a single relevance score Content is controlled. Someone makes
by machine learning methods. the entries. (Maybe a lot of people make
Second dimension: properties of the model the entries, but there are rules for
admission.)
Models without term-
Searching and browsing are somewhat
interdependencies treat different
open, not controlled by fixed keys and
terms/words as independent. This fact is
anticipated queries.
usually represented in vector space
Nature of the collection regulates
models by the orthogonality assumption
indexing somewhat.
of term vectors or in probabilistic models
by an independency assumption for term
variables.
Models with immanent term
interdependencies allow a representation
of interdependencies between terms.
However the degree of the
interdependency between two terms is
defined by the model itself. It is usually
directly or indirectly derived (e.g.
by dimensional reduction) from the co-
INFORMATION RETRIEVAL
American Memory
http://memory.loc.gov/ammem/in
dex.html.
This is the main process involved
Between the chaos of the Web and the
strict structure of a database, the digital
library contains an organized collection.
We saw the digital collection at the
Falvey library session.
See also:
NSDL www.nsdl.org
And the computing
component, CITIDEL:
citidel.villanova.edu