Models of Modern IR
Systems
Chapter Four- Part I
1
Template for Search Engine Evaluation
Task
Cover page (course title, group members name, search engines name)
Introduction (only one page)
About how and why you select the search engines for the comparison,
brief description about each search engines)
Comparison (only one page)
Use the table next page
Conclusion (only one page)
Discussion of key findings you want to emphasis
Reference
File naming convention – Team leader name with first letter of his/her
father -UG-IR-SEE
2
Example : tibebeb-UG-IR-SEE
Criteria SE1 SE2 SE3
Size of index database
Searching options
Stemming technique
Ranking approach
Similarity measure
approach
User interface
Relevance feedback
mechanism
Others
3
Objectives
Understanding retrieval process
Be familiar with the basic retrieval models
Get hands on experience ( simulate) retrieval of
items
4
Topics
Overview
Boolean/Logical Model
Vector Space Model (VSM)
5
Modeling of Modern IR Systems
Retrieval based on index terms assumes the
semantics of the documents and the user
information need can be expressed through sets of
index terms.
A central problem in IR is predicting which
documents are relevant and which are not.
Such a decision is dependent on a ranking algorithm
implemented.
6
Cont…
Ranking is an ordering of the documents retrieved
that (hopefully) reflects the relevance of the
documents to the user query.
Ranking is based on fundamental premises
regarding the notion of relevance, such as:
common sets of index terms
sharing of weighted terms
likelihood of relevance
7
Cont…
Such distinct set of premises (regarding document
relevance) leads to a distinct IR models.
Model
is an idealization or abstraction of the actual
process (here, retrieval).
It represents something that exists or is
planned in the real world and that in someway
is too complex or large for us to understand it
as it stands.
8
Cont…
A model is, in someway, simplified, or reduced in size,
scope or scale.
It helps to understand the system better.
And it is the best way, scientific way to study reality.
Thus, a model is a simplified representation of a complex
reality, usually for the purpose of understanding that
reality, and having all the features of that reality
necessary for the current task or problem.
9
Cont…
A model may be conceptual like a mathematical model,
which is full of equations, and are used to study the
properties of the process, draw conclusions, and make
predictions.
Statistical models on the other hand represent repetitive
processes, make predictions about frequencies of
interesting events, and use probability as the fundamental
tool.
Retrieval model ?
10
What is a retrieval model?
It is a model that describes the computational
process (e.g. how documents are ranked) and
human process (e.g. the information need,
interaction).
Note that how documents or indexes are stored
is implementation.
In relation to this, retrieval variables are queries,
documents, terms, relevance judgments, users,
information needs …
11
What an IR Model includes
Two elements
The retrieval mechanism: used to match query with a set
of documents
The ways in which the user’s information need can be
formulated as a query that can be searched by that
mechanism
thus a retrieval model specifies the details of
Document representation
Query representation
Retrieval function
12
Building a Model
To build a model, we need to think of first on
representations of the documents and the user
information need.
Given these representations the next step is to conceive
a framework in which they can be modeled.
This framework should also provide the idea on
constructing a ranking function.
In the Boolean model, the framework is composed of sets
of documents and the standard operations on sets
For the vector space model, the framework is composed
of a t-dimensional Vector space and standard linear
algebra operations on vectors 13
Cont…
The discussions made so far provide support for
discussing the two basic information retrieval models:
namely
Boolean retrieval models and
Vector space models (VSM)
14
Boolean/Logical model
It is a simple retrieval model, which is more of retrieval
than document representation and based on or uses set
theory and Boolean algebra.
Documents and queries are represented as sets of index
terms
Provides a framework, which is easy to grasp by a
common user of an IR system.
Attracted great attention in past years and was adopted
by many of the early commercial bibliographic systems.
15
Cont…
It is a basis for the majority of DBMS and conventional
IR systems.
It is the most common exact-match model.
Queries are logic expressions with document
features as operands that means query terms are
linked by the logical operators AND, OR and NOT.
A document is an object or a set consisting of
terms
Terms are features of the objects (documents):
And the search engine retrieves those documents
satisfying the logical constraints of the query.
16
Cont…
Example
Doc1: Information storage and retrieval
Doc 2: Expert system and information retrieval
systems
Doc 3: Information processing and management
Doc 4: Information retrieval in archives
17
Cont…
Index term Document
Information 1,2,3,4
Storage 1
Retrieval 1,2,4
System 2
Processing 3
Management 3
Archives 4
Suppose our query consists of information AND Retrieval
Which documents will be retrieved based on Boolean
model?
18
Cont…
The basic assumption is that there is a domain and
both the author of the document and the readers
belong to the same domain, at any one time you
have t of them.
whatit means is that any document in the
domain is written in these terms
19
Relevance – matching
Matching as a concept is the degree of similarity
between D and Q,
The degree of similarity determines the degree of
closeness between D and Q.
Ifthere is more sharing between query terms and
document terms, the author and the user are
talking the same thing.
Thus by taking intersection, what we call similarity
can be captured mathematically.
20
Cont…
Boolean model’s matching considers that index terms
are either present or absent in a document.
Thus, the index term weights are assumed to be all
binary.
That is, wij = {0, 1}
A query q is composed of index terms linked by the
three connectives, example
q = ka ( kb kc )
Boolean expressions represent a request to determine what
documents contain (or do not contain) a given set of key words.
A query searches a set of documents to determine
21
their content.
Boolean model (Document, Term, Weight, Matching
Document (how a document is viewed in BM)
Is an object, a set consisting of terms
That is, documents are sets of terms
Instance of an object (i.e., document or query) is created
when we assign value (concepts) to the features
Term (how a term is viewed in BM)
Terms are features of the objects (documents)
The terms come from the vocabulary of the subject
Represent documents in terms, of which together represent doc.
The terms are the things we used to describe concepts in a
particular domain
The vocabulary is growing when new terms are introduced
22
Cont…
Weight
Terms are either present or absent in documents
Thus, the index term weight variables are all binary, i.e.,
wij {0,1}
Matching
Degree of similarity between D and Q. If there is more sharing between
query terms and document terms, the author and the user are talking the
same thing
By taking intersection, what we call similarity can be captured
mathematically in Boolean model
Example
Q = (1, 1, o, 0, 1, 0, 0) d = (1, 0, 0, 1, 1, 0, 0) S(q, d) = 2
Thus, intersection
Takes operands and returns degree of commonness
Is a function that counts the number of matches
23
How do you explain the
essence of relevance in IRS
designed using Boolean model?
24
Example:
1. Query: Find all documents containing “information”
Boolean expression
Information
Result (means)
A set whose elements are all documents containing
the pattern “information”
25
Cont…
2. Query: Find all documents that do not contain
“information”
This is a query which attempts to find documents
that do not contain a particular pattern
Boolean expression (representation)
NOT information
Result
A set whose elements are all documents that
do not contain the pattern “information
26
Cont…
Most queries search for more than one term
Find all documents containing “information” and
“retrieval”
Find all documents containing “information” or
“retrieval” (or both)
Find all documents containing “information” or
“retrieval”, but not both
Each of the three queries illustrates a particular
concept that may form a Boolean expression, namely
Conjunction, Disjunction, Exclusive disjunction
27
Cont…
Boolean expressions may be formed from other
Boolean expressions to yield complex structure
Query
Find all documents containing “information”,
“retrieval” or not containing both “retrieval”
and “science”
Boolean expression
(Information
and retrieval) OR NOT (retrieval
AND science), parenthesis avoid ambiguity
28
Cont…
Each portion of a Boolean expression yields a set of
documents.
These portions are evaluated separately.
Combining the terms of Boolean expressions is simple
and done as follows
Let U represent the set of all docs in the collection
d1 and d2 represent those docs that contain patterns
p1 and p2 respectively.
29
The following list defines how to evaluate
Boolean expressions operators in terms of
the sets
U – d1 is the set of all docs not containing p1
(NOT)
d1 ∩ d2 is the set of all docs. containing both
p1 and p2 (AND)
d1 U d2 is the set of all docs. containing
either p1 or p2 (OR)
d1 U d2 – d1 ∩ d2 is the set of all docs.
Containing either p1 or p2, but not both (XOR)
30
Cont…
Thus In Boolean,
the use of AND requires that both terms that it
connects be present in the retrieved documents,
the use of OR requires that at least one of the terms
be present.
This is an inclusive use of OR, meaning that it is
acceptable for both of the terms to be present,
31
Cont…
If an exclusive use of OR is desired- one term or the
other, but not both- the construction is more
complex:
(A AND NOT B) OR (B AND NOT A)
or
(A OR B) AND NOT (A AND B)
NOT requires that the specified term be absent from
any retrieved document
32
Exercise: Consider a set of five docs and assume that
they contain the terms shown in the table
Doc. Terms
D1 Algorithm, information, retrieval
D2 Retrieval, science
D3 Algorithm, information, science
D4 Pattern, retrieval, science
D5 Science, algorithm
Find documents retrieved by the following expressions
• Information AND retrieval
• Information OR retrieval
• (Information and Retrieval) OR NOT (Retrieval and Science)
33
Solution
Information AND retrieval
{d1,d3} ∩{d1,d2,d4}={d1}
Information OR retrieval
{d1,d3} U {d1,d2,d4}={d1, d2,d3,d4}
(Information and Retrieval) OR NOT (Retrieval
and Science)
(d1) OR NOT (d4,d2)= {d1,d3,d5}
34
Advantages of the Boolean
model
Simplicity
Isstill a dominate model with the
commercial database systems
Providesa good starting point for those
new to the field
35
Limitations (Drawbacks) of the
Boolean Model
In pure Boolean, there is no good way to weight
terms for significance. (thus it does only binary
partition
Either a term is present or absent. Thus, the
user has little control over how important a
given term is to the query.
That is, Its retrieval strategy is based on
binary decision criteria.
No weighting for document terms and no
weighting for query terms
36
Cont…
That is, the significance concept is totally ignored.
The representation is only binary.
The system is not flexible to represent weight which is
said very important in IR.
Reconsideration of index weight brings us to the
vector model
Predicts that each document is either relevant or
non-relevant.
Is a simple partition - those that match the query and
those that do not?
Divides the collection into two subsets only,
retrieved and non-retrieved 37
Cont…
There is no notion of partial matching to the query
condition.
For example, let dj be a document for which
vector
dj= (0,1,0)
Document dj includes the index term kb but is
considered non- relevant to the query
ka (kb kc)
This prevents good retrieval performance.
38
Cont…
In Boolean model no ranking of the documents is
provided (absence of a grading scale)
as all documents are considered equal, no
ordering of retrieved set
Retrieved documents are generally not ranked.
All retrieved are presumed to be equally useful.
No mechanism to show the relative importance of
the different components of a query
39
Cont…
Query formulation is too difficult using the Boolean
operators.
Boolean expressions have precise semantics
Thus, it is not simple to translate an information
need into a Boolean expression.
Informationneed has to be translated into a
Boolean expression which most users find awkward.
40
Cont…
To answer sophisticated queries we need to know
more about Boolean logic.
We need also to have good knowledge of representing
queries in Boolean logic, which presumes knowledge
of the document, queries (user’s needs) and so on
As a consequence there is a
Need for trained intermediary, which create
another problem, problem of understanding.
Instead of yourself, somebody do the translation
for you on your behalf.
41
Cont…
Boolean model frequently returns either too few or too
many documents in response to a user query
As it is very difficult to precisely define users need at the
beginning
As a result of which, the Boolean model frequently
returns either too few or too many documents in
response to a user query
That is, exact matching may lead to retrieval of too few or
too many documents (main problem)
This shows very little control over the size of the output by a
particular query.
That is, the size of retrieved set can hardly be controlled.
42
Cont…
“NOT”, for instance, retrieves every document that does
not contain a specific term.
A query such as ‘NOT aardvark’ runs of retrieving virtually
the entire database
Again another point is separation between retrieved / non-
retrieved too strict that means
For q= t1 Λ t2 Λ t3, documents containing two of the
terms will be rejected as well as those containing none
Analogously for q=t1 V t2 V t3, no ordering within
retrieved documents
Generally it has poor retrieval quality and its main problem
is the inability to recognize partial matches which
frequently leads to poor performance 43
Exercise
Given the following four documents with the following
contents:
D1 = “computer information retrieval”
D2 = “computer retrieval”
D3 = “information”
D4 = “computer information”
What are the relevant documents retrieved for the
queries:
Q1 = “information retrieval”
Q2 = “information ¬computer”
44
The Boolean Model: Example
• Given the following determine documents retrieved by the Boolean
model based IR system
• Index Terms: K1, …,K8.
• Documents:
1. D1 = {K1, K2, K3, K4, K5}
2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2 K3)
45
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
46
The Boolean Model: Example
Given the following three documents, Construct Term – document matrix and
find the relevant documents retrieved by the Boolean model for given
query
• D1: “Shipment of gold damaged in a fire” • Find the relevant
• D2: “Delivery of silver arrived in a silver truck” documents for the
• D3: “Shipment of gold arrived in a truck” queries (use AND , OR)
• Query: “gold silver truck” (a)gold delivery
Use table below for the –term matrix (b)ship gold
(c)silver truck
arrive damage deliver fire gold silver ship truck
D1
D2
D3
query 47
Next
On Vector Space Model
48