0% found this document useful (0 votes)

9 views42 pages

Lecture1 Intro Boolean

The document provides an introduction to Information Retrieval (IR), defining it as the process of finding unstructured material that satisfies an information need from large collections. It discusses the evolution of IR, the importance of Boolean queries, and the structure of document collections, including the use of term-document incidence matrices and inverted indexes. The lecture also highlights the challenges of information scarcity and abundance, as well as the effectiveness of IR systems measured by precision and recall.

Uploaded by

olfa.gaddour

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views42 pages

Lecture1 Intro Boolean

Uploaded by

olfa.gaddour

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Lecture 1: Introduction and the Boolean Model

Information Retrieval
Computer Science Tripos Part II

Ronan Cummins1

Natural Language and Information Processing (NLIP) Group

[email protected]

2016

1
Adapted from Simone Teufel’s original slides
1
Overview

1 Motivation
Definition of “Information Retrieval”
IR: beginnings to now

2 First Boolean Example

Term-Document Incidence matrix
The inverted index
Processing Boolean Queries
Practicalities of Boolean Search
What is Information Retrieval?

Manning et al, 2008:

Information retrieval (IR) is finding material . . . of an unstructured

nature . . . that satisfies an information need from within large
collections . . . .

2
What is Information Retrieval?

Manning et al, 2008:

Information retrieval (IR) is finding material . . . of an unstructured

nature . . . that satisfies an information need from within large
collections . . . .

3
Document Collections

4
Document Collections

IR in the 17th century: Samuel Pepys, the famous English diarist,

subject-indexed his treasured 1000+ books library with key words.
5
Document Collections

6
What we mean here by document collections
Manning et al, 2008:

Information retrieval (IR) is finding material (usually documents)

of an unstructured nature . . . that satisfies an information need
from within large collections (usually stored on computers).

Document Collection: text units we have built an IR system

over.
Usually documents
But could be
memos
book chapters
paragraphs
scenes of a movie
turns in a conversation...
Lots of them

7
IR Basics

Document
Collection

Query IR System

Set of relevant
documents

8
IR Basics

web
pages

Query IR System

Set of relevant
web pages

9
What is Information Retrieval?

Manning et al, 2008:

Information retrieval (IR) is finding material (usually documents)

of an unstructured nature . . . that satisfies an information need
from within large collections (usually stored on computers).

10
Structured vs Unstructured Data
Unstructured data means that a formal, semantically overt,
easy-for-computer structure is missing.
In contrast to the rigidly structured data used in DB style
searching (e.g. product inventories, personnel records)

SELECT *
FROM business catalogue
WHERE category = ’florist’
AND city zip = ’cb1’

This does not mean that there is no structure in the data

Document structure (headings, paragraphs, lists. . . )
Explicit markup formatting (e.g. in HTML, XML. . . )
Linguistic structure (latent, hidden)

11
Information Needs and Relevance

Manning et al, 2008:

Information retrieval (IR) is finding material (usually documents) of

an unstructured nature (usually text) that satisfies an information
need from within large collections (usually stored on computers).

An information need is the topic about which the user desires

to know more about.
A query is what the user conveys to the computer in an
attempt to communicate the information need.
A document is relevant if the user perceives that it contains
information of value with respect to their personal information
need.

12
Types of information needs

Manning et al, 2008:

Information retrieval (IR) is finding material . . . of an unstructured

nature . . . that satisfies an information need from within large
collections . . . .

Known-item search
Precise information seeking search
Open-ended search (“topical search”)

13
Information scarcity vs. information abundance

Information scarcity problem (or needle-in-haystack problem):

hard to find rare information
Lord Byron’s first words? 3 years old? Long sentence to the
nurse in perfect English?

. . . when a servant had spilled an urn of hot coffee over his legs, he replied to
the distressed inquiries of the lady of the house, ’Thank you, madam, the
agony is somewhat abated.’ [not Lord Byron, but Lord Macaulay]

Information abundance problem (for more clear-cut

information needs): redundancy of obvious information
What is toxoplasmosis?

14
Relevance

Manning et al, 2008:

Information retrieval (IR) is finding material (usually documents) of

an unstructured nature (usually text) that satisfies an information
need from within large collections (usually stored on computers).

Are the retrieved documents

about the target subject
up-to-date?
from a trusted source?
satisfying the user’s needs?
How should we rank documents in terms of these factors?
More on this in a lecture soon

15
How well has the system performed?

The effectiveness of an IR system (i.e., the quality of its search

results) is determined by two key statistics about the system’s
returned results for a query:
Precision: What fraction of the returned results are relevant to
the information need?
Recall: What fraction of the relevant documents in the
collection were returned by the system?
What is the best balance between the two?
Easy to get perfect recall: just retrieve everything
Easy to get good precision: retrieve only the most relevant
There is much more to say about this – lecture 6

16
IR today

Web search ( )
Search ground are billions of documents on millions of
computers
issues: spidering; efficient indexing and search; malicious
manipulation to boost search engine rankings
Link analysis covered in Lecture 8

Enterprise and institutional search ( )

e.g company’s documentation, patents, research articles
often domain-specific
Centralised storage; dedicated machines for search.
Most prevalent IR evaluation scenario: US intelligence analyst’s
searches
Personal information retrieval (email, pers. documents; )
e.g., Mac OS X Spotlight; Windows’ Instant Search
Issues: different file types; maintenance-free, lightweight to run
in background

17
A short history of IR

1945 1950s 1960s 1970s 1990s 2000s

1980s
Term Cranﬁeld Salton;
IR coined experiments TREC
by Calvin VSM
memex Moers
Boolean
IR
Literature
searching SMART
systems; Multimedia
evaluation Multilingual
pagerank
by P&R (CLEF)
(Alan Kent)
1
Recommendation
Systems
rec all
precision/
recall

prec ision

no items retrieved

18
IR for non-textual media

19
Similarity Searches

20
Areas of IR

“Ad hoc” retrieval and classification (lectures 1-5)

web retrieval (lecture 8)
Support for browsing and filtering document collections:
Evaluation lecture 6)
Clustering (lecture 7)
Further processing a set of retrieved documents, e.g., by using
natural language processing
Information extraction
Summarisation
Question answering

21
Overview

1 Motivation
Definition of “Information Retrieval”
IR: beginnings to now

2 First Boolean Example

Term-Document Incidence matrix
The inverted index
Processing Boolean Queries
Practicalities of Boolean Search
Boolean Retrieval
In the Boolean retrieval model we can pose any query in the
form of a Boolean expression of terms
i.e., one in which terms are combined with the operators and,
or, and not.
Shakespeare example

22
Brutus AND Caesar AND NOT Calpurnia

Which plays of Shakespeare contain the words Brutus and

Caesar, but not Calpurnia?
Naive solution: linear scan through all text – “grepping”
In this case, works OK (Shakespeare’s Collected works has less
than 1M words).
But in the general case, with much larger text colletions, we
need to index.
Indexing is an offline operation that collects data about which
words occur in a text, so that at search time you only have to
access the precompiled index.

23
The term-document incidence matrix

Main idea: record for each document whether it contains each

word out of all the different words Shakespeare used (about 32K).

Antony Julius The Hamlet Othello Macbeth

and Caesar Tempest
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
...

Matrix element (t, d) is 1 if the play in column d contains the

word in row t, 0 otherwise.

24
Query “Brutus AND Caesar AND NOT Calpunia”

We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):

Antony Julius The Hamlet Othello Macbeth

and Caesar Tempest
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
...

This returns two documents, “Antony and Cleopatra” and

“Hamlet”.

25
Query “Brutus AND Caesar AND NOT Calpunia”

We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):

Antony Julius The Hamlet Othello Macbeth

and Caesar Tempest
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
¬Calpurnia 1 0 1 1 1 1
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
...

This returns two documents, “Antony and Cleopatra” and

“Hamlet”.

26
Query “Brutus AND Caesar AND NOT Calpunia”

We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):

Antony Julius The Hamlet Othello Macbeth

and Caesar Tempest
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
¬Calpurnia 1 0 1 1 1 1
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
AND 1 0 0 1 0 0

Bitwise AND returns two documents, “Antony and Cleopatra” and

“Hamlet”.

27
The results: two documents

Antony and Cleopatra, Act III, Scene ii

Agrippa [Aside to Dominitus Enobarbus]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring, and he wept
When at Philippi he found Brutus slain.

Hamlet, Act III, Scene ii

Lord Polonius: I did enact Julius Caesar: I was killed i’ the
Capitol; Brutus killed me.

28
Bigger collections

Consider N=106 documents, each with about 1000 tokens

109 tokens at avg 6 Bytes per token ⇒ 6GB
Assume there are M=500,000 distinct terms in the collection
Size of incidence matrix is then 500,000 ×106
Half a trillion 0s and 1s

29
Can’t build the Term-Document incidence matrix

Observation: the term-document matrix is very sparse

Contains no more than one billion 1s.
Better representation: only represent the things that do occur
Term-document matrix has other disadvantages, such as lack
of support for more complex query operators (e.g., proximity
search)
We will move towards richer representations, beginning with
the inverted index.

30
The inverted index

The inverted index consists of

a dictionary of terms (also: lexicon, vocabulary)
and a postings list for each term, i.e., a list that records which
documents the term occurs in.

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132 179

Calpurnia 2 31 54 101

31
Processing Boolean Queries: conjunctive queries

Our Boolean Query

Brutus AND Calpurnia

Locate the postings lists of both query terms and intersect them.

Brutus 1 2 4 11 31 45 173 174

Calpurnia 2 31 54 101

Intersection 2 31

Note: this only works if postings lists are sorted

32
Algorithm for intersection of two postings

INTERSECT (p1, p2)

1 answer ← <>
2 while p1 6= NIL and p2 6= NIL
3 do if docID(p1) = docID(p2)
4 then ADD (answer, docID(p1))
5 p1 ← next(p1)
6 p2 ← next(p2)
7 if docID(p1) < docID(p2)
8 then p1← next(p1)
9 else p2← next(p2)
10 return answer

Brutus 1 2 4 11 31 45 173 174

Calpurnia 2 31 54 101

Intersection 2 31

33
Complexity of the Intersection Algorithm

Bounded by worst-case length of postings lists

Thus “officially” O(N), with N the number of documents in
the document collection
But in practice much, much better than linear scanning,
which is asymptotically also O(N)

34
Query Optimisation: conjunctive terms

Organise order in which the postings lists are accessed so that least
work needs to be done
Brutus AND Caesar AND Calpurnia

Process terms in increasing document frequency: execute as

(Calpurnia AND Brutus) AND Caesar

Brutus 8 1 2 4 11 31 45 173 174

Caesar 9 1 2 4 5 6 16 57 132 179

Calpurnia 4 2 31 54 101

35
Query Optimisation: disjunctive terms

(maddening OR crowd) AND (ignoble OR strife) AND (killed OR slain)

Process the query in increasing order of the size of each

disjunctive term
Estimate this in turn (conservatively) by the sum of
frequencies of its disjuncts

36
Practical Boolean Search

Provided by large commercial information providers

1960s-1990s
Complex query language; complex and long queries
Extended Boolean retrieval models with additional operators –
proximity operators
Proximity operator: two terms must occur close together in a
document (in terms of certain number of words, or within
sentence or paragraph)
Unordered results...

37
Examples

Westlaw : Largest commercial legal search service – 500K

subscribers
Medical search
Patent search
Useful when expert queries are carefully defined and
incrementally developed

38
Does Google use the Boolean Model?

On Google, the default interpretation of a query [w1 w2 ... wn ] is

w1 AND w2 AND ... AND wn
Cases where you get hits which don’t contain one of the w−i :
Page contains variant of wi (morphology, misspelling,
synonym)
long query (n is large)
Boolean expression generates very few hits
wi was in the anchor text
Google also ranks the result set
Simple Boolean Retrieval returns matching documents in no
particular order.
Google (and most well-designed Boolean engines) rank hits
according to some estimator of relevance

39
Reading

Manning, Raghavan, Schütze: Introduction to Information

Retrieval (MRS), chapter 1

CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Computer Parts Full Form List A To Z 629
No ratings yet
Computer Parts Full Form List A To Z 629
17 pages
Week 6
No ratings yet
Week 6
98 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
Introduction To IIR
No ratings yet
Introduction To IIR
53 pages
Chap1 Boolean
No ratings yet
Chap1 Boolean
39 pages
Information Retrieval Module 1 24
No ratings yet
Information Retrieval Module 1 24
53 pages
IR Summary Lec 1 - Introduction
No ratings yet
IR Summary Lec 1 - Introduction
54 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Boolean Retrieval PPT Updated
No ratings yet
Boolean Retrieval PPT Updated
30 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
60 pages
Intro To NLP IR
No ratings yet
Intro To NLP IR
73 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
51 pages
Ir 1
No ratings yet
Ir 1
59 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
38 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
SBL Errors
No ratings yet
SBL Errors
29 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
47 pages
Intro to Info Retrieval Course
No ratings yet
Intro to Info Retrieval Course
31 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Lecturenote - 580003121chapter 1
No ratings yet
Lecturenote - 580003121chapter 1
10 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Introduction to Information Retrieval Course
No ratings yet
Introduction to Information Retrieval Course
39 pages
Intro To Digital Marketing
No ratings yet
Intro To Digital Marketing
14 pages
Information Retrieval
No ratings yet
Information Retrieval
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Introduction to Boolean Retrieval
No ratings yet
Introduction to Boolean Retrieval
50 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
01 - Introduction To Information Retrieval
No ratings yet
01 - Introduction To Information Retrieval
15 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Lecture 2-Boolean Retrieval
No ratings yet
Lecture 2-Boolean Retrieval
29 pages
Cheat Sheet
No ratings yet
Cheat Sheet
60 pages
Indian Railways Track Management System
100% (1)
Indian Railways Track Management System
81 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
Introduction To IR Chapter 01
No ratings yet
Introduction To IR Chapter 01
29 pages
Operacion PMM
No ratings yet
Operacion PMM
28 pages
Chapter 3 E-Business Infrastructure
No ratings yet
Chapter 3 E-Business Infrastructure
9 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Ir 1
No ratings yet
Ir 1
14 pages
Beats Making Selling
100% (3)
Beats Making Selling
39 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
MapGuide Programming Manual
No ratings yet
MapGuide Programming Manual
164 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
Online Grocery Ordering System Using PHP
No ratings yet
Online Grocery Ordering System Using PHP
10 pages
Google Ads Measurement Answers 2021
No ratings yet
Google Ads Measurement Answers 2021
40 pages
Free Social Media Content Calendar Template
No ratings yet
Free Social Media Content Calendar Template
2 pages
The Fastest Path To Success: Easy Control Relays Easye4
No ratings yet
The Fastest Path To Success: Easy Control Relays Easye4
9 pages
Barracuda TechLibrary - How To Configure Proxy Settings Using Group Policy Management - Barracuda Web Security Service
No ratings yet
Barracuda TechLibrary - How To Configure Proxy Settings Using Group Policy Management - Barracuda Web Security Service
2 pages
Network Protocols and Standards
No ratings yet
Network Protocols and Standards
38 pages
Boosting of USA VPS Server Hosting From Onlive Server
No ratings yet
Boosting of USA VPS Server Hosting From Onlive Server
10 pages
University Web Portal SRS Document
No ratings yet
University Web Portal SRS Document
28 pages
Architect User Manual
No ratings yet
Architect User Manual
7 pages
tr4631
No ratings yet
tr4631
102 pages
023 IC20 POL0L200 DataSheet en PDF
No ratings yet
023 IC20 POL0L200 DataSheet en PDF
10 pages
Melt Exhibition
No ratings yet
Melt Exhibition
22 pages
11.4.3.6 Packet Tracer - Troubleshooting Connectivity Issues
No ratings yet
11.4.3.6 Packet Tracer - Troubleshooting Connectivity Issues
3 pages
WMSU Students' Top Social Media Picks
No ratings yet
WMSU Students' Top Social Media Picks
2 pages
Who Are The Skype Users?: Million %
No ratings yet
Who Are The Skype Users?: Million %
2 pages
2017 CLC 1032
No ratings yet
2017 CLC 1032
7 pages
How To Connect To An API With JavaScript
No ratings yet
How To Connect To An API With JavaScript
11 pages
Dasar Pem Rogram An Go Lang
No ratings yet
Dasar Pem Rogram An Go Lang
618 pages
5SSPP236 New Political Economy of The Media - Notes Moore: Martin - Moore@kcl - Ac.uk 1
No ratings yet
5SSPP236 New Political Economy of The Media - Notes Moore: Martin - Moore@kcl - Ac.uk 1
9 pages
Crowdsourcing Insights
No ratings yet
Crowdsourcing Insights
31 pages
BIBS: A Lecture Webcasting System: Lawrence A. Rowe, Diane Harley, and Peter Pletcher Shannon Lawrence
No ratings yet
BIBS: A Lecture Webcasting System: Lawrence A. Rowe, Diane Harley, and Peter Pletcher Shannon Lawrence
23 pages

Lecture1 Intro Boolean

Uploaded by

Lecture1 Intro Boolean

Uploaded by

Lecture 1: Introduction and the Boolean Model

Natural Language and Information Processing (NLIP) Group

2 First Boolean Example

Manning et al, 2008:

Information retrieval (IR) is finding material . . . of an unstructured

Manning et al, 2008:

Information retrieval (IR) is finding material . . . of an unstructured

IR in the 17th century: Samuel Pepys, the famous English diarist,

Information retrieval (IR) is finding material (usually documents)

Document Collection: text units we have built an IR system

Manning et al, 2008:

Information retrieval (IR) is finding material (usually documents)

This does not mean that there is no structure in the data

Manning et al, 2008:

Information retrieval (IR) is finding material (usually documents) of

An information need is the topic about which the user desires

Manning et al, 2008:

Information retrieval (IR) is finding material . . . of an unstructured

Information scarcity problem (or needle-in-haystack problem):

Information abundance problem (for more clear-cut

Manning et al, 2008:

Information retrieval (IR) is finding material (usually documents) of

Are the retrieved documents

The effectiveness of an IR system (i.e., the quality of its search

Enterprise and institutional search ( )

1945 1950s 1960s 1970s 1990s 2000s

“Ad hoc” retrieval and classification (lectures 1-5)

2 First Boolean Example

Which plays of Shakespeare contain the words Brutus and

Main idea: record for each document whether it contains each

Antony Julius The Hamlet Othello Macbeth

Matrix element (t, d) is 1 if the play in column d contains the

Antony Julius The Hamlet Othello Macbeth

This returns two documents, “Antony and Cleopatra” and

Antony Julius The Hamlet Othello Macbeth

This returns two documents, “Antony and Cleopatra” and

Antony Julius The Hamlet Othello Macbeth

Bitwise AND returns two documents, “Antony and Cleopatra” and

Antony and Cleopatra, Act III, Scene ii

Hamlet, Act III, Scene ii

Consider N=106 documents, each with about 1000 tokens

Observation: the term-document matrix is very sparse

The inverted index consists of

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132 179

Our Boolean Query

Brutus 1 2 4 11 31 45 173 174

Note: this only works if postings lists are sorted

INTERSECT (p1, p2)

Brutus 1 2 4 11 31 45 173 174

Bounded by worst-case length of postings lists

Process terms in increasing document frequency: execute as

(Calpurnia AND Brutus) AND Caesar

Brutus 8 1 2 4 11 31 45 173 174

Caesar 9 1 2 4 5 6 16 57 132 179

(maddening OR crowd) AND (ignoble OR strife) AND (killed OR slain)

Process the query in increasing order of the size of each

Provided by large commercial information providers

Westlaw : Largest commercial legal search service – 500K

On Google, the default interpretation of a query [w1 w2 ... wn ] is

Manning, Raghavan, Schütze: Introduction to Information

You might also like