Boolean Retrieval Explained

The Boolean retrieval model allows users to perform queries using Boolean expressions with operators AND, OR, and NOT, treating documents as sets of words. Grepping, or linear scanning through documents, is a basic retrieval method but is inefficient for large datasets, necessitating more advanced techniques like indexing and term-document incidence matrices. This model enables efficient querying and ranked retrieval, which is essential for managing vast collections of unstructured data.

Uploaded by

golanihimanshu2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views24 pages

Boolean Retrieval Explained

Uploaded by

golanihimanshu2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Lect 2: Boolean Retrieval

Dr. Subrat Kumar Nayak

Associate Professor
Department of CSE, ITER, SOADu
Boolean Retrieval Model
 The Boolean retrieval model is a model for information retrieval in
which we can pose any query which is in the form of a Boolean
expression of terms, that is, in which terms are combined with the
operators AND, OR, and NOT.
 The model views each document as just a set of words.
 Queries are Boolean expressions, e.g., CAESAR AND BRUTUS
 The search engine returns all documents that satisfy the Boolean
expression.
 A sort of linear scan through documents.
Boolean Retrieval: Grepping
 The simplest form of document retrieval is for a computer to do linear scan through
documents. This process is commonly referred to as grepping through text.
 Names after the Unix command grep, which performs this process.
 Grepping through text can be a very effective process, especially given the speed of
modern computers
 Often this allows useful possibilities for wild card pattern matching through the use of
regular expressions.
Example:
 Suppose you wanted to determine which document contain the words Information AND
Retrieval AND NOT Boolean
 One way to do that is to start at the beginning and to read through all the text, noting for
each document whether it contains Information and Retrieval and excluding it from
consideration if it contains Boolean.
 This process is commonly referred to as grepping through text
Boolean Retrieval: Grepping
 But for many purposes, you do need more:
❑ To process large document collections quickly. The amount of online data has
grown at least as quickly as the speed of computers, and we would now like to be
able to search collections that total in the order of billions to trillions of words.
❑ To allow more flexible matching operations. For example, it is impractical to
perform the query Romans NEAR countrymen with grep, where NEAR might be
defined as “within 5 words” or “within the same sentence”.
❑ To allow ranked retrieval: in many cases you want the best answer to an
information need among many documents that contain certain words.
Unstructured data in 1620
 Which plays of Shakespeare contain the words Brutus AND Caesar
but NOT Calpurnia?
 One could grep all of Shakespeare’s plays for Brutus and Caesar,
then strip out lines containing Calpurnia?
 Why is that not the answer?
Slow (for large corpora)
NOT Calpurnia is non-trivial
Other operations (e.g., find the word Romans near countrymen)
not feasible
Ranked retrieval (best documents to return)
Later lectures
Boolean Retrieval: Some terminology
 Documents: documents means whatever units we have decided to
build a retrieval system over. They might be individual memos or
chapters of a book.
 Collection/ Corpus: We will refer to the group of documents over
which we perform retrieval as the (document) collection. It is
sometimes also referred to as a corpus (a body of texts).

Let us consider Shakespeare’s Collected Works, and use it to

introduce the basics of the Boolean retrieval model.
Boolean Retrieval: Term-document
incidence matrices
 The way to avoid linearly scanning the texts for each query is to index the documents in
advance.
 The binary term-document incidence matrix, is an outcome of recording each document
– here a play of Shakespeare’s – whether it contains each word out of all the words
Shakespeare used (Shakespeare used about 32,000 different words)
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Brutus AND Caesar BUT NOT 1 if play contains

Calpurnia word, 0 otherwise
Boolean Retrieval: Incidence vectors
 So we have a 0/1 vector for each term.
 To answer query: take the vectors for Brutus, Caesar and Calpurnia
(complemented) ➔ bitwise AND.
110100 AND
110111 AND
101111 =
100100 Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Answers to query
Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.

Hamlet, Act III, Scene ii

Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.
Bigger collections
 Consider N = 1 million documents, each with about 1000 words.
 Avg 6 bytes/word including spaces/punctuation
6GB of data in the documents.
 Say there are M = 500K distinct terms among these.
Can’t build the matrix
 500K x 1M matrix has half-a-trillion 0’s and 1’s.

 But it has no more than one billion 1’s. Why?

matrix is extremely sparse.

 What’s a better representation?

We only record the 1 positions.

L2 Boolean Retrieval
No ratings yet
L2 Boolean Retrieval
33 pages
My 01intro
No ratings yet
My 01intro
76 pages
OpenText InfoArchive
No ratings yet
OpenText InfoArchive
14 pages
Cpanel Fundamental Exam
70% (10)
Cpanel Fundamental Exam
2 pages
Chapter 1 - Boolean-Retrieval
No ratings yet
Chapter 1 - Boolean-Retrieval
33 pages
23ad2102r-Dbms Work Book
No ratings yet
23ad2102r-Dbms Work Book
125 pages
BES3141 - ClassHandout 3141 Shinde AU2024 - 1727476478261001mrud
No ratings yet
BES3141 - ClassHandout 3141 Shinde AU2024 - 1727476478261001mrud
20 pages
Chap1 Boolean
No ratings yet
Chap1 Boolean
39 pages
Data Science
No ratings yet
Data Science
35 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
Database Concepts - Pps
No ratings yet
Database Concepts - Pps
39 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
Graph vs Vector in RAG for QA
No ratings yet
Graph vs Vector in RAG for QA
69 pages
Pertemuan 5 - Information Retrieval
No ratings yet
Pertemuan 5 - Information Retrieval
17 pages
2-Boolean IR and Indexing
No ratings yet
2-Boolean IR and Indexing
46 pages
IR Unit 2 Final
No ratings yet
IR Unit 2 Final
43 pages
IR Merged Merged
No ratings yet
IR Merged Merged
132 pages
Unit 1 Intro To IR
No ratings yet
Unit 1 Intro To IR
32 pages
Data Analytics Assignment
No ratings yet
Data Analytics Assignment
5 pages
Lecture02 - IR
No ratings yet
Lecture02 - IR
36 pages
Wa0008.
No ratings yet
Wa0008.
19 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
60 pages
Kajal Data Science Engineer
No ratings yet
Kajal Data Science Engineer
2 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
SQL User Stories Sprint 1
No ratings yet
SQL User Stories Sprint 1
3 pages
Boolean Model 2021spring
No ratings yet
Boolean Model 2021spring
43 pages
Lec2 BooleanRetrieval 1
No ratings yet
Lec2 BooleanRetrieval 1
61 pages
Context Tuning for Enhanced RAG
No ratings yet
Context Tuning for Enhanced RAG
9 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
L3L4 IRSW Boolean Retrieval
No ratings yet
L3L4 IRSW Boolean Retrieval
54 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Relational Algebra Practice - 2
No ratings yet
Relational Algebra Practice - 2
4 pages
Sujani Resume
No ratings yet
Sujani Resume
4 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
51 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
CSE202 Database Management Systems: Lecture #1
No ratings yet
CSE202 Database Management Systems: Lecture #1
122 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Hadoop File Management Guide
No ratings yet
Hadoop File Management Guide
3 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
Cloudera - DANA-262: Analyzing With Cloudera Data Warehouse
No ratings yet
Cloudera - DANA-262: Analyzing With Cloudera Data Warehouse
3 pages
NetApp ONTAP Fpolicy Configuration
No ratings yet
NetApp ONTAP Fpolicy Configuration
11 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Unit I
No ratings yet
Unit I
83 pages
Search Engine Evaluation Guidelines
No ratings yet
Search Engine Evaluation Guidelines
4 pages
Class Comparison Methods in Data Mining - Javatpoint
No ratings yet
Class Comparison Methods in Data Mining - Javatpoint
3 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Unit 1
No ratings yet
Unit 1
181 pages
File System Directory Structures
No ratings yet
File System Directory Structures
16 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
Informatica HCL
100% (1)
Informatica HCL
221 pages
Customer and Supplier Data Mapping
No ratings yet
Customer and Supplier Data Mapping
37 pages
RMAN Backup and Recovery Guide
No ratings yet
RMAN Backup and Recovery Guide
8 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
31 pages
Introduction To Information Retrieval
100% (2)
Introduction To Information Retrieval
60 pages
Cal Answers Analysis Training: University of California, Berkeley March 2012
No ratings yet
Cal Answers Analysis Training: University of California, Berkeley March 2012
18 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Ir Notes
No ratings yet
Ir Notes
111 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Information Retrieval: Indexing
No ratings yet
Information Retrieval: Indexing
32 pages
Unit Ii Modeling
No ratings yet
Unit Ii Modeling
15 pages
Read Write BLOBs From To SQL Server Using C#
No ratings yet
Read Write BLOBs From To SQL Server Using C#
6 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
No ratings yet
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
44 pages
Information Retrieval
No ratings yet
Information Retrieval
44 pages
Intro To IRE
No ratings yet
Intro To IRE
48 pages
Boolean Retrieval in Information Retrieval
No ratings yet
Boolean Retrieval in Information Retrieval
45 pages
Ir 1
No ratings yet
Ir 1
14 pages
Boolean Retrieval PPT Updated
No ratings yet
Boolean Retrieval PPT Updated
30 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
Shell c99 XML
No ratings yet
Shell c99 XML
58 pages
Ems SQL Storage
No ratings yet
Ems SQL Storage
5 pages
SecureSphere Database Security Solution
No ratings yet
SecureSphere Database Security Solution
2 pages
Synopsis of T24 Java Documentations
No ratings yet
Synopsis of T24 Java Documentations
1 page
Lecture 2-Boolean Retrieval
No ratings yet
Lecture 2-Boolean Retrieval
29 pages
Inverted Index Construction: Adapted From Lectures by
No ratings yet
Inverted Index Construction: Adapted From Lectures by
78 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
54 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
Introduction to Boolean Retrieval
No ratings yet
Introduction to Boolean Retrieval
50 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
On Information Retrival
No ratings yet
On Information Retrival
23 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages