Biomedical Information
Retrieval
Search Engine Architecture [2]
Lecture 3
Dr. Ebtsam AbdelHakam
Minia University
Indexing Process
Index
Creation
Storing Document Statistics
•
‣ The counts and positions of document terms are stored
Two ways for index representation
1. Term-Document Matrix
2. Inverted Index
The Shakespeare collection as
Term-Document Matrix
Matrix element (t,d) is:
1 if term t occurs in document d,
0 otherwise
The Shakespeare collection as
Term-Document Matrix
QUERY:
Brutus AND Caesar AND NOT Calpurnia
Answer: “Antony and Cleopatra”(d=1), “Hamlet”(d=4)
The Shakespeare collection as
Term-Document Matrix
QUERY:
Brutus AND Caesar AND NOT Calpurnia
Answer: “Antony and Cleopatra”(d=1), “Hamlet”(d=4)
1 1 0 1 0 0
1 1 0 1 1 1
1 0 1 1 1 1
1 0 0 1 0 0
Index
Creation
Storing Document Statistics
•
‣ The counts and positions of document terms are stored
• Forward Index: Key is the document, value is a list of terms and
term positions. Easiest for the crawler to build.
• Inverted Index: Key is a term, value is a list of documents and
term positions. Provides faster processing at query time.
‣ Term weights are calculated and stored with the terms. The
weight estimates the term’s importance to the document.
Sec. 1.2
Inverted index
We need variable-size postings lists
On disk, a continuous run of postings is normal and best
In memory, can use linked lists or variable length arrays Posting
Some tradeoffs in size/ease of insertion
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
Dictionary Postings
Sorted by docID (more later on why).
8
Sec. 1.2
Inverted index construction
Documents to Friends, Romans, countrymen.
be indexed
Tokenizer
Token stream Friends Romans Countrymen
Linguistic modules
Modified tokens friend roman countryman
Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Sec. 1.2
Where do we pay in storage?
Lists of
docIDs
Terms
and
counts IR system
implementation
• How do we index
efficiently?
• How much storage
do we need?
10
Pointers
Sec. 1.2
Indexer steps: Token sequence
Sequence of (Modified token, Document ID) pairs.
Doc 1 Doc 2
I did enact Julius So let it be with
Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec. 1.2
Indexer steps: Sort
Sort by terms
At least conceptually
And then docID
Core indexing step
Sec. 1.2
Indexer steps: Dictionary & Postings
Term docID
ambitious 2
be 2
brutus 1
Multiple term entries in a brutus 2
single document are merged. capitol 1
caesar 1
Split into Dictionary and caesar 2
Postings caesar 2
did 1
Doc. frequency information is enact 1
hath 1
added. I 1
I 1
i' 1
it 2
julius 1
killed 1
killed 1
let 2
me 1
noble 2
so 2
the 1
the 2
told 2
Why frequency? you 2
Will discuss later. was
was
1
2
with 2
Quiz: How would you process these queries?
QUERY:
Brutus AND Caesar AND Calpurnia
QUERY:
Brutus AND (Caesar OR Calpurnia)
QUERY:
Brutus AND Caesar AND NOT Calpurnia
Which terms do you intersect first?
Think: What terms to process first? How to handle OR, NOT?
Sec. 1.3
Query processing: AND
Consider processing the query:
Brutus AND Caesar
Locate Brutus in the Dictionary;
Retrieve its postings.
Locate Caesar in the Dictionary;
Retrieve its postings.
“Merge” the two postings (intersect the document sets):
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
15
Sec. 1.3
The merge
Walk through the two postings simultaneously, in time linear
in the total number of postings entries
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
If the list lengths are x and y, the merge takes O(x+y)
operations.
Crucial: postings sorted by docID. 16
Sec. 1.3
The merge
Walk through the two postings simultaneously, in time linear in the
total number of postings entries
2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar
If the list lengths are x and y, the merge takes O(x+y)
operations.
Crucial: postings sorted by docID.
17
Intersecting two postings lists
(a “merge” algorithm)
18
TF-IDF Scoring Function
• The weights are used by ranking algorithms
• e.g.TF-IDF ranks documents by the Term Frequency of the query
term within the document times * the Inverse Document
Frequency of the term across all documents.
• Higher scores means you have more query terms which are not
found in many documents.
• Given query q and document d
TF-IDF Scoring Function
• Given query q and document d
TF-IDF
terms t in q
Term frequency (raw count) of t in d
Inverse document frequency
Total number of documents
Number of documents
with >=1 occurrence of t
Let’s take an example to get a clearer understanding.
Sentence 1 : The car is driven on the road.
Sentence 2: The truck is driven on the highway.
Classical IR Models
classical IR Model: They are similarity-based models.
1. Boolean Model
2. Vector Space Model
Evaluation: How good/bad is my IR?
• Evaluation is important:
– Compare two IR systems
– Decide whether our IR is ready for deployment
– Identify research challenges
Precision and Recall
metrics
not
relevant relevant When a search engine returns 30 pages,
only 20 of which are relevant, while failing
retrieved
A B to return 40 additional relevant pages, its
precision is 20/30 = 2/3, which tells us
not how valid the results are, while its recall is
retrieved
C D 20/60 = 1/3, which tells us how complete
the results are.
A
precision =
A+B
20 10
A
recall =
A+C 40
Information Retrieval: IR
Evaluation metrics
Precision: fraction of relevant documents
retrieved divided by the total returned
documents
Recall: proportion of relevant documents
returned divided by the total number of
relevant documents
F-score: the harmonic mean of precision
and recall
25
Lists of open source search
engines
Apache Lucene
Sphinx
Whoosh
Carrot2