Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
31 views27 pages

03lecture 3 - Biomedical IR-indexing

Uploaded by

Mahmoud Nasser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views27 pages

03lecture 3 - Biomedical IR-indexing

Uploaded by

Mahmoud Nasser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Biomedical Information

Retrieval
Search Engine Architecture [2]
Lecture 3

Dr. Ebtsam AbdelHakam

Minia University
Indexing Process
Index
Creation
Storing Document Statistics

‣ The counts and positions of document terms are stored

Two ways for index representation


1. Term-Document Matrix
2. Inverted Index
The Shakespeare collection as
Term-Document Matrix

Matrix element (t,d) is:


1 if term t occurs in document d,
0 otherwise
The Shakespeare collection as
Term-Document Matrix

QUERY:
Brutus AND Caesar AND NOT Calpurnia

Answer: “Antony and Cleopatra”(d=1), “Hamlet”(d=4)


The Shakespeare collection as
Term-Document Matrix

QUERY:
Brutus AND Caesar AND NOT Calpurnia

Answer: “Antony and Cleopatra”(d=1), “Hamlet”(d=4)


1 1 0 1 0 0
1 1 0 1 1 1
1 0 1 1 1 1
1 0 0 1 0 0
Index
Creation
Storing Document Statistics

‣ The counts and positions of document terms are stored

• Forward Index: Key is the document, value is a list of terms and


term positions. Easiest for the crawler to build.

• Inverted Index: Key is a term, value is a list of documents and


term positions. Provides faster processing at query time.

‣ Term weights are calculated and stored with the terms. The
weight estimates the term’s importance to the document.
Sec. 1.2

Inverted index

 We need variable-size postings lists


 On disk, a continuous run of postings is normal and best
 In memory, can use linked lists or variable length arrays Posting
 Some tradeoffs in size/ease of insertion

Brutus 1 2 4 11 31 45 173 174


Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Postings
Sorted by docID (more later on why).
8
Sec. 1.2

Inverted index construction


Documents to Friends, Romans, countrymen.
be indexed

Tokenizer

Token stream Friends Romans Countrymen

Linguistic modules

Modified tokens friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Sec. 1.2

Where do we pay in storage?

Lists of
docIDs

Terms
and
counts IR system
implementation
• How do we index
efficiently?
• How much storage
do we need?

10
Pointers
Sec. 1.2

Indexer steps: Token sequence


 Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with


Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec. 1.2

Indexer steps: Sort

 Sort by terms
 At least conceptually
 And then docID

Core indexing step


Sec. 1.2

Indexer steps: Dictionary & Postings


Term docID
ambitious 2
be 2
brutus 1
 Multiple term entries in a brutus 2
single document are merged. capitol 1
caesar 1
 Split into Dictionary and caesar 2
Postings caesar 2
did 1
 Doc. frequency information is enact 1
hath 1
added. I 1
I 1
i' 1
it 2
julius 1
killed 1
killed 1
let 2
me 1
noble 2
so 2
the 1
the 2
told 2
Why frequency? you 2
Will discuss later. was
was
1
2
with 2
Quiz: How would you process these queries?
QUERY:
Brutus AND Caesar AND Calpurnia
QUERY:
Brutus AND (Caesar OR Calpurnia)
QUERY:
Brutus AND Caesar AND NOT Calpurnia

Which terms do you intersect first?

Think: What terms to process first? How to handle OR, NOT?


Sec. 1.3

Query processing: AND

 Consider processing the query:


Brutus AND Caesar
 Locate Brutus in the Dictionary;
 Retrieve its postings.

 Locate Caesar in the Dictionary;


 Retrieve its postings.

 “Merge” the two postings (intersect the document sets):

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
15
Sec. 1.3

The merge

 Walk through the two postings simultaneously, in time linear


in the total number of postings entries

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)


operations.
Crucial: postings sorted by docID. 16
Sec. 1.3

The merge

 Walk through the two postings simultaneously, in time linear in the


total number of postings entries

2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)


operations.
Crucial: postings sorted by docID.
17
Intersecting two postings lists
(a “merge” algorithm)

18
TF-IDF Scoring Function
• The weights are used by ranking algorithms

• e.g.TF-IDF ranks documents by the Term Frequency of the query


term within the document times * the Inverse Document
Frequency of the term across all documents.
• Higher scores means you have more query terms which are not
found in many documents.

• Given query q and document d


TF-IDF Scoring Function
• Given query q and document d

TF-IDF
terms t in q
Term frequency (raw count) of t in d
Inverse document frequency

Total number of documents

Number of documents
with >=1 occurrence of t
 Let’s take an example to get a clearer understanding.
 Sentence 1 : The car is driven on the road.
 Sentence 2: The truck is driven on the highway.
Classical IR Models

 classical IR Model: They are similarity-based models.


1. Boolean Model
2. Vector Space Model
Evaluation: How good/bad is my IR?

• Evaluation is important:
– Compare two IR systems
– Decide whether our IR is ready for deployment
– Identify research challenges
Precision and Recall
metrics
not
relevant relevant When a search engine returns 30 pages,
only 20 of which are relevant, while failing
retrieved
A B to return 40 additional relevant pages, its
precision is 20/30 = 2/3, which tells us
not how valid the results are, while its recall is
retrieved
C D 20/60 = 1/3, which tells us how complete
the results are.
A
precision =
A+B
20 10
A
recall =
A+C 40
Information Retrieval: IR
Evaluation metrics

 Precision: fraction of relevant documents


retrieved divided by the total returned
documents
 Recall: proportion of relevant documents
returned divided by the total number of
relevant documents
 F-score: the harmonic mean of precision
and recall

25
Lists of open source search
engines

 Apache Lucene
 Sphinx
 Whoosh
 Carrot2

You might also like