0% found this document useful (0 votes)

31 views27 pages

03lecture 3 - Biomedical IR-indexing

Uploaded by

Mahmoud Nasser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views27 pages

03lecture 3 - Biomedical IR-indexing

Uploaded by

Mahmoud Nasser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Biomedical Information

Retrieval
Search Engine Architecture [2]
Lecture 3

Dr. Ebtsam AbdelHakam

Minia University
Indexing Process
Index
Creation
Storing Document Statistics
•

‣ The counts and positions of document terms are stored

Two ways for index representation

1. Term-Document Matrix
2. Inverted Index
The Shakespeare collection as
Term-Document Matrix

Matrix element (t,d) is:

1 if term t occurs in document d,
0 otherwise
The Shakespeare collection as
Term-Document Matrix

QUERY:
Brutus AND Caesar AND NOT Calpurnia

Answer: “Antony and Cleopatra”(d=1), “Hamlet”(d=4)

The Shakespeare collection as
Term-Document Matrix

QUERY:
Brutus AND Caesar AND NOT Calpurnia

Answer: “Antony and Cleopatra”(d=1), “Hamlet”(d=4)

1 1 0 1 0 0
1 1 0 1 1 1
1 0 1 1 1 1
1 0 0 1 0 0
Index
Creation
Storing Document Statistics
•

‣ The counts and positions of document terms are stored

• Forward Index: Key is the document, value is a list of terms and

term positions. Easiest for the crawler to build.

• Inverted Index: Key is a term, value is a list of documents and

term positions. Provides faster processing at query time.

‣ Term weights are calculated and stored with the terms. The
weight estimates the term’s importance to the document.
Sec. 1.2

Inverted index

 We need variable-size postings lists

 On disk, a continuous run of postings is normal and best
 In memory, can use linked lists or variable length arrays Posting
 Some tradeoffs in size/ease of insertion

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Postings
Sorted by docID (more later on why).
8
Sec. 1.2

Inverted index construction

Documents to Friends, Romans, countrymen.
be indexed

Tokenizer

Token stream Friends Romans Countrymen

Linguistic modules

Modified tokens friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Sec. 1.2

Where do we pay in storage?

Lists of
docIDs

Terms
and
counts IR system
implementation
• How do we index
efficiently?
• How much storage
do we need?

10
Pointers
Sec. 1.2

Indexer steps: Token sequence

 Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with

Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec. 1.2

Indexer steps: Sort

 Sort by terms
 At least conceptually
 And then docID

Core indexing step

Sec. 1.2

Indexer steps: Dictionary & Postings

Term docID
ambitious 2
be 2
brutus 1
 Multiple term entries in a brutus 2
single document are merged. capitol 1
caesar 1
 Split into Dictionary and caesar 2
Postings caesar 2
did 1
 Doc. frequency information is enact 1
hath 1
added. I 1
I 1
i' 1
it 2
julius 1
killed 1
killed 1
let 2
me 1
noble 2
so 2
the 1
the 2
told 2
Why frequency? you 2
Will discuss later. was
was
1
2
with 2
Quiz: How would you process these queries?
QUERY:
Brutus AND Caesar AND Calpurnia
QUERY:
Brutus AND (Caesar OR Calpurnia)
QUERY:
Brutus AND Caesar AND NOT Calpurnia

Which terms do you intersect first?

Think: What terms to process first? How to handle OR, NOT?

Sec. 1.3

Query processing: AND

 Consider processing the query:

Brutus AND Caesar
 Locate Brutus in the Dictionary;
 Retrieve its postings.

 Locate Caesar in the Dictionary;

 Retrieve its postings.

 “Merge” the two postings (intersect the document sets):

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
15
Sec. 1.3

The merge

 Walk through the two postings simultaneously, in time linear

in the total number of postings entries

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)

operations.
Crucial: postings sorted by docID. 16
Sec. 1.3

The merge

 Walk through the two postings simultaneously, in time linear in the

total number of postings entries

2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)

operations.
Crucial: postings sorted by docID.
17
Intersecting two postings lists
(a “merge” algorithm)

18
TF-IDF Scoring Function
• The weights are used by ranking algorithms

• e.g.TF-IDF ranks documents by the Term Frequency of the query

term within the document times * the Inverse Document
Frequency of the term across all documents.
• Higher scores means you have more query terms which are not
found in many documents.

• Given query q and document d

TF-IDF Scoring Function
• Given query q and document d

TF-IDF
terms t in q
Term frequency (raw count) of t in d
Inverse document frequency

Total number of documents

Number of documents
with >=1 occurrence of t
 Let’s take an example to get a clearer understanding.
 Sentence 1 : The car is driven on the road.
 Sentence 2: The truck is driven on the highway.
Classical IR Models

 classical IR Model: They are similarity-based models.

1. Boolean Model
2. Vector Space Model
Evaluation: How good/bad is my IR?

• Evaluation is important:
– Compare two IR systems
– Decide whether our IR is ready for deployment
– Identify research challenges
Precision and Recall
metrics
not
relevant relevant When a search engine returns 30 pages,
only 20 of which are relevant, while failing
retrieved
A B to return 40 additional relevant pages, its
precision is 20/30 = 2/3, which tells us
not how valid the results are, while its recall is
retrieved
C D 20/60 = 1/3, which tells us how complete
the results are.
A
precision =
A+B
20 10
A
recall =
A+C 40
Information Retrieval: IR
Evaluation metrics

 Precision: fraction of relevant documents

retrieved divided by the total returned
documents
 Recall: proportion of relevant documents
returned divided by the total number of
relevant documents
 F-score: the harmonic mean of precision
and recall

25
Lists of open source search
engines

 Apache Lucene
 Sphinx
 Whoosh
 Carrot2

SEO Guide: Indexing Basics & Techniques
No ratings yet
SEO Guide: Indexing Basics & Techniques
34 pages
Indexing for Efficient Retrieval
No ratings yet
Indexing for Efficient Retrieval
26 pages
3 Indexing
No ratings yet
3 Indexing
28 pages
Unit 1
No ratings yet
Unit 1
181 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Index Construction Guide
No ratings yet
Index Construction Guide
43 pages
Machine Learning Handwritten Notes
100% (1)
Machine Learning Handwritten Notes
98 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
Introduction to Boolean Retrieval
No ratings yet
Introduction to Boolean Retrieval
50 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Information Retrieval: Indexing
No ratings yet
Information Retrieval: Indexing
32 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
31 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
Ir 1
No ratings yet
Ir 1
14 pages
KEN2570-5-Search and IR
No ratings yet
KEN2570-5-Search and IR
18 pages
Unit I
No ratings yet
Unit I
83 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
chapter2-MA212-Indexing & Preprocessing
No ratings yet
chapter2-MA212-Indexing & Preprocessing
68 pages
L3L4 IRSW Boolean Retrieval
No ratings yet
L3L4 IRSW Boolean Retrieval
54 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Unit 2
No ratings yet
Unit 2
58 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Week 6
No ratings yet
Week 6
98 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
60 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
54 pages
Lecture02 - IR
No ratings yet
Lecture02 - IR
36 pages
ch3 - Indexing - 2019
No ratings yet
ch3 - Indexing - 2019
38 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
03 - Lect3 Search Engines-Part2
No ratings yet
03 - Lect3 Search Engines-Part2
32 pages
On Information Retrival
No ratings yet
On Information Retrival
23 pages
2-Boolean IR and Indexing
No ratings yet
2-Boolean IR and Indexing
46 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
IR Unit 2 Final
No ratings yet
IR Unit 2 Final
43 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
3-Index Construction
No ratings yet
3-Index Construction
43 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
Unit 1 Intro To IR
No ratings yet
Unit 1 Intro To IR
32 pages
4 Indexing
No ratings yet
4 Indexing
29 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Lec 2
No ratings yet
Lec 2
17 pages
Chapter 1 - Boolean-Retrieval
No ratings yet
Chapter 1 - Boolean-Retrieval
33 pages
Unit 2
No ratings yet
Unit 2
125 pages
Lect 3 Inverted Index
No ratings yet
Lect 3 Inverted Index
24 pages
Class 9 AI
No ratings yet
Class 9 AI
30 pages
L2 Boolean Retrieval
No ratings yet
L2 Boolean Retrieval
33 pages
Evaluating Accuracy of Classifier or Predictor
No ratings yet
Evaluating Accuracy of Classifier or Predictor
3 pages
Afaan Oromo Text Retrieval Thesis
No ratings yet
Afaan Oromo Text Retrieval Thesis
79 pages
Hackathon Document
No ratings yet
Hackathon Document
2 pages
Data Science & Big Data Lab Guide
No ratings yet
Data Science & Big Data Lab Guide
167 pages
A Survey On Intrusion Detection System in IoT Networks
No ratings yet
A Survey On Intrusion Detection System in IoT Networks
19 pages
Machine Learning Q&A with Dr. Das
No ratings yet
Machine Learning Q&A with Dr. Das
18 pages
Content Analysis Best Practices
No ratings yet
Content Analysis Best Practices
22 pages
MLT Ese
No ratings yet
MLT Ese
21 pages
1 s2.0 S1470160X25001566 Main
No ratings yet
1 s2.0 S1470160X25001566 Main
13 pages
FRA Milestone 2
No ratings yet
FRA Milestone 2
16 pages
Machine Learning Classification Bootcamp Cheatsheet
No ratings yet
Machine Learning Classification Bootcamp Cheatsheet
7 pages
Personality Prediction Using Social Media
No ratings yet
Personality Prediction Using Social Media
13 pages
Fraud Detection with Machine Learning
No ratings yet
Fraud Detection with Machine Learning
12 pages
Data Science in Medicine - Precision & Recall or Specificity & Sensitivity? - by Alon Lekhtman - Towards Data Science
No ratings yet
Data Science in Medicine - Precision & Recall or Specificity & Sensitivity? - by Alon Lekhtman - Towards Data Science
11 pages
Prediction of Cervical Cancer From Behavior Risk Using Machine Learning Techniques
No ratings yet
Prediction of Cervical Cancer From Behavior Risk Using Machine Learning Techniques
10 pages
Error Analysis of Emotion Detection Using BERT
No ratings yet
Error Analysis of Emotion Detection Using BERT
8 pages
Deepcrack A Deep Learning Approach For Image-Based Crack Prediction Using MobileNet and Transfer Learning
No ratings yet
Deepcrack A Deep Learning Approach For Image-Based Crack Prediction Using MobileNet and Transfer Learning
6 pages
Research Proposal
No ratings yet
Research Proposal
6 pages
27 CameraReadySubmission CL Dice Neurips Med
No ratings yet
27 CameraReadySubmission CL Dice Neurips Med
5 pages
CBSE Sample Papers For Class 10 AI Set 9 With Solutions - Learn CBSE
No ratings yet
CBSE Sample Papers For Class 10 AI Set 9 With Solutions - Learn CBSE
11 pages
Anomaly Detection with NAB
No ratings yet
Anomaly Detection with NAB
8 pages
Study On Decision Tree and KNN Algorithm For Intrusion Detection System IJERTV9IS050303
No ratings yet
Study On Decision Tree and KNN Algorithm For Intrusion Detection System IJERTV9IS050303
6 pages
Prediction of Diabetes Using Classi Cation Algorithms
No ratings yet
Prediction of Diabetes Using Classi Cation Algorithms
8 pages
945-Article Text-2920-1-10-20190802
No ratings yet
945-Article Text-2920-1-10-20190802
6 pages
Comparative Analysis of Classification Algorithms On Diferrent Dataset Using Weka SW PDF
No ratings yet
Comparative Analysis of Classification Algorithms On Diferrent Dataset Using Weka SW PDF
5 pages
Python Code
No ratings yet
Python Code
52 pages
Machine Learning in Cybersecurity
No ratings yet
Machine Learning in Cybersecurity
6 pages

03lecture 3 - Biomedical IR-indexing

Uploaded by

03lecture 3 - Biomedical IR-indexing

Uploaded by

Biomedical Information

Dr. Ebtsam AbdelHakam

‣ The counts and positions of document terms are stored

Two ways for index representation

Matrix element (t,d) is:

Answer: “Antony and Cleopatra”(d=1), “Hamlet”(d=4)

Answer: “Antony and Cleopatra”(d=1), “Hamlet”(d=4)

‣ The counts and positions of document terms are stored

• Forward Index: Key is the document, value is a list of terms and

• Inverted Index: Key is a term, value is a list of documents and

 We need variable-size postings lists

Brutus 1 2 4 11 31 45 173 174

Inverted index construction

Token stream Friends Romans Countrymen

Modified tokens friend roman countryman

Where do we pay in storage?

Indexer steps: Token sequence

I did enact Julius So let it be with

Indexer steps: Sort

Core indexing step

Indexer steps: Dictionary & Postings

Which terms do you intersect first?

Think: What terms to process first? How to handle OR, NOT?

Query processing: AND

 Consider processing the query:

 Locate Caesar in the Dictionary;

 “Merge” the two postings (intersect the document sets):

 Walk through the two postings simultaneously, in time linear

If the list lengths are x and y, the merge takes O(x+y)

 Walk through the two postings simultaneously, in time linear in the

If the list lengths are x and y, the merge takes O(x+y)

• e.g.TF-IDF ranks documents by the Term Frequency of the query

• Given query q and document d

Total number of documents

 classical IR Model: They are similarity-based models.

 Precision: fraction of relevant documents

You might also like