Biomedical IR
Search Engine Architecture [part 3]
Lecture 4
Dr. Ebtsam AbdelHakam
Minia University
Indexing Process
document unique ID
Document what can you store?
web-crawling disk space? rights?
provider feeds data store
compression?
RSS “feeds”
A System desktop/email
and
…………M…e…th…od for
……………………
……………………………………
……………………………………
Documents Index
……………………………………
……….. acquisition creation Index
what data
do we want? a lookup table for
quickly finding all docs
Text containing a word
transformation
format conversion. international?
which part contains “meaning”? © Addison Wesley, 2008
word units? stopping? stemming?
Walid Magdy, TTDS 2017/2018
Text Transformation (Pre-processing)
• Standard text pre-processing steps:
1. Tokenisation
2. Stop of word removal
3. Normalization
4. Stemming
Walid Magdy, TTDS 2017/2018
Getting ready for indexing?
• Pre-processing steps before indexing:
• Tokenisation
• Stopping
• Stemming
• Objective identify the optimal form of the term to
be indexed to achieve the best retrieval performance
Walid Magdy, TTDS 2017/2018
Tokenisation
• Tokenizer: A document is converted to a stream of tokens,
e.g. individual words.
• Sentence tokenization (splitting) tokens
• A token is an instance of a sequence of characters
• Typical technique: split at non-letter characters (space)
• Each such token is now a candidate for an index entry
(term), after further processing
Walid Magdy, TTDS 2017/2018
• This is a very exciting lecture on the technologies of
text
• Stop words: the most common words in collection
the, a, is, he, she, I, him, for, on, to, very, …
• There are a lot of them ≈ 30-40% of text
• New stop words appear in specific domains
• Tweets: RT “RT @realDonalTrump Mexico will …”
• Patents: said, claim “a said method that extracts ….”
• Stop words
• influence on sentence structure
• less influence on topic (aboutness)
Walid Magdy, TTDS 2017/2018
Stopping: stop words
• Common practice in many applications “remove them”
• You need them for:
• Phrase queries:
“King of Denmark”, “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
• Objective make words with different
surface forms look the same
• Document: “this is my CAR!!”
Query: “car”
should “car” match “CAR”?
• Sentence tokenisation tokens normalisation
terms to be indexed
• Same tokenisation/normalisation steps should be
applied to documents & queries
Walid Magdy, TTDS 2017/2018
• Search for: “play”
should it match: “played”, “playing”, “player”?
• Stemmers attempt to reduce morphological variations
of words to a common stem
• usually involves removing suffixes (in English)
• Many morphological variations of words
• inflectional (plurals, tenses)
• derivational (making verbs nouns etc.)
• In most cases, aboutness does not change
• Can be done at indexing time or as part of query
processing (like stopwords)
Walid Magdy, TTDS 2017/2018
• Two basic types
• Dictionary-based: uses lists of related words
• Algorithmic: uses program to determine related words
• Algorithmic stemmers
• suffix-s: remove ‘s’ endings assuming plural
• e.g., cats → cat, lakes → lake, windows → window
• Many false negatives: supplies → supplie
• Some false positives: James → Jame
Walid Magdy, TTDS 2017/2018
• Most common algorithm for stemming English
• Conventions + 5 phases of reductions
• phases applied sequentially
• each phase consists of a set of commands
• sample convention:
of the rules in a compound command, select the one that
applies to the longest suffix.
• Example rules in Porter stemmer
• sses ss (processes process)
• yi (reply repli)
• ies i (replies repli)
• ement → null (replacement replac)
Walid Magdy, TTDS 2017/2018
• Irregular verbs:
• saw see
• went go
• Different spellings
• colour vs. color
• tokenisation vs. tokenization
• Television vs. TV
• Synonyms
• car vs. vehicle
• UK vs. Britain
• Solution Query expansion …
Walid Magdy, TTDS 2017/2018
Text pre-processing before IR:
Tokenisation Stopping Stemming
Walid Magdy, TTDS 2017/2018
Index
Creation
Storing Document Statistics
•
‣ The counts and positions of document terms are stored.
‣ Index common types:
1. Forward Index: Key is the document, value is a list of terms
and term positions. Easiest for the crawler to build.
2. Inverted Index: Key is a term, value is a list of documents
and term positions. Provides faster processing at query time.
Forward index
- The rationale behind developing a forward index is that as documents
are parsed, it is better to intermediately store the words per document.
- The forward index is sorted to transform it to an inverted index.
- The forward index is essentially a list of pairs consisting of a document
and a word, collated by the document.
Inverted index
• This index can only determine whether a word exists within a particular
document, since it stores no information regarding the frequency and
position of the word; it is therefore considered to be a boolean index.
• Such an index determines which documents match a query but does
not rank matched documents.