CSCI 7000
Modern Information Retrieval
Jim Martin
Lecture 3
9/3/2008
Today 9/5
Review
Dictionary contents
Advance query handling
Phrases
Wildcards
Spelling correction
First programming assignment
11/02/21 CSCI 7000 - IR 2
Index: The Dictionary file and a Postings file
Term Doc # Freq
ambitious 2 1 Doc # Freq
Term N docs Coll freq 2 1
be 2 1
ambitious 1 1 2 1
brutus 1 1
be 1 1 1 1
brutus 2 1
brutus 2 2 2 1
capitol 1 1
capitol 1 1 1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 did 1 1 2 2
did 1 1 enact 1 1 1 1
enact 1 1 hath 1 1 1 1
hath 2 1 I 1 2
2 1
I 1 2 i' 1 1
1 2
i' 1 1 it 1 1
1 1
it 2 1 julius 1 1
killed 1 2 2 1
julius 1 1
let 1 1 1 1
killed 1 2
me 1 1 1 2
let 2 1
noble 1 1 2 1
me 1 1
so 1 1 1 1
noble 2 1
the 2 2 2 1
so 2 1 2 1
told 1 1
the 1 1 you 1 1 1 1
the 2 1 was 2 2 2 1
told 2 1 with 1 1 2 1
you 2 1 2 1
was 1 1 1 1
was 2 1 2 1
with 2 1 2 1
11/02/21 CSCI 7000 - IR 3
Review: Dictionary
What goes into creating the dictionary?
Tokenization
Case folding
Stemming
Stop-listing
Dealing with numbers (and number-like entities)
Complex morphology
11/02/21 CSCI 7000 - IR 4
Phrasal queries
Want to handle queries such as
“Colorado Buffaloes” – as a phrase
This concept is popular with users; about 10% of
web queries are phrasal queries
Postings that consist of document lists alone are
not sufficient to handle phrasal queries
Two general approaches
Biword indexing
Positional indexing
11/02/21 CSCI 7000 - IR 5
Solution 1: Biword Indexing
Index every consecutive pair of terms in
the text as a phrase
For example the text “Friends, Romans,
Countrymen” would generate the biwords
friends romans
romans countrymen
Each of these biwords is now a dictionary
term
Two-word phrase query-processing is now
free
11/02/21 Not really. CSCI 7000 - IR 6
Longer Phrasal Queries
Longer phrases can be broken into the
Boolean AND queries on the component
biwords
“Colorado Buffaloes at Arizona”
(Colorado Buffaloes) AND (Buffaloes at)
AND (at Arizona)
Susceptible to Type 1 errors (false positives)
11/02/21 CSCI 7000 - IR 7
Solution 2: Positional Indexing
Change our posting content
Store, for each term, entries of the form:
<number of docs containing term;
doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
11/02/21 CSCI 7000 - IR 8
Positional index example
<be: 993427;
1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?
5: 363, 367, …>
11/02/21 CSCI 7000 - IR 9
Processing a phrase query
Extract inverted index entries for each distinct
term: to, be, or, not.
Merge their doc:position lists to enumerate all
positions with “to be or not to be”.
to:
2:1,17,74,222,551; 4:8,16,190,429,433;
7:13,23,191; ...
be:
1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
Same general method for proximity searches
11/02/21 CSCI 7000 - IR 10
Positional index size
As we’ll see you can compress position
values/offsets
But a positional index still expands the
postings storage substantially
Nevertheless, it is now the standard
approach because of the power and
usefulness of phrase and proximity queries
… whether used explicitly or implicitly in a
ranking retrieval system.
11/02/21 CSCI 7000 - IR 11
Rules of thumb
Positional index size 35–50% of volume of
original text
Caveat: all of this holds for “English-like”
languages
11/02/21 CSCI 7000 - IR 12
Combination Techniques
Biwords are faster.
And they cover a large percentage of very
frequent (implied) phrasal queries
Britney Spears
So it can be effective to combine positional
indexes with biword indexes for frequent
bigrams
11/02/21 CSCI 7000 - IR 13
Web
Cuil
Yahoo! BOSS
11/02/21 CSCI 7000 - IR 14
Programming Assignment: Part 1
Download and install Lucene
How does Lucene handle (by default)
Case, stemming, and phrasal queries
Download and index a collection that I will
point you at
How big is the resulting index?
Terms and size of index
Return the Top N document IDs (hits) from
a set of queries I’ll provide.
11/02/21 CSCI 7000 - IR 15
Programming Assignment: Part 2
Make it better
11/02/21 CSCI 7000 - IR 16
Wild Card Queries
Two flavors
Word-based
Caribb*
Phrasal
“Pirates * Caribbean”
General approach
Generate a set of new queries from the
original
Operation on the dictionary
Run those queries in a not stupid way
11/02/21 CSCI 7000 - IR 17
Simple Single Wild-card queries: *
Single instance of a *
* means an string of length 0 or more
This is not Kleene *.
mon*: find all docs containing any word
beginning “mon”.
Index your lexicon on prefixes
*mon: find words ending in “mon”
Maintain a backwards index
Exercise: from this, how can w enumerate all terms
meeting the wild-card query pro*cent ?
11/02/21 CSCI 7000 - IR 18
Arbitrary Wildcards
How can we handle multiple *’s in the
middle of query term?
The solution: transform every wild-card
query so that the *’s occur at the end
This gives rise to the Permuterm Index.
11/02/21 CSCI 7000 - IR 19
Permuterm Index
For term hello index under:
hello$, ello$h, llo$he, lo$hel, o$hell
where $ is a special symbol.
Example
Query = hel*o
Rotate
Lookup o$hel*
11/02/21 CSCI 7000 - IR 20
Permuterm query processing
Rotate query wild-card to the right
Now use indexed lookup as before.
Permuterm problem: ≈ quadruples lexicon
size
Empirical observation for English.
11/02/21 CSCI 7000 - IR 21
Spelling Correction
Two primary uses
Correcting document(s) being indexed
Retrieve matching documents when query
contains a spelling error
Two main flavors:
Isolated word
Check each word on its own for misspelling
Will not catch typos resulting in correctly spelled words
e.g., from form
Context-sensitive
Look at surrounding words, e.g., I flew form
Heathrow to Narita.
11/02/21 CSCI 7000 - IR 22
Document correction
Primarily for OCR’ed documents
Correction algorithms tuned for this
Goal: the index (dictionary) contains fewer
OCR-induced misspellings
Can use domain-specific knowledge
E.g., OCR can confuse O and D more often
than it would confuse O and I (adjacent on
the QWERTY keyboard, so more likely
interchanged in typing).
11/02/21 CSCI 7000 - IR 23
Query correction
Our principal focus here
E.g., the query Alanis Morisett
We can either
Retrieve using that spelling
Retrieve documents indexed by the correct
spelling, OR
Return several suggested alternative
queries with the correct spelling
Did you mean … ?
11/02/21 CSCI 7000 - IR 24
Isolated word correction
Fundamental premise – there is a lexicon
from which the correct spellings come
Two basic choices for this
A standard lexicon such as
Webster’s English Dictionary
An “industry-specific” lexicon – hand-maintained
The lexicon of the indexed corpus
E.g., all words on the web
All names, acronyms etc.
(Including the mis-spellings)
11/02/21 CSCI 7000 - IR 25
Isolated word correction
Given a lexicon and a character sequence
Q, return the words in the lexicon closest
to Q
What’s “closest”?
We’ll study several alternatives
Edit distance
Weighted edit distance
Character n-gram overlap
11/02/21 CSCI 7000 - IR 26
Edit distance
Given two strings S1 and S2, the minimum
number of basic operations to covert one to
the other
Basic operations are typically character-level
Insert
Delete
Replace
E.g., the edit distance from cat to dog is 3.
Generally found by dynamic programming.
11/02/21 CSCI 7000 - IR 27
Weighted edit distance
As above, but the weight of an operation
depends on the character(s) involved
Meant to capture keyboard errors, e.g. m
more likely to be mis-typed as n than as q
Therefore, replacing m by n is a smaller
edit distance than by q
(Same ideas usable for OCR, but with
different weights)
Require weight matrix as input
Modify dynamic programming to handle
weights (Viterbi)
11/02/21 CSCI 7000 - IR 28
Using edit distances
Given query, first enumerate all dictionary
terms within a preset (weighted) edit
distance
Then look up enumerated dictionary terms
in the term-document inverted index
11/02/21 CSCI 7000 - IR 29
Edit distance to all dictionary terms?
Given a (misspelled) query – do we
compute its edit distance to every
dictionary term?
Expensive and slow
How do we cut the set of candidate
dictionary terms?
Here we can use n-gram overlap for this
11/02/21 CSCI 7000 - IR 30
Context-sensitive spell correction
Text: I flew from Heathrow to Narita.
Consider the phrase query “flew form
Heathrow”
We’d like to respond
Did you mean “flew from Heathrow”?
because no docs matched the query phrase.
11/02/21 CSCI 7000 - IR 31
Context-sensitive correction
Need surrounding context to catch this.
NLP too heavyweight for this.
First idea: retrieve dictionary terms close (in
weighted edit distance) to each query term
Now try all possible resulting phrases with one
word “fixed” at a time
flew from heathrow
fled form heathrow
flea form heathrow
etc.
Suggest the alternative that has lots of hits?
11/02/21 CSCI 7000 - IR 32
Exercise
Suppose that for “flew form Heathrow”
we have 7 alternatives for flew, 19 for
form and 3 for heathrow.
How many “corrected” phrases will we
enumerate in this scheme?
11/02/21 CSCI 7000 - IR 33
General issue in spell correction
Will enumerate multiple alternatives for
“Did you mean”
Need to figure out which one (or small
number) to present to the user
Use heuristics
The alternative hitting most docs
Query log analysis + tweaking
For especially popular, topical queries
Language modeling
11/02/21 CSCI 7000 - IR 34
Computational cost
Spell-correction is computationally
expensive
Avoid running routinely on every query?
Run only on queries that matched few docs
11/02/21 CSCI 7000 - IR 35
Next Time
On to Chapter 4
Real indexing
11/02/21 CSCI 7000 - IR 36