Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
77 views36 pages

CSCI 7000 Modern Information Retrieval Jim Martin

This document contains notes from a lecture on Modern Information Retrieval. It discusses reviewing dictionary contents, handling phrases through biword indexing and positional indexing, processing wild card queries using prefix indexing and permuterm indexing, and correcting spelling in documents and queries. The programming assignment involves indexing a collection using Lucene and returning search results for sample queries.

Uploaded by

api-20013624
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views36 pages

CSCI 7000 Modern Information Retrieval Jim Martin

This document contains notes from a lecture on Modern Information Retrieval. It discusses reviewing dictionary contents, handling phrases through biword indexing and positional indexing, processing wild card queries using prefix indexing and permuterm indexing, and correcting spelling in documents and queries. The programming assignment involves indexing a collection using Lucene and returning search results for sample queries.

Uploaded by

api-20013624
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 36

CSCI 7000

Modern Information Retrieval

Jim Martin

Lecture 3
9/3/2008
Today 9/5

 Review
 Dictionary contents
 Advance query handling
 Phrases
 Wildcards
 Spelling correction
 First programming assignment

11/02/21 CSCI 7000 - IR 2


Index: The Dictionary file and a Postings file

Term Doc # Freq


ambitious 2 1 Doc # Freq
Term N docs Coll freq 2 1
be 2 1
ambitious 1 1 2 1
brutus 1 1
be 1 1 1 1
brutus 2 1
brutus 2 2 2 1
capitol 1 1
capitol 1 1 1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 did 1 1 2 2
did 1 1 enact 1 1 1 1
enact 1 1 hath 1 1 1 1
hath 2 1 I 1 2
2 1
I 1 2 i' 1 1
1 2
i' 1 1 it 1 1
1 1
it 2 1 julius 1 1
killed 1 2 2 1
julius 1 1
let 1 1 1 1
killed 1 2
me 1 1 1 2
let 2 1
noble 1 1 2 1
me 1 1
so 1 1 1 1
noble 2 1
the 2 2 2 1
so 2 1 2 1
told 1 1
the 1 1 you 1 1 1 1
the 2 1 was 2 2 2 1
told 2 1 with 1 1 2 1
you 2 1 2 1
was 1 1 1 1
was 2 1 2 1
with 2 1 2 1

11/02/21 CSCI 7000 - IR 3


Review: Dictionary

 What goes into creating the dictionary?


 Tokenization
 Case folding
 Stemming
 Stop-listing
 Dealing with numbers (and number-like entities)
 Complex morphology

11/02/21 CSCI 7000 - IR 4


Phrasal queries
 Want to handle queries such as
 “Colorado Buffaloes” – as a phrase
 This concept is popular with users; about 10% of
web queries are phrasal queries
 Postings that consist of document lists alone are
not sufficient to handle phrasal queries
 Two general approaches
 Biword indexing
 Positional indexing

11/02/21 CSCI 7000 - IR 5


Solution 1: Biword Indexing

 Index every consecutive pair of terms in


the text as a phrase
 For example the text “Friends, Romans,
Countrymen” would generate the biwords
 friends romans
 romans countrymen
 Each of these biwords is now a dictionary
term
 Two-word phrase query-processing is now
free

11/02/21 Not really. CSCI 7000 - IR 6
Longer Phrasal Queries

 Longer phrases can be broken into the


Boolean AND queries on the component
biwords
 “Colorado Buffaloes at Arizona”
 (Colorado Buffaloes) AND (Buffaloes at)
AND (at Arizona)

Susceptible to Type 1 errors (false positives)

11/02/21 CSCI 7000 - IR 7


Solution 2: Positional Indexing

 Change our posting content


 Store, for each term, entries of the form:
<number of docs containing term;
doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>

11/02/21 CSCI 7000 - IR 8


Positional index example

<be: 993427;
1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?
5: 363, 367, …>

11/02/21 CSCI 7000 - IR 9


Processing a phrase query
 Extract inverted index entries for each distinct
term: to, be, or, not.
 Merge their doc:position lists to enumerate all
positions with “to be or not to be”.
 to:
 2:1,17,74,222,551; 4:8,16,190,429,433;
7:13,23,191; ...
 be:
 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
 Same general method for proximity searches

11/02/21 CSCI 7000 - IR 10


Positional index size

 As we’ll see you can compress position


values/offsets
 But a positional index still expands the
postings storage substantially
 Nevertheless, it is now the standard
approach because of the power and
usefulness of phrase and proximity queries
… whether used explicitly or implicitly in a
ranking retrieval system.

11/02/21 CSCI 7000 - IR 11


Rules of thumb

 Positional index size 35–50% of volume of


original text
 Caveat: all of this holds for “English-like”
languages

11/02/21 CSCI 7000 - IR 12


Combination Techniques

 Biwords are faster.


 And they cover a large percentage of very
frequent (implied) phrasal queries
 Britney Spears
 So it can be effective to combine positional
indexes with biword indexes for frequent
bigrams

11/02/21 CSCI 7000 - IR 13


Web

 Cuil
 Yahoo! BOSS

11/02/21 CSCI 7000 - IR 14


Programming Assignment: Part 1

 Download and install Lucene


 How does Lucene handle (by default)
 Case, stemming, and phrasal queries
 Download and index a collection that I will
point you at
 How big is the resulting index?
 Terms and size of index
 Return the Top N document IDs (hits) from
a set of queries I’ll provide.

11/02/21 CSCI 7000 - IR 15


Programming Assignment: Part 2

 Make it better

11/02/21 CSCI 7000 - IR 16


Wild Card Queries

 Two flavors
 Word-based
 Caribb*
 Phrasal
 “Pirates * Caribbean”
 General approach
 Generate a set of new queries from the
original
 Operation on the dictionary
 Run those queries in a not stupid way

11/02/21 CSCI 7000 - IR 17


Simple Single Wild-card queries: *

 Single instance of a *
 * means an string of length 0 or more
 This is not Kleene *.
 mon*: find all docs containing any word
beginning “mon”.
 Index your lexicon on prefixes
 *mon: find words ending in “mon”
 Maintain a backwards index
Exercise: from this, how can w enumerate all terms
meeting the wild-card query pro*cent ?
11/02/21 CSCI 7000 - IR 18
Arbitrary Wildcards

 How can we handle multiple *’s in the


middle of query term?
 The solution: transform every wild-card
query so that the *’s occur at the end
 This gives rise to the Permuterm Index.

11/02/21 CSCI 7000 - IR 19


Permuterm Index

 For term hello index under:


hello$, ello$h, llo$he, lo$hel, o$hell
where $ is a special symbol.
 Example

Query = hel*o

Rotate

Lookup o$hel*

11/02/21 CSCI 7000 - IR 20


Permuterm query processing

 Rotate query wild-card to the right


 Now use indexed lookup as before.
 Permuterm problem: ≈ quadruples lexicon
size
Empirical observation for English.

11/02/21 CSCI 7000 - IR 21


Spelling Correction

 Two primary uses


 Correcting document(s) being indexed
 Retrieve matching documents when query
contains a spelling error
 Two main flavors:
 Isolated word
 Check each word on its own for misspelling
 Will not catch typos resulting in correctly spelled words
e.g., from  form
 Context-sensitive
 Look at surrounding words, e.g., I flew form
Heathrow to Narita.
11/02/21 CSCI 7000 - IR 22
Document correction

 Primarily for OCR’ed documents


 Correction algorithms tuned for this
 Goal: the index (dictionary) contains fewer
OCR-induced misspellings
 Can use domain-specific knowledge
 E.g., OCR can confuse O and D more often
than it would confuse O and I (adjacent on
the QWERTY keyboard, so more likely
interchanged in typing).

11/02/21 CSCI 7000 - IR 23


Query correction

 Our principal focus here


 E.g., the query Alanis Morisett
 We can either
 Retrieve using that spelling
 Retrieve documents indexed by the correct
spelling, OR
 Return several suggested alternative
queries with the correct spelling
 Did you mean … ?

11/02/21 CSCI 7000 - IR 24


Isolated word correction

 Fundamental premise – there is a lexicon


from which the correct spellings come
 Two basic choices for this
 A standard lexicon such as
 Webster’s English Dictionary
 An “industry-specific” lexicon – hand-maintained
 The lexicon of the indexed corpus
 E.g., all words on the web
 All names, acronyms etc.
 (Including the mis-spellings)

11/02/21 CSCI 7000 - IR 25


Isolated word correction

 Given a lexicon and a character sequence


Q, return the words in the lexicon closest
to Q
 What’s “closest”?
 We’ll study several alternatives
 Edit distance
 Weighted edit distance
 Character n-gram overlap

11/02/21 CSCI 7000 - IR 26


Edit distance

 Given two strings S1 and S2, the minimum


number of basic operations to covert one to
the other
 Basic operations are typically character-level
 Insert
 Delete
 Replace
 E.g., the edit distance from cat to dog is 3.
 Generally found by dynamic programming.

11/02/21 CSCI 7000 - IR 27


Weighted edit distance

 As above, but the weight of an operation


depends on the character(s) involved
 Meant to capture keyboard errors, e.g. m
more likely to be mis-typed as n than as q
 Therefore, replacing m by n is a smaller
edit distance than by q
 (Same ideas usable for OCR, but with
different weights)
 Require weight matrix as input
 Modify dynamic programming to handle
weights (Viterbi)
11/02/21 CSCI 7000 - IR 28
Using edit distances

 Given query, first enumerate all dictionary


terms within a preset (weighted) edit
distance
 Then look up enumerated dictionary terms
in the term-document inverted index

11/02/21 CSCI 7000 - IR 29


Edit distance to all dictionary terms?

 Given a (misspelled) query – do we


compute its edit distance to every
dictionary term?
 Expensive and slow
 How do we cut the set of candidate
dictionary terms?
 Here we can use n-gram overlap for this

11/02/21 CSCI 7000 - IR 30


Context-sensitive spell correction

 Text: I flew from Heathrow to Narita.


 Consider the phrase query “flew form

Heathrow”
 We’d like to respond

Did you mean “flew from Heathrow”?


because no docs matched the query phrase.

11/02/21 CSCI 7000 - IR 31


Context-sensitive correction
 Need surrounding context to catch this.
 NLP too heavyweight for this.
 First idea: retrieve dictionary terms close (in
weighted edit distance) to each query term
 Now try all possible resulting phrases with one
word “fixed” at a time
 flew from heathrow
 fled form heathrow
 flea form heathrow
 etc.
 Suggest the alternative that has lots of hits?

11/02/21 CSCI 7000 - IR 32


Exercise

Suppose that for “flew form Heathrow”


we have 7 alternatives for flew, 19 for
form and 3 for heathrow.
How many “corrected” phrases will we
enumerate in this scheme?

11/02/21 CSCI 7000 - IR 33


General issue in spell correction

 Will enumerate multiple alternatives for


“Did you mean”
 Need to figure out which one (or small
number) to present to the user
 Use heuristics
 The alternative hitting most docs
 Query log analysis + tweaking
 For especially popular, topical queries
 Language modeling

11/02/21 CSCI 7000 - IR 34


Computational cost

 Spell-correction is computationally
expensive
 Avoid running routinely on every query?
 Run only on queries that matched few docs

11/02/21 CSCI 7000 - IR 35


Next Time

 On to Chapter 4
 Real indexing

11/02/21 CSCI 7000 - IR 36

You might also like