Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views9 pages

Lect 7 Normalization

The document discusses the concept of normalization in the context of information retrieval, defining key terms such as word, term, token, and type. It explains methods for token normalization, including creating equivalence classes and maintaining relations between unnormalized tokens, while also addressing challenges with diacritics and case-folding in various languages. Additionally, it touches on handling synonyms, homonyms, and spelling mistakes through techniques like thesauri and Soundex.

Uploaded by

golanihimanshu2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views9 pages

Lect 7 Normalization

The document discusses the concept of normalization in the context of information retrieval, defining key terms such as word, term, token, and type. It explains methods for token normalization, including creating equivalence classes and maintaining relations between unnormalized tokens, while also addressing challenges with diacritics and case-folding in various languages. Additionally, it touches on handling synonyms, homonyms, and spelling mistakes through techniques like thesauri and Soundex.

Uploaded by

golanihimanshu2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Lect 7: Normalization

Dr. Subrat Kumar Nayak


Associate Professor
Dept. of CSE, ITER, SOADU
Definitions
 Word – A delimited string of characters as it appears in the text.
 Term – A “normalized” word (case, morphology, spelling etc); an equivalence
class of words.
 Token – An instance of a word or term occurring in a document.
 Type – The same as a term in most cases: an equivalence class of tokens.
Normalization
 Token normalization is the process of canonicalizing tokens so that
matches occur despite superficial differences in the character sequences of
the tokens.
 The most standard way to normalize is to implicitly create equivalence
classes, which are normally named after one member of the set.
 Example: anti-discriminatory and antidiscriminatory are both mapped
onto the term antidiscriminatory, in both the document text and queries,
then searches for one term will retrieve documents that contain either.
Normalization
 An alternative to creating equivalence classes is to maintain relations between
unnormalized tokens.
 This method can be extended to hand-constructed lists of synonyms such as car
and automobile
 These term relationships can be achieved in two ways.
➢ The usual way is to index unnormalized tokens and to maintain a query expansion
list of multiple vocabulary entries to consider for a certain query term. A query
term is then effectively a disjunction of several postings lists.
➢ The alternative is to perform the expansion during index construction. When the
document contains automobile, we index it under car as well
 Use of either of these methods is considerably less efficient than equivalence
classing, as there are more postings to store and merge.
➢ The first method adds a query expansion dictionary and requires more processing
at query time, while the second method requires more space for storing postings.
 Traditionally, expanding the space required for the postings lists was seen as
more disadvantageous, but with modern storage costs, the increased flexibility
that comes from distinct postings lists is appealing.
Normalization
 We may need to “normalize” words in indexed text as well as query words into
the same form
▪ We want to match U.S.A.and USA
 Result is terms: a term is a (normalized) word type, which is an entry in our IR
system dictionary
 We most commonly implicitly define equivalence classes of terms by, e.g.,
▪ deleting periods to form a term
U.S.A.,USA
▪ deleting hyphens to form a term
anti-discriminatory, antidiscriminatory
 Alternatively: do asymmetric expansion
➢ window → window, windows
➢ windows → Windows, windows, window
➢ Windows (no expansion)
 More powerful, but less efficient
Normalization: Other Languages
 Diacritics : Diacritics on characters in English have a fairly marginal status, and
we might well want cliché and cliche to match, or naïve and naïve.
➢ This can be done by normalizing tokens to remove diacritics. In many other
languages, diacritics are a regular part of the writing system and distinguish
different sounds.
 Accents: Occasionally words are distinguished only by their accents.
▪ Example: For instance, in Spanish, peña is ‘a cliff’, while pena is ‘sorrow’.
▪ Example2:French résumé vs. resume.
 Umlauts: German: Tuebingen vs. Tübingen
▪ Should be equivalent
 Most important criterion:
▪ How are your users like to write their queries for these words?
 Even in languages that standardly have accents, users often may not type
them
▪ Often best to normalize to a de-accented term/ equate all words to a form
without diacritics.
• Tuebingen, Tübingen, Tubingen
Normalization: Other Languages
 Normalization of things like date forms
▪ 7月30日vs. 7/30
▪ Japanese use of kana vs. Chinese characters
 Tokenization and normalization may depend on the language and so is
intertwined with language detection

Is this
 Morgen will ich in MIT… German “mit”?

 Crucial: Need to “normalize” indexed text as well as query terms


identically
Case-folding
 Reduce all letters to lower case
➢ exception: upper case in mid-sentence?
▪ e.g., General Motors
➢ The same task can be done more accurately by a machine learning sequence model
which uses more features to make the decision of when to case-fold. This is known
as truecasing.
▪ Fed vs. fed
▪ SAIL vs. sail
➢ Often best to lower case everything, since users will use lower case
regard less of ‘correct’ capitalization…
 Google example:
▪ Query C.A.T.
▪ #1 result is for “cats” (well, Lolcats)not
Thesauri and Soundex
 Do we handle synonyms and homonyms?
➢ E.g., by hand-constructed equivalence classes
▪ car=automobile color=colour
➢ We can rewrite to form equivalence-class terms
▪ When the document contains automobile, index it under car-
automobile(and vice-versa)
➢ Homonyms: Jaguar, BalckBery or Blackberry
➢ Or we can expand a query
▪ When the query contains automobile, look under car as well
 What about spelling mistakes?
➢ One approach is Soundex, which forms equivalence classes of words
based on phonetic heuristics.

You might also like