Lect 7 Normalization

The document discusses the concept of normalization in the context of information retrieval, defining key terms such as word, term, token, and type. It explains methods for token normalization, including creating equivalence classes and maintaining relations between unnormalized tokens, while also addressing challenges with diacritics and case-folding in various languages. Additionally, it touches on handling synonyms, homonyms, and spelling mistakes through techniques like thesauri and Soundex.

Uploaded by

golanihimanshu2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views9 pages

Lect 7 Normalization

Uploaded by

golanihimanshu2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Lect 7: Normalization

Dr. Subrat Kumar Nayak

Associate Professor
Dept. of CSE, ITER, SOADU
Definitions
 Word – A delimited string of characters as it appears in the text.
 Term – A “normalized” word (case, morphology, spelling etc); an equivalence
class of words.
 Token – An instance of a word or term occurring in a document.
 Type – The same as a term in most cases: an equivalence class of tokens.
Normalization
 Token normalization is the process of canonicalizing tokens so that
matches occur despite superficial differences in the character sequences of
the tokens.
 The most standard way to normalize is to implicitly create equivalence
classes, which are normally named after one member of the set.
 Example: anti-discriminatory and antidiscriminatory are both mapped
onto the term antidiscriminatory, in both the document text and queries,
then searches for one term will retrieve documents that contain either.
Normalization
 An alternative to creating equivalence classes is to maintain relations between
unnormalized tokens.
 This method can be extended to hand-constructed lists of synonyms such as car
and automobile
 These term relationships can be achieved in two ways.
➢ The usual way is to index unnormalized tokens and to maintain a query expansion
list of multiple vocabulary entries to consider for a certain query term. A query
term is then effectively a disjunction of several postings lists.
➢ The alternative is to perform the expansion during index construction. When the
document contains automobile, we index it under car as well
 Use of either of these methods is considerably less efficient than equivalence
classing, as there are more postings to store and merge.
➢ The first method adds a query expansion dictionary and requires more processing
at query time, while the second method requires more space for storing postings.
 Traditionally, expanding the space required for the postings lists was seen as
more disadvantageous, but with modern storage costs, the increased flexibility
that comes from distinct postings lists is appealing.
Normalization
 We may need to “normalize” words in indexed text as well as query words into
the same form
▪ We want to match U.S.A.and USA
 Result is terms: a term is a (normalized) word type, which is an entry in our IR
system dictionary
 We most commonly implicitly define equivalence classes of terms by, e.g.,
▪ deleting periods to form a term
U.S.A.,USA
▪ deleting hyphens to form a term
anti-discriminatory, antidiscriminatory
 Alternatively: do asymmetric expansion
➢ window → window, windows
➢ windows → Windows, windows, window
➢ Windows (no expansion)
 More powerful, but less efficient
Normalization: Other Languages
 Diacritics : Diacritics on characters in English have a fairly marginal status, and
we might well want cliché and cliche to match, or naïve and naïve.
➢ This can be done by normalizing tokens to remove diacritics. In many other
languages, diacritics are a regular part of the writing system and distinguish
different sounds.
 Accents: Occasionally words are distinguished only by their accents.
▪ Example: For instance, in Spanish, peña is ‘a cliff’, while pena is ‘sorrow’.
▪ Example2:French résumé vs. resume.
 Umlauts: German: Tuebingen vs. Tübingen
▪ Should be equivalent
 Most important criterion:
▪ How are your users like to write their queries for these words?
 Even in languages that standardly have accents, users often may not type
them
▪ Often best to normalize to a de-accented term/ equate all words to a form
without diacritics.
• Tuebingen, Tübingen, Tubingen
Normalization: Other Languages
 Normalization of things like date forms
▪ 7月30日vs. 7/30
▪ Japanese use of kana vs. Chinese characters
 Tokenization and normalization may depend on the language and so is
intertwined with language detection

Is this
 Morgen will ich in MIT… German “mit”?

 Crucial: Need to “normalize” indexed text as well as query terms

identically
Case-folding
 Reduce all letters to lower case
➢ exception: upper case in mid-sentence?
▪ e.g., General Motors
➢ The same task can be done more accurately by a machine learning sequence model
which uses more features to make the decision of when to case-fold. This is known
as truecasing.
▪ Fed vs. fed
▪ SAIL vs. sail
➢ Often best to lower case everything, since users will use lower case
regard less of ‘correct’ capitalization…
 Google example:
▪ Query C.A.T.
▪ #1 result is for “cats” (well, Lolcats)not
Thesauri and Soundex
 Do we handle synonyms and homonyms?
➢ E.g., by hand-constructed equivalence classes
▪ car=automobile color=colour
➢ We can rewrite to form equivalence-class terms
▪ When the document contains automobile, index it under car-
automobile(and vice-versa)
➢ Homonyms: Jaguar, BalckBery or Blackberry
➢ Or we can expand a query
▪ When the query contains automobile, look under car as well
 What about spelling mistakes?
➢ One approach is Soundex, which forms equivalence classes of words
based on phonetic heuristics.

L3 Vocabulary+Postings List
No ratings yet
L3 Vocabulary+Postings List
28 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
L6 Dictonary+Tolerant Retrieval
No ratings yet
L6 Dictonary+Tolerant Retrieval
63 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
Lec 5
No ratings yet
Lec 5
22 pages
IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
Lecture2 Indexing
No ratings yet
Lecture2 Indexing
78 pages
3-More On Indexing & Text Operations
No ratings yet
3-More On Indexing & Text Operations
27 pages
Chap 2 Part 2
No ratings yet
Chap 2 Part 2
20 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
41 pages
Text-Processing
No ratings yet
Text-Processing
70 pages
2.3 Chap NLP Stemming
No ratings yet
2.3 Chap NLP Stemming
32 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
37 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Information Retrieval: Lecture One
No ratings yet
Information Retrieval: Lecture One
101 pages
IR Lec03 Vocabulary Postings List
No ratings yet
IR Lec03 Vocabulary Postings List
28 pages
Food Package Distribution Report
No ratings yet
Food Package Distribution Report
7 pages
Basics of Text Processing
No ratings yet
Basics of Text Processing
28 pages
IR Lec3
No ratings yet
IR Lec3
41 pages
Chap 2 Part 2
No ratings yet
Chap 2 Part 2
20 pages
Information Retrieval Systems Chap 2
67% (3)
Information Retrieval Systems Chap 2
60 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Text Preprocessing for Information Retrieval
No ratings yet
Text Preprocessing for Information Retrieval
58 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
Inverted Index Construction Guide
No ratings yet
Inverted Index Construction Guide
57 pages
C2 Dictionary
No ratings yet
C2 Dictionary
6 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
CL - Lec 6
No ratings yet
CL - Lec 6
28 pages
Term Vocabulary and Postings List
No ratings yet
Term Vocabulary and Postings List
64 pages
Medical Technology Laws and Bioethics
No ratings yet
Medical Technology Laws and Bioethics
12 pages
Lec 19
No ratings yet
Lec 19
60 pages
6 The Term Vocabulary & Posting List
No ratings yet
6 The Term Vocabulary & Posting List
19 pages
Offerletter Infinity Applicationid 529 46202111739370
No ratings yet
Offerletter Infinity Applicationid 529 46202111739370
3 pages
Marketing and Marketing Process
No ratings yet
Marketing and Marketing Process
55 pages
Text Operations 2021
No ratings yet
Text Operations 2021
45 pages
10 Dictionaries and Tolerant Retrieval
No ratings yet
10 Dictionaries and Tolerant Retrieval
13 pages
Session 1
No ratings yet
Session 1
33 pages
03 - Lect3 Search Engines-Part2
No ratings yet
03 - Lect3 Search Engines-Part2
32 pages
Info Retrieval for Linguists
No ratings yet
Info Retrieval for Linguists
38 pages
16-Limpan Investment Corp. v. CIR G.R. No. L-21570 July 26, 1966
No ratings yet
16-Limpan Investment Corp. v. CIR G.R. No. L-21570 July 26, 1966
4 pages
Lecture3 Roy
No ratings yet
Lecture3 Roy
5 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
Grsim - RoboCup Small Size Robot Soccer Simulator PDF
No ratings yet
Grsim - RoboCup Small Size Robot Soccer Simulator PDF
11 pages
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
No ratings yet
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
48 pages
Intro to Info Retrieval Basics
No ratings yet
Intro to Info Retrieval Basics
34 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
62 pages
Chap 4
No ratings yet
Chap 4
76 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
17 pages
Handbook of Ethics in Quantitative Methodology 1st Edition A. T. Panter All Chapters Instant Download
100% (4)
Handbook of Ethics in Quantitative Methodology 1st Edition A. T. Panter All Chapters Instant Download
84 pages
Info Retrieval for CS Students
No ratings yet
Info Retrieval for CS Students
47 pages
Marketting Plan For TATA NEXON EV Group 9
100% (1)
Marketting Plan For TATA NEXON EV Group 9
17 pages
Unit 3 - Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
No ratings yet
Unit 3 - Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
8 pages
DZS791 Part1
No ratings yet
DZS791 Part1
61 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
20 Tolerantretrieval
No ratings yet
20 Tolerantretrieval
39 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
Elasticsearch Basics for Beginners
No ratings yet
Elasticsearch Basics for Beginners
44 pages
Httpbin (1) - HTTP Client Testing Service
No ratings yet
Httpbin (1) - HTTP Client Testing Service
2 pages
Siam Reiki Level 1 TBSRL1
No ratings yet
Siam Reiki Level 1 TBSRL1
41 pages
2 - Text Operation - 1
No ratings yet
2 - Text Operation - 1
28 pages
Text Processing, Tokenization & Characteristics
100% (1)
Text Processing, Tokenization & Characteristics
89 pages
Bio Enzyme Recipes
No ratings yet
Bio Enzyme Recipes
5 pages
Judge Spinner Cases 01 01 2010 To 06 16 2012
No ratings yet
Judge Spinner Cases 01 01 2010 To 06 16 2012
75 pages
Understanding Smart Cities - An Integrative Framework - Chourabi
No ratings yet
Understanding Smart Cities - An Integrative Framework - Chourabi
9 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
I'm Yours Lyrics for Singers
No ratings yet
I'm Yours Lyrics for Singers
2 pages
Piping Design - Engineering Information
No ratings yet
Piping Design - Engineering Information
32 pages
Iot Physical Devices and Endpoints: Bahga & Madisetti, © 2015
No ratings yet
Iot Physical Devices and Endpoints: Bahga & Madisetti, © 2015
14 pages
1 - PEDDINSERT - Insertion Machines
No ratings yet
1 - PEDDINSERT - Insertion Machines
26 pages
04 - Lect4 - Text Transformation
No ratings yet
04 - Lect4 - Text Transformation
16 pages
Atheist's Critique of God's Nature
100% (1)
Atheist's Critique of God's Nature
16 pages
Chapter 2 Text Operations
No ratings yet
Chapter 2 Text Operations
37 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Amba Pancharatnam Notes
No ratings yet
Amba Pancharatnam Notes
15 pages
Week 1 Webinar
No ratings yet
Week 1 Webinar
26 pages
Postmodernism and Consumer Societ1
No ratings yet
Postmodernism and Consumer Societ1
3 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
The Terror: About This Text
0% (2)
The Terror: About This Text
6 pages
20 Recipes For Oxtail
No ratings yet
20 Recipes For Oxtail
4 pages
A Study of Interplay Between Judiciary and Parliament
100% (1)
A Study of Interplay Between Judiciary and Parliament
92 pages
Econometric Multicollinearity Analysis
No ratings yet
Econometric Multicollinearity Analysis
4 pages
Walmart Display Makes and Models 2
No ratings yet
Walmart Display Makes and Models 2
1 page
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Week 8 Marriage and Family Life On The Eve of Missionary Contact
No ratings yet
Week 8 Marriage and Family Life On The Eve of Missionary Contact
8 pages
Explain Item Normalization?
No ratings yet
Explain Item Normalization?
7 pages
Brave New World Essay
No ratings yet
Brave New World Essay
3 pages
Karan Resume
No ratings yet
Karan Resume
2 pages

Lect 7 Normalization

Uploaded by

Lect 7 Normalization

Uploaded by

Lect 7: Normalization

Dr. Subrat Kumar Nayak

 Crucial: Need to “normalize” indexed text as well as query terms

You might also like