Basics of Text Processing

The document provides an overview of Speech and Natural Language Processing, focusing on key concepts such as tokenization, normalization, stemming, and spelling correction. It discusses various methods of tokenization, issues encountered, and techniques like lemmatization and case folding. Additionally, it highlights the importance of handling spelling errors and references essential literature in the field.

Uploaded by

fpar570

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views28 pages

Basics of Text Processing

Uploaded by

fpar570

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Speech & Natural Language Processing

Curated by
Dr. Tohida Rehman
Assistant Professor
Department of Information Technology
Jadavpur University
Contents

1. Basics of text processing

2. References
Tokenization
 Tokenization is the process of breaking up the sequence of characters in a text by locating
the word boundaries, the points where one word ends and another begins.
 For computational linguistics purposes, the words thus identified are frequently referred to
as tokens.
 In written languages where no word boundaries are explicitly marked in the writing system,
tokenization is also known as word segmentation, and this term is frequently used
synonymously with tokenization.
 Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text
document into smaller units, such as individual words or terms.
 Each of these smaller units are called tokens.
 Word tokenization: a set of words of a sentence,
 Sentence tokenization: a set sentences broken from a paragraph.
Issues in tokenization
 Finland’s capital 
 Finland? Finlands? Finland’s?
 Hewlett-Packard  Hewlett and Packard as two tokens?
 state-of-the-art: break up hyphenated sequence.
 co-education
 lowercase, lower-case, lower case ?
 It’s effective to get the user to put in possible hyphens
 San Francisco: one token or two? How do you decide it is one token?
Some Tokenization methods
 White Space Tokenization
 Dictionary based Tokenization
 Rule based Tokenization
 Penn Tree Tokenization
 Spacy Tokenizer
 Subword Tokenization
 Byte Pair Encoding
 Word Piece Encoding
 Sentence Piece Encoding
White Space Tokenization
 Input :sentence or paragraph
 Output: list of tokens based on white space
sentence1 = "It's true, Ms. Rehman! #like coding."
sentence2 = "I went to New-York for attending a conference."
sentence3 = "Last year we have visited San Francisco."
["It's", 'true,', 'Ms.', 'Rehman!', '#like', 'coding.’]
['I', 'went', 'to', 'New-York', 'for', 'attending', 'a',
'conference.’]
['Last', 'year', 'we', 'have', 'visited', 'San', 'Francisco.']
 Fast and simple tokenization technique.
 All language have whitespace words separator??
 Chinese and Japanese no spaces between words
 San Francisco: one token or two? How do you decide it is one token?
Punctuation-based tokenization
 Input :sentence or paragraph
 Output: list of tokens based on white space, punctuations and also retains the punctuations.
sentence1 = "It's true, Ms. Rehman! #like coding."
sentence2 = "I went to New-York for attending a conference."
sentence3 = "Last year we have visited San Francisco."
['It', "'", 's', 'true', ',', 'Ms', '.', 'Rehman', '!', '#',
'like', 'coding', '.’]
['I', 'went', 'to', 'New', '-', 'York', 'for', 'attending', 'a',
'conference', '.’]
['Last', 'year', 'we', 'have', 'visited', 'San', 'Francisco', ‘.’]
 “Ms.” or “Ms”/”MS”/”mS have different meaning in different context. Is not it??
Penn TreeBank Tokenization
 Input :sentence or paragraph
 Output: list of tokens based on punctuation and clitics(It’s) but grouped hyphenated words.
 sentence1 = "It's true, Ms. Rehman! #like coding."
sentence2 = "I went to New-York for attending a conference."
sentence3 = "Last year we have visited San Francisco.“
['It', "'s", 'true', ',', 'Ms.', 'Rehman', '!', '#', 'like',
'coding', '.’]
['I', 'went', 'to', 'New-York', 'for', 'attending', 'a',
'conference', '.’]
['Last', 'year', 'we', 'have', 'visited', 'San', 'Francisco',
'.’]

 “Ms.” and 'New-York'are in right way. Is not it??

SpacyTokenization
 Input :sentence or paragraph
 Output: create our own tokenizer with our own customized rules.
 These rules are prefix searches, infix searches, postfix searches, URL searches, and defining
special cases.
Normalization
 Need to “normalize” terms
 Information Retrieval: indexed text & query terms must have same form.
 We want to match U.S.A. and USA
 We implicitly define equivalence classes of terms
 e.g., deleting periods in a term
 Alternative: asymmetric expansion:
 Enter: window Search: window, windows
 Enter: windows Search: Windows, windows, window
 Enter: WindowsSearch: Windows
 Potentially more powerful, but less efficient
Case folding
 Applications like IR: reduce all letters to lower case
 Since users tend to use lower case
 Possible exception: upper case in mid-sentence?
e.g., General Motors
Fed vs. fed
SAIL vs. sail
 For sentiment analysis, MT, Information extraction
 Case is helpful (US versus us is important)
Stop words
 With a stop list, you exclude from dictionary entirely the commonest words. Intuition:
 They have little semantic content: the, a, and, to, be
 There are a lot of them: ~30% of postings for top 30 wds
 But the trend is away from doing this:
 Good compression techniques means the space for including stop words in a system is
very small
 Good query optimization techniques mean you pay little at query time for including stop
words.
 You need them for:
Phrase queries: “President of India”
Various song titles, etc.: “Let it be”, “To be or not to be”
“Relational” queries: “flights to America”
Lemmatization
 Reduce inflections or variant forms to base form

 am, are, is  be
 car, cars, car's, cars'  car
 the boy's cars are different colors  the boy car be different color
 ছেলেটির গাড়িগুলো ডিডিন্ন রলের
 ছেলের গাড়ির রে ডিন্ন
 Lemmatization: have to find correct dictionary headword form
Lemmatization usually refers to doing things properly with the use of a vocabulary and
morphological analysis of words, normally aiming to remove inflectional endings only and
to return the base or dictionary form of a word, which is known as the lemma
Morphology
 Morphemes:
 The small meaningful units that make up words
 Stems: The core meaning-bearing units
 Affixes: Bits and pieces that adhere to stems
Often with grammatical functions
Stemming
 Reduce terms to their stems in information retrieval
 Stemming is crude chopping of affixes
 language dependent
 e.g., automate(s), automatic, automation all reduced to automat.

for example compressed for exampl compress and

and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
Porter’s algorithm
The most common English stemmer

Step 1a Step 2 (for long stems)

sses  ss caresses  caress ational ate relational relate
ies  i ponies  poni izer ize digitizer  digitize
ss  ss caress  caress ator ate operator  operate
…
s ø cats  cat
Step 3 (for longer stems)
Step 1b al  ø revival  reviv
(*v*)ing  ø walking  walk able  ø adjustable  adjust
sing  sing ate  ø activate  activ
(*v*)ed  ø plastered  plaster …
…
Viewing morphology in a corpus
Why only strip –ing if there is a vowel?

(v)ing  ø walking  walk

sing  sing
17
Dealing with complex morphology is sometimes
necessary
 Some languages requires complex morpheme segmentation
 Turkish

 Uygarlastiramadiklarimizdanmissinizcasina

 `(behaving) as if you are among those whom we could not civilize’

 Uygar `civilized’ + las `become’
+ tir `cause’ + ama `not able’
+ dik `past’ + lar ‘plural’
+ imiz ‘p1pl’ + dan ‘abl’
+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’
Comparison of 3 stemming method
 Sample text: Such an analysis can reveal features that are not easily visible
from the variations in the individual genes and can lead to a picture of
expression that is more biologically transparent and accessible to interpretation
 Lovins stemmer: such an analys can reve featur that ar not eas vis from th
vari in th individu gen and can lead to a pictur of expres that is mor biolog
transpar and acces to interpres
 Porter stemmer: such an analysi can reveal featur that ar not easili visibl from
the variat in the individu gene and can lead to a pictur of express that is more
biolog transpar and access to interpret
 Paice stemmer: such an analys can rev feat that are not easy vis from the vary
in the individ gen and can lead to a pict of express that is mor biolog transp
and access to interpret
Sentence Segmentation
 !, ? are relatively unambiguous
 Period “.” is quite ambiguous
 Sentence boundary
 Abbreviations like Inc. or Dr.
 Numbers like .02% or 4.3
 Build a binary classifier
 Looks at a “.”
 Decides EndOfSentence/NotEndOfSentence
 Classifiers: hand-written rules, regular expressions, or machine-learning
Determining if a word is end-of-sentence: a Decision Tree
Applications for spelling correction

Word processing Phones

22
Web search
Spelling Tasks
 Spelling Error Detection
 Spelling Error Correction:
 Autocorrect htethe
 Suggest a correction
 Suggestion lists
Types of spelling errors
 Non-word Errors
 graffe giraffe
 Real-word Errors
 Typographical errors
three there
 Cognitive Errors (homophones)
piecepeace,

too  two
Rates of spelling errors
26%: Web queries Wang et al. 2003

13%: Retyping, no backspace: Whitelaw et al. English&German

7%: Words corrected retyping on phone-sized organizer
2%: Words uncorrected on organizer Soukoreff &MacKenzie 2003
1-2%: Retyping: Kane and Wobbrock 2007, Gruden et al. 1983
25
Non-word spelling errors
 Non-word spelling error detection:
 Any word not in a dictionary is an error
 The larger the dictionary the better
 Non-word spelling error correction:
 Generate candidates: real words that are similar to error
 Choose the one which is best:
Shortest weighted edit distance
Highest noisy channel probability
26
Real word spelling errors
 For each word w, generate candidate set:
 Find candidate words with similar pronunciations
 Find candidate words with similar spelling
 Include w in candidate set
 Choose best candidate
 Noisy Channel
 Classifier

27
Reference Books
1. Daniel Jurafsky and James H. Martin. 2020. Speech and Language Processing.
2. 3rd Edition Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical
Natural Language Processing. MIT Press.
3. Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, Harshit Surana. 2020. Practical
Natural Language Processing. O'Reilly.
4. NPTEL NLP course.
5. https://www.google.co.in/
6. Coursera course - Natural Language Processing

Neo Ffi 3 Manual
No ratings yet
Neo Ffi 3 Manual
14 pages
Corrine Glesne - Becoming Qualitative Researchers - An Introduction (2016)
100% (3)
Corrine Glesne - Becoming Qualitative Researchers - An Introduction (2016)
347 pages
RIBT
No ratings yet
RIBT
23 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Multilingualism: Understanding Linguistic Diversity 2nd Edition John Edwards Download
No ratings yet
Multilingualism: Understanding Linguistic Diversity 2nd Edition John Edwards Download
165 pages
(Medieval and Early Modern Philosophy 2) Khaled El-Rouayheb - The Development of Arabic Logic (1200-1800) - Schwabe Verlagsgruppe (2019)
No ratings yet
(Medieval and Early Modern Philosophy 2) Khaled El-Rouayheb - The Development of Arabic Logic (1200-1800) - Schwabe Verlagsgruppe (2019)
338 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
Module 2 Complete
No ratings yet
Module 2 Complete
134 pages
Feelings and Emotions Worksheets With Answers 50673
No ratings yet
Feelings and Emotions Worksheets With Answers 50673
14 pages
Chap 2
No ratings yet
Chap 2
70 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
Text Data Preprocessing 2025
No ratings yet
Text Data Preprocessing 2025
39 pages
2.3 Chap NLP Stemming
No ratings yet
2.3 Chap NLP Stemming
32 pages
EPW, Vol.58, Issue No.44, 04 Nov 2023
No ratings yet
EPW, Vol.58, Issue No.44, 04 Nov 2023
66 pages
Module 2 Reference Material 1
No ratings yet
Module 2 Reference Material 1
43 pages
Natural Language Processing Notes Class 10 AI
No ratings yet
Natural Language Processing Notes Class 10 AI
24 pages
Text Preprocessing for Information Retrieval
No ratings yet
Text Preprocessing for Information Retrieval
58 pages
Ancient and Medieval Political Theories COURSE OUTLINE
No ratings yet
Ancient and Medieval Political Theories COURSE OUTLINE
3 pages
CL - Lec 6
No ratings yet
CL - Lec 6
28 pages
IR Lec3
No ratings yet
IR Lec3
41 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
Text-Processing
No ratings yet
Text-Processing
70 pages
Text Preprocessing
No ratings yet
Text Preprocessing
39 pages
3.word Level Analysis-Tokenization Stemming
No ratings yet
3.word Level Analysis-Tokenization Stemming
8 pages
Module 1 NLP
No ratings yet
Module 1 NLP
26 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
NLP Lecture 6 Week 3
No ratings yet
NLP Lecture 6 Week 3
9 pages
Stress Impact on RMG Workers
No ratings yet
Stress Impact on RMG Workers
11 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
NLP m2
No ratings yet
NLP m2
71 pages
An Investigation Into The Psychoanalytic Concept of Envy Eliahu Feldman and Heitor de Paola PDF
No ratings yet
An Investigation Into The Psychoanalytic Concept of Envy Eliahu Feldman and Heitor de Paola PDF
20 pages
Chapter - 2 Text Operation (Lecture 2.1)
No ratings yet
Chapter - 2 Text Operation (Lecture 2.1)
63 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
Practical Research 2 Reviewer
100% (1)
Practical Research 2 Reviewer
6 pages
Psychology and Social Practice
No ratings yet
Psychology and Social Practice
22 pages
Unit 3 - Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
No ratings yet
Unit 3 - Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
8 pages
What Is NLP?: Components of An FSA
No ratings yet
What Is NLP?: Components of An FSA
16 pages
Part B Notes
No ratings yet
Part B Notes
62 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
2 - Text Operation - 1
No ratings yet
2 - Text Operation - 1
28 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
3-More On Indexing & Text Operations
No ratings yet
3-More On Indexing & Text Operations
27 pages
Lecture 3 - Basic Text Processing
No ratings yet
Lecture 3 - Basic Text Processing
58 pages
02 Textprocessingboth
No ratings yet
02 Textprocessingboth
46 pages
Week 2
No ratings yet
Week 2
90 pages
Learning Evaluation Words That Describe Me: Educ 4 Technology Teaching and Learning 2020-2021 Page 1 of 13
No ratings yet
Learning Evaluation Words That Describe Me: Educ 4 Technology Teaching and Learning 2020-2021 Page 1 of 13
11 pages
2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
3.chapter4 - Lexical Representations
No ratings yet
3.chapter4 - Lexical Representations
36 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
35 pages
Week 3
No ratings yet
Week 3
15 pages
Text Mining
No ratings yet
Text Mining
34 pages
Module 3 - Part 1
No ratings yet
Module 3 - Part 1
54 pages
Session 1
No ratings yet
Session 1
33 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Text Mining
No ratings yet
Text Mining
62 pages
Portuguese Philosophy of Technology Legacies and Contemporary Work From The Portuguese Speaking Community 1st Edition Helena Mateus Jerónimo Full Access
No ratings yet
Portuguese Philosophy of Technology Legacies and Contemporary Work From The Portuguese Speaking Community 1st Edition Helena Mateus Jerónimo Full Access
135 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
Lec 19
No ratings yet
Lec 19
60 pages
18 Text Mining - Text Preprocessing
No ratings yet
18 Text Mining - Text Preprocessing
40 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
Unit 2
No ratings yet
Unit 2
20 pages
Tokenization & Morphology in NLP
No ratings yet
Tokenization & Morphology in NLP
63 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
41 pages
Intro-to-Psychology BS 1 Complete
No ratings yet
Intro-to-Psychology BS 1 Complete
42 pages
Text Processing for IR Systems
No ratings yet
Text Processing for IR Systems
43 pages
Text Normalization in NLP
No ratings yet
Text Normalization in NLP
29 pages
Balgo A Mindanao Forum Article
No ratings yet
Balgo A Mindanao Forum Article
15 pages
Inverted Index Construction Guide
No ratings yet
Inverted Index Construction Guide
57 pages
Understanding Love and Relationships
No ratings yet
Understanding Love and Relationships
31 pages
Stemming, Lemmatization & NLP Basics
No ratings yet
Stemming, Lemmatization & NLP Basics
6 pages
Topic 1 - Introduction To Appreciation of Ethics and Civilisation - Malaysian Perspectives
No ratings yet
Topic 1 - Introduction To Appreciation of Ethics and Civilisation - Malaysian Perspectives
21 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
46 pages
NLP: Spelling Correction & QA
No ratings yet
NLP: Spelling Correction & QA
80 pages
Safety Leadership & Engagement
100% (3)
Safety Leadership & Engagement
2 pages
Language Maintenance and Shift
No ratings yet
Language Maintenance and Shift
1 page
Notes Class 12 Chapter 1
No ratings yet
Notes Class 12 Chapter 1
7 pages
Public Speaking S3 Ang 2
No ratings yet
Public Speaking S3 Ang 2
19 pages
Activity 3 On Aristotle Virtue Ethics Theory
No ratings yet
Activity 3 On Aristotle Virtue Ethics Theory
2 pages
Universal Language Debate
No ratings yet
Universal Language Debate
15 pages
Basic Text Processing: Word Normaliza, On and Stemming
No ratings yet
Basic Text Processing: Word Normaliza, On and Stemming
11 pages
Social Work Group Dynamics
No ratings yet
Social Work Group Dynamics
62 pages
Discourse Analysis Fairclough
No ratings yet
Discourse Analysis Fairclough
26 pages
Grade 6 Academic Planner
No ratings yet
Grade 6 Academic Planner
28 pages
Birmingham WorldlyImmortalityAge 2018
No ratings yet
Birmingham WorldlyImmortalityAge 2018
13 pages
Bahan 10
No ratings yet
Bahan 10
5 pages
TOA2 - 00 - Architecture and Theory
No ratings yet
TOA2 - 00 - Architecture and Theory
1 page

Basics of Text Processing

Uploaded by

Basics of Text Processing

Uploaded by

Speech & Natural Language Processing

1. Basics of text processing

 “Ms.” and 'New-York'are in right way. Is not it??

for example compressed for exampl compress and

Step 1a Step 2 (for long stems)

(*v*)ing  ø walking  walk

 `(behaving) as if you are among those whom we could not civilize’

Word processing Phones

13%: Retyping, no backspace: Whitelaw et al. English&German

You might also like

(v)ing  ø walking  walk