Speech & Natural Language Processing
Curated by
Dr. Tohida Rehman
Assistant Professor
Department of Information Technology
Jadavpur University
Contents
1. Basics of text processing
2. References
Tokenization
Tokenization is the process of breaking up the sequence of characters in a text by locating
the word boundaries, the points where one word ends and another begins.
For computational linguistics purposes, the words thus identified are frequently referred to
as tokens.
In written languages where no word boundaries are explicitly marked in the writing system,
tokenization is also known as word segmentation, and this term is frequently used
synonymously with tokenization.
Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text
document into smaller units, such as individual words or terms.
Each of these smaller units are called tokens.
Word tokenization: a set of words of a sentence,
Sentence tokenization: a set sentences broken from a paragraph.
Issues in tokenization
Finland’s capital
Finland? Finlands? Finland’s?
Hewlett-Packard Hewlett and Packard as two tokens?
state-of-the-art: break up hyphenated sequence.
co-education
lowercase, lower-case, lower case ?
It’s effective to get the user to put in possible hyphens
San Francisco: one token or two? How do you decide it is one token?
Some Tokenization methods
White Space Tokenization
Dictionary based Tokenization
Rule based Tokenization
Penn Tree Tokenization
Spacy Tokenizer
Subword Tokenization
Byte Pair Encoding
Word Piece Encoding
Sentence Piece Encoding
White Space Tokenization
Input :sentence or paragraph
Output: list of tokens based on white space
sentence1 = "It's true, Ms. Rehman! #like coding."
sentence2 = "I went to New-York for attending a conference."
sentence3 = "Last year we have visited San Francisco."
["It's", 'true,', 'Ms.', 'Rehman!', '#like', 'coding.’]
['I', 'went', 'to', 'New-York', 'for', 'attending', 'a',
'conference.’]
['Last', 'year', 'we', 'have', 'visited', 'San', 'Francisco.']
Fast and simple tokenization technique.
All language have whitespace words separator??
Chinese and Japanese no spaces between words
San Francisco: one token or two? How do you decide it is one token?
Punctuation-based tokenization
Input :sentence or paragraph
Output: list of tokens based on white space, punctuations and also retains the punctuations.
sentence1 = "It's true, Ms. Rehman! #like coding."
sentence2 = "I went to New-York for attending a conference."
sentence3 = "Last year we have visited San Francisco."
['It', "'", 's', 'true', ',', 'Ms', '.', 'Rehman', '!', '#',
'like', 'coding', '.’]
['I', 'went', 'to', 'New', '-', 'York', 'for', 'attending', 'a',
'conference', '.’]
['Last', 'year', 'we', 'have', 'visited', 'San', 'Francisco', ‘.’]
“Ms.” or “Ms”/”MS”/”mS have different meaning in different context. Is not it??
Penn TreeBank Tokenization
Input :sentence or paragraph
Output: list of tokens based on punctuation and clitics(It’s) but grouped hyphenated words.
sentence1 = "It's true, Ms. Rehman! #like coding."
sentence2 = "I went to New-York for attending a conference."
sentence3 = "Last year we have visited San Francisco.“
['It', "'s", 'true', ',', 'Ms.', 'Rehman', '!', '#', 'like',
'coding', '.’]
['I', 'went', 'to', 'New-York', 'for', 'attending', 'a',
'conference', '.’]
['Last', 'year', 'we', 'have', 'visited', 'San', 'Francisco',
'.’]
“Ms.” and 'New-York'are in right way. Is not it??
SpacyTokenization
Input :sentence or paragraph
Output: create our own tokenizer with our own customized rules.
These rules are prefix searches, infix searches, postfix searches, URL searches, and defining
special cases.
Normalization
Need to “normalize” terms
Information Retrieval: indexed text & query terms must have same form.
We want to match U.S.A. and USA
We implicitly define equivalence classes of terms
e.g., deleting periods in a term
Alternative: asymmetric expansion:
Enter: window Search: window, windows
Enter: windows Search: Windows, windows, window
Enter: WindowsSearch: Windows
Potentially more powerful, but less efficient
Case folding
Applications like IR: reduce all letters to lower case
Since users tend to use lower case
Possible exception: upper case in mid-sentence?
e.g., General Motors
Fed vs. fed
SAIL vs. sail
For sentiment analysis, MT, Information extraction
Case is helpful (US versus us is important)
Stop words
With a stop list, you exclude from dictionary entirely the commonest words. Intuition:
They have little semantic content: the, a, and, to, be
There are a lot of them: ~30% of postings for top 30 wds
But the trend is away from doing this:
Good compression techniques means the space for including stop words in a system is
very small
Good query optimization techniques mean you pay little at query time for including stop
words.
You need them for:
Phrase queries: “President of India”
Various song titles, etc.: “Let it be”, “To be or not to be”
“Relational” queries: “flights to America”
Lemmatization
Reduce inflections or variant forms to base form
am, are, is be
car, cars, car's, cars' car
the boy's cars are different colors the boy car be different color
ছেলেটির গাড়িগুলো ডিডিন্ন রলের
ছেলের গাড়ির রে ডিন্ন
Lemmatization: have to find correct dictionary headword form
Lemmatization usually refers to doing things properly with the use of a vocabulary and
morphological analysis of words, normally aiming to remove inflectional endings only and
to return the base or dictionary form of a word, which is known as the lemma
Morphology
Morphemes:
The small meaningful units that make up words
Stems: The core meaning-bearing units
Affixes: Bits and pieces that adhere to stems
Often with grammatical functions
Stemming
Reduce terms to their stems in information retrieval
Stemming is crude chopping of affixes
language dependent
e.g., automate(s), automatic, automation all reduced to automat.
for example compressed for exampl compress and
and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
Porter’s algorithm
The most common English stemmer
Step 1a Step 2 (for long stems)
sses ss caresses caress ational ate relational relate
ies i ponies poni izer ize digitizer digitize
ss ss caress caress ator ate operator operate
…
s ø cats cat
Step 3 (for longer stems)
Step 1b al ø revival reviv
(*v*)ing ø walking walk able ø adjustable adjust
sing sing ate ø activate activ
(*v*)ed ø plastered plaster …
…
Viewing morphology in a corpus
Why only strip –ing if there is a vowel?
(*v*)ing ø walking walk
sing sing
17
Dealing with complex morphology is sometimes
necessary
Some languages requires complex morpheme segmentation
Turkish
Uygarlastiramadiklarimizdanmissinizcasina
`(behaving) as if you are among those whom we could not civilize’
Uygar `civilized’ + las `become’
+ tir `cause’ + ama `not able’
+ dik `past’ + lar ‘plural’
+ imiz ‘p1pl’ + dan ‘abl’
+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’
Comparison of 3 stemming method
Sample text: Such an analysis can reveal features that are not easily visible
from the variations in the individual genes and can lead to a picture of
expression that is more biologically transparent and accessible to interpretation
Lovins stemmer: such an analys can reve featur that ar not eas vis from th
vari in th individu gen and can lead to a pictur of expres that is mor biolog
transpar and acces to interpres
Porter stemmer: such an analysi can reveal featur that ar not easili visibl from
the variat in the individu gene and can lead to a pictur of express that is more
biolog transpar and access to interpret
Paice stemmer: such an analys can rev feat that are not easy vis from the vary
in the individ gen and can lead to a pict of express that is mor biolog transp
and access to interpret
Sentence Segmentation
!, ? are relatively unambiguous
Period “.” is quite ambiguous
Sentence boundary
Abbreviations like Inc. or Dr.
Numbers like .02% or 4.3
Build a binary classifier
Looks at a “.”
Decides EndOfSentence/NotEndOfSentence
Classifiers: hand-written rules, regular expressions, or machine-learning
Determining if a word is end-of-sentence: a Decision Tree
Applications for spelling correction
Word processing Phones
22
Web search
Spelling Tasks
Spelling Error Detection
Spelling Error Correction:
Autocorrect htethe
Suggest a correction
Suggestion lists
Types of spelling errors
Non-word Errors
graffe giraffe
Real-word Errors
Typographical errors
three there
Cognitive Errors (homophones)
piecepeace,
too two
Rates of spelling errors
26%: Web queries Wang et al. 2003
13%: Retyping, no backspace: Whitelaw et al. English&German
7%: Words corrected retyping on phone-sized organizer
2%: Words uncorrected on organizer Soukoreff &MacKenzie 2003
1-2%: Retyping: Kane and Wobbrock 2007, Gruden et al. 1983
25
Non-word spelling errors
Non-word spelling error detection:
Any word not in a dictionary is an error
The larger the dictionary the better
Non-word spelling error correction:
Generate candidates: real words that are similar to error
Choose the one which is best:
Shortest weighted edit distance
Highest noisy channel probability
26
Real word spelling errors
For each word w, generate candidate set:
Find candidate words with similar pronunciations
Find candidate words with similar spelling
Include w in candidate set
Choose best candidate
Noisy Channel
Classifier
27
Reference Books
1. Daniel Jurafsky and James H. Martin. 2020. Speech and Language Processing.
2. 3rd Edition Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical
Natural Language Processing. MIT Press.
3. Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, Harshit Surana. 2020. Practical
Natural Language Processing. O'Reilly.
4. NPTEL NLP course.
5. https://www.google.co.in/
6. Coursera course - Natural Language Processing