Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views28 pages

Basics of Text Processing

The document provides an overview of Speech and Natural Language Processing, focusing on key concepts such as tokenization, normalization, stemming, and spelling correction. It discusses various methods of tokenization, issues encountered, and techniques like lemmatization and case folding. Additionally, it highlights the importance of handling spelling errors and references essential literature in the field.

Uploaded by

fpar570
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views28 pages

Basics of Text Processing

The document provides an overview of Speech and Natural Language Processing, focusing on key concepts such as tokenization, normalization, stemming, and spelling correction. It discusses various methods of tokenization, issues encountered, and techniques like lemmatization and case folding. Additionally, it highlights the importance of handling spelling errors and references essential literature in the field.

Uploaded by

fpar570
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Speech & Natural Language Processing

Curated by
Dr. Tohida Rehman
Assistant Professor
Department of Information Technology
Jadavpur University
Contents

1. Basics of text processing


2. References
Tokenization
 Tokenization is the process of breaking up the sequence of characters in a text by locating
the word boundaries, the points where one word ends and another begins.
 For computational linguistics purposes, the words thus identified are frequently referred to
as tokens.
 In written languages where no word boundaries are explicitly marked in the writing system,
tokenization is also known as word segmentation, and this term is frequently used
synonymously with tokenization.
 Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text
document into smaller units, such as individual words or terms.
 Each of these smaller units are called tokens.
 Word tokenization: a set of words of a sentence,
 Sentence tokenization: a set sentences broken from a paragraph.
Issues in tokenization
 Finland’s capital 
 Finland? Finlands? Finland’s?
 Hewlett-Packard  Hewlett and Packard as two tokens?
 state-of-the-art: break up hyphenated sequence.
 co-education
 lowercase, lower-case, lower case ?
 It’s effective to get the user to put in possible hyphens
 San Francisco: one token or two? How do you decide it is one token?
Some Tokenization methods
 White Space Tokenization
 Dictionary based Tokenization
 Rule based Tokenization
 Penn Tree Tokenization
 Spacy Tokenizer
 Subword Tokenization
 Byte Pair Encoding
 Word Piece Encoding
 Sentence Piece Encoding
White Space Tokenization
 Input :sentence or paragraph
 Output: list of tokens based on white space
sentence1 = "It's true, Ms. Rehman! #like coding."
sentence2 = "I went to New-York for attending a conference."
sentence3 = "Last year we have visited San Francisco."
["It's", 'true,', 'Ms.', 'Rehman!', '#like', 'coding.’]
['I', 'went', 'to', 'New-York', 'for', 'attending', 'a',
'conference.’]
['Last', 'year', 'we', 'have', 'visited', 'San', 'Francisco.']
 Fast and simple tokenization technique.
 All language have whitespace words separator??
 Chinese and Japanese no spaces between words
 San Francisco: one token or two? How do you decide it is one token?
Punctuation-based tokenization
 Input :sentence or paragraph
 Output: list of tokens based on white space, punctuations and also retains the punctuations.
sentence1 = "It's true, Ms. Rehman! #like coding."
sentence2 = "I went to New-York for attending a conference."
sentence3 = "Last year we have visited San Francisco."
['It', "'", 's', 'true', ',', 'Ms', '.', 'Rehman', '!', '#',
'like', 'coding', '.’]
['I', 'went', 'to', 'New', '-', 'York', 'for', 'attending', 'a',
'conference', '.’]
['Last', 'year', 'we', 'have', 'visited', 'San', 'Francisco', ‘.’]
 “Ms.” or “Ms”/”MS”/”mS have different meaning in different context. Is not it??
Penn TreeBank Tokenization
 Input :sentence or paragraph
 Output: list of tokens based on punctuation and clitics(It’s) but grouped hyphenated words.
 sentence1 = "It's true, Ms. Rehman! #like coding."
sentence2 = "I went to New-York for attending a conference."
sentence3 = "Last year we have visited San Francisco.“
['It', "'s", 'true', ',', 'Ms.', 'Rehman', '!', '#', 'like',
'coding', '.’]
['I', 'went', 'to', 'New-York', 'for', 'attending', 'a',
'conference', '.’]
['Last', 'year', 'we', 'have', 'visited', 'San', 'Francisco',
'.’]

 “Ms.” and 'New-York'are in right way. Is not it??


SpacyTokenization
 Input :sentence or paragraph
 Output: create our own tokenizer with our own customized rules.
 These rules are prefix searches, infix searches, postfix searches, URL searches, and defining
special cases.
Normalization
 Need to “normalize” terms
 Information Retrieval: indexed text & query terms must have same form.
 We want to match U.S.A. and USA
 We implicitly define equivalence classes of terms
 e.g., deleting periods in a term
 Alternative: asymmetric expansion:
 Enter: window Search: window, windows
 Enter: windows Search: Windows, windows, window
 Enter: WindowsSearch: Windows
 Potentially more powerful, but less efficient
Case folding
 Applications like IR: reduce all letters to lower case
 Since users tend to use lower case
 Possible exception: upper case in mid-sentence?
e.g., General Motors
Fed vs. fed
SAIL vs. sail
 For sentiment analysis, MT, Information extraction
 Case is helpful (US versus us is important)
Stop words
 With a stop list, you exclude from dictionary entirely the commonest words. Intuition:
 They have little semantic content: the, a, and, to, be
 There are a lot of them: ~30% of postings for top 30 wds
 But the trend is away from doing this:
 Good compression techniques means the space for including stop words in a system is
very small
 Good query optimization techniques mean you pay little at query time for including stop
words.
 You need them for:
Phrase queries: “President of India”
Various song titles, etc.: “Let it be”, “To be or not to be”
“Relational” queries: “flights to America”
Lemmatization
 Reduce inflections or variant forms to base form

 am, are, is  be
 car, cars, car's, cars'  car
 the boy's cars are different colors  the boy car be different color
 ছেলেটির গাড়িগুলো ডিডিন্ন রলের
 ছেলের গাড়ির রে ডিন্ন
 Lemmatization: have to find correct dictionary headword form
Lemmatization usually refers to doing things properly with the use of a vocabulary and
morphological analysis of words, normally aiming to remove inflectional endings only and
to return the base or dictionary form of a word, which is known as the lemma
Morphology
 Morphemes:
 The small meaningful units that make up words
 Stems: The core meaning-bearing units
 Affixes: Bits and pieces that adhere to stems
Often with grammatical functions
Stemming
 Reduce terms to their stems in information retrieval
 Stemming is crude chopping of affixes
 language dependent
 e.g., automate(s), automatic, automation all reduced to automat.

for example compressed for exampl compress and


and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
Porter’s algorithm
The most common English stemmer

Step 1a Step 2 (for long stems)


sses  ss caresses  caress ational ate relational relate
ies  i ponies  poni izer ize digitizer  digitize
ss  ss caress  caress ator ate operator  operate

s ø cats  cat
Step 3 (for longer stems)
Step 1b al  ø revival  reviv
(*v*)ing  ø walking  walk able  ø adjustable  adjust
sing  sing ate  ø activate  activ
(*v*)ed  ø plastered  plaster …

Viewing morphology in a corpus
Why only strip –ing if there is a vowel?

(*v*)ing  ø walking  walk


sing  sing
17
Dealing with complex morphology is sometimes
necessary
 Some languages requires complex morpheme segmentation
 Turkish

 Uygarlastiramadiklarimizdanmissinizcasina

 `(behaving) as if you are among those whom we could not civilize’


 Uygar `civilized’ + las `become’
+ tir `cause’ + ama `not able’
+ dik `past’ + lar ‘plural’
+ imiz ‘p1pl’ + dan ‘abl’
+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’
Comparison of 3 stemming method
 Sample text: Such an analysis can reveal features that are not easily visible
from the variations in the individual genes and can lead to a picture of
expression that is more biologically transparent and accessible to interpretation
 Lovins stemmer: such an analys can reve featur that ar not eas vis from th
vari in th individu gen and can lead to a pictur of expres that is mor biolog
transpar and acces to interpres
 Porter stemmer: such an analysi can reveal featur that ar not easili visibl from
the variat in the individu gene and can lead to a pictur of express that is more
biolog transpar and access to interpret
 Paice stemmer: such an analys can rev feat that are not easy vis from the vary
in the individ gen and can lead to a pict of express that is mor biolog transp
and access to interpret
Sentence Segmentation
 !, ? are relatively unambiguous
 Period “.” is quite ambiguous
 Sentence boundary
 Abbreviations like Inc. or Dr.
 Numbers like .02% or 4.3
 Build a binary classifier
 Looks at a “.”
 Decides EndOfSentence/NotEndOfSentence
 Classifiers: hand-written rules, regular expressions, or machine-learning
Determining if a word is end-of-sentence: a Decision Tree
Applications for spelling correction

Word processing Phones

22
Web search
Spelling Tasks
 Spelling Error Detection
 Spelling Error Correction:
 Autocorrect htethe
 Suggest a correction
 Suggestion lists
Types of spelling errors
 Non-word Errors
 graffe giraffe
 Real-word Errors
 Typographical errors
three there
 Cognitive Errors (homophones)
piecepeace,

too  two
Rates of spelling errors
26%: Web queries Wang et al. 2003

13%: Retyping, no backspace: Whitelaw et al. English&German


7%: Words corrected retyping on phone-sized organizer
2%: Words uncorrected on organizer Soukoreff &MacKenzie 2003
1-2%: Retyping: Kane and Wobbrock 2007, Gruden et al. 1983
25
Non-word spelling errors
 Non-word spelling error detection:
 Any word not in a dictionary is an error
 The larger the dictionary the better
 Non-word spelling error correction:
 Generate candidates: real words that are similar to error
 Choose the one which is best:
Shortest weighted edit distance
Highest noisy channel probability
26
Real word spelling errors
 For each word w, generate candidate set:
 Find candidate words with similar pronunciations
 Find candidate words with similar spelling
 Include w in candidate set
 Choose best candidate
 Noisy Channel
 Classifier

27
Reference Books
1. Daniel Jurafsky and James H. Martin. 2020. Speech and Language Processing.
2. 3rd Edition Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical
Natural Language Processing. MIT Press.
3. Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, Harshit Surana. 2020. Practical
Natural Language Processing. O'Reilly.
4. NPTEL NLP course.
5. https://www.google.co.in/
6. Coursera course - Natural Language Processing

You might also like