Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
22 views5 pages

Text Normalization with Regex

NLP - Word Placement
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views5 pages

Text Normalization with Regex

NLP - Word Placement
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Stemming and lemmatization can be considered as a kind of linguistic compression.

In the same sense,


word replacement can be thought of as text normalization or error correction.
But why we needed word replacement? Suppose if we talk about tokenization, then it is having issues
with contractions (like can’t, won’t, etc.). So, to handle such issues we need word replacement. For
example, we can replace contractions with their expanded forms.

Word replacement using regular expression

First, we are going to replace words that matches the regular expression. But for this we must have a
basic understanding of regular expressions as well as python re module. In the example below, we will
be replacing contraction with their expanded forms (e.g. “can’t” will be replaced with “cannot”), all
that by using regular expressions.

Example

First, import the necessary package re to work with regular expressions.

import re
from nltk.corpus import wordnet

Next, define the replacement patterns of your choice as follows −

R_patterns =
[ (r'won\'t', 'will not'),
(r'can\'t', 'cannot'),
(r'i\'m', 'i am'),
(r'(\w+)\'ll', '\g<1> will'),
(r'(\w+)n\'t', '\g<1> not'),
(r'(\w+)\'ve', '\g<1> have'),
(r'(\w+)\'s', '\g<1> is'),
(r'(\w+)\'re', '\g<1> are'),
]

Now, create a class that can be used for replacing words −

class REReplacer(object):
def init (self, pattern = R_patterns):
self.pattern = [(re.compile(regex), repl) for (regex, repl) in patterns]
def replace(self, text):
s = text
for (pattern, repl) in self.pattern:
s = re.sub(pattern, repl, s)
return s
Save this python program (say repRE.py) and run it from python command prompt. After running it,
import REReplacer class when you want to replace words. Let us see how.

from repRe import REReplacer


rep_word = REReplacer()
rep_word.replace("I won't do it")
Output:
'I will not do it'
rep_word.replace("I can’t do it")
Output:
'I cannot do it'

Complete implementation example

repRe.py

import re
from nltk.corpus import wordnet
R_patterns = [
(r'won\'t', 'will not'),
(r'can\'t', 'cannot'),
(r'i\'m', 'i am'),
(r'(\w+)\'ll', '\g<1> will'),
(r'(\w+)n\'t', '\g<1> not'),
(r'(\w+)\'ve', '\g<1> have'),
(r'(\w+)\'s', '\g<1> is'),
(r'(\w+)\'re', '\g<1> are'),
]

class REReplacer(object):
def init (self, patterns=R_patterns):
self.patterns = [(re.compile(regex), repl) for (regex, repl) in
patterns] def replace(self, text):
s = text
for (pattern, repl) in
self.patterns: s =
re.sub(pattern, repl, s)
return s the above program and run it, you can import the class and use it as follows –
Now once you saved

Main.py
from repRe import REReplacer
rep_word = REReplacer()
print(rep_word.replace("I won't do it"))
Output

Replacement before text processing

One of the common practices while working with natural language processing (NLP) is to clean up the
text before text processing. In this concern we can also use our REReplacer class created above in
previous example, as a preliminary step before text processing i.e. tokenization.

Example

from nltk.tokenize import


word_tokenize from repRe import
REReplacer
rep_word = REReplacer()
print(word_tokenize("I won't be able to do this now"))
print(word_tokenize(rep_word.replace("I won't be able to do this now")))
Output

In the above Python recipe, we can easily understand the difference between the output of word
tokenizer without and with using regular expression replace.
Removal of repeating characters

Do we strictly grammatical in our everyday language? No, we are not. For example, sometimes we
write ‘Hiiiiiiiiiiii Mohan’ in order to emphasize the word ‘Hi’. But computer system does not know that
‘Hiiiiiiiiiiii’ is a variation of the word “Hi”. In the example below, we will be creating a class
named rep_word_removal which can be used for removing the repeating words.

Example

First, import the necessary package re to work with regular expressions

import re
from nltk.corpus import wordnet

Now, create a class that can be used for removing the repeating words −
import re
from nltk.corpus import wordnet
class Rep_word_removal(object):
def init (self):
self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\
w*)') self.repl = r'\1\2\3'
def replace(self, word):
if wordnet.synsets(word):
return word
repl_word = self.repeat_regexp.sub(self.repl,
word) if repl_word != word:
return self.replace(repl_word)
else:
return repl_word

Save this python program (say removalrepeat.py) and run it from python command prompt. After
running it, import Rep_word_removal class when you want to remove the repeating words. Let us see
how?
from removalrepeat import
Rep_word_removal rep_word =
Rep_word_removal() print("Before:
Hiiiiiiiiiiiiiiiiiiiii") print("Now:")
print(rep_word.replace("Hiiiiiiiiiiiiiiiiiiii
i")) print("Before: Hellooooooooooooooo")
print("Now:")
print(rep_word.replace("Hellooooooooooooooo")
)
Complete implementation example

removalrepeat.py
import re
from nltk.corpus import wordnet
class Rep_word_removal(object):
def init (self):
self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\
w*)') self.repl = r'\1\2\3'
def replace(self, word):
if wordnet.synsets(word):
return word
repl_word = self.repeat_regexp.sub(self.repl,
word) if repl_word != word:
return self.replace(repl_word)
else:
return repl_word

Now once you saved the above program and run it, you can import the class and use it as follows –

Main.py
from removalrepeat import
Rep_word_removal rep_word =
Rep_word_removal() print("Before:
Hiiiiiiiiiiiiiiiiiiiii") print("Now:")
print(rep_word.replace("Hiiiiiiiiiiiiiiiiiiii
i")) print("Before: Hellooooooooooooooo")
print("Now:")
print(rep_word.replace("Hellooooooooooooooo")
)
Output

You might also like