Stemming and lemmatization can be considered as a kind of linguistic compression.
In the same sense,
word replacement can be thought of as text normalization or error correction.
But why we needed word replacement? Suppose if we talk about tokenization, then it is having issues
with contractions (like can’t, won’t, etc.). So, to handle such issues we need word replacement. For
example, we can replace contractions with their expanded forms.
Word replacement using regular expression
First, we are going to replace words that matches the regular expression. But for this we must have a
basic understanding of regular expressions as well as python re module. In the example below, we will
be replacing contraction with their expanded forms (e.g. “can’t” will be replaced with “cannot”), all
that by using regular expressions.
Example
First, import the necessary package re to work with regular expressions.
import re
from nltk.corpus import wordnet
Next, define the replacement patterns of your choice as follows −
R_patterns =
[ (r'won\'t', 'will not'),
(r'can\'t', 'cannot'),
(r'i\'m', 'i am'),
(r'(\w+)\'ll', '\g<1> will'),
(r'(\w+)n\'t', '\g<1> not'),
(r'(\w+)\'ve', '\g<1> have'),
(r'(\w+)\'s', '\g<1> is'),
(r'(\w+)\'re', '\g<1> are'),
]
Now, create a class that can be used for replacing words −
class REReplacer(object):
def init (self, pattern = R_patterns):
self.pattern = [(re.compile(regex), repl) for (regex, repl) in patterns]
def replace(self, text):
s = text
for (pattern, repl) in self.pattern:
s = re.sub(pattern, repl, s)
return s
Save this python program (say repRE.py) and run it from python command prompt. After running it,
import REReplacer class when you want to replace words. Let us see how.
from repRe import REReplacer
rep_word = REReplacer()
rep_word.replace("I won't do it")
Output:
'I will not do it'
rep_word.replace("I can’t do it")
Output:
'I cannot do it'
Complete implementation example
repRe.py
import re
from nltk.corpus import wordnet
R_patterns = [
(r'won\'t', 'will not'),
(r'can\'t', 'cannot'),
(r'i\'m', 'i am'),
(r'(\w+)\'ll', '\g<1> will'),
(r'(\w+)n\'t', '\g<1> not'),
(r'(\w+)\'ve', '\g<1> have'),
(r'(\w+)\'s', '\g<1> is'),
(r'(\w+)\'re', '\g<1> are'),
]
class REReplacer(object):
def init (self, patterns=R_patterns):
self.patterns = [(re.compile(regex), repl) for (regex, repl) in
patterns] def replace(self, text):
s = text
for (pattern, repl) in
self.patterns: s =
re.sub(pattern, repl, s)
return s the above program and run it, you can import the class and use it as follows –
Now once you saved
Main.py
from repRe import REReplacer
rep_word = REReplacer()
print(rep_word.replace("I won't do it"))
Output
Replacement before text processing
One of the common practices while working with natural language processing (NLP) is to clean up the
text before text processing. In this concern we can also use our REReplacer class created above in
previous example, as a preliminary step before text processing i.e. tokenization.
Example
from nltk.tokenize import
word_tokenize from repRe import
REReplacer
rep_word = REReplacer()
print(word_tokenize("I won't be able to do this now"))
print(word_tokenize(rep_word.replace("I won't be able to do this now")))
Output
In the above Python recipe, we can easily understand the difference between the output of word
tokenizer without and with using regular expression replace.
Removal of repeating characters
Do we strictly grammatical in our everyday language? No, we are not. For example, sometimes we
write ‘Hiiiiiiiiiiii Mohan’ in order to emphasize the word ‘Hi’. But computer system does not know that
‘Hiiiiiiiiiiii’ is a variation of the word “Hi”. In the example below, we will be creating a class
named rep_word_removal which can be used for removing the repeating words.
Example
First, import the necessary package re to work with regular expressions
import re
from nltk.corpus import wordnet
Now, create a class that can be used for removing the repeating words −
import re
from nltk.corpus import wordnet
class Rep_word_removal(object):
def init (self):
self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\
w*)') self.repl = r'\1\2\3'
def replace(self, word):
if wordnet.synsets(word):
return word
repl_word = self.repeat_regexp.sub(self.repl,
word) if repl_word != word:
return self.replace(repl_word)
else:
return repl_word
Save this python program (say removalrepeat.py) and run it from python command prompt. After
running it, import Rep_word_removal class when you want to remove the repeating words. Let us see
how?
from removalrepeat import
Rep_word_removal rep_word =
Rep_word_removal() print("Before:
Hiiiiiiiiiiiiiiiiiiiii") print("Now:")
print(rep_word.replace("Hiiiiiiiiiiiiiiiiiiii
i")) print("Before: Hellooooooooooooooo")
print("Now:")
print(rep_word.replace("Hellooooooooooooooo")
)
Complete implementation example
removalrepeat.py
import re
from nltk.corpus import wordnet
class Rep_word_removal(object):
def init (self):
self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\
w*)') self.repl = r'\1\2\3'
def replace(self, word):
if wordnet.synsets(word):
return word
repl_word = self.repeat_regexp.sub(self.repl,
word) if repl_word != word:
return self.replace(repl_word)
else:
return repl_word
Now once you saved the above program and run it, you can import the class and use it as follows –
Main.py
from removalrepeat import
Rep_word_removal rep_word =
Rep_word_removal() print("Before:
Hiiiiiiiiiiiiiiiiiiiii") print("Now:")
print(rep_word.replace("Hiiiiiiiiiiiiiiiiiiii
i")) print("Before: Hellooooooooooooooo")
print("Now:")
print(rep_word.replace("Hellooooooooooooooo")
)
Output