0% found this document useful (0 votes)

22 views5 pages

Text Normalization with Regex

NLP - Word Placement

Uploaded by

cyrelljoyvertudes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views5 pages

Text Normalization with Regex

NLP - Word Placement

Uploaded by

cyrelljoyvertudes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Stemming and lemmatization can be considered as a kind of linguistic compression.

In the same sense,

word replacement can be thought of as text normalization or error correction.
But why we needed word replacement? Suppose if we talk about tokenization, then it is having issues
with contractions (like can’t, won’t, etc.). So, to handle such issues we need word replacement. For
example, we can replace contractions with their expanded forms.

Word replacement using regular expression

First, we are going to replace words that matches the regular expression. But for this we must have a
basic understanding of regular expressions as well as python re module. In the example below, we will
be replacing contraction with their expanded forms (e.g. “can’t” will be replaced with “cannot”), all
that by using regular expressions.

Example

First, import the necessary package re to work with regular expressions.

import re
from nltk.corpus import wordnet

Next, define the replacement patterns of your choice as follows −

R_patterns =
[ (r'won\'t', 'will not'),
(r'can\'t', 'cannot'),
(r'i\'m', 'i am'),
(r'(\w+)\'ll', '\g<1> will'),
(r'(\w+)n\'t', '\g<1> not'),
(r'(\w+)\'ve', '\g<1> have'),
(r'(\w+)\'s', '\g<1> is'),
(r'(\w+)\'re', '\g<1> are'),
]

Now, create a class that can be used for replacing words −

class REReplacer(object):
def init (self, pattern = R_patterns):
self.pattern = [(re.compile(regex), repl) for (regex, repl) in patterns]
def replace(self, text):
s = text
for (pattern, repl) in self.pattern:
s = re.sub(pattern, repl, s)
return s
Save this python program (say repRE.py) and run it from python command prompt. After running it,
import REReplacer class when you want to replace words. Let us see how.

from repRe import REReplacer

rep_word = REReplacer()
rep_word.replace("I won't do it")
Output:
'I will not do it'
rep_word.replace("I can’t do it")
Output:
'I cannot do it'

Complete implementation example

repRe.py

import re
from nltk.corpus import wordnet
R_patterns = [
(r'won\'t', 'will not'),
(r'can\'t', 'cannot'),
(r'i\'m', 'i am'),
(r'(\w+)\'ll', '\g<1> will'),
(r'(\w+)n\'t', '\g<1> not'),
(r'(\w+)\'ve', '\g<1> have'),
(r'(\w+)\'s', '\g<1> is'),
(r'(\w+)\'re', '\g<1> are'),
]

class REReplacer(object):
def init (self, patterns=R_patterns):
self.patterns = [(re.compile(regex), repl) for (regex, repl) in
patterns] def replace(self, text):
s = text
for (pattern, repl) in
self.patterns: s =
re.sub(pattern, repl, s)
return s the above program and run it, you can import the class and use it as follows –
Now once you saved

Main.py
from repRe import REReplacer
rep_word = REReplacer()
print(rep_word.replace("I won't do it"))
Output

Replacement before text processing

One of the common practices while working with natural language processing (NLP) is to clean up the
text before text processing. In this concern we can also use our REReplacer class created above in
previous example, as a preliminary step before text processing i.e. tokenization.

Example

from nltk.tokenize import

word_tokenize from repRe import
REReplacer
rep_word = REReplacer()
print(word_tokenize("I won't be able to do this now"))
print(word_tokenize(rep_word.replace("I won't be able to do this now")))
Output

In the above Python recipe, we can easily understand the difference between the output of word
tokenizer without and with using regular expression replace.
Removal of repeating characters

Do we strictly grammatical in our everyday language? No, we are not. For example, sometimes we
write ‘Hiiiiiiiiiiii Mohan’ in order to emphasize the word ‘Hi’. But computer system does not know that
‘Hiiiiiiiiiiii’ is a variation of the word “Hi”. In the example below, we will be creating a class
named rep_word_removal which can be used for removing the repeating words.

Example

First, import the necessary package re to work with regular expressions

import re
from nltk.corpus import wordnet

Now, create a class that can be used for removing the repeating words −
import re
from nltk.corpus import wordnet
class Rep_word_removal(object):
def init (self):
self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\
w*)') self.repl = r'\1\2\3'
def replace(self, word):
if wordnet.synsets(word):
return word
repl_word = self.repeat_regexp.sub(self.repl,
word) if repl_word != word:
return self.replace(repl_word)
else:
return repl_word

Save this python program (say removalrepeat.py) and run it from python command prompt. After
running it, import Rep_word_removal class when you want to remove the repeating words. Let us see
how?
from removalrepeat import
Rep_word_removal rep_word =
Rep_word_removal() print("Before:
Hiiiiiiiiiiiiiiiiiiiii") print("Now:")
print(rep_word.replace("Hiiiiiiiiiiiiiiiiiiii
i")) print("Before: Hellooooooooooooooo")
print("Now:")
print(rep_word.replace("Hellooooooooooooooo")
)
Complete implementation example

removalrepeat.py
import re
from nltk.corpus import wordnet
class Rep_word_removal(object):
def init (self):
self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\
w*)') self.repl = r'\1\2\3'
def replace(self, word):
if wordnet.synsets(word):
return word
repl_word = self.repeat_regexp.sub(self.repl,
word) if repl_word != word:
return self.replace(repl_word)
else:
return repl_word

Now once you saved the above program and run it, you can import the class and use it as follows –

Main.py
from removalrepeat import
Rep_word_removal rep_word =
Rep_word_removal() print("Before:
Hiiiiiiiiiiiiiiiiiiiii") print("Now:")
print(rep_word.replace("Hiiiiiiiiiiiiiiiiiiii
i")) print("Before: Hellooooooooooooooo")
print("Now:")
print(rep_word.replace("Hellooooooooooooooo")
)
Output

AFMAN 33-363 Management of Records PDF
No ratings yet
AFMAN 33-363 Management of Records PDF
59 pages
Ngram 2x3
No ratings yet
Ngram 2x3
5 pages
AutoCAD Instruction Manual
100% (2)
AutoCAD Instruction Manual
114 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
AllTorque Gen II Manual
100% (1)
AllTorque Gen II Manual
43 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
NLTK Cheatsheet for Text Analysis
No ratings yet
NLTK Cheatsheet for Text Analysis
3 pages
Chapter 13 Database Development Process - Database Design
No ratings yet
Chapter 13 Database Development Process - Database Design
7 pages
V6.4.3e Releasenotes v3.0
No ratings yet
V6.4.3e Releasenotes v3.0
142 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP Lab Manual (R20)
50% (2)
NLP Lab Manual (R20)
24 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
44 pages
Final Summary NLP
No ratings yet
Final Summary NLP
446 pages
Chapter 3 - Block Ciphers and The Data Encryption Standard
No ratings yet
Chapter 3 - Block Ciphers and The Data Encryption Standard
47 pages
Duplication - Typecasting-Problem Statement
No ratings yet
Duplication - Typecasting-Problem Statement
6 pages
Sax Phone
No ratings yet
Sax Phone
188 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Getting Started Tutorial LS-DYNA
No ratings yet
Getting Started Tutorial LS-DYNA
39 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
75rt Ad Ccpulse+
No ratings yet
75rt Ad Ccpulse+
94 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
29 pages
Tutorial 2
No ratings yet
Tutorial 2
82 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Stereo Amplifier for Hi-Fi Systems
No ratings yet
Stereo Amplifier for Hi-Fi Systems
12 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
74 pages
2010-08-18 Zernik, J: Data Mining of Online Judicial Records of The Networked US Federal Courts, International Journal On Social Media: Monitoring, Measurement, Mining, 1:69-83 (2010)
No ratings yet
2010-08-18 Zernik, J: Data Mining of Online Judicial Records of The Networked US Federal Courts, International Journal On Social Media: Monitoring, Measurement, Mining, 1:69-83 (2010)
13 pages
Chapter 1
No ratings yet
Chapter 1
31 pages
NLP Record
No ratings yet
NLP Record
15 pages
Python Regex & NLTK Guide
No ratings yet
Python Regex & NLTK Guide
53 pages
How To Install Ford IDS V98 For WiFi VCM II - OBD2express - Co
No ratings yet
How To Install Ford IDS V98 For WiFi VCM II - OBD2express - Co
6 pages
Unit 5
No ratings yet
Unit 5
4 pages
20BCP112 - NLP Lab - LAB - Manual
No ratings yet
20BCP112 - NLP Lab - LAB - Manual
65 pages
200Mhz Bandwidth Digital Storage Scope For PC: Part No. 01ossds200
No ratings yet
200Mhz Bandwidth Digital Storage Scope For PC: Part No. 01ossds200
3 pages
Lab 02 - Regular Expression
No ratings yet
Lab 02 - Regular Expression
7 pages
Text Processing
No ratings yet
Text Processing
16 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
16 pages
Regex for Genomics & Programming
No ratings yet
Regex for Genomics & Programming
38 pages
NLP Regex & Morphology Guide
No ratings yet
NLP Regex & Morphology Guide
3 pages
NLP 04
No ratings yet
NLP 04
3 pages
Slides Interim 2017 CFRG 01 Sessa Secp256k1 00
No ratings yet
Slides Interim 2017 CFRG 01 Sessa Secp256k1 00
7 pages
Excel Basics for Beginners
No ratings yet
Excel Basics for Beginners
6 pages
TriBuild 1.41: Advanced Diagnostics
No ratings yet
TriBuild 1.41: Advanced Diagnostics
2 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
Chen Et Al 2019
No ratings yet
Chen Et Al 2019
35 pages
Python Code For NLP
No ratings yet
Python Code For NLP
6 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
Lab 2 - Learning The Details of Attacks
No ratings yet
Lab 2 - Learning The Details of Attacks
2 pages
Natural Language Processing - Session 3 - Regular Expressions
No ratings yet
Natural Language Processing - Session 3 - Regular Expressions
39 pages
Minecraft Launcher Debug Log
No ratings yet
Minecraft Launcher Debug Log
14 pages
Frank Wyatt Prentice - Patent CA253765
100% (1)
Frank Wyatt Prentice - Patent CA253765
10 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
NoSQL M2
No ratings yet
NoSQL M2
47 pages
Manipulating Text With Regular Expression in Python
No ratings yet
Manipulating Text With Regular Expression in Python
4 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
CSIntroduction
No ratings yet
CSIntroduction
4 pages
Lecture 2n 04032024 081220pm 19022025 105409am
No ratings yet
Lecture 2n 04032024 081220pm 19022025 105409am
38 pages
Python Notes
No ratings yet
Python Notes
2 pages
CS173 Class Activity 2 Regex PDF
No ratings yet
CS173 Class Activity 2 Regex PDF
3 pages
OOP-18CLC1-2-W03 Contructor-Destructor
No ratings yet
OOP-18CLC1-2-W03 Contructor-Destructor
6 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
7 Idf
No ratings yet
7 Idf
5 pages
Chapter 6
No ratings yet
Chapter 6
6 pages
Module II
No ratings yet
Module II
17 pages
Chapter 8
No ratings yet
Chapter 8
7 pages
Chapter 5
No ratings yet
Chapter 5
7 pages
ENG Cypefire Design y Sprinklers
No ratings yet
ENG Cypefire Design y Sprinklers
8 pages
Chapter 3
No ratings yet
Chapter 3
4 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
Electronics Engineering Student Profile
No ratings yet
Electronics Engineering Student Profile
1 page
Removing Stopwords in NLP
No ratings yet
Removing Stopwords in NLP
32 pages
Lab 2
No ratings yet
Lab 2
49 pages
Module 5
No ratings yet
Module 5
69 pages
3.Nlp Lab Manual
No ratings yet
3.Nlp Lab Manual
18 pages
Recitation05 Cachelab
No ratings yet
Recitation05 Cachelab
97 pages
NLP Learning Materials 1
No ratings yet
NLP Learning Materials 1
28 pages
NLP Exp-123
No ratings yet
NLP Exp-123
6 pages
NLP - Exp 1 11
No ratings yet
NLP - Exp 1 11
29 pages
Preprocessing in Python
No ratings yet
Preprocessing in Python
50 pages
NLP Practical Journal
No ratings yet
NLP Practical Journal
36 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
2.3.1 SN SAMF X010 Import Software Installation Data
No ratings yet
2.3.1 SN SAMF X010 Import Software Installation Data
9 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
Natural Langauage Processing (NLP) : Tokenization of Words
No ratings yet
Natural Langauage Processing (NLP) : Tokenization of Words
8 pages
NLP Day1
No ratings yet
NLP Day1
4 pages

Text Normalization with Regex

Uploaded by

Text Normalization with Regex

Uploaded by

Stemming and lemmatization can be considered as a kind of linguistic compression.

In the same sense,

Word replacement using regular expression

First, import the necessary package re to work with regular expressions.

Next, define the replacement patterns of your choice as follows −

Now, create a class that can be used for replacing words −

from repRe import REReplacer

Complete implementation example

Replacement before text processing

from nltk.tokenize import

First, import the necessary package re to work with regular expressions

You might also like