0% found this document useful (0 votes)

20 views6 pages

NLP Exp-123

The document describes various natural language processing (NLP) techniques for cleaning and normalizing text data, including noise removal, lemmatization, stemming, and slang standardization. It provides Python code examples for each technique, demonstrating how to remove hashtags, perform lemmatization and stemming using the NLTK library, and replace social media slang with their standard meanings. The document emphasizes the importance of these processes in preparing text data for analysis and machine learning applications.

Uploaded by

Suneela Mathe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views6 pages

NLP Exp-123

Uploaded by

Suneela Mathe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Experiment-01 Date :

Demonstrate Noise Removal for any textual data and remove regular expression pattern
such as hash tag from textual data.

Description: NLP is a branch of data science that consists of systematic processes for
analyzing, understanding, and deriving information from text data in a smart and efficient
manner.

By utilizing NLP and its components, one can organize massive chunks of text data,
perform numerous automated tasks and solve a wide range of problems such as –
automatic summarization, machine translation, named entity recognition, relationship
extraction, sentiment analysis, speech recognition, and topic segmentation etc.

Before moving further, I would like to explain some terms that are used:

• Tokenization – process of converting a text into tokens

• Tokens – words or entities present in the text

• Text object – a sentence or a phrase or a word or an article

• Regular expressions or RegEx is defined as a sequence of characters that are mainly

used to find or replace patterns present in the text. In simple words, we can say that a
regular expression is a set of characters or a pattern that is used to find substrings in a
given string.

Noise Removal

• Any piece of text which is not relevant to the context of the data and the end-output can
be specified as the noise.

For example – language stopwords (commonly used words of a language – is, am, the,
of, in etc), URLs or links, social media entities (mentions, hashtags), punctuations and
industry specific words. This step deals with removal of all types of noisy entities present
in the text.

A general approach for noise removal is to prepare a dictionary of noisy entities, and
iterate the text object by tokens (or by words), eliminating those tokens which are present
in the noise dictionary.
Following is the python code for the same purpose.

Input Text:

"We had an amazing time at the #beach! The #sunset was mesmerizing. Can't wait to go
back! #vacation #fun"

Python Code:

import re

# Sample text

text = "We had an amazing time at the #beach! The #sunset was mesmerizing. Can't
wait to go back! #vacation #fun"

# Regular expression to remove hashtags

cleaned_text = re.sub(r'#\w+', '', text)

# Remove extra whitespace left behind

cleaned_text = ' '.join(cleaned_text.split())

print("Original Text:", text)

print("Cleaned Text:", cleaned_text)

Output:

Original Text: We had an amazing time at the #beach! The #sunset was mesmerizing.
Can't wait to go back! #vacation #fun

Cleaned Text: We had an amazing time at the! The was mesmerizing. Can't wait to go
back!
Experiment-02 Date :

Perform lemmatization and stemming using python library nltk.

Explanation:

• Stemming reduces words to their root form but may not result in meaningful
words.

• Lemmatization uses a dictionary to reduce words to their base or canonical

form and ensures the result is a valid word.

Lexicon Normalization

Another type of textual noise is about the multiple representations exhibited by a single
word.

For example – “play”, “player”, “played”, “plays” and “playing” are different variations of
the word – “play”, Though they mean different but contextually all are similar. The step
converts all the disparities of a word into their normalized form (also known as a lemma).

Normalization is a pivotal step for feature engineering with text as it converts the high
dimensional features (N different features) to the low dimensional space (1 feature),
which is an ideal task for any ML model.

The most common lexicon normalization practices are :

• Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes

(“ing”, “ly”, “es”, “s” etc) from a word.

• Lemmatization: Lemmatization, on the other hand, is an organized & step by step

procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary
importance of words) and morphological analysis (word structure and grammar
relations)

Example Code:

from nltk.stem import PorterStemmer

from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize

import nltk

# Download required resources

nltk.download('punkt')

nltk.download('wordnet')
nltk.download('omw-1.4')

# Sample text

text = "The geese are flying in the sky. He studies and is studying diligently."

# Tokenize the text

words = word_tokenize(text)

# Initialize Stemmer and Lemmatizer

stemmer = PorterStemmer()

lemmatizer = WordNetLemmatizer()

# Perform stemming and lemmatization

stemmed_words = [stemmer.stem(word) for word in words]

lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

# Print results

print("Original Words:", words)

print("Stemmed Words:", stemmed_words)

print("Lemmatized Words:", lemmatized_words)

Output:

Original Words: ['The', 'geese', 'are', 'flying', 'in', 'the', 'sky', '.', 'He', 'studies', 'and', 'is',
'studying', 'diligently', '.']

Stemmed Words: ['the', 'gees', 'are', 'fli', 'in', 'the', 'sky', '.', 'he', 'studi', 'and', 'is', 'studi',
'dilig', '.']

Lemmatized Words: ['The', 'goose', 'are', 'flying', 'in', 'the', 'sky', '.', 'He', 'study', 'and', 'is',
'studying', 'diligently', '.']

Observations:

• Stemming: Reduces "flying" to "fli" and "studies" to "studi," which aren't actual
words.

• Lemmatization: Converts "geese" to "goose" and "studies" to "study," retaining

meaningful words.
Experiment-03 Date :

Demonstrate object standardization such as replace social media slangs from a text.

Example:

We will standardize the text by replacing common social media slang terms like "lol,"
"brb," or "omg" with their proper meanings.

Python Code:

import re

# Define a dictionary of slangs and their standard equivalents

slang_dict = {

"lol": "laughing out loud",

"brb": "be right back",

"omg": "oh my god",

"idk": "I don't know",

"btw": "by the way",

"ttyl": "talk to you later",

"smh": "shaking my head"

# Sample text with slangs

text = "OMG, I can't believe this! LOL. BRB, let me check. IDK what to say. BTW, talk to
you later. SMH."

# Replace slangs with their standard equivalents

def replace_slang(text, slang_dict):

# Create a regular expression pattern to match slangs

pattern = re.compile(r'\b(' + '|'.join(re.escape(key) for key in slang_dict.keys()) + r')\b',

re.IGNORECASE)
# Replace matched slang with corresponding value from the dictionary

standardized_text = pattern.sub(lambda x: slang_dict[x.group().lower()], text)

return standardized_text

# Perform slang replacement

standardized_text = replace_slang(text, slang_dict)

print("Original Text:", text)

print("Standardized Text:", standardized_text)

Output :

Original Text: OMG, I can't believe this! LOL. BRB, let me check. IDK what to say. BTW,
talk to you later. SMH.

Standardized Text: oh my god, I can't believe this! laughing out loud. be right back, let
me check. I don't know what to say. by the way, talk to you later. shaking my head.

Explanation:

• The slang_dict holds common slangs as keys and their meanings as values.

• A regular expression finds slangs in the text, ensuring case-insensitive matching.

• Matched slangs are replaced using their meanings from the dictionary.

Pitman Shorthand Guide English Tamil Dictionary
100% (1)
Pitman Shorthand Guide English Tamil Dictionary
431 pages
Technical English For Mining (L3)
No ratings yet
Technical English For Mining (L3)
21 pages
Eng10 Q1 Mod10 Discourse Markers Generalization Version 3 PDF
No ratings yet
Eng10 Q1 Mod10 Discourse Markers Generalization Version 3 PDF
56 pages
NLP - Exp 1 11
No ratings yet
NLP - Exp 1 11
29 pages
NLP Record
No ratings yet
NLP Record
15 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
NLP with Python Lab Manual
No ratings yet
NLP with Python Lab Manual
15 pages
Experiment 3 Manual
No ratings yet
Experiment 3 Manual
7 pages
NLP Lab Manual (R20)
50% (2)
NLP Lab Manual (R20)
24 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Lecture 2n 04032024 081220pm 19022025 105409am
No ratings yet
Lecture 2n 04032024 081220pm 19022025 105409am
38 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
Unit 5
No ratings yet
Unit 5
4 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
Stemming, Lemmatization & NLP Basics
No ratings yet
Stemming, Lemmatization & NLP Basics
6 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
NLP Lab Programms
No ratings yet
NLP Lab Programms
9 pages
NLP Regex & Morphology Guide
No ratings yet
NLP Regex & Morphology Guide
3 pages
NLPEXP3
No ratings yet
NLPEXP3
3 pages
NLP 04
No ratings yet
NLP 04
3 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
7 pages
NLP Lab Manual - Final
No ratings yet
NLP Lab Manual - Final
15 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
ANLP semVI Labmanual
No ratings yet
ANLP semVI Labmanual
33 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
4 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Lab 04 - Text Normalization Tutorial
No ratings yet
Lab 04 - Text Normalization Tutorial
5 pages
NLP Lab 2
No ratings yet
NLP Lab 2
6 pages
Bling
No ratings yet
Bling
7 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
25 pages
Lab 2
No ratings yet
Lab 2
49 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
20BCP112 - NLP Lab - LAB - Manual
No ratings yet
20BCP112 - NLP Lab - LAB - Manual
65 pages
Chap 2
No ratings yet
Chap 2
70 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
7 Exp
No ratings yet
7 Exp
6 pages
TextMining
No ratings yet
TextMining
43 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
11 pages
NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
NLTK
No ratings yet
NLTK
3 pages
Detailed Explanation of The Code
No ratings yet
Detailed Explanation of The Code
4 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
NLP Lab Codes Till Mod3
No ratings yet
NLP Lab Codes Till Mod3
7 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
16 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
Python NLP Tasks with NLTK
No ratings yet
Python NLP Tasks with NLTK
17 pages
Chapter 6
No ratings yet
Chapter 6
6 pages
Text Normalization with Regex
No ratings yet
Text Normalization with Regex
5 pages
18 Text Mining - Text Preprocessing
No ratings yet
18 Text Mining - Text Preprocessing
40 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
NLP Experiments No-1
No ratings yet
NLP Experiments No-1
7 pages
Experiment9 Week-10
No ratings yet
Experiment9 Week-10
4 pages
ScannerGo 1729835135490
No ratings yet
ScannerGo 1729835135490
9 pages
Experiment-8 BDA Lab
No ratings yet
Experiment-8 BDA Lab
9 pages
Online Shopping System
No ratings yet
Online Shopping System
8 pages
3-1 It Artificial Intelligence Q.bank Unit-4,5
No ratings yet
3-1 It Artificial Intelligence Q.bank Unit-4,5
6 pages
DL Ex-7
No ratings yet
DL Ex-7
8 pages
Unit 2 CNS
No ratings yet
Unit 2 CNS
37 pages
Programming With C Notes
No ratings yet
Programming With C Notes
115 pages
Cns Lab Manual III Cse II Sem
No ratings yet
Cns Lab Manual III Cse II Sem
28 pages
JNTU - Neural Network
No ratings yet
JNTU - Neural Network
5 pages
SQL DDL and DML Commands Tutorial
No ratings yet
SQL DDL and DML Commands Tutorial
46 pages
4th Soil Notes
No ratings yet
4th Soil Notes
2 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
4 pages
C For Statistics: Computer Programming
No ratings yet
C For Statistics: Computer Programming
44 pages
Case Study Tata
No ratings yet
Case Study Tata
3 pages
Lab Manual: CSE 421: Artificial Intelligent and Deep Learning
No ratings yet
Lab Manual: CSE 421: Artificial Intelligent and Deep Learning
28 pages
Unit II
No ratings yet
Unit II
162 pages
#Include : Void Int Int Int
No ratings yet
#Include : Void Int Int Int
4 pages
Text A Text B Text C Text D: What Is The Text About? (Subject/focus)
75% (12)
Text A Text B Text C Text D: What Is The Text About? (Subject/focus)
2 pages
Grammar by Frank Palmer
No ratings yet
Grammar by Frank Palmer
3 pages
Unit 51
No ratings yet
Unit 51
2 pages
Three Keys To Speak English Quickly and Automatically PDF
No ratings yet
Three Keys To Speak English Quickly and Automatically PDF
17 pages
Edelt 105 Lesson 7 8
No ratings yet
Edelt 105 Lesson 7 8
11 pages
7 Elt302.47.spta.2
No ratings yet
7 Elt302.47.spta.2
9 pages
Handout (Poem) - Rain On The Roof - Class IX - 2022-23
No ratings yet
Handout (Poem) - Rain On The Roof - Class IX - 2022-23
3 pages
Year 9 English Workbook: Letter to the Editor
No ratings yet
Year 9 English Workbook: Letter to the Editor
38 pages
A Study of Reading Strategies
No ratings yet
A Study of Reading Strategies
126 pages
New Dhammapada
No ratings yet
New Dhammapada
183 pages
Guía 5 Ingles Verbo To Be
92% (13)
Guía 5 Ingles Verbo To Be
6 pages
IELTS Writing Task 1
0% (1)
IELTS Writing Task 1
9 pages
Understanding English Articles
No ratings yet
Understanding English Articles
4 pages
Arya
No ratings yet
Arya
5 pages
Lesson 2
No ratings yet
Lesson 2
2 pages
Fonetic Japanese
No ratings yet
Fonetic Japanese
3 pages
Screen Colored-The Lion and The Mouse Exercise Pack
No ratings yet
Screen Colored-The Lion and The Mouse Exercise Pack
127 pages
Nonverbal Communication Insights
No ratings yet
Nonverbal Communication Insights
16 pages
Use of English (Grammar) : Key Word Transformation Word Formation Open Close Multiple Choice
No ratings yet
Use of English (Grammar) : Key Word Transformation Word Formation Open Close Multiple Choice
1 page
Iconicity in Language - An Encyclopaedic Dictionary
100% (1)
Iconicity in Language - An Encyclopaedic Dictionary
479 pages
El Poder Del Pensamiento Positivo
No ratings yet
El Poder Del Pensamiento Positivo
8 pages
Vocabulary Mastery for Test Takers
No ratings yet
Vocabulary Mastery for Test Takers
12 pages
Numerals in Kokborok
No ratings yet
Numerals in Kokborok
7 pages
2
No ratings yet
2
5 pages
Sexism in Language: Cross-Cultural Communications Wtuc
No ratings yet
Sexism in Language: Cross-Cultural Communications Wtuc
30 pages
Aspects of Language
No ratings yet
Aspects of Language
90 pages
8 Antonyms - Turkish Language Lessons
No ratings yet
8 Antonyms - Turkish Language Lessons
3 pages

NLP Exp-123

Uploaded by

NLP Exp-123

Uploaded by

Experiment-01 Date :

• Tokenization – process of converting a text into tokens

• Tokens – words or entities present in the text

• Text object – a sentence or a phrase or a word or an article

• Regular expressions or RegEx is defined as a sequence of characters that are mainly

# Regular expression to remove hashtags

cleaned_text = re.sub(r'#\w+', '', text)

# Remove extra whitespace left behind

cleaned_text = ' '.join(cleaned_text.split())

print("Original Text:", text)

print("Cleaned Text:", cleaned_text)

Perform lemmatization and stemming using python library nltk.

• Lemmatization uses a dictionary to reduce words to their base or canonical

The most common lexicon normalization practices are :

• Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes

• Lemmatization: Lemmatization, on the other hand, is an organized & step by step

from nltk.stem import PorterStemmer

from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize

# Download required resources

# Tokenize the text

# Initialize Stemmer and Lemmatizer

# Perform stemming and lemmatization

stemmed_words = [stemmer.stem(word) for word in words]

lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print("Original Words:", words)

print("Stemmed Words:", stemmed_words)

print("Lemmatized Words:", lemmatized_words)

• Lemmatization: Converts "geese" to "goose" and "studies" to "study," retaining

# Define a dictionary of slangs and their standard equivalents

"lol": "laughing out loud",

"brb": "be right back",

"omg": "oh my god",

"idk": "I don't know",

"btw": "by the way",

"ttyl": "talk to you later",

"smh": "shaking my head"

# Sample text with slangs

# Replace slangs with their standard equivalents

def replace_slang(text, slang_dict):

# Create a regular expression pattern to match slangs

pattern = re.compile(r'\b(' + '|'.join(re.escape(key) for key in slang_dict.keys()) + r')\b',

standardized_text = pattern.sub(lambda x: slang_dict[x.group().lower()], text)

# Perform slang replacement

standardized_text = replace_slang(text, slang_dict)

print("Original Text:", text)

print("Standardized Text:", standardized_text)

• A regular expression finds slangs in the text, ensuring case-insensitive matching.

You might also like