Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
20 views6 pages

NLP Exp-123

The document describes various natural language processing (NLP) techniques for cleaning and normalizing text data, including noise removal, lemmatization, stemming, and slang standardization. It provides Python code examples for each technique, demonstrating how to remove hashtags, perform lemmatization and stemming using the NLTK library, and replace social media slang with their standard meanings. The document emphasizes the importance of these processes in preparing text data for analysis and machine learning applications.

Uploaded by

Suneela Mathe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views6 pages

NLP Exp-123

The document describes various natural language processing (NLP) techniques for cleaning and normalizing text data, including noise removal, lemmatization, stemming, and slang standardization. It provides Python code examples for each technique, demonstrating how to remove hashtags, perform lemmatization and stemming using the NLTK library, and replace social media slang with their standard meanings. The document emphasizes the importance of these processes in preparing text data for analysis and machine learning applications.

Uploaded by

Suneela Mathe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Experiment-01 Date :

Demonstrate Noise Removal for any textual data and remove regular expression pattern
such as hash tag from textual data.

Description: NLP is a branch of data science that consists of systematic processes for
analyzing, understanding, and deriving information from text data in a smart and efficient
manner.

By utilizing NLP and its components, one can organize massive chunks of text data,
perform numerous automated tasks and solve a wide range of problems such as –
automatic summarization, machine translation, named entity recognition, relationship
extraction, sentiment analysis, speech recognition, and topic segmentation etc.

Before moving further, I would like to explain some terms that are used:

• Tokenization – process of converting a text into tokens

• Tokens – words or entities present in the text

• Text object – a sentence or a phrase or a word or an article

• Regular expressions or RegEx is defined as a sequence of characters that are mainly


used to find or replace patterns present in the text. In simple words, we can say that a
regular expression is a set of characters or a pattern that is used to find substrings in a
given string.

Noise Removal

• Any piece of text which is not relevant to the context of the data and the end-output can
be specified as the noise.

For example – language stopwords (commonly used words of a language – is, am, the,
of, in etc), URLs or links, social media entities (mentions, hashtags), punctuations and
industry specific words. This step deals with removal of all types of noisy entities present
in the text.

A general approach for noise removal is to prepare a dictionary of noisy entities, and
iterate the text object by tokens (or by words), eliminating those tokens which are present
in the noise dictionary.
Following is the python code for the same purpose.

Input Text:

"We had an amazing time at the #beach! The #sunset was mesmerizing. Can't wait to go
back! #vacation #fun"

Python Code:

import re

# Sample text

text = "We had an amazing time at the #beach! The #sunset was mesmerizing. Can't
wait to go back! #vacation #fun"

# Regular expression to remove hashtags

cleaned_text = re.sub(r'#\w+', '', text)

# Remove extra whitespace left behind

cleaned_text = ' '.join(cleaned_text.split())

print("Original Text:", text)

print("Cleaned Text:", cleaned_text)

Output:

Original Text: We had an amazing time at the #beach! The #sunset was mesmerizing.
Can't wait to go back! #vacation #fun

Cleaned Text: We had an amazing time at the! The was mesmerizing. Can't wait to go
back!
Experiment-02 Date :

Perform lemmatization and stemming using python library nltk.

Explanation:

• Stemming reduces words to their root form but may not result in meaningful
words.

• Lemmatization uses a dictionary to reduce words to their base or canonical


form and ensures the result is a valid word.

Lexicon Normalization

Another type of textual noise is about the multiple representations exhibited by a single
word.

For example – “play”, “player”, “played”, “plays” and “playing” are different variations of
the word – “play”, Though they mean different but contextually all are similar. The step
converts all the disparities of a word into their normalized form (also known as a lemma).

Normalization is a pivotal step for feature engineering with text as it converts the high
dimensional features (N different features) to the low dimensional space (1 feature),
which is an ideal task for any ML model.

The most common lexicon normalization practices are :

• Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes


(“ing”, “ly”, “es”, “s” etc) from a word.

• Lemmatization: Lemmatization, on the other hand, is an organized & step by step


procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary
importance of words) and morphological analysis (word structure and grammar
relations)

Example Code:

from nltk.stem import PorterStemmer

from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize

import nltk

# Download required resources

nltk.download('punkt')

nltk.download('wordnet')
nltk.download('omw-1.4')

# Sample text

text = "The geese are flying in the sky. He studies and is studying diligently."

# Tokenize the text

words = word_tokenize(text)

# Initialize Stemmer and Lemmatizer

stemmer = PorterStemmer()

lemmatizer = WordNetLemmatizer()

# Perform stemming and lemmatization

stemmed_words = [stemmer.stem(word) for word in words]

lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

# Print results

print("Original Words:", words)

print("Stemmed Words:", stemmed_words)

print("Lemmatized Words:", lemmatized_words)

Output:

Original Words: ['The', 'geese', 'are', 'flying', 'in', 'the', 'sky', '.', 'He', 'studies', 'and', 'is',
'studying', 'diligently', '.']

Stemmed Words: ['the', 'gees', 'are', 'fli', 'in', 'the', 'sky', '.', 'he', 'studi', 'and', 'is', 'studi',
'dilig', '.']

Lemmatized Words: ['The', 'goose', 'are', 'flying', 'in', 'the', 'sky', '.', 'He', 'study', 'and', 'is',
'studying', 'diligently', '.']

Observations:

• Stemming: Reduces "flying" to "fli" and "studies" to "studi," which aren't actual
words.

• Lemmatization: Converts "geese" to "goose" and "studies" to "study," retaining


meaningful words.
Experiment-03 Date :

Demonstrate object standardization such as replace social media slangs from a text.

Example:

We will standardize the text by replacing common social media slang terms like "lol,"
"brb," or "omg" with their proper meanings.

Python Code:

import re

# Define a dictionary of slangs and their standard equivalents

slang_dict = {

"lol": "laughing out loud",

"brb": "be right back",

"omg": "oh my god",

"idk": "I don't know",

"btw": "by the way",

"ttyl": "talk to you later",

"smh": "shaking my head"

# Sample text with slangs

text = "OMG, I can't believe this! LOL. BRB, let me check. IDK what to say. BTW, talk to
you later. SMH."

# Replace slangs with their standard equivalents

def replace_slang(text, slang_dict):

# Create a regular expression pattern to match slangs

pattern = re.compile(r'\b(' + '|'.join(re.escape(key) for key in slang_dict.keys()) + r')\b',


re.IGNORECASE)
# Replace matched slang with corresponding value from the dictionary

standardized_text = pattern.sub(lambda x: slang_dict[x.group().lower()], text)

return standardized_text

# Perform slang replacement

standardized_text = replace_slang(text, slang_dict)

print("Original Text:", text)

print("Standardized Text:", standardized_text)

Output :

Original Text: OMG, I can't believe this! LOL. BRB, let me check. IDK what to say. BTW,
talk to you later. SMH.

Standardized Text: oh my god, I can't believe this! laughing out loud. be right back, let
me check. I don't know what to say. by the way, talk to you later. shaking my head.

Explanation:

• The slang_dict holds common slangs as keys and their meanings as values.

• A regular expression finds slangs in the text, ensuring case-insensitive matching.

• Matched slangs are replaced using their meanings from the dictionary.

You might also like