Experiment-01 Date :
Demonstrate Noise Removal for any textual data and remove regular expression pattern
such as hash tag from textual data.
Description: NLP is a branch of data science that consists of systematic processes for
analyzing, understanding, and deriving information from text data in a smart and efficient
manner.
By utilizing NLP and its components, one can organize massive chunks of text data,
perform numerous automated tasks and solve a wide range of problems such as –
automatic summarization, machine translation, named entity recognition, relationship
extraction, sentiment analysis, speech recognition, and topic segmentation etc.
Before moving further, I would like to explain some terms that are used:
• Tokenization – process of converting a text into tokens
• Tokens – words or entities present in the text
• Text object – a sentence or a phrase or a word or an article
• Regular expressions or RegEx is defined as a sequence of characters that are mainly
used to find or replace patterns present in the text. In simple words, we can say that a
regular expression is a set of characters or a pattern that is used to find substrings in a
given string.
Noise Removal
• Any piece of text which is not relevant to the context of the data and the end-output can
be specified as the noise.
For example – language stopwords (commonly used words of a language – is, am, the,
of, in etc), URLs or links, social media entities (mentions, hashtags), punctuations and
industry specific words. This step deals with removal of all types of noisy entities present
in the text.
A general approach for noise removal is to prepare a dictionary of noisy entities, and
iterate the text object by tokens (or by words), eliminating those tokens which are present
in the noise dictionary.
Following is the python code for the same purpose.
Input Text:
"We had an amazing time at the #beach! The #sunset was mesmerizing. Can't wait to go
back! #vacation #fun"
Python Code:
import re
# Sample text
text = "We had an amazing time at the #beach! The #sunset was mesmerizing. Can't
wait to go back! #vacation #fun"
# Regular expression to remove hashtags
cleaned_text = re.sub(r'#\w+', '', text)
# Remove extra whitespace left behind
cleaned_text = ' '.join(cleaned_text.split())
print("Original Text:", text)
print("Cleaned Text:", cleaned_text)
Output:
Original Text: We had an amazing time at the #beach! The #sunset was mesmerizing.
Can't wait to go back! #vacation #fun
Cleaned Text: We had an amazing time at the! The was mesmerizing. Can't wait to go
back!
Experiment-02 Date :
Perform lemmatization and stemming using python library nltk.
Explanation:
• Stemming reduces words to their root form but may not result in meaningful
words.
• Lemmatization uses a dictionary to reduce words to their base or canonical
form and ensures the result is a valid word.
Lexicon Normalization
Another type of textual noise is about the multiple representations exhibited by a single
word.
For example – “play”, “player”, “played”, “plays” and “playing” are different variations of
the word – “play”, Though they mean different but contextually all are similar. The step
converts all the disparities of a word into their normalized form (also known as a lemma).
Normalization is a pivotal step for feature engineering with text as it converts the high
dimensional features (N different features) to the low dimensional space (1 feature),
which is an ideal task for any ML model.
The most common lexicon normalization practices are :
• Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes
(“ing”, “ly”, “es”, “s” etc) from a word.
• Lemmatization: Lemmatization, on the other hand, is an organized & step by step
procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary
importance of words) and morphological analysis (word structure and grammar
relations)
Example Code:
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk
# Download required resources
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
# Sample text
text = "The geese are flying in the sky. He studies and is studying diligently."
# Tokenize the text
words = word_tokenize(text)
# Initialize Stemmer and Lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
# Perform stemming and lemmatization
stemmed_words = [stemmer.stem(word) for word in words]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
# Print results
print("Original Words:", words)
print("Stemmed Words:", stemmed_words)
print("Lemmatized Words:", lemmatized_words)
Output:
Original Words: ['The', 'geese', 'are', 'flying', 'in', 'the', 'sky', '.', 'He', 'studies', 'and', 'is',
'studying', 'diligently', '.']
Stemmed Words: ['the', 'gees', 'are', 'fli', 'in', 'the', 'sky', '.', 'he', 'studi', 'and', 'is', 'studi',
'dilig', '.']
Lemmatized Words: ['The', 'goose', 'are', 'flying', 'in', 'the', 'sky', '.', 'He', 'study', 'and', 'is',
'studying', 'diligently', '.']
Observations:
• Stemming: Reduces "flying" to "fli" and "studies" to "studi," which aren't actual
words.
• Lemmatization: Converts "geese" to "goose" and "studies" to "study," retaining
meaningful words.
Experiment-03 Date :
Demonstrate object standardization such as replace social media slangs from a text.
Example:
We will standardize the text by replacing common social media slang terms like "lol,"
"brb," or "omg" with their proper meanings.
Python Code:
import re
# Define a dictionary of slangs and their standard equivalents
slang_dict = {
"lol": "laughing out loud",
"brb": "be right back",
"omg": "oh my god",
"idk": "I don't know",
"btw": "by the way",
"ttyl": "talk to you later",
"smh": "shaking my head"
# Sample text with slangs
text = "OMG, I can't believe this! LOL. BRB, let me check. IDK what to say. BTW, talk to
you later. SMH."
# Replace slangs with their standard equivalents
def replace_slang(text, slang_dict):
# Create a regular expression pattern to match slangs
pattern = re.compile(r'\b(' + '|'.join(re.escape(key) for key in slang_dict.keys()) + r')\b',
re.IGNORECASE)
# Replace matched slang with corresponding value from the dictionary
standardized_text = pattern.sub(lambda x: slang_dict[x.group().lower()], text)
return standardized_text
# Perform slang replacement
standardized_text = replace_slang(text, slang_dict)
print("Original Text:", text)
print("Standardized Text:", standardized_text)
Output :
Original Text: OMG, I can't believe this! LOL. BRB, let me check. IDK what to say. BTW,
talk to you later. SMH.
Standardized Text: oh my god, I can't believe this! laughing out loud. be right back, let
me check. I don't know what to say. by the way, talk to you later. shaking my head.
Explanation:
• The slang_dict holds common slangs as keys and their meanings as values.
• A regular expression finds slangs in the text, ensuring case-insensitive matching.
• Matched slangs are replaced using their meanings from the dictionary.