EXPERIMENT NO: 3
Aim: Apply various other text preprocessing techniques for any given text: Stop
Word Removal, Lemmatization / Stemming.
I. Abstract
This experiment demonstrates fundamental text preprocessing techniques in Natural Language
Processing (NLP), specifically focusing on stop word removal, lemmatization, and stemming.
These techniques are vital for cleaning and simplifying textual data before applying advanced
analysis or machine learning models. Stop word removal filters out common, insignificant
words, thereby reducing noise in the text. Lemmatization converts words to their dictionary base
form by considering grammatical context, while stemming reduces words to their root form
using heuristic rules. By applying these methods to a sample sentence, the experiment highlights
how each technique impacts tokenized text, aiding in better understanding and representation of
data. These preprocessing steps are essential in enhancing the accuracy and efficiency of NLP
systems.
II. Introduction
1. Stop Word Removal: Stop words are commonly used words in a language that carry little
semantic meaning on their own, such as "the," "is," "in," "and," etc. Removing stop words from a
text can reduce the size of the data and focus on the more significant words that carry the
essential meaning. This step is crucial in text preprocessing, especially in natural language
processing (NLP), as it helps in simplifying the data and improving the efficiency of machine
learning models.
2. Lemmatization: Lemmatization is a text normalization technique that reduces words to their
base or root form, known as the lemma. For instance, the words "running," "ran," and "runs"
would all be reduced to the lemma "run." Lemmatization considers the context of the word and
uses a dictionary to find the correct base form. This process helps in reducing the complexity of
the text data and ensuring that different forms of a word are treated as a single entity, thus
improving the accuracy of NLP tasks.
3. Stemming: Stemming is another text normalization technique that reduces words to their root
form by removing suffixes. Unlike lemmatization, stemming does not consider the context or use
a dictionary, and therefore, it may produce non-existent words. For example, "running,"
"runner," and "ran" might all be reduced to "run" or "runn." While stemming is faster and
simpler than lemmatization, it is often less accurate.
II. Implementation
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
# Download necessary NLTK data (run once)
# nltk.download('punkt')
# nltk.download('punkt_tab')
# nltk.download('stopwords')
# nltk.download('wordnet')
def preprocess_text(text):
print(f"Sentence= \"{text}\"")
# Tokenization with explicit language
tokens = word_tokenize(text, language='english')
print(f"Tokens: {tokens}")
# Stopword removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in
stop_words and word.isalpha()]
print(f"Stopword removal: {filtered_tokens}")
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in
filtered_tokens]
print(f"Lemmatization: {lemmatized_tokens}")
# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print(f"Stemming: {stemmed_tokens}")
# Example text in plural tense
text = "How can a clam cram in a clean cream can?"
preprocess_text(text)
Output:
II. Conclusion
Applying stop word removal, lemmatization, and stemming are essential preprocessing
techniques in NLP that help simplify and normalize text data. Stop word removal eliminates
irrelevant words, while lemmatization and stemming reduce words to their base forms, making
the text easier to analyze and improving the performance of machine learning models. These
techniques are foundational for preparing text data for further analysis or model training.