Detailed Explanation of the Code
Step 1: Import Necessary Libraries
import nltk
import string
import re
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
nltk: Natural Language Toolkit, used for text processing tasks.
string: For string operations like removing punctuation.
re: Regular expressions for pattern matching, used for removing numbers.
collections.Counter: For counting word frequencies.
nltk.corpus.stopwords: To remove common words that don’t add much meaning (e.g.,
“and”, “the”).
nltk.tokenize.word_tokenize: To split text into words.
nltk.stem.porter.PorterStemmer: For stemming, which reduces words to their root form.
nltk.stem.WordNetLemmatizer: For lemmatization, which reduces words to their base form.
wordcloud.WordCloud: For generating word clouds.
matplotlib.pyplot: For plotting the word cloud.
PIL.Image and numpy: For handling images and arrays, although not used in this specific
task.
Step 2: Ensure Required NLTK Data is Downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
Downloads necessary datasets for tokenization, stopwords, and wordnet for lemmatization.
Example Essay
input_str = """
In recent years, the movement toward organic food has gained significant momentum...
"""
An example text on organic food and sustainability to demonstrate text processing and word cloud
generation.
Step 2: Text Lowercase
def text_lowercase(text):
return text.lower()
input_str = text_lowercase(input_str)
Converts all text to lowercase to ensure uniformity.
Step 3: Remove Numbers
def remove_numbers(text):
result = re.sub(r'\d+', '', text)
return result
input_str = remove_numbers(input_str)
Removes any numbers from the text using regular expressions.
Step 4: Remove Punctuation
def remove_punctuation(text):
translator = str.maketrans('', '', string.punctuation)
return text.translate(translator)
input_str = remove_punctuation(input_str)
Removes punctuation from the text using str.translate.
Step 5: Remove Whitespace from Text
def remove_whitespace(text):
return " ".join(text.split())
input_str = remove_whitespace(input_str)
Removes extra whitespace by splitting and rejoining the text.
Step 6: Remove Stop Words
def remove_stopwords(text):
stop_words = set(stopwords.words("english"))
word_tokens = word_tokenize(text)
filtered_text = [word for word in word_tokens if word not in stop_words]
return ' '.join(filtered_text)
input_str = remove_stopwords(input_str)
Removes common stop words to focus on meaningful words.
Step 7: Stemming
stemmer = PorterStemmer()
def stem_words(text):
word_tokens = word_tokenize(text)
stems = [stemmer.stem(word) for word in word_tokens]
return ' '.join(stems)
stemmed_str = stem_words(input_str)
Reduces words to their root form using Porter Stemmer.
Step 8: Lemmatization
lemmatizer = WordNetLemmatizer()
def lemma_words(text):
word_tokens = word_tokenize(text)
lemmas = [lemmatizer.lemmatize(word) for word in word_tokens]
return ' '.join(lemmas)
lemmatized_str = lemma_words(input_str)
Reduces words to their base form using WordNet Lemmatizer.
Step 9: Create a Word Cloud of the Top 40 Words
def create_wordcloud(text):
# Tokenize words and count frequency
word_tokens = word_tokenize(text)
word_freq = Counter(word_tokens)
# Get the top 40 words
top_40_words = dict(word_freq.most_common(40))
# Generate the word cloud
wordcloud = WordCloud(width=800, height=400,
background_color='white').generate_from_frequencies(top_40_words)
# Plot the word cloud
plt.figure(figsize=(10, 5), facecolor=None)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()
# Create the word cloud
create_wordcloud(lemmatized_str)
Tokenize words and count frequency: Counts the frequency of each word.
Get the top 40 words: Selects the 40 most common words.
Generate the word cloud: Uses WordCloud to generate a visual representation.
Plot the word cloud: Displays the word cloud using matplotlib.
Conclusion
This code processes an example essay through several steps: converting to lowercase, removing
numbers and punctuation, eliminating extra whitespace, removing stop words, stemming, and
lemmatization. Finally, it generates a word cloud of the top 40 words, visualizing the most common
terms in the processed text.