EXP 1:
PROGRAM:
import re
text = "The quick brown fox jumps over the lazy dog"
pattern = r'\b[a-zA-Z]*[qQ][a-zA-Z]*\b' # Words containing the letter 'q' or 'Q'
matches = re.findall(pattern, text)
print(matches) # Output: ['quick']
OUTPUT:
['quick']
PROGRAM:
import re
text = "The quick brown fox"
tokens = re.split(r'\s+', text)
print(tokens) # Output: ['The', 'quick', 'brown', 'fox']
OUTPUT:
['The', 'quick', 'brown', 'fox']
EXP 2:
PROGRAM:
import re
from collections import Counter
text = """
Natural Language Processing (NLP) is a fascinating field of Artificial Intelligence.
It helps computers understand, interpret, and generate human language.
With NLP, we can build chatbots, translators, sentiment analyzers, and more. “””
# 1️ Tokenization (simple split using regex)
# --------------------------
tokens = re.findall(r'\b\w+\b', text.lower())
# --------------------------
# 2️ Search in Text
# --------------------------
search_word = "nlp"
occurrences = [i for i, token in enumerate(tokens) if token == search_word]
print(f"Positions of '{search_word}':", occurrences)
# --------------------------
# 3️ Vocabulary Size
# --------------------------
vocab = set(tokens)
print("Vocabulary Size:", len(vocab))
print("Vocabulary:", sorted(vocab))
# --------------------------
# 4️ Frequency Distribution
# --------------------------
fdist = Counter(tokens)
print("\nTop 5 Frequent Words:")
for word, freq in fdist.most_common(5):
print(word, "->", freq)
# --------------------------
# 5️ Bigrams
# --------------------------
bigrams_list = [(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)]
print("\nBigrams List:", bigrams_list)
OUTPUT:
Positions of 'nlp': [3, 21]
Vocabulary Size: 28
Vocabulary: ['a', 'analyzers', 'and', 'artificial', 'build', 'can', 'chatbots', 'computers', 'fascinating',
'field', 'generate', 'helps', 'human', 'intelligence', 'interpret', 'is', 'it', 'language', 'more', 'natural',
'nlp', 'of', 'processing', 'sentiment', 'translators', 'understand', 'we', 'with']
Top 5 Frequent Words:
language -> 2
nlp -> 2
and -> 2
natural -> 1
processing -> 1
Bigrams List: [('natural', 'language'), ('language', 'processing'), ('processing', 'nlp'), ('nlp', 'is'), ('is',
'a'), ('a', 'fascinating'), ('fascinating', 'field'), ('field', 'of'), ('of', 'artificial'), ('artificial',
'intelligence'), ('intelligence', 'it'), ('it', 'helps'), ('helps', 'computers'), ('computers', 'understand'),
('understand', 'interpret'), ('interpret', 'and'), ('and', 'generate'), ('generate', 'human'), ('human',
'language'), ('language', 'with'), ('with', 'nlp'), ('nlp', 'we'), ('we', 'can'), ('can', 'build'), ('build',
'chatbots'), ('chatbots', 'translators'), ('translators', 'sentiment'), ('sentiment', 'analyzers'),
('analyzers', 'and'), ('and', 'more')]
=== Code Execution Successful ===
EXP 3:
PROGRAM:
import nltk
from nltk.corpus import gutenberg, brown, wordnet
# Download corpora if not already downloaded
nltk.download('gutenberg')
nltk.download('brown')
nltk.download('wordnet')
# Accessing the Gutenberg Corpus
print("=== Gutenberg Corpus ===")
print("Available files:", gutenberg.fileids())
emma_text = gutenberg.raw('austen-emma.txt')[:500] # Extracting first 500 characters
print("Sample text from 'Emma' by Jane Austen:")
print(emma_text)
# Accessing the Brown Corpus
print("\n=== Brown Corpus ===")
print("Available categories (genres):", brown.categories())
news_text = brown.raw(categories='news')[:500] # Extracting first 500 characters
print("Sample text from the 'news' category:")
print(news_text)
# Accessing the WordNet Corpus
print("\n=== WordNet Corpus ===")
car_synsets = wordnet.synsets('car') # Synsets for the word 'car'
print("Synsets for the word 'car':", car_synsets)
print("Definitions of the synsets:")
for synset in car_synsets:
print("-", synset.definition())
OUTPUT:
=== Gutenberg Corpus ===
Available files: ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt',
'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-
ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-
moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt',
'shakespeare-macbeth.txt', 'whitman-leaves.txt']
Sample text from 'Emma' by Jane Austen:
[Emma by Jane Austen 1816]
VOLUME I
CHAPTER I
Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.
She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period. Her mother
had died t
=== Brown Corpus ===
Available categories (genres): ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government',
'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance',
'science_fiction']
Sample text from the 'news' category:
The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at
investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at
evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.
The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at
City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at
election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/c
=== WordNet Corpus ===
Synsets for the word 'car': [Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'),
Synset('car.n.04'), Synset('cable_car.n.01')]
Definitions of the synsets:
- a motor vehicle with four wheels; usually propelled by an internal combustion engine
- a wheeled vehicle adapted to the rails of railroad
- the compartment that is suspended from an airship and that carries personnel and the cargo and
the power plant
- where passengers ride up and down
- a conveyance for passengers or freight on a cable railway
EXP 4:
PROGRAM:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
# Download required resources
nltk.download('punkt') # Tokenizer
nltk.download('punkt_tab') # New tokenizer table (fixes your error)
nltk.download('stopwords') # Stop words
def most_frequent_words(text):
# Tokenize the text
tokens = word_tokenize(text)
# Filter out stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words and
word.isalpha()]
# Calculate frequency distribution
fdist = FreqDist(filtered_tokens)
# Get the 50 most frequent words
most_frequent = fdist.most_common(50)
return most_frequent
# Example usage
text = """This is a sample text. It contains some words that will be counted.
The words in this text will be analyzed to find the most frequent ones.
This text is here to demonstrate frequency counting in NLP using NLTK."""
result = most_frequent_words(text)
print("50 most frequently occurring words (excluding stop words):")
print(result)
OUTPUT:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
50 most frequently occurring words (excluding stop words):
[('text', 3), ('words', 2), ('sample', 1), ('contains', 1), ('counted', 1), ('analyzed', 1), ('find', 1),
('frequent', 1), ('ones', 1), ('demonstrate', 1), ('frequency', 1), ('counting', 1), ('NLP', 1), ('using',
1), ('NLTK', 1)]
[nltk_data] Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
EXP: 5
PROGRAM:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import gensim
from gensim.models import Word2Vec
# Download required NLTK resources
nltk.download('punkt')
nltk.download('punkt_tab') # Fixes tokenizer table error in newer NLTK versions
nltk.download('stopwords')
# -------------------------
# 1️ Sample corpus
# -------------------------
corpus = """
Natural Language Processing (NLP) is a field of Artificial Intelligence (AI).
It helps computers understand, interpret, and generate human language.
Word2Vec is a popular algorithm for word embeddings.
It represents words in vector space, capturing semantic meaning.
"""
# -------------------------
# 2️ Preprocessing
# -------------------------
stop_words = set(stopwords.words('english'))
# Tokenize and clean text
tokens = word_tokenize(corpus.lower())
tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
# Word2Vec expects a list of sentences (list of list of tokens)
sentences = [tokens]
# -------------------------
# 3️ Train Word2Vec model
# -------------------------
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=1)
# vector_size: dimension of vectors
# window: context size
# min_count: ignores words with frequency < 1
# sg=1: skip-gram model (sg=0 for CBOW)
# -------------------------
# 4️ Explore the model
# -------------------------
print("\nVector for 'language':\n", model.wv['language'])
print("\nMost similar words to 'language':\n", model.wv.most_similar('language'))
# Save model (optional)
model.save("word2vec_model.model")
OUTPUT:
Most similar words to 'language':
[('embeddings', 0.2707250714302063), ('human', 0.21214450895786285), ('capturing',
0.18699145317077637), ('understand', 0.16841934621334076), ('space', 0.16121222078800201),
('semantic', 0.15025976300239563), ('intelligence', 0.1321220099925995), ('interpret',
0.12795543670654297), ('helps', 0.10077444463968277), ('words', 0.07131274044513702)]
EXP: 6
PROGRAM:
# Simple Rule-Based Chatbot
def chatbot_response(user_input):
user_input = user_input.lower()
# Rule-based responses
if "hello" in user_input or "hi" in user_input:
return "Hello! How can I help you today?"
elif "how are you" in user_input:
return "I'm just a bot, but I'm doing great! How about you?"
elif "your name" in user_input:
return "I am ChatBot 1.0, your friendly assistant."
elif "weather" in user_input:
return "I can't check the weather right now, but it looks sunny in my world."
elif "bye" in user_input or "goodbye" in user_input:
return "Goodbye! Have a nice day."
else:
return "I'm not sure how to respond to that. Could you rephrase?"
# Main loop
print("ChatBot 1.0: Hello! Type 'bye' to end the chat.")
while True:
user_text = input("You: ")
response = chatbot_response(user_text)
print("Bot:", response)
if "bye" in user_text.lower():
break
OUTPUT:
Chat Bot 1.0: Hello! Type 'bye' to end the chat.
You: HI HOW CAN I HELP YOU
Bot: Hello! How can I help you today?
You: HELO
Bot: I'm not sure how to respond to that. Could you rephrase?
You: I DON'T UNDERSAND THIS
Bot: Hello! How can I help you today?
You: THANK U
Bot: I'm not sure how to respond to that. Could you rephrase?
You: GOOD BYE
Bot: Goodbye! Have a nice day.
EXP: 7
PROGRAM:
# Install required libraries
!pip install gTTS speechrecognition pydub jiwer
from gtts import gTTS
import speech_recognition as sr
from jiwer import wer
import os
# --------------------
# Step 1: Text to Speech
# --------------------
text = "Hello, welcome to the NLP Lab. We are testing text to speech
accuracy."
tts = gTTS(text=text, lang='en')
tts.save("output.mp3")
print("✅ Audio file saved as 'output.mp3'")
# --------------------
# Step 2: Speech to Text (recognition)
# --------------------
recognizer = sr.Recognizer()
# Convert MP3 to WAV for recognition
from pydub import AudioSegment
sound = AudioSegment.from_mp3("output.mp3")
sound.export("output.wav", format="wav")
# Load audio file
with sr.AudioFile("output.wav") as source:
audio_data = recognizer.record(source)
try:
recognized_text = recognizer.recognize_google(audio_data)
print("🔹 Recognized Text:", recognized_text)
except sr.UnknownValueError:
print("Speech Recognition could not understand the audio.")
recognized_text = ""
except sr.RequestError as e:
print(f"Could not request results; {e}")
recognized_text = ""
# --------------------
# Step 3: Accuracy Calculation
# --------------------
error_rate = wer(text.lower(), recognized_text.lower())
accuracy = (1 - error_rate) * 100
print(f"📊 Word Error Rate (WER): {error_rate:.2f}")
print(f"✅ Accuracy: {accuracy:.2f}%")
OUTPUT:
Collecting gTTS
Downloading gTTS-2.5.4-py3-none-any.whl.metadata (4.1 kB)
Collecting speechrecognition
Downloading speechrecognition-3.14.3-py3-none-any.whl.metadata (30 kB)
Requirement already satisfied: pydub in /usr/local/lib/python3.11/dist-
packages (0.25.1)
Collecting jiwer
Downloading jiwer-4.0.0-py3-none-any.whl.metadata (3.3 kB)
Requirement already satisfied: requests<3,>=2.27 in
/usr/local/lib/python3.11/dist-packages (from gTTS) (2.32.3)
Collecting click<8.2,>=7.1 (from gTTS)
Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Requirement already satisfied: typing-extensions in
/usr/local/lib/python3.11/dist-packages (from speechrecognition) (4.14.1)
Collecting rapidfuzz>=3.9.7 (from jiwer)
Downloading rapidfuzz-3.13.0-cp311-cp311-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Requirement already satisfied: charset-normalizer<4,>=2 in
/usr/local/lib/python3.11/dist-packages (from requests<3,>=2.27->gTTS)
(3.4.2)
Requirement already satisfied: idna<4,>=2.5 in
/usr/local/lib/python3.11/dist-packages (from requests<3,>=2.27->gTTS) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in
/usr/local/lib/python3.11/dist-packages (from requests<3,>=2.27->gTTS)
(2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in
/usr/local/lib/python3.11/dist-packages (from requests<3,>=2.27->gTTS)
(2025.8.3)
Downloading gTTS-2.5.4-py3-none-any.whl (29 kB)
Downloading speechrecognition-3.14.3-py3-none-any.whl (32.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
32.9/32.9 MB 27.8 MB/s eta 0:00:00
Downloading jiwer-4.0.0-py3-none-any.whl (23 kB)
Downloading click-8.1.8-py3-none-any.whl (98 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
98.2/98.2 kB 6.4 MB/s eta 0:00:00
Downloading rapidfuzz-3.13.0-cp311-cp311-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1
MB 83.2 MB/s eta 0:00:00
Installing collected packages: speechrecognition, rapidfuzz, click, jiwer,
gTTS
Attempting uninstall: click
Found existing installation: click 8.2.1
Uninstalling click-8.2.1:
Successfully uninstalled click-8.2.1
Successfully installed click-8.1.8 gTTS-2.5.4 jiwer-4.0.0 rapidfuzz-3.13.0
speechrecognition-3.14.3
✅ Audio file saved as 'output.mp3'
🔹 Recognized Text: hello welcome to the NLP lab we are testing text to speech
accuracy
📊 Word Error Rate (WER): 0.23
✅ Accuracy: 76.92%