Natural Language Processing Lab Manual (AIP-101)
L:T:P: 0:0:2 Credits: 1
Course Outcomes
At the end of the course, the student will be able to:
1. Use the NLTK and spaCy toolkit for NLP programming.
2. Analyze various corpora for developing programs.
3. Develop various pre-processing techniques for a given corpus.
4. Develop programming logic using NLTK functions.
5. Build applications using various NLP techniques for a given corpus.
List of Programs
Experiment 1: Installation and exploring features of NLTK and spaCy tools.
Download Word Cloud and few corpora.
Objective
To install and configure the NLTK and spaCy libraries.
To explore and utilize basic features such as downloading corpora and generating word
clouds.
Outcomes
Understand the process of installing and setting up NLP libraries in Python.
Learn how to download and use corpora for further NLP tasks.
Explore the concept of word clouds and how to generate them for corpus analysis.
Experiment 2: (i) Write a program to implement word Tokenizer, Sentence,
and Paragraph Tokenizers.
(ii) Check how many words are there in any corpus. Also, check how many
distinct words are there.
Objective
To implement word, sentence, and paragraph tokenization using NLTK and spaCy.
To analyze the number of total and distinct words in a given corpus.
Outcomes
Learn how tokenization works in NLP and how to break text into smaller units.
Gain knowledge about counting word frequencies and understanding the diversity of
vocabulary in a corpus.
Experiment 3: (i) Write a program to implement both user-defined and pre-
defined functions to generate (a) Uni-grams (b) Bi-grams (c) Tri-grams (d) N-
grams.
(ii) Write a program to calculate the highest probability of a word (w2)
occurring after another word (w1).
Objective
To implement various N-gram models to generate unigrams, bigrams, trigrams, and general
N-grams.
To calculate the conditional probability of one word occurring after another.
Outcomes
Understand N-gram models and their application in NLP.
Learn how to calculate and work with word probabilities for language modeling.
Experiment 4: (i) Write a program to identify word collocations.
(ii) Write a program to print all words beginning with a given sequence of
letters.
(iii) Write a program to print all words longer than four characters.
Objective
To identify word collocations and patterns in a corpus.
To extract words based on specific criteria (prefix and word length).
Outcomes
Learn how to find common word pairs or collocations.
Develop skills to filter words based on given patterns or word length constraints.
Experiment 5: (i) Write a program to identify the mathematical expression in
a given sentence.
(ii) Write a program to identify different components of an email address.
Objective
To write regular expressions to identify mathematical expressions and components of email
addresses.
Outcomes
Learn to use regular expressions for pattern matching.
Identify structured data within unstructured text, such as email components and
mathematical expressions.
Experiment 6: (i) Write a program to identify all antonyms and synonyms of a
word.
(ii) Write a program to find hyponymy, homonymy, polysemy for a given word.
Objective
To identify synonyms and antonyms using lexical databases like WordNet.
To explore word relationships such as hyponymy, homonymy, and polysemy.
Outcomes
Understand how to find relationships between words using NLP libraries.
Explore word semantics and develop an understanding of lexical ambiguity.
Experiment 7: (i) Write a program to find all the stop words in any given text.
(ii) Write a function that finds the 50 most frequently occurring words of a text
that are not stopwords.
Objective
To identify and remove stop words from a given text.
To analyze the frequency distribution of non-stop words.
Outcomes
Learn how to filter stopwords from text data.
Understand word frequency analysis and its importance in text mining.
Experiment 8: Write a program to implement various stemming techniques and
prepare a chart with the performance of each method.
Objective
To implement and compare different stemming techniques (e.g., Porter Stemmer, Lancaster
Stemmer).
Outcomes
Understand stemming and its role in text preprocessing.
Evaluate the performance of various stemming algorithms based on accuracy and efficiency.
Experiment 9: Write a program to implement various lemmatization
techniques and prepare a chart with the performance of each method.
Objective
To implement lemmatization techniques and compare their effectiveness.
Outcomes
Understand lemmatization and its importance in reducing words to their base form.
Compare lemmatization techniques based on performance.
Experiment 10: (i) Write a program to implement Conditional Frequency
Distributions (CFD) for any corpus.
(ii) Find all the four-letter words in any corpus. With the help of a frequency
distribution (FreqDist), show these words in decreasing order of frequency.
(iii) Define a conditional frequency distribution over the names corpus that
allows you to see which initial letters are more frequent for males versus
females.
Objective
To implement conditional frequency distributions and analyze text corpus.
To explore frequency distribution and patterns in text data.
Outcomes
Understand the concept of conditional frequency distributions and its applications.
Analyze the frequency of words and patterns in specific corpora.
Experiment 11: (i) Write a program to implement Part-of-Speech (PoS) tagging
for any corpus.
(ii) Write a program to identify which word has the greatest number of distinct
tags. What are they, and what do they represent?
(iii) Write a program to list tags in order of decreasing frequency and what do
the 20 most frequent tags represent?
(iv) Write a program to identify which tags are nouns most commonly found
after. What do these tags represent?
Objective
To implement part-of-speech tagging and analyze word classifications.
To explore the distribution of PoS tags and understand their significance.
Outcomes
Understand part-of-speech tagging and its application in NLP.
Analyze the distribution and frequency of different PoS tags in a corpus.
Experiment 12: Write a program to implement TF-IDF for any corpus.
Objective
To implement Term Frequency-Inverse Document Frequency (TF-IDF) and use it for text
vectorization.
Outcomes
Understand how TF-IDF is used for text representation.
Implement the TF-IDF algorithm for text classification or retrieval.
Experiment 13: Write a program to implement chunking and chinking for any
corpus.
Objective
To implement chunking and chinking for identifying specific structures in text, such as noun
phrases or verb phrases.
Outcomes
Understand the concepts of chunking and chinking in text analysis.
Learn how to extract syntactic structures from text data.
Experiment 14: (i) Write a program to find all the mis-spelled words in a
paragraph.
(ii) Write a program to prepare a table with the frequency of mis-spelled tags
for any given text.
Objective
To identify and correct spelling errors in text.
To analyze mis-spelled words and their frequency distribution.
Outcomes
Learn to detect and handle spelling mistakes in NLP tasks.
Develop skills in error analysis and frequency distribution.
Experiment 15: Write a program to implement all the NLP Pre-Processing
Techniques required to perform further NLP tasks.
Objective
To implement common pre-processing techniques such as tokenization, stemming,
lemmatization, stop-word removal, etc.
Outcomes
Gain a comprehensive understanding of the essential pre-processing steps in NLP.
Develop a pipeline for preparing text for downstream NLP tasks.