Assignment 1:
Title:
Tokenization and Stemming Techniques using NLTK
Objectives:
- To perform tokenization on sample sentences using various techniques available in
NLTK library including whitespace, punctuation-based, Treebank, Tweet, and MWE
tokenization.
- To compare the effectiveness of different tokenization techniques in terms of
accuracy and speed.
- To apply Porter Stemmer and Snowball Stemmer on tokenized sentences to reduce
them to their root form.
- To apply lemmatization techniques on the same set of tokenized sentences for
comparison.
Pre-requisites:
- Basic knowledge of Natural Language Processing (NLP) concepts
- Familiarity with Python programming language and NLTK library
Sample Sentence:
"I am trying to learn Natural Language Processing using the NLTK library. NLTK is a
powerful tool for working with human language data."
Theory:
Tokenization is the process of breaking a text into individual words or phrases, also
known as tokens. There are several tokenization techniques available in the NLTK
library, including whitespace, punctuation-based, Treebank, Tweet, and MWE
tokenization. Each technique has its own advantages and disadvantages, and the choice
of technique depends on the specific requirements of the NLP task.
Stemming is the process of reducing a word to its root form. Porter Stemmer and
Snowball Stemmer are two widely used stemming algorithms in the NLTK library. While
Porter Stemmer is based on a set of rules and heuristics, Snowball Stemmer is an
improvement over the Porter Stemmer algorithm and provides better results.
Lemmatization is the process of reducing a word to its base or dictionary form, known
as lemma. It uses a dictionary to map words to their base form, which makes it more
accurate than stemming.
Conclusion:
We have explored different tokenization techniques available in the NLTK library and
compared their effectiveness in terms of accuracy and speed. We have also applied
Porter Stemmer and Snowball Stemmer on tokenized sentences to reduce them to
their root form. Finally, we have compared the results of stemming and lemmatization
techniques on the same set of tokenized sentences.
Assignment 2:
Title:
Bag-of-Words, TF-IDF and Word2Vec Embeddings on Car Dataset
Objectives:
- To perform a bag-of-words approach on the Car Dataset by counting the occurrence
and normalized count occurrence of words in the dataset.
- To calculate TF-IDF score for the words in the dataset.
- To create word embeddings using Word2Vec model and analyze the results.
Pre-requisites:
- Basic knowledge of Natural Language Processing (NLP) concepts.
- Familiarity with the Python programming language and its libraries such as NLTK,
Pandas, and Gensim.
Dataset:
The dataset to be used for this assignment is the Car Dataset from Kaggle, which
contains information about cars, including their make, model, year, mileage, fuel type,
and more.
Theory:
The Bag-of-Words approach is a common NLP technique that represents a document as
a bag of words, ignoring the order and context of the words. We will count the
occurrence and normalized count occurrence of words in the dataset.
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to
evaluate how important a word is to a document in a collection. It measures the
frequency of a word in a document relative to its frequency in the entire collection. We
will calculate TF-IDF scores for the words in the Car Dataset.
Word2Vec is a neural network-based approach used to create word embeddings, which
are vector representations of words in a high-dimensional space. We will create
Word2Vec embeddings for the Car Dataset and analyze the results.
Conclusion:
We have explored different techniques for analyzing text data in the Car Dataset. We
have performed a bag-of-words approach to count the occurrence and normalized count
occurrence of words in the dataset, as well as calculated TF-IDF scores for the words.
Finally, we have created Word2Vec embeddings for the dataset and analyzed the
results.
Assignment 3:
Title:
Text Cleaning, Lemmatization, Stop Word Removal, Label Encoding, and TF-IDF
Representation on News Dataset
Objectives:
- To perform text cleaning on the News Dataset.
- To perform lemmatization on the cleaned text using any method.
- To remove stop words from the text using any method.
- To perform label encoding on the target variable of the dataset.
- To create a TF-IDF representation of the preprocessed text.
- To save the outputs of the preprocessing steps.
Pre-requisites:
- Basic knowledge of Natural Language Processing (NLP) concepts.
- Familiarity with the Python programming language and its libraries such as NLTK,
Pandas, and Scikit-learn.
Dataset:
The dataset to be used for this assignment is the News Dataset available on the
following GitHub repository:
https://github.com/PICT-NLP/BE-NLP-Elective/blob/main/3-Preprocessing/News_dat
aset.pickle. This dataset contains news articles labeled with their respective
categories.
Theory:
Text Cleaning involves removing noise, unwanted characters, and unnecessary words
from the text data. We will perform text cleaning on the News Dataset.
Lemmatization is the process of reducing words to their base or dictionary form. We
will perform lemmatization on the cleaned text using any method.
Stop Word Removal involves removing common words that do not carry much meaning
from the text data. We will remove stop words from the text using any method.
Label Encoding is a process of converting categorical variables into numerical format.
We will perform label encoding on the target variable of the dataset.
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to
evaluate how important a word is to a document in a collection. We will create a TF-IDF
representation of the preprocessed text.
Conclusion:
We have performed various preprocessing steps on the News Dataset, including text
cleaning, lemmatization, stop word removal, and label encoding. We have also created a
TF-IDF representation of the preprocessed text. These steps are essential in
preparing text data for various NLP applications. Finally, we have saved the outputs of
the preprocessing steps for future use.
Assignment 4:
Title:
Building a Transformer from Scratch using Pytorch Library
Objectives:
- To understand the architecture of a Transformer.
- To implement the key components of a Transformer, including the Multi-Head
Attention, Position-wise Feedforward Network, and Layer Normalization.
- To train and evaluate the Transformer model on a text classification task.
- To analyze the performance of the model and interpret the results.
Pre-requisites:
- Knowledge of deep learning concepts, including neural networks and optimization
algorithms.
- Familiarity with Pytorch library and its modules, such as nn, optim, and DataLoader.
- Understanding of NLP concepts, such as tokenization, padding, and embedding.
Dataset:
We can use any text classification dataset, such as the IMDB movie review dataset or
the AG News dataset.
Theory:
The Transformer is a type of neural network architecture that was introduced in the
paper "Attention Is All You Need" by Vaswani et al. (2017). It is a self-attention
mechanism that can process sequential data, such as text or speech.
The key components of a Transformer are Multi-Head Attention, Position-wise
Feedforward Network, and Layer Normalization. Multi-Head Attention is used to
compute the attention between the input sequence and itself, while Position-wise
Feedforward Network is used to transform the attention outputs. Layer Normalization
is used to normalize the outputs of each layer.
To implement the Transformer from scratch using Pytorch, we will need to define each
of these components and combine them to form a complete model. We will then train
and evaluate the model on a text classification task.
Conclusion:
We have explored the architecture of a Transformer and its key components, including
Multi-Head Attention, Position-wise Feedforward Network, and Layer Normalization.
We have implemented these components from scratch using Pytorch and trained the
model on a text classification task. We have also analyzed the performance of the
model and interpreted the results. Building a Transformer from scratch is a challenging
but rewarding task that can enhance our understanding of deep learning and NLP.
Assignment 5:
Title:
Understanding Morphology Using Add-Delete Tables
Objectives:
- To understand the concept of morphology and how words are built up from smaller
meaning-bearing units.
- To learn about the different types of morphemes, including free and bound
morphemes.
- To use add-delete tables as a tool for analyzing the morphological structure of words.
Pre-requisites:
- Basic knowledge of linguistics and grammar.
- Familiarity with the concept of words and their structures.
- Understanding of the difference between morphemes and phonemes.
Theory:
Morphology is the study of the structure and form of words, including how they are
built up from smaller meaning-bearing units called morphemes. There are two types of
morphemes: free morphemes, which can stand alone as words, and bound morphemes,
which must be attached to other morphemes to create words.
Add-delete tables are a tool used in morphology to analyze the morphological structure
of words. These tables show how words can be built up from smaller morphemes by
adding or deleting affixes. The table is divided into three columns: the stem, the affix,
and the resulting word.
To use add-delete tables, we start with a stem, which is the base form of a word. We
then add prefixes or suffixes to the stem to create new words. We can also delete
affixes to derive new words or analyze the morphological structure of existing words.
Conclusion:
We have explored the concept of morphology and how words are built up from smaller
meaning-bearing units called morphemes. We have learned about the different types of
morphemes, including free and bound morphemes, and how they are used to create
words. We have also used add-delete tables as a tool for analyzing the morphological
structure of words. By studying morphology, we can gain a deeper understanding of the
structure and meaning of language.