Thanks to visit codestin.com
Credit goes to www.scribd.com

Open navigation menu

Scribd

0% found this document useful (0 votes)

94 views8 pages

NLTK Tokenization & Stemming Guide

The document discusses 5 assignments related to natural language processing techniques. The assignments cover topics like tokenization, stemming, lemmatization, bag-of-words modeling, TF-IDF, word embeddings, text classification using transformers, and morphological analysis using add-delete tables.

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views8 pages

NLTK Tokenization & Stemming Guide

The document discusses 5 assignments related to natural language processing techniques. The assignments cover topics like tokenization, stemming, lemmatization, bag-of-words modeling, TF-IDF, word embeddings, text classification using transformers, and morphological analysis using add-delete tables.

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Assignment 1:

Title:

Tokenization and Stemming Techniques using NLTK

Objectives:

- To perform tokenization on sample sentences using various techniques available in

NLTK library including whitespace, punctuation-based, Treebank, Tweet, and MWE

tokenization.

- To compare the effectiveness of different tokenization techniques in terms of

accuracy and speed.

- To apply Porter Stemmer and Snowball Stemmer on tokenized sentences to reduce

them to their root form.

- To apply lemmatization techniques on the same set of tokenized sentences for

comparison.

Pre-requisites:

- Basic knowledge of Natural Language Processing (NLP) concepts

- Familiarity with Python programming language and NLTK library

Sample Sentence:

"I am trying to learn Natural Language Processing using the NLTK library. NLTK is a

powerful tool for working with human language data."

Theory:

Tokenization is the process of breaking a text into individual words or phrases, also

known as tokens. There are several tokenization techniques available in the NLTK

library, including whitespace, punctuation-based, Treebank, Tweet, and MWE

tokenization. Each technique has its own advantages and disadvantages, and the choice

of technique depends on the specific requirements of the NLP task.

Stemming is the process of reducing a word to its root form. Porter Stemmer and

Snowball Stemmer are two widely used stemming algorithms in the NLTK library. While

Porter Stemmer is based on a set of rules and heuristics, Snowball Stemmer is an

improvement over the Porter Stemmer algorithm and provides better results.

Lemmatization is the process of reducing a word to its base or dictionary form, known

as lemma. It uses a dictionary to map words to their base form, which makes it more

accurate than stemming.

Conclusion:

We have explored different tokenization techniques available in the NLTK library and

compared their effectiveness in terms of accuracy and speed. We have also applied

Porter Stemmer and Snowball Stemmer on tokenized sentences to reduce them to

their root form. Finally, we have compared the results of stemming and lemmatization

techniques on the same set of tokenized sentences.

Assignment 2:

Title:

Bag-of-Words, TF-IDF and Word2Vec Embeddings on Car Dataset

Objectives:

- To perform a bag-of-words approach on the Car Dataset by counting the occurrence

and normalized count occurrence of words in the dataset.

- To calculate TF-IDF score for the words in the dataset.

- To create word embeddings using Word2Vec model and analyze the results.
Pre-requisites:

- Basic knowledge of Natural Language Processing (NLP) concepts.

- Familiarity with the Python programming language and its libraries such as NLTK,

Pandas, and Gensim.

Dataset:

The dataset to be used for this assignment is the Car Dataset from Kaggle, which

contains information about cars, including their make, model, year, mileage, fuel type,

and more.

Theory:

The Bag-of-Words approach is a common NLP technique that represents a document as

a bag of words, ignoring the order and context of the words. We will count the

occurrence and normalized count occurrence of words in the dataset.

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to

evaluate how important a word is to a document in a collection. It measures the

frequency of a word in a document relative to its frequency in the entire collection. We

will calculate TF-IDF scores for the words in the Car Dataset.

Word2Vec is a neural network-based approach used to create word embeddings, which

are vector representations of words in a high-dimensional space. We will create

Word2Vec embeddings for the Car Dataset and analyze the results.

Conclusion:

We have explored different techniques for analyzing text data in the Car Dataset. We

have performed a bag-of-words approach to count the occurrence and normalized count

occurrence of words in the dataset, as well as calculated TF-IDF scores for the words.
Finally, we have created Word2Vec embeddings for the dataset and analyzed the

results.

Assignment 3:

Title:

Text Cleaning, Lemmatization, Stop Word Removal, Label Encoding, and TF-IDF

Representation on News Dataset

Objectives:

- To perform text cleaning on the News Dataset.

- To perform lemmatization on the cleaned text using any method.

- To remove stop words from the text using any method.

- To perform label encoding on the target variable of the dataset.

- To create a TF-IDF representation of the preprocessed text.

- To save the outputs of the preprocessing steps.

Pre-requisites:

- Basic knowledge of Natural Language Processing (NLP) concepts.

- Familiarity with the Python programming language and its libraries such as NLTK,

Pandas, and Scikit-learn.

Dataset:

The dataset to be used for this assignment is the News Dataset available on the

following GitHub repository:

https://github.com/PICT-NLP/BE-NLP-Elective/blob/main/3-Preprocessing/News_dat

aset.pickle. This dataset contains news articles labeled with their respective

categories.
Theory:

Text Cleaning involves removing noise, unwanted characters, and unnecessary words

from the text data. We will perform text cleaning on the News Dataset.

Lemmatization is the process of reducing words to their base or dictionary form. We

will perform lemmatization on the cleaned text using any method.

Stop Word Removal involves removing common words that do not carry much meaning

from the text data. We will remove stop words from the text using any method.

Label Encoding is a process of converting categorical variables into numerical format.

We will perform label encoding on the target variable of the dataset.

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to

evaluate how important a word is to a document in a collection. We will create a TF-IDF

representation of the preprocessed text.

Conclusion:

We have performed various preprocessing steps on the News Dataset, including text

cleaning, lemmatization, stop word removal, and label encoding. We have also created a

TF-IDF representation of the preprocessed text. These steps are essential in

preparing text data for various NLP applications. Finally, we have saved the outputs of

the preprocessing steps for future use.

Assignment 4:

Title:

Building a Transformer from Scratch using Pytorch Library

Objectives:

- To understand the architecture of a Transformer.

- To implement the key components of a Transformer, including the Multi-Head

Attention, Position-wise Feedforward Network, and Layer Normalization.

- To train and evaluate the Transformer model on a text classification task.

- To analyze the performance of the model and interpret the results.

Pre-requisites:

- Knowledge of deep learning concepts, including neural networks and optimization

algorithms.

- Familiarity with Pytorch library and its modules, such as nn, optim, and DataLoader.

- Understanding of NLP concepts, such as tokenization, padding, and embedding.

Dataset:

We can use any text classification dataset, such as the IMDB movie review dataset or

the AG News dataset.

Theory:

The Transformer is a type of neural network architecture that was introduced in the

paper "Attention Is All You Need" by Vaswani et al. (2017). It is a self-attention

mechanism that can process sequential data, such as text or speech.

The key components of a Transformer are Multi-Head Attention, Position-wise

Feedforward Network, and Layer Normalization. Multi-Head Attention is used to

compute the attention between the input sequence and itself, while Position-wise

Feedforward Network is used to transform the attention outputs. Layer Normalization

is used to normalize the outputs of each layer.

To implement the Transformer from scratch using Pytorch, we will need to define each

of these components and combine them to form a complete model. We will then train

and evaluate the model on a text classification task.

Conclusion:

We have explored the architecture of a Transformer and its key components, including

Multi-Head Attention, Position-wise Feedforward Network, and Layer Normalization.

We have implemented these components from scratch using Pytorch and trained the

model on a text classification task. We have also analyzed the performance of the

model and interpreted the results. Building a Transformer from scratch is a challenging

but rewarding task that can enhance our understanding of deep learning and NLP.

Assignment 5:

Title:

Understanding Morphology Using Add-Delete Tables

Objectives:

- To understand the concept of morphology and how words are built up from smaller

meaning-bearing units.

- To learn about the different types of morphemes, including free and bound

morphemes.

- To use add-delete tables as a tool for analyzing the morphological structure of words.

Pre-requisites:

- Basic knowledge of linguistics and grammar.

- Familiarity with the concept of words and their structures.

- Understanding of the difference between morphemes and phonemes.

Theory:
Morphology is the study of the structure and form of words, including how they are

built up from smaller meaning-bearing units called morphemes. There are two types of

morphemes: free morphemes, which can stand alone as words, and bound morphemes,

which must be attached to other morphemes to create words.

Add-delete tables are a tool used in morphology to analyze the morphological structure

of words. These tables show how words can be built up from smaller morphemes by

adding or deleting affixes. The table is divided into three columns: the stem, the affix,

and the resulting word.

To use add-delete tables, we start with a stem, which is the base form of a word. We

then add prefixes or suffixes to the stem to create new words. We can also delete

affixes to derive new words or analyze the morphological structure of existing words.

Conclusion:

We have explored the concept of morphology and how words are built up from smaller

meaning-bearing units called morphemes. We have learned about the different types of

morphemes, including free and bound morphemes, and how they are used to create

words. We have also used add-delete tables as a tool for analyzing the morphological

structure of words. By studying morphology, we can gain a deeper understanding of the

structure and meaning of language.

You might also like

Understanding Machine Learning Solution Manual: 2 Gentle Start
No ratings yet
Understanding Machine Learning Solution Manual: 2 Gentle Start
67 pages
Altair PBS Eclipse Integraton 2012
No ratings yet
Altair PBS Eclipse Integraton 2012
13 pages
Computer Organization and Design: Lecture: 3 Tutorial: 1 Practical: 0 Credit: 4
No ratings yet
Computer Organization and Design: Lecture: 3 Tutorial: 1 Practical: 0 Credit: 4
18 pages
1-Introduction To Networking
No ratings yet
1-Introduction To Networking
18 pages
NLP Assignment 2
No ratings yet
NLP Assignment 2
2 pages
Network Technologies and TCP/IP: Biyani's Think Tank
100% (1)
Network Technologies and TCP/IP: Biyani's Think Tank
78 pages
CS6659 Artificial Intelligence
No ratings yet
CS6659 Artificial Intelligence
10 pages
Module 2 - Natural Language Processing: Paulo Gomes DEI - FCTUC, 2006/2007
No ratings yet
Module 2 - Natural Language Processing: Paulo Gomes DEI - FCTUC, 2006/2007
42 pages
Digital Image Processing Quiz
No ratings yet
Digital Image Processing Quiz
5 pages
Deep Learning Midterm Exam
No ratings yet
Deep Learning Midterm Exam
2 pages
Machine Learning MCQ Assignment
No ratings yet
Machine Learning MCQ Assignment
56 pages
EC8093 - Digital Image Processing (Ripped From Amazon Kindle Ebooks by Sai Seena) PDF
0% (1)
EC8093 - Digital Image Processing (Ripped From Amazon Kindle Ebooks by Sai Seena) PDF
102 pages
NLP Midsem Paper Jan 2024 Regular Exam
No ratings yet
NLP Midsem Paper Jan 2024 Regular Exam
4 pages
Query Operation 2021
No ratings yet
Query Operation 2021
35 pages
ML Assignment 6
No ratings yet
ML Assignment 6
5 pages
DIP - Assignment 9 Solution
No ratings yet
DIP - Assignment 9 Solution
6 pages
NLP End Sem Paper - Evaluation Scheme
No ratings yet
NLP End Sem Paper - Evaluation Scheme
14 pages
Digital Image Processing Assignment Week 6: NPTEL Online Certificate Courses Indian Institute of Technology, Kharagpur
No ratings yet
Digital Image Processing Assignment Week 6: NPTEL Online Certificate Courses Indian Institute of Technology, Kharagpur
4 pages
DSDBA Sppu Dsbda QP
No ratings yet
DSDBA Sppu Dsbda QP
11 pages
Ai QB
No ratings yet
Ai QB
3 pages
Model Question Paper
0% (1)
Model Question Paper
2 pages
Al3502 Deep Learning For Vision Lab Manuval
No ratings yet
Al3502 Deep Learning For Vision Lab Manuval
19 pages
Machine Learning 1
No ratings yet
Machine Learning 1
160 pages
NLP Lab Tasks for Students
No ratings yet
NLP Lab Tasks for Students
16 pages
Machine Learning Full Question Bank
No ratings yet
Machine Learning Full Question Bank
14 pages
Introduction To Feed Forward Neural Networks
No ratings yet
Introduction To Feed Forward Neural Networks
121 pages
Al3502 - DLV Unit 1 Notes
No ratings yet
Al3502 - DLV Unit 1 Notes
15 pages
Mcq-For-Dip With Solution
No ratings yet
Mcq-For-Dip With Solution
55 pages
Unit-3-Second Chapter
No ratings yet
Unit-3-Second Chapter
9 pages
Question Bank
No ratings yet
Question Bank
2 pages
Assignment 2 Solution
No ratings yet
Assignment 2 Solution
4 pages
Cs8080 Irt Local Author
No ratings yet
Cs8080 Irt Local Author
168 pages
Deep Learning with RBMs and DBNs
No ratings yet
Deep Learning with RBMs and DBNs
79 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
18 pages
NLP Question Bank: Chapter-Wise Practice Problems With Solutions
No ratings yet
NLP Question Bank: Chapter-Wise Practice Problems With Solutions
45 pages
Assignment 1 Solution
No ratings yet
Assignment 1 Solution
4 pages
Machine Learning Quiz for Students
No ratings yet
Machine Learning Quiz for Students
45 pages
Digital Image Processing Exam Topics
No ratings yet
Digital Image Processing Exam Topics
12 pages
CS 224n Word2Vec Assignment Guide
No ratings yet
CS 224n Word2Vec Assignment Guide
4 pages
Computer Skills
No ratings yet
Computer Skills
65 pages
Practice Final sp22
No ratings yet
Practice Final sp22
10 pages
Deep Learning Exam With Answers
No ratings yet
Deep Learning Exam With Answers
4 pages
Natural Language Processing Notes
No ratings yet
Natural Language Processing Notes
61 pages
Assignment On RNN
No ratings yet
Assignment On RNN
1 page
Chapter2 - Machine Instructions and Programs
No ratings yet
Chapter2 - Machine Instructions and Programs
54 pages
CS230 Midterm Solutions Fall 2022
No ratings yet
CS230 Midterm Solutions Fall 2022
20 pages
Summary Notes of CNN
No ratings yet
Summary Notes of CNN
23 pages
NLP Qb-Ese
No ratings yet
NLP Qb-Ese
2 pages
IAT-I Question Paper With Solution of 18CS71 Artificial Intelligence and Machine Learning Oct-2022-Dr. Paras Nath Singh
No ratings yet
IAT-I Question Paper With Solution of 18CS71 Artificial Intelligence and Machine Learning Oct-2022-Dr. Paras Nath Singh
7 pages
Department of Computer Science and Engineering (CSE)
No ratings yet
Department of Computer Science and Engineering (CSE)
11 pages
2 Marks
No ratings yet
2 Marks
11 pages
Python For Data Science - Unit 6 - Week 4
No ratings yet
Python For Data Science - Unit 6 - Week 4
5 pages
Machine Learning Theory Essentials
No ratings yet
Machine Learning Theory Essentials
9 pages
CNN Basics for AI Enthusiasts
No ratings yet
CNN Basics for AI Enthusiasts
29 pages
Deep Learning: CNNs Explained
No ratings yet
Deep Learning: CNNs Explained
477 pages
Unit-8: Natural Language: Processing
No ratings yet
Unit-8: Natural Language: Processing
16 pages
NNunit 2
No ratings yet
NNunit 2
25 pages
Image Processing Questions
No ratings yet
Image Processing Questions
4 pages
Image Processing Syllabus ME Course
No ratings yet
Image Processing Syllabus ME Course
8 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
21 pages
NLP Assignment (917722H031)
No ratings yet
NLP Assignment (917722H031)
18 pages
Smart Load Cell Digital Filtering
No ratings yet
Smart Load Cell Digital Filtering
6 pages
Slicing 1
No ratings yet
Slicing 1
7 pages
Morphological Analysis Guide
No ratings yet
Morphological Analysis Guide
5 pages
Thesis Statement About Gadgets
100% (2)
Thesis Statement About Gadgets
7 pages
Fire Panel Guide for Engineers
100% (1)
Fire Panel Guide for Engineers
11 pages
Persona Overview
No ratings yet
Persona Overview
8 pages
Medium Voltage Detector Guide
No ratings yet
Medium Voltage Detector Guide
1 page
Project Report OF Spider Robot
100% (1)
Project Report OF Spider Robot
13 pages
Artificial Intelligence Questions
No ratings yet
Artificial Intelligence Questions
15 pages
Maxillary Teeth Esthetic Proportions
No ratings yet
Maxillary Teeth Esthetic Proportions
5 pages
AI-Powered DeFi Trading Platform
No ratings yet
AI-Powered DeFi Trading Platform
22 pages
FFS120 FashionandRace F24
No ratings yet
FFS120 FashionandRace F24
18 pages
Stage 4 Business Analysis and System Recommendation
No ratings yet
Stage 4 Business Analysis and System Recommendation
8 pages
SM-A305F.FN Galaxy A30 PDF
No ratings yet
SM-A305F.FN Galaxy A30 PDF
1 page
Zoom Client For Meetings: Step 1: Download Zoom Download Application From The Following Link
No ratings yet
Zoom Client For Meetings: Step 1: Download Zoom Download Application From The Following Link
6 pages
System On Chip
No ratings yet
System On Chip
12 pages
Silicon N-Channel Power MOSFET: General Description
No ratings yet
Silicon N-Channel Power MOSFET: General Description
10 pages
Windows System Error Codes
No ratings yet
Windows System Error Codes
304 pages
Prototype CNC Machine Design PDF
No ratings yet
Prototype CNC Machine Design PDF
6 pages
SIR2 Manual
No ratings yet
SIR2 Manual
32 pages
DLL Arts Q2 W2 D3 Nov 16
No ratings yet
DLL Arts Q2 W2 D3 Nov 16
6 pages
PMP Certification: PMBOK® 6.0
No ratings yet
PMP Certification: PMBOK® 6.0
11 pages
CIS Docker Benchmark v1.5.0 PDF
No ratings yet
CIS Docker Benchmark v1.5.0 PDF
292 pages
SDQCQAManual
No ratings yet
SDQCQAManual
344 pages
Term Paper On Management Information System
100% (1)
Term Paper On Management Information System
4 pages
Introduction To Central User Administration (CUA) - SAP - All About Web and Cloud
No ratings yet
Introduction To Central User Administration (CUA) - SAP - All About Web and Cloud
3 pages
Music Notation Shortcuts Guide
No ratings yet
Music Notation Shortcuts Guide
7 pages
Practical Deep Learning For NLP: Maarten Versteegh NLP Research Engineer
No ratings yet
Practical Deep Learning For NLP: Maarten Versteegh NLP Research Engineer
44 pages
Masterclass Conclusion
No ratings yet
Masterclass Conclusion
12 pages