Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
335 views6 pages

NLP Worksheet: Text Processing, Bag of Words, Tf-Idf Activity

The document provides instructions for performing various natural language processing (NLP) tasks on a corpus of 3 health-related documents, including sentence segmentation, tokenization, removing stopwords/special characters, stemming, lemmatization, bag-of-words modeling, and calculating term frequency-inverse document frequency (tf-idf). The tasks are broken down into clear steps but no examples or results are provided.

Uploaded by

Gayathri . M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
335 views6 pages

NLP Worksheet: Text Processing, Bag of Words, Tf-Idf Activity

The document provides instructions for performing various natural language processing (NLP) tasks on a corpus of 3 health-related documents, including sentence segmentation, tokenization, removing stopwords/special characters, stemming, lemmatization, bag-of-words modeling, and calculating term frequency-inverse document frequency (tf-idf). The tasks are broken down into clear steps but no examples or results are provided.

Uploaded by

Gayathri . M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

NLP Worksheet

Text processing, bag of words, tf-idf activity


Suppose you have obtained these information and you would like to analyse it. Let’s start by making it ready for the
computer!

Corpus
Document 1: We can use health chatbots for treating stress.
Document 2: We can use NLP to create chatbots and we will be making health chatbots now!
Document 3: Health Chatbots cannot replace human counsellors now. Yay >< !! @1nteLA!4Y

Step 1: Sentence Segmentation

No. Sentence

Step 2: Tokenization
Separate your sentences into tokens. How many tokens do you have?

Tokens

Number of tokens: ________


Step 3: Remove stopwords, special characters, numbers
List out the stopwords, special characters, and numbers that you want to remove!

Stopwords, special characters, and numbers

Step 4: Converting text to a common case


Which text do you need to modify? What is the modified form?

Modified form

Step 5: Stemming
List out the stem words.

Stem words

Step 6: Lemmatization
List out the root words/ lemma.

Lemma
Final data
List out the final, processed data.

Processed data

Congratulations, you’ve managed to process the data!

Bag of words
Step 1: Collect data and process it
For this exercise, we can use the sentences without processing it so that it is easier for us to read the sentences.

No. Sentence

1 We can use health chatbots for treating stress

2 We can use NLP to create chatbots and we will be making health chatbots now

3 Health chatbots cannot replace human counsellors now

Step 2: Create dictionary


Make a list of all the different words in the text.

Dictionary

Step 3:Create document vectors


Use the next page to create your document vector!
Tf-idf
You’ve obtained your bag of words. Now let’s continue with the tf-idf!

Step 1 - 3: Count the number of documents where the word appears at least once & write that
number down next to the word in your vocabulary to get your document frequency. Draw your
own table for this!

Example of a document frequency:

aman and Anil are stressed went to a therapist download health chatbot

2 1 2 1 1 2 2 2 1 1 1 1

Your document frequency:

Step 4: Get your inverse document frequency.


Example of an inverse document frequency:

aman and anil are stressed went to a therapist download health chatbot

3/2 3/1 3/2 3/1 3/1 3/2 3/2 3/2 3/1 3/1 3/1 3/1

Your inverse document frequency:


Step 5: Get your tf-idf
Example of a tf-idf:

After log operation:

Your tf-idf:

You might also like