NLP Worksheet
Text processing, bag of words, tf-idf activity
Suppose you have obtained these information and you would like to analyse it. Let’s start by making it ready for the
computer!
Corpus
Document 1: We can use health chatbots for treating stress.
Document 2: We can use NLP to create chatbots and we will be making health chatbots now!
Document 3: Health Chatbots cannot replace human counsellors now. Yay >< !! @1nteLA!4Y
Step 1: Sentence Segmentation
No. Sentence
Step 2: Tokenization
Separate your sentences into tokens. How many tokens do you have?
Tokens
Number of tokens: ________
Step 3: Remove stopwords, special characters, numbers
List out the stopwords, special characters, and numbers that you want to remove!
Stopwords, special characters, and numbers
Step 4: Converting text to a common case
Which text do you need to modify? What is the modified form?
Modified form
Step 5: Stemming
List out the stem words.
Stem words
Step 6: Lemmatization
List out the root words/ lemma.
Lemma
Final data
List out the final, processed data.
Processed data
Congratulations, you’ve managed to process the data!
Bag of words
Step 1: Collect data and process it
For this exercise, we can use the sentences without processing it so that it is easier for us to read the sentences.
No. Sentence
1 We can use health chatbots for treating stress
2 We can use NLP to create chatbots and we will be making health chatbots now
3 Health chatbots cannot replace human counsellors now
Step 2: Create dictionary
Make a list of all the different words in the text.
Dictionary
Step 3:Create document vectors
Use the next page to create your document vector!
Tf-idf
You’ve obtained your bag of words. Now let’s continue with the tf-idf!
Step 1 - 3: Count the number of documents where the word appears at least once & write that
number down next to the word in your vocabulary to get your document frequency. Draw your
own table for this!
Example of a document frequency:
aman and Anil are stressed went to a therapist download health chatbot
2 1 2 1 1 2 2 2 1 1 1 1
Your document frequency:
Step 4: Get your inverse document frequency.
Example of an inverse document frequency:
aman and anil are stressed went to a therapist download health chatbot
3/2 3/1 3/2 3/1 3/1 3/2 3/2 3/2 3/1 3/1 3/1 3/1
Your inverse document frequency:
Step 5: Get your tf-idf
Example of a tf-idf:
After log operation:
Your tf-idf: