Text

The document discusses various steps in natural language processing, including text normalization, sentence segmentation, tokenization, and the removal of stopwords. It explains techniques like stemming and lemmatization, highlighting the differences between them, and introduces the Bag of Words algorithm for creating document vectors. Finally, it covers the TFIDF method for evaluating the importance of words in a corpus based on their frequency and distribution across documents.

Uploaded by

krishagupta080

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views3 pages

Text

Uploaded by

krishagupta080

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 3

word 5

Page 8 of 20Possibilities: Is he having an allergic reaction? Or is he not able to

bear the taste of that medicine? 3. "His face turns red after consuming the
medicine" 2 Perfect Syntax, no Meaning "Chickens feed extravagantly whilethe moon
drinks tea " Meaning: This statement is correct grammatically but makes no sense.
In Human language, a perfect balance of syntax and semantics is important for
betterunderstanding. 1. Data Processing Since we all know that the language of
computers is Numerical, the very first step thatcomes to our mind is to convert our
languageto numbers. This conversion takes a few steps to happen. The first step to
it is Text Normalisation. Text Normalisation In Text Normalization, weundergo
several steps to normalizethetext to a lower level. That is, we will be working on
text from multiple documents and the term used for the whole textual data from all
the documents altogether is known as "Corpus ". 6

Page 9 of 201 Sentence Segmentation Under sentence segmentation, the whole corpus
is divided into sentences. Each sentence is taken as a different data so now the
whole corpus gets reduced to sentences. Example: You want to seethe dreams
withcloseeyes and achievethem? Before Sentence Segmentation After Sentence
Segmentation “You want to seethe dreams withclose eyesand achievethem? They’ll
remain dreams, look for AIMsand youreyes haveto stay openforachangeto be seen.”
They’ll remain dreams, look for AIMsand youreyes haveto stay openfora change to
beseen Tokenisation 2 Tokenisation A “Token” is a term used for any word or number
or special character occurring in a sentence. Under Tokenisation, every word,
number, and special character is considered separately and each of them isnow
aseparatetoken. 7

Page 10 of 204 Removal of Stopwords Corpus: A corpuscan be defined asacollection of

text documents. Example: You want to see the dreams with close eyes and achieve
them? You want to see the dreams with close eyes and acheive them ? Inthis step,
thetokens whicharenotnecessary areremoved from thetoken list. To makeiteasier for
thecomputer to focus on meaningful terms, these wordsareremoved. Along with these
words,alot of times ourcorpus might havespecial charactersand/ornumbers. Stopwords:
Stopwordsarethe words that occur very frequently inthe corpus but do notadd any
valueto it. Examples:a,an,and,are,as, for, it, is, into, in, if, on, or, such, the,
there, to. if youare working ona documentcontainingemail IDs, then you mightnot
want to removethespecialcharactersand numbers 8

Page 11 of 205 Converting text to a common case Example: You want to see the dreams
with close eyes and achieve them? the removed words would be to, the, and, ? The
outcome would be: -> You want see dreams with close eyes achieve them Weconvert the
wholetext into asimilarcase, preferably lowercase. This ensures that
thecasesensitivity of the machine doesnotconsider thesame wordsas different just
because of differentcases. 6 Stemming Stemmingisatechniqueused to extract the
baseform of the words by removingaffixes from them. It is just likecutting downthe
branches ofatreeto its stems. Mightnot be meaningful. 9

Page 12 of 20Words Affixes Stem healing ing heal dreams s dream studies es studi
Words Affixes lemma healing ing heal dreams s dream studies es study Example: 7
Lemmatization Inlemmatization, the word wegetafteraffix removal (also knownas
lemma) isa meaningful oneand it takesalonger timeto executethan stemming.
Lemmatization makes surethatalemmaisa word with meaning Example: 10

Page 13 of 20Difference between stemming and lemmatization Thestemmed words

mightnot be meaningful. Caring ➔ Car Stemming Thelemma word isa meaningful one.
Caring ➔ Care lemmatization Bag of word Algorithm Herecallingthisalgorithm a“bag”
of words symbolizes that thesequence of sentences or tokens doesnot matter
inthiscaseasall weneed aretheunique wordsand their frequency init. Bag of Words
justcreatesaset of vectorscontainingthecount of word occurrences inthe document
(reviews). Bag of Words vectors iseasy to interpret. A vocabulary of words for
thecorpus Thefrequency of these words (number of times it has occurred inthe
wholecorpus). The bag of wordsgivesus two things: 11

Page 14 of 20Steps of the bag of words algorithm Text Normalisation: Collecting

dataand pre-processingit Create Dictionary: Makingalist ofall theunique words
occurringin thecorpus. (Vocabulary) Create document vectors: Foreach document
inthecorpus, find out how many times the word from theuniquelist of words has
occurred. Create document vectors forall the documents. 1. 2. 3. 4. Example: Step
1: Collecting dataand pre-processingit. Document 1: Amanand Anilare stressed
Document 2: Aman went to a therapist Document 3: Anil went to download
ahealthchatbot Raw Data Document 1: [aman,and,anil,are, stressed ] Document 2:
[aman, went, to,a, therapist] Document 3: [anil, went, to, download,a,
health,chatbot] Processed Data Step 2: Create Dictionary Dictionary in NLP
meansalist ofall theunique words occurringin thecorpus. If some wordsarerepeated in
different documents, they areall writtenjust once whilecreatingthe dictionary. 12

Page 15 of 20aman and anil are stressed went download health chatbot therapist a to
aman and anil are stressed went to a therapist download health chatbot 1 1 1 1 1 0
0 0 0 0 0 0 Some wordsarerepeated in different documents, they areall writtenjust
once, whilecreatingthe dictionary, wecreatealist ofunique words. Step 3: Createa
document vector The document Vectorcontains thefrequency ofeach word of the
vocabulary ina particular document. Now, foreach word inthe document, if it matches
the vocabulary, puta1under it. If thesame word appearsagain, increment the previous
value by 1. And if the word doesnot occur inthat document, puta 0 under it. Inthe
document, vector vocabulary is writteninthetop row. 13

Page 16 of 20aman and anil are stressed went to a therapist download health chatbot
1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 1 0 1 1 1 Step 4:
Creatinga document vector tableforall documents Inthis table, theheader row
contains thevocabulary of thecorpusand threerowscorrespond to three different
documents. Finally, thisgivesus the document vector tablefor ourcorpus. But the
tokens havestillnotconverted to numbers. This leadsus to thefinal steps of
ouralgorithm: TFIDF. TFIDF TFIDF stands for Term Frequency & Inverse Document
Frequency. 1 Term Frequency Term frequency is thefrequency ofa word in one
document. Term frequency caneasily befound inthe document vector table 1. 2.
Example: 14

Page 17 of 20aman and anil are stressed went to a therapist download health chatbot
1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 1 0 1 1 1 aman and
anil are stressed went to a therapist download health chatbot 2 1 2 1 1 2 2 2 1 1 1
1 Here,as wecanseethat thefrequency ofeach word foreach document has beenrecorded
inthetable. Thesenumbersarenothing but the Term Frequencies! 2 Document Frequency
Document Frequency is thenumber of documents in which the word occurs irrespective
of how many times it has occurred inthose documents. Document frequency of ‘aman’,
‘anil’, ‘went’, ‘to’and ‘a’ is 2 as they have occurred intwo documents. Rest of
them occurred injust one document hencethe document frequency for them is one.
Wecan observefrom thetableis: 1. 2. 15

Page 18 of 20aman and anil are stressed went to a therapist download health chatbot
3/2 3/1 3/2 3/1 3/1 3/2 3/2 3/2 3/1 3/1 3/1 3/1 aman and anil are stressed went to
a therapist download health chatbot 1*log(3/2) 1*log(3) 1*log (3/2) 1*log(3)
1*log(3) 0*log (3/2) 0*log (3/2) 0*log (3/2) 0*log(3) 0*log(3) 0*log(3) 0*log(3)
1*log(3/2) 0*log(3) 0*log (3/2) 0*log(3) 0*log(3) 1*log (3/2) 1*log (3/2) 1*log
(3/2) 1*log(3) 0*log(3) 0*log(3) 0*log(3) 0*log(3/2) 0*log(3) 1*log (3/2) 0*log(3)
0*log(3) 1*log (3/2) 1*log (3/2) 1*log (3/2) 0*log(3) 1*log(3) 1*log(3) 1*log(3) 3
Inverse Document Frequency Inthecase of inverse document frequency, weneed to put
the document frequency inthe denominator whilethe totalnumber of documents is
thenumerator. Formula of TFIDF Theformula of TFIDF forany word W becomes: TFIDF(W)
= TF(W) * log( IDF(W) ) We don’tneed to calculatethelogvalues by ourselves.
Wesimply haveto usethe logfunctioninthecalculatorand find out! 16

Page 19 of 20aman and anil are stressed went to a therapist download health chatbot
0.176 .477 0.176 0.477 0.477 0 0 0 0 0 0 0 0.176 0 0 0 0 0.176 0.176 0.176 0.477 0
0 0 0 0 0.176 0 0 0.176 0.176 0.176 0 0.477 0.477 0.477 Aftercalculatingall
thevalues, weget: Finally, the words have beenconverted to numbers. Thesenumbersare
thevalues ofeach document. Here, wecanseethat since wehavelessamount of data, words
like‘are’ and ‘and’also haveahigh value. Butas theIDF valueincreases, thevalue of
that word decreases. Total Number of documents: 10 Number of documents in which
‘and’ occurs: 10 That is, forexample: Therefore, IDF(and) = 10/10 = 1 Which means:
log(1) = 0. Hence, thevalue of ‘and’ becomes 0. Onthe other hand, thenumber of
documents in which ‘pollution’ occurs: 3 IDF(pollution) = 10/3 = 3.3333... This
means log(3.3333) = 0.522; which shows that the word ‘pollution’
hasconsiderablevalueinthecorpus. 17

Cost Comparison Engage Report Final Version PDF
No ratings yet
Cost Comparison Engage Report Final Version PDF
80 pages
NLP Worksheet: Text Processing, Bag of Words, Tf-Idf Activity
No ratings yet
NLP Worksheet: Text Processing, Bag of Words, Tf-Idf Activity
6 pages
ERP Project Management Plan
0% (1)
ERP Project Management Plan
25 pages
NLP Worksheet: Text Processing, Bag of Words and TF-IDF
100% (2)
NLP Worksheet: Text Processing, Bag of Words and TF-IDF
10 pages
Text Operations 2021
No ratings yet
Text Operations 2021
45 pages
Text Preprocessing with NLTK
No ratings yet
Text Preprocessing with NLTK
42 pages
NLP Worksheet: Text Processing, Bag of Words, Tf-Idf Activity
No ratings yet
NLP Worksheet: Text Processing, Bag of Words, Tf-Idf Activity
6 pages
Term Frequency and Inverse Document Frequency
No ratings yet
Term Frequency and Inverse Document Frequency
26 pages
NLP-Neuro Linguistic Programming: What Is A Corpus?
No ratings yet
NLP-Neuro Linguistic Programming: What Is A Corpus?
3 pages
NLP Revision Notes and Applications
No ratings yet
NLP Revision Notes and Applications
4 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
Chapter 2 (Information Storage & Retrieval)
No ratings yet
Chapter 2 (Information Storage & Retrieval)
56 pages
SMEA Orientation
No ratings yet
SMEA Orientation
35 pages
NLP Q&A for Class X AI Course
No ratings yet
NLP Q&A for Class X AI Course
7 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
Hospital Housekeeping Guide
100% (1)
Hospital Housekeeping Guide
16 pages
NLP Challenges & Techniques
No ratings yet
NLP Challenges & Techniques
45 pages
7866 Gas Analyzer/Indicator Modbus® RTU Serial Communications User Manual
No ratings yet
7866 Gas Analyzer/Indicator Modbus® RTU Serial Communications User Manual
42 pages
IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
Tutorial 26 Sarma Non-Vertical Slices
No ratings yet
Tutorial 26 Sarma Non-Vertical Slices
6 pages
Claudio Et Al HE Vol 3
No ratings yet
Claudio Et Al HE Vol 3
17 pages
UG Open Architecture Programming Guide
No ratings yet
UG Open Architecture Programming Guide
4 pages
Unit 2 Technological Change Population and Growth 1.0
No ratings yet
Unit 2 Technological Change Population and Growth 1.0
33 pages
Dimensionless Numbers
No ratings yet
Dimensionless Numbers
13 pages
Session 1
No ratings yet
Session 1
33 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
5 pages
NLP - Worksheet Solved
No ratings yet
NLP - Worksheet Solved
6 pages
Text Mining & NLP for Academics
No ratings yet
Text Mining & NLP for Academics
38 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
Bag of Words
No ratings yet
Bag of Words
19 pages
Design Patterns Refcard PDF
100% (1)
Design Patterns Refcard PDF
7 pages
Strategy Implementation Guide
100% (1)
Strategy Implementation Guide
18 pages
Sample Transportation Problems
No ratings yet
Sample Transportation Problems
4 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Listening-đã chuyển đổi
No ratings yet
Listening-đã chuyển đổi
8 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
AC7114-4 Rev G AUDIT CRITERIA FOR NONDESTRUCTIVE TESTING FACILITY FILM RADIOGRAPHY SURVEY
100% (2)
AC7114-4 Rev G AUDIT CRITERIA FOR NONDESTRUCTIVE TESTING FACILITY FILM RADIOGRAPHY SURVEY
21 pages
Case Study Repor Take Time
No ratings yet
Case Study Repor Take Time
18 pages
Guru Harkrishan Public School, India Gate Holiday Homework (2019 - 20) Class 8 English
No ratings yet
Guru Harkrishan Public School, India Gate Holiday Homework (2019 - 20) Class 8 English
5 pages
Linguistic Politeness in Literature
No ratings yet
Linguistic Politeness in Literature
10 pages
Data Presentation Methods Explained
No ratings yet
Data Presentation Methods Explained
13 pages
JAVA Modifier Inheritance
No ratings yet
JAVA Modifier Inheritance
3 pages
NLP Worksheet2222
No ratings yet
NLP Worksheet2222
10 pages
Intro to Operating Systems
No ratings yet
Intro to Operating Systems
13 pages
RajeshAC SrDBA
No ratings yet
RajeshAC SrDBA
3 pages
NLP Applications in Healthcare
No ratings yet
NLP Applications in Healthcare
71 pages
Quantitative Text Analysis Methods
No ratings yet
Quantitative Text Analysis Methods
55 pages
Vector Semantics
No ratings yet
Vector Semantics
83 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
Pharma Operations Expert CV
No ratings yet
Pharma Operations Expert CV
2 pages
Intro to NLP and Chatbots
No ratings yet
Intro to NLP and Chatbots
3 pages
Jka421e Ap 2017-2018
No ratings yet
Jka421e Ap 2017-2018
5 pages
Chapter 2 Text Operations
No ratings yet
Chapter 2 Text Operations
37 pages
NLP Applications and Techniques
No ratings yet
NLP Applications and Techniques
7 pages
Sneha Garde: Career Objective
No ratings yet
Sneha Garde: Career Objective
2 pages
Text Mining
No ratings yet
Text Mining
35 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
1009 NLP PPT
No ratings yet
1009 NLP PPT
31 pages
Text Mining
No ratings yet
Text Mining
34 pages
PDF NLP
No ratings yet
PDF NLP
7 pages
1-S2.0-S1877050916311589-Main - Part-5
No ratings yet
1-S2.0-S1877050916311589-Main - Part-5
7 pages
Sensory Integration 2021 On-Demand
No ratings yet
Sensory Integration 2021 On-Demand
43 pages
Dupppppppppp
No ratings yet
Dupppppppppp
15 pages
Q ClassX AI Ch7
No ratings yet
Q ClassX AI Ch7
6 pages
517-C-30070-Assignment - Chapter NLP
No ratings yet
517-C-30070-Assignment - Chapter NLP
9 pages
Text Mining Notes
No ratings yet
Text Mining Notes
24 pages
South Korea Country Analysis Report
No ratings yet
South Korea Country Analysis Report
12 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
Module III
No ratings yet
Module III
42 pages
Lecture 7
No ratings yet
Lecture 7
32 pages
NLP Revision Notes
No ratings yet
NLP Revision Notes
6 pages
Engineering Lubrication Guide
No ratings yet
Engineering Lubrication Guide
6 pages
Oral Communication-Oct11
No ratings yet
Oral Communication-Oct11
33 pages
The AIM Test
No ratings yet
The AIM Test
4 pages
SKD Academy (CBSE) Session - 2024-2025 Subject - Artificial Intelligence (417) Important Questions Chap - NLP
No ratings yet
SKD Academy (CBSE) Session - 2024-2025 Subject - Artificial Intelligence (417) Important Questions Chap - NLP
7 pages
AIML Unit5
No ratings yet
AIML Unit5
36 pages
ITD253 L2 TextPreprocessing
No ratings yet
ITD253 L2 TextPreprocessing
33 pages
DS Lecture 03 04
No ratings yet
DS Lecture 03 04
74 pages
Lecture 2 Bag of Words
No ratings yet
Lecture 2 Bag of Words
25 pages
Natural Language Processing Notes Class 10 AI
No ratings yet
Natural Language Processing Notes Class 10 AI
24 pages
Natural Language Processing Notes
No ratings yet
Natural Language Processing Notes
5 pages
NLP Notes CL 10
No ratings yet
NLP Notes CL 10
13 pages
NLP2
No ratings yet
NLP2
8 pages
NLP Class X AI
No ratings yet
NLP Class X AI
36 pages
DSC 202
No ratings yet
DSC 202
8 pages

Text

Uploaded by

Text

Uploaded by

word 5

Page 8 of 20Possibilities: Is he having an allergic reaction? Or is he not able to

Page 10 of 204 Removal of Stopwords Corpus: A corpuscan be defined asacollection of

Page 13 of 20Difference between stemming and lemmatization Thestemmed words

Page 14 of 20Steps of the bag of words algorithm Text Normalisation: Collecting

You might also like