Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views3 pages

Text

The document discusses various steps in natural language processing, including text normalization, sentence segmentation, tokenization, and the removal of stopwords. It explains techniques like stemming and lemmatization, highlighting the differences between them, and introduces the Bag of Words algorithm for creating document vectors. Finally, it covers the TFIDF method for evaluating the importance of words in a corpus based on their frequency and distribution across documents.

Uploaded by

krishagupta080
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views3 pages

Text

The document discusses various steps in natural language processing, including text normalization, sentence segmentation, tokenization, and the removal of stopwords. It explains techniques like stemming and lemmatization, highlighting the differences between them, and introduces the Bag of Words algorithm for creating document vectors. Finally, it covers the TFIDF method for evaluating the importance of words in a corpus based on their frequency and distribution across documents.

Uploaded by

krishagupta080
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

word 5

Page 8 of 20Possibilities: Is he having an allergic reaction? Or is he not able to


bear the taste of that medicine? 3. "His face turns red after consuming the
medicine" 2 Perfect Syntax, no Meaning "Chickens feed extravagantly whilethe moon
drinks tea " Meaning: This statement is correct grammatically but makes no sense.
In Human language, a perfect balance of syntax and semantics is important for
betterunderstanding. 1. Data Processing Since we all know that the language of
computers is Numerical, the very first step thatcomes to our mind is to convert our
languageto numbers. This conversion takes a few steps to happen. The first step to
it is Text Normalisation. Text Normalisation In Text Normalization, weundergo
several steps to normalizethetext to a lower level. That is, we will be working on
text from multiple documents and the term used for the whole textual data from all
the documents altogether is known as "Corpus ". 6

Page 9 of 201 Sentence Segmentation Under sentence segmentation, the whole corpus
is divided into sentences. Each sentence is taken as a different data so now the
whole corpus gets reduced to sentences. Example: You want to seethe dreams
withcloseeyes and achievethem? Before Sentence Segmentation After Sentence
Segmentation “You want to seethe dreams withclose eyesand achievethem? They’ll
remain dreams, look for AIMsand youreyes haveto stay openforachangeto be seen.”
They’ll remain dreams, look for AIMsand youreyes haveto stay openfora change to
beseen Tokenisation 2 Tokenisation A “Token” is a term used for any word or number
or special character occurring in a sentence. Under Tokenisation, every word,
number, and special character is considered separately and each of them isnow
aseparatetoken. 7

Page 10 of 204 Removal of Stopwords Corpus: A corpuscan be defined asacollection of


text documents. Example: You want to see the dreams with close eyes and achieve
them? You want to see the dreams with close eyes and acheive them ? Inthis step,
thetokens whicharenotnecessary areremoved from thetoken list. To makeiteasier for
thecomputer to focus on meaningful terms, these wordsareremoved. Along with these
words,alot of times ourcorpus might havespecial charactersand/ornumbers. Stopwords:
Stopwordsarethe words that occur very frequently inthe corpus but do notadd any
valueto it. Examples:a,an,and,are,as, for, it, is, into, in, if, on, or, such, the,
there, to. if youare working ona documentcontainingemail IDs, then you mightnot
want to removethespecialcharactersand numbers 8

Page 11 of 205 Converting text to a common case Example: You want to see the dreams
with close eyes and achieve them? the removed words would be to, the, and, ? The
outcome would be: -> You want see dreams with close eyes achieve them Weconvert the
wholetext into asimilarcase, preferably lowercase. This ensures that
thecasesensitivity of the machine doesnotconsider thesame wordsas different just
because of differentcases. 6 Stemming Stemmingisatechniqueused to extract the
baseform of the words by removingaffixes from them. It is just likecutting downthe
branches ofatreeto its stems. Mightnot be meaningful. 9

Page 12 of 20Words Affixes Stem healing ing heal dreams s dream studies es studi
Words Affixes lemma healing ing heal dreams s dream studies es study Example: 7
Lemmatization Inlemmatization, the word wegetafteraffix removal (also knownas
lemma) isa meaningful oneand it takesalonger timeto executethan stemming.
Lemmatization makes surethatalemmaisa word with meaning Example: 10

Page 13 of 20Difference between stemming and lemmatization Thestemmed words


mightnot be meaningful. Caring ➔ Car Stemming Thelemma word isa meaningful one.
Caring ➔ Care lemmatization Bag of word Algorithm Herecallingthisalgorithm a“bag”
of words symbolizes that thesequence of sentences or tokens doesnot matter
inthiscaseasall weneed aretheunique wordsand their frequency init. Bag of Words
justcreatesaset of vectorscontainingthecount of word occurrences inthe document
(reviews). Bag of Words vectors iseasy to interpret. A vocabulary of words for
thecorpus Thefrequency of these words (number of times it has occurred inthe
wholecorpus). The bag of wordsgivesus two things: 11

Page 14 of 20Steps of the bag of words algorithm Text Normalisation: Collecting


dataand pre-processingit Create Dictionary: Makingalist ofall theunique words
occurringin thecorpus. (Vocabulary) Create document vectors: Foreach document
inthecorpus, find out how many times the word from theuniquelist of words has
occurred. Create document vectors forall the documents. 1. 2. 3. 4. Example: Step
1: Collecting dataand pre-processingit. Document 1: Amanand Anilare stressed
Document 2: Aman went to a therapist Document 3: Anil went to download
ahealthchatbot Raw Data Document 1: [aman,and,anil,are, stressed ] Document 2:
[aman, went, to,a, therapist] Document 3: [anil, went, to, download,a,
health,chatbot] Processed Data Step 2: Create Dictionary Dictionary in NLP
meansalist ofall theunique words occurringin thecorpus. If some wordsarerepeated in
different documents, they areall writtenjust once whilecreatingthe dictionary. 12

Page 15 of 20aman and anil are stressed went download health chatbot therapist a to
aman and anil are stressed went to a therapist download health chatbot 1 1 1 1 1 0
0 0 0 0 0 0 Some wordsarerepeated in different documents, they areall writtenjust
once, whilecreatingthe dictionary, wecreatealist ofunique words. Step 3: Createa
document vector The document Vectorcontains thefrequency ofeach word of the
vocabulary ina particular document. Now, foreach word inthe document, if it matches
the vocabulary, puta1under it. If thesame word appearsagain, increment the previous
value by 1. And if the word doesnot occur inthat document, puta 0 under it. Inthe
document, vector vocabulary is writteninthetop row. 13

Page 16 of 20aman and anil are stressed went to a therapist download health chatbot
1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 1 0 1 1 1 Step 4:
Creatinga document vector tableforall documents Inthis table, theheader row
contains thevocabulary of thecorpusand threerowscorrespond to three different
documents. Finally, thisgivesus the document vector tablefor ourcorpus. But the
tokens havestillnotconverted to numbers. This leadsus to thefinal steps of
ouralgorithm: TFIDF. TFIDF TFIDF stands for Term Frequency & Inverse Document
Frequency. 1 Term Frequency Term frequency is thefrequency ofa word in one
document. Term frequency caneasily befound inthe document vector table 1. 2.
Example: 14

Page 17 of 20aman and anil are stressed went to a therapist download health chatbot
1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 1 0 1 1 1 aman and
anil are stressed went to a therapist download health chatbot 2 1 2 1 1 2 2 2 1 1 1
1 Here,as wecanseethat thefrequency ofeach word foreach document has beenrecorded
inthetable. Thesenumbersarenothing but the Term Frequencies! 2 Document Frequency
Document Frequency is thenumber of documents in which the word occurs irrespective
of how many times it has occurred inthose documents. Document frequency of ‘aman’,
‘anil’, ‘went’, ‘to’and ‘a’ is 2 as they have occurred intwo documents. Rest of
them occurred injust one document hencethe document frequency for them is one.
Wecan observefrom thetableis: 1. 2. 15

Page 18 of 20aman and anil are stressed went to a therapist download health chatbot
3/2 3/1 3/2 3/1 3/1 3/2 3/2 3/2 3/1 3/1 3/1 3/1 aman and anil are stressed went to
a therapist download health chatbot 1*log(3/2) 1*log(3) 1*log (3/2) 1*log(3)
1*log(3) 0*log (3/2) 0*log (3/2) 0*log (3/2) 0*log(3) 0*log(3) 0*log(3) 0*log(3)
1*log(3/2) 0*log(3) 0*log (3/2) 0*log(3) 0*log(3) 1*log (3/2) 1*log (3/2) 1*log
(3/2) 1*log(3) 0*log(3) 0*log(3) 0*log(3) 0*log(3/2) 0*log(3) 1*log (3/2) 0*log(3)
0*log(3) 1*log (3/2) 1*log (3/2) 1*log (3/2) 0*log(3) 1*log(3) 1*log(3) 1*log(3) 3
Inverse Document Frequency Inthecase of inverse document frequency, weneed to put
the document frequency inthe denominator whilethe totalnumber of documents is
thenumerator. Formula of TFIDF Theformula of TFIDF forany word W becomes: TFIDF(W)
= TF(W) * log( IDF(W) ) We don’tneed to calculatethelogvalues by ourselves.
Wesimply haveto usethe logfunctioninthecalculatorand find out! 16

Page 19 of 20aman and anil are stressed went to a therapist download health chatbot
0.176 .477 0.176 0.477 0.477 0 0 0 0 0 0 0 0.176 0 0 0 0 0.176 0.176 0.176 0.477 0
0 0 0 0 0.176 0 0 0.176 0.176 0.176 0 0.477 0.477 0.477 Aftercalculatingall
thevalues, weget: Finally, the words have beenconverted to numbers. Thesenumbersare
thevalues ofeach document. Here, wecanseethat since wehavelessamount of data, words
like‘are’ and ‘and’also haveahigh value. Butas theIDF valueincreases, thevalue of
that word decreases. Total Number of documents: 10 Number of documents in which
‘and’ occurs: 10 That is, forexample: Therefore, IDF(and) = 10/10 = 1 Which means:
log(1) = 0. Hence, thevalue of ‘and’ becomes 0. Onthe other hand, thenumber of
documents in which ‘pollution’ occurs: 3 IDF(pollution) = 10/3 = 3.3333... This
means log(3.3333) = 0.522; which shows that the word ‘pollution’
hasconsiderablevalueinthecorpus. 17

You might also like