Extracting, Cleaning and Pre-Processing Text

Machine learning ppt

Uploaded by

rhinouser1717

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views12 pages

Extracting, Cleaning and Pre-Processing Text

Machine learning ppt

Uploaded by

rhinouser1717

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Extracting, Cleaning and Pre-processing Text

• Need for it
Need for Tokenization

• In order to get our computer to understand any text, we need to

break that word down in a way that our machine can understand.
That’s where the concept of tokenization in Natural Language
Processing (NLP) comes in
• Tokenization is a way of separating a piece of text into smaller units
called tokens. Here, tokens can be either words, characters, subwords
or sentences. Hence, tokenization can be broadly classified into 3
types – word, character, and subword (n-gram characters)
tokenization.
How is Tokenization done

• The most common way of forming tokens is based on space.

Assuming space as a delimiter, the tokenization of the sentence
results in 3 tokens – Natural-Language-Processing. As each token is a
word, it becomes an example of Word tokenization.
Bigrams, Trigrams & Ngrams
• N-grams are simply all combinations of adjacent words or letters of
length n that you can find in your source text.
• The basic point of n-grams is that they capture the language structure
from the statistical point of view, like what letter or word is likely to
follow the given one. The longer the n-gram (the higher the n), the
more context you have to work with. Optimum length really depends
on the application – if your n-grams are too short, you may fail to
capture important differences. On the other hand, if they are too
long, you may fail to capture the “general knowledge” and only stick
to particular cases.
Frequency Distribution
• A frequency distribution records the number of times each outcome
of an experiment has occurred. For example, a frequency distribution
could be used to record the frequency of each word type in a
document
• We need Frequency Distribution to fully understand what is
happening within our data, we can use a frequency distribution
function that will display the most frequent tokens and words.
Cleaning up your data
• While working with frequency distributions, we noticed that there
were many words that are identified as “different” tokens because
some instances were capitalized while others were not. In order to
combine these groups, we must convert all the characters into their
lowercase forms.
• Next, we must remove all meaningless punctuation. This allows our
model to entirely focus on the important phrases in the text instead
of the number of periods.
Stemming
• Stemming is the process of reducing a word to its word stem that
affixes to suffixes and prefixes or to the roots of words known as a
lemma.
• The input to the stemmer is tokenized words
• Stemming is used in information retrieval systems like search engines.
Lemmatization
• Lemmatization is the process of grouping together the different
inflected forms of a word so they can be analyzed as a single item.
Lemmatization is similar to stemming but it brings context to the
words. So it links words with similar meanings to one word.
• To make Lemmatization work we have to supply the dictionary so that
words can be linked to their root lemma
Difference between stemming and
lemmatisation
• In stemming, a part of the word is just chopped off at the tail end to arrive at the stem of
the word. There are definitely different algorithms used to find out how many characters
have to be chopped off, but the algorithms don’t actually know the meaning of the word
in the language it belongs to. In lemmatization, on the other hand, the algorithms have
this knowledge. In fact, you can even say that these algorithms refer a dictionary to
understand the meaning of the word before reducing it to its root word, or lemma.
• So, a lemmatization algorithm would know that the word better is derived from the
word good, and hence, the lemme is good. But a stemming algorithm wouldn’t be able to
do the same. There could be over-stemming or under-stemming, and the
word better could be reduced to either bet, or bett, or just retained as better. But there is
no way in stemming that it could be reduced to its root word good. This, basically is the
difference between stemming and lemmatization.
Stopwords
• Stop words are a set of commonly used words in a language. Examples of stop
words in English are “a”, “the”, “is”, “are” and etc. Stop words are commonly
used in Text Mining and Natural Language Processing (NLP) to eliminate words
that are so commonly used that they carry very little useful information.
• This can be done by maintaining a list of stop words (which can be manually or
automatically curated) and preventing all words from your stop word list from
being analyzed.
• On removing stopwords, dataset size decreases and the time to train the model
also decreases
• Removing stopwords can potentially help improve the performance as there are
fewer and only meaningful tokens left. Thus, it could increase classification
accuracy
Part Of Speech Tagging
• It is a process of converting a sentence to forms – list of words, list of
tuples (where each tuple is having a form (word, tag)). The tag in case
of is a part-of-speech tag, and signifies whether the word is a noun,
adjective, verb, and so on.
Named Entity Recognition

NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Telangana Schemes and Policies (2014-2024) Updated Book-Target TSPSC - 35390223 - 2024 - 06 - 24 - 11 - 08
50% (2)
Telangana Schemes and Policies (2014-2024) Updated Book-Target TSPSC - 35390223 - 2024 - 06 - 24 - 11 - 08
117 pages
Financial Metrics for Investors
0% (1)
Financial Metrics for Investors
5 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
Gravitation
75% (4)
Gravitation
23 pages
Hana'a Makahle: Project Coordinator Resume
No ratings yet
Hana'a Makahle: Project Coordinator Resume
3 pages
How To Troubleshoot Identity Awareness Issues - Checkpoint
No ratings yet
How To Troubleshoot Identity Awareness Issues - Checkpoint
17 pages
Pound Ezra The Cantos
100% (1)
Pound Ezra The Cantos
615 pages
Introduction To Natural Language Processing (Nlp) : Ths. Đặng Nhân Cách Email: [email protected]
No ratings yet
Introduction To Natural Language Processing (Nlp) : Ths. Đặng Nhân Cách Email: [email protected]
25 pages
Text Mining
No ratings yet
Text Mining
62 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
The Thick of It - 3x02
No ratings yet
The Thick of It - 3x02
51 pages
NLP Preprocessing Techniques Guide
No ratings yet
NLP Preprocessing Techniques Guide
24 pages
Employment Law (Palgrave Law Masters) (PDFDrive)
No ratings yet
Employment Law (Palgrave Law Masters) (PDFDrive)
521 pages
Stemming, Lemmatization & NLP Basics
No ratings yet
Stemming, Lemmatization & NLP Basics
6 pages
NLP Pipeline and Morphology
No ratings yet
NLP Pipeline and Morphology
21 pages
Final LP-VI NLP Manual 2023-24
No ratings yet
Final LP-VI NLP Manual 2023-24
29 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
RWS Vol. 29 No. 2 113 146 Presto 2020 Revisiting Intersectional Identities - Voices of Poor Bakla Youth in Rural Philippines
No ratings yet
RWS Vol. 29 No. 2 113 146 Presto 2020 Revisiting Intersectional Identities - Voices of Poor Bakla Youth in Rural Philippines
34 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
16 pages
18 Text Mining - Text Preprocessing
No ratings yet
18 Text Mining - Text Preprocessing
40 pages
Text Preprocessing Techniques Guide
No ratings yet
Text Preprocessing Techniques Guide
6 pages
Cisco Stealthwatch: Cisco Threat Response Integration Guide 7.1.2
No ratings yet
Cisco Stealthwatch: Cisco Threat Response Integration Guide 7.1.2
23 pages
Vani Ganapathy
No ratings yet
Vani Ganapathy
2 pages
Natural Language Processing Questions
No ratings yet
Natural Language Processing Questions
5 pages
Ai Notes
No ratings yet
Ai Notes
11 pages
Grapheme:: Morpheme
No ratings yet
Grapheme:: Morpheme
20 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
NLTK
No ratings yet
NLTK
3 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Part B Notes
No ratings yet
Part B Notes
62 pages
Unit2 A
No ratings yet
Unit2 A
22 pages
Q - ClassX - AI - NATURAL LANGUAGE PROCESSING
No ratings yet
Q - ClassX - AI - NATURAL LANGUAGE PROCESSING
10 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
Grsim - RoboCup Small Size Robot Soccer Simulator PDF
No ratings yet
Grsim - RoboCup Small Size Robot Soccer Simulator PDF
11 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
NLP 03
No ratings yet
NLP 03
3 pages
NLP Experiment 3
No ratings yet
NLP Experiment 3
5 pages
NLP and Python Course Overview
No ratings yet
NLP and Python Course Overview
121 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Speeding Mitigation Plan
No ratings yet
Speeding Mitigation Plan
2 pages
Intro to NLP and Chatbots
No ratings yet
Intro to NLP and Chatbots
3 pages
NLP Qa
No ratings yet
NLP Qa
10 pages
NLP 3-6
No ratings yet
NLP 3-6
20 pages
CL - Lec 6
No ratings yet
CL - Lec 6
28 pages
Viva Questions
No ratings yet
Viva Questions
6 pages
Success STory: SAP C4C Sales Cloud Implementation at AL Tasnim Group (ATNM)
No ratings yet
Success STory: SAP C4C Sales Cloud Implementation at AL Tasnim Group (ATNM)
1 page
EPI New Application Form
No ratings yet
EPI New Application Form
9 pages
Week 5
No ratings yet
Week 5
8 pages
Adobe Scan 30 Sept 2024
No ratings yet
Adobe Scan 30 Sept 2024
6 pages
Mini Project 1
No ratings yet
Mini Project 1
9 pages
NLP m2
No ratings yet
NLP m2
71 pages
PDF NLP
No ratings yet
PDF NLP
7 pages
3 A Morphology
No ratings yet
3 A Morphology
4 pages
Lemmatization - Wikipedia
No ratings yet
Lemmatization - Wikipedia
2 pages
Natural Language Computing
No ratings yet
Natural Language Computing
20 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
Additional Illustration 17
No ratings yet
Additional Illustration 17
2 pages
NLP Exp 3
No ratings yet
NLP Exp 3
24 pages
VO - MCA - SEM 4 - Text Mining - U2
No ratings yet
VO - MCA - SEM 4 - Text Mining - U2
15 pages
NLP Unit-2
No ratings yet
NLP Unit-2
12 pages
Lemmatization Stemming Presentation
No ratings yet
Lemmatization Stemming Presentation
11 pages
Walmart Display Makes and Models 2
No ratings yet
Walmart Display Makes and Models 2
1 page
2 Modulepattern
No ratings yet
2 Modulepattern
2 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
Migros-Online - Product Manager 80% - 100%
No ratings yet
Migros-Online - Product Manager 80% - 100%
1 page
Module 1 NLP
No ratings yet
Module 1 NLP
26 pages
Insert - Elecsys FSH.08932387500.V2.En
No ratings yet
Insert - Elecsys FSH.08932387500.V2.En
4 pages
BA Underpayment Appeal Letter - NSA MNRP
No ratings yet
BA Underpayment Appeal Letter - NSA MNRP
3 pages
ML Ch-6 Text Mining and Time Series
No ratings yet
ML Ch-6 Text Mining and Time Series
11 pages
Module 2 Complete
No ratings yet
Module 2 Complete
134 pages
October 2023 Online Version
No ratings yet
October 2023 Online Version
5 pages
Unit2 01
No ratings yet
Unit2 01
9 pages
Adv Funct Materials - 2024 - Haque - Heterogeneous Integration of High Endurance Ferroelectric and Piezoelectric Epitaxial
No ratings yet
Adv Funct Materials - 2024 - Haque - Heterogeneous Integration of High Endurance Ferroelectric and Piezoelectric Epitaxial
10 pages
NLP Notes
No ratings yet
NLP Notes
56 pages
Chap 2
No ratings yet
Chap 2
70 pages
NLP Experiments No-1
No ratings yet
NLP Experiments No-1
7 pages
Module 3
No ratings yet
Module 3
39 pages
No Due III B
No ratings yet
No Due III B
3 pages
2.3 Chap NLP Stemming
No ratings yet
2.3 Chap NLP Stemming
32 pages
短语、分句、句子
No ratings yet
短语、分句、句子
7 pages
NLP Exp 3
No ratings yet
NLP Exp 3
4 pages
Zeeshan CV
No ratings yet
Zeeshan CV
1 page
NLP Exp 2
No ratings yet
NLP Exp 2
4 pages
Algorithms and Models For The Web Graph Anthony Bonato Download
No ratings yet
Algorithms and Models For The Web Graph Anthony Bonato Download
175 pages
Stemming and Lemmatization
No ratings yet
Stemming and Lemmatization
18 pages
Orientation Groupings Auxilio
No ratings yet
Orientation Groupings Auxilio
12 pages
NLP Concepts
No ratings yet
NLP Concepts
27 pages

Extracting, Cleaning and Pre-Processing Text

Uploaded by

Extracting, Cleaning and Pre-Processing Text

Uploaded by

Extracting, Cleaning and Pre-processing Text

• In order to get our computer to understand any text, we need to

• The most common way of forming tokens is based on space.

You might also like