0% found this document useful (0 votes)

5 views39 pages

Text Data Preprocessing 2025

The document outlines the essential steps and techniques involved in text preprocessing, including data cleaning, tokenization, stop-word removal, stemming, lemmatization, and named entity recognition. It highlights challenges in sentence segmentation and word tokenization, as well as methods for text normalization and vectorization. Additionally, it discusses various approaches to word embeddings, such as Word2Vec and GloVe, for representing words in a numerical format.

Uploaded by

pramodadudhal303

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views39 pages

Text Data Preprocessing 2025

Uploaded by

pramodadudhal303

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Basics Text Pre-processing

1
Need of Text Preprocessing
1. Data Cleaning
2. Tokenization
3. Lowercasing
4. Stop-word Removal
5. Stemming and Lemmatization
6. Text Normalization
7. Part-of-Speech Tagging
8. Named Entity Recognition (NER)
9. Text Vectorization
2
Instance of Text Data from twitter
For every retweet this gets, Pedigree will donate one bowl of dog food to dogs in need! 😊
#tweetforbowls
We got kicked out of a @Delta airplane because I spoke Arabic to my mom on the phone
and with my friend slim... WTHHHHHH please spread
Hello hiiiii! well... It is already quite late.... :( @@yash......
Thank you for everything. My last ask is the same as my first. I'm asking you to believe—not
in my ability to create change, but in yours.@POTUSredo your work , will you ok? criss-
cross !!!
I can't can you? you should've books' Yash's pens and papers are lying on the sofa! aren't
they? USA U.S.A. :-(:)<><http://www.google.com>.....????.?!.#@@2@a1@abc
@leighabelle maybe. I doubt it. Lol. Sooooooo...

3
Data Cleaning
• Removing unnecessary whitespace, special
characters, punctuations etc
INPUT Text:
For every retweet this gets, Pedigree will donate one bowl of dog food to dogs in need! 😊
#tweetforbowls
We got kicked out of a @Delta airplane because I spoke Arabic to my mom on the phone
and with my friend slim... WTHHHHHH please spread
Hello hiiiii! well... It is already quite late.... :( @@yash......
Thank you for everything. My last ask is the same as my first. I'm asking you to believe—not
in my ability to create change, but in yours.@POTUSredo your work , will you ok? criss-
cross !!!
OUTPUT Text:
For every retweet this gets Pedigree will donate one bowl of dog food to dogs in need 😊
tweetforbowls We got kicked out of a Delta airplane because I spoke Arabic to my mom
on the phone and with my friend slim WTHHHHHH please spread Hello hiiiii well It is
already quite late yash Thank you for everything My last ask is the same as my first Im
asking you to believe—not in my ability to create change but in yoursPOTUSredo your
work will you ok crisscross
4
Tokenization
• Splitting text into individual units, such as words or
sentences, called tokens. Tokenization is a fundamental
step in text processing.
INPUT Text:
Data is the new oil. AI is the last invention

OUTPUT Text:
['Data', 'is', 'the', 'new', 'oil', '.', 'AI', 'is', 'the', 'last', 'invention']

INPUT Text:
Data is the new oil. AI is the last invention
OUTPUT Text:
['Data is the new oil.', 'AI is the last invention']

5
Challenges involved in Sentence
Segmentation
• Deciding about how to mark the beginning and
end of a sentence.
• Decision Criteria:
1. Wherever there is a dot, is it end of the sentence
2. Whether ?, !, : indicate end of the sentence
3. Lots of blank line after dot
• !, ? are relatively unambiguous
• Period “.” is quite ambiguous
– Sentence boundary
– Abbreviations like Ph.D. or Dr. or Inc.
– Numbers like 0.02% or 4.3
6
Challenges involved in Sentence
Segmentation
Issue: Treating white spaces alone as separators.

7
Challenges involved in Sentence
Segmentation
Issue: Treating white spaces alone as separators.
Challenge: handling multi-word expression
eg., New Delhi, Hong Kong, Tamil Nadu

8
Challenges involved in Sentence
Segmentation
Issue: Treating white spaces alone as separators.
Challenge: handling multi-word expression
eg., New Delhi, Hong Kong, Tamil Nadu
Solution:
• Detecting names, date, time, organization, emails, as
entities.

9
Challenges involved in Sentence
Segmentation
Issue: Treating white spaces alone as separators.

Solution:
• Treat punctuation in addition with white space as
word boundary.

10
Challenges involved in Sentence
Segmentation
Issue: Treating white spaces alone as separators.

Solution:
• Treat punctuation in addition with white space as
word boundary.

Challenge:
• punctuation often occurs in word internally,
examples, m.p.h, Ph.D., 29/1/2016, google.com ,
87.8, what’re , we’re
11
Issues in Word Tokenization
• Finland’s capital  Finland Finlands Finland’s?
• what’re, I’m, isn’t  What are, I am, is not
• Hewlett-Packard  Hewlett Packard ?
• state-of-the-art  state of the art ?
• Lowercase  lower-case lowercase lower
case ?
• San Francisco  one token or two?
• m.p.h., Ph.D.  ??

12
Lowercasing
• Converting text data into lower case.

INPUT Text:
Data is the new oil. A.I is the last invention

OUTPUT Text:
['data is the new oil. a.i is the last invention!]

13
Challenge of Case folding
• Applications like IR: reduce all letters to lower
case
– Words with upper or sentence case in mid-
sentence?
• General Motors
• US versus us

14
Stopword Removal
• Common words like "the," "and," "is," which
don't carry significant meaning, are removed to
reduce the dimensionality of the data.

INPUT Text:
Data is the new oil. A.I is the last invention

OUTPUT Text:
['Data', 'new', 'oil', '.', 'A.I', 'last', 'invention']

15
Stemming and Lemmatization
• Stemming is the process of reduction of a word
into its root or stem word. The word affixes are
removed leaving behind only the root form.

INPUT Text:
Products, Product, Production, Producing
OUTPUT Text:
['product', 'product', 'product', 'produc']

INPUT Text:
Studies Studying Study
OUTPUT Text:
[‘studi', ' studi ', ' studi ‘] 16
Stemming and Lemmatization
• Lemmatization is the process of reduction of a
word into its root or lemma form by referring
the knowledge base or dictionary.

INPUT Text:
Studies is going on. Keep Studying. Need to Study
for exams
OUTPUT Text:
['Studies', 'be', 'go', 'on', '.', 'Keep', 'Studying', '.',
'Need', 'to', 'Study', 'for', 'exams']

17
Text Normalization
• Normalizing text means converting text into
standard form (canonical form).
• Normalizing text, numbers, dates, and other
entities helps ensure consistency and
comparability across documents.
For Example,
– "ok" and "k" can be transformed to "okay", its
canonical form.
– near identical words such as "preprocessing",
"pre-processing" and "pre processing“ are
mapped to just "preprocessing".
18
Named Entity Recognition (NER)
• What is Named Entity (NE) ?
– Named entities are proper nouns.
• Named entity tasks often include :
– expressions for date and time,
– names of persons, organization, location,
sports adventure activities, etc
– terms for biological species and substances

19
Named Entity Recognition (NER)
Categories and subcategories of Named Entities:
1)Entity (ENAMEX): person, organization,
location
2)Time expression (TIMEX): date, time
3)Numeric expression (NUMEX): money,
percent.

20
Named Entity Recognition (NER)
• Recognition of information units like names,
including person, organization and location
names, and numeric expressions including
time, date, money and percent expressions
required for various Information Extraction
and NLP tasks.
• Identifying references to these entities in text
is called as Named Entity Recognition and
Classification (NER)

21
Named Entity Recognition (NER)
• Most NER systems has been structured as
taking an unannotated block of text,
Example:“The delegation, which included the
commander of the U .N. troops in Bosnia , Lt. Gen.
Sir Michael Rose reached Sarajevo on 13th October
.”
Annotated block of text: “The delegation, which
included the commander of the <ORG> U .N. </ORG>
troops in <LOC> Bosnia </LOC>, <PERS>Lt. Gen. Sir
Michael Rose </PERS> reached <LOC> Sarajevo </LOC>
on <TIME>13th October </TIME>”.
Note: Both the boundaries of an expression and its label must be marked 22
Text Vectorization
• Process of converting textual data into
numerical vectors
• Techniques includes:
1. Bag of Words (BoW),
2. Term Frequency - Inverse Document
Frequency (TF-IDF), and
3. Word Vectors
4. Word Embeddings

23
Bag of Words
• Unordered set of words of a document are called
bag of words
• Sentence 1: ”Welcome to Great Learning, Now
start learning”
• Sentence 2: “Learning is a good practice”
BOW = [Welcome, to, Great, Learning, Now, start,
learning, is, a, good, practice]

After pre-processing:
- Stop-word removal - Lowercasing

BOW = [welcome, great, learning, start, good, practice]

24
Term Weights: Term Frequency
• Term Frequency: Count of term in document.

fij = frequency of term i in document j

• Normalize term frequency (tf) across the

entire corpus:
tfij = fij / max {fij}

25
Term Weight:
Inverse Document Frequency
• Terms that appear in many different
documents are less indicative of overall topic.

IDFi = inverse document frequency of term i,

= log2 (N/ df i)
where,
N: total number of documents
df i : document frequency of term i
: number of documents containing term i
26
TF-IDF Weighting
• TF-IDF weighting:

wij = tfij idfi = tfij log2 (N/ dfi)

• A term occurring frequently in the document

but rarely in the rest of the collection is given
high weight.

27
Data Representation
• Each document of the dataset is converted
into an object in an abstract space, where we
can measure distance between objects.
• The most obvious abstract space is the
Euclidean space, Rt.
• A document d ϵ Rt can be thought of as a
t-dimensional vector,
d = (d1, d2, ……,dt)

28
Vector-Space Model
• t distinct terms remain after preprocessing
– Unique terms that form the VOCABULARY
• These terms form a vector space.
Dimension = t = |vocabulary|
• Each term, i, in a document j, is given a real-
valued weight, wij.
• Documents are expressed as t-dimensional
vectors:
dj = (w1jx1 , w2jx2 , …, wtjxt )

29
Graphic Representation
Example:
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
D3 = 0T1 + 0T2 + 2T3
T3

D1 = 2T1+ 3T2 + 5T3

D3 = 0T1 + 0T2 + 2T3

2 3
T1
D2 = 3T1 + 7T2 + T3

7
T2

30
Document Collection Representation
• A collection of n documents can be
represented in the vector space model by a
term-document matrix.
• An entry in the matrix corresponds to the
“weight” of a term in the document; zero
means the term has no significance in the
document or it simply doesn’t exist in the
document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
31
Word Embedding
• Word embeddings are a way of representing
words as numerical vectors in a continuous
space, capturing semantic relationships
between words.

32
Word Vector
• It is simply a vector of weights.
• It is 1-of-N (or ‘one-hot’) encoding where every
element in the vector is associated with a word
in the vocabulary and represented as N-
dimensional vector
• The encoding of a given word is simply the vector
in which the corresponding element is set to one,
and all other elements are zero.
• One-hot representation:
Motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] AND
Hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0
33
Word Vectors - One-hot Encoding
• Suppose our vocabulary has only five words:
King, Queen, Man, Woman, and Child.

• We could encode the word ‘Queen’ as:

King Queen Woman Man Child

0 1 0 0 0

1-of-N Encoding

34
Limitations of One-hot encoding
Word vectors are not comparable
• Using such an encoding, there is no
meaningful comparison we can make between
word vectors other than equality testing.

35
Word Embedding
• Distributional representation
Any word wi in the corpus is given a distributional
representation by an embedding
wi Є Rd

i.e., a d ← dimensional vector, which is mostly learnt!

36
Word Embedding: Illustration
• If we label the dimensions in a hypothetical
word vector (there are no such pre-assigned
labels in the algorithm of course), it might
look a bit like this:
King Green Queen Princess
Royalty --- 0.99 0.02 0.99 0.98
Masculine --- 0.99 0.05 0.01 0.02
Feminine --- 0.05 0.88 0.99 0.94
Age --- 0.7 0.6 0.5 0.1
... . . . . .
. . . . .
Such a vector comes to represent in some abstract way the ‘meaning’
37
of a word
Word2Vec
• One of the most popular methods for creating
word embeddings.
• It has 2 main architectures:
– Continuous bag of words (CBOW)
– Skip gram

38
Glove
• Global vectors for word representations
• Another word embedding technique
• It constructs co-occurrence matrix of words
and optimizes the embeddings to capture
global word co-occurrence statistics.

NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Optional Challenge 2
0% (6)
Optional Challenge 2
3 pages
Natural Language Processing Notes Class 10 AI
100% (1)
Natural Language Processing Notes Class 10 AI
20 pages
Real Estate Math Essentials
67% (3)
Real Estate Math Essentials
59 pages
Prof Ed N5 Assessment of Learning
No ratings yet
Prof Ed N5 Assessment of Learning
32 pages
Text Preprocessing for Information Retrieval
No ratings yet
Text Preprocessing for Information Retrieval
58 pages
CFD Analysis of Manifold
No ratings yet
CFD Analysis of Manifold
27 pages
NLP Revision Notes and Applications
No ratings yet
NLP Revision Notes and Applications
4 pages
Option Delta With Skew Adjustment
100% (1)
Option Delta With Skew Adjustment
33 pages
One-Dimensional Assembly Tolerance Stack-Up
100% (2)
One-Dimensional Assembly Tolerance Stack-Up
26 pages
Mechap 06
No ratings yet
Mechap 06
82 pages
VBScript Examples
No ratings yet
VBScript Examples
8 pages
Tokenization & Morphology in NLP
No ratings yet
Tokenization & Morphology in NLP
63 pages
Bartłomiej Żyliński, Ryszard Buczkowski
No ratings yet
Bartłomiej Żyliński, Ryszard Buczkowski
18 pages
COSC 3101A - Design and Analysis of Algorithms 7
No ratings yet
COSC 3101A - Design and Analysis of Algorithms 7
50 pages
Differential Equations For Engineers and Scientists
No ratings yet
Differential Equations For Engineers and Scientists
79 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
Computational Intelligence (CS3030/CS3031) : School of Computer Engineering, KIIT-DU, BBS-24, India
No ratings yet
Computational Intelligence (CS3030/CS3031) : School of Computer Engineering, KIIT-DU, BBS-24, India
2 pages
Ce 1252 - Strength of Materials: Two Mark Question & Answers
No ratings yet
Ce 1252 - Strength of Materials: Two Mark Question & Answers
21 pages
Natural Language Processing: Learning Is Not A Course, Its A Path From Passion To Profession
No ratings yet
Natural Language Processing: Learning Is Not A Course, Its A Path From Passion To Profession
19 pages
Sample Paper Questions - NLP (Part 2)
No ratings yet
Sample Paper Questions - NLP (Part 2)
7 pages
Introduction
No ratings yet
Introduction
23 pages
Mid Term Exam SQL
100% (1)
Mid Term Exam SQL
17 pages
NLP - CH-6
No ratings yet
NLP - CH-6
4 pages
TextMining
No ratings yet
TextMining
43 pages
Fundaments of Text Analysis
No ratings yet
Fundaments of Text Analysis
14 pages
Type 1: Real (B) Ixl Lyl #0 C) X+y 0, X #0
No ratings yet
Type 1: Real (B) Ixl Lyl #0 C) X+y 0, X #0
3 pages
Reliability, Validity, Sensitivity
No ratings yet
Reliability, Validity, Sensitivity
3 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
16 pages
18 Text Mining - Text Preprocessing
No ratings yet
18 Text Mining - Text Preprocessing
40 pages
JMST Template v2
No ratings yet
JMST Template v2
2 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
NLP Insem Notes
No ratings yet
NLP Insem Notes
13 pages
NLP Basics for Beginners
No ratings yet
NLP Basics for Beginners
4 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
Ai 6
No ratings yet
Ai 6
55 pages
A Generalised Failure Envelope For Undrained Capacity of Circular Shallow Foundations Under General Loading
No ratings yet
A Generalised Failure Envelope For Undrained Capacity of Circular Shallow Foundations Under General Loading
10 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
How Indian Highways Are Numbered
No ratings yet
How Indian Highways Are Numbered
3 pages
Euclid's Algorithm: ENGI 1331: Exam 2 Review - Additional Practice Problems Fall 2020
No ratings yet
Euclid's Algorithm: ENGI 1331: Exam 2 Review - Additional Practice Problems Fall 2020
4 pages
Python Basics for Beginners
No ratings yet
Python Basics for Beginners
11 pages
Age Calculation
No ratings yet
Age Calculation
4 pages
Unit 2
No ratings yet
Unit 2
20 pages
Term 2 (Mathematics) Syllabus
No ratings yet
Term 2 (Mathematics) Syllabus
2 pages
Linear Equations and Inequalities Lesson Plan
100% (1)
Linear Equations and Inequalities Lesson Plan
7 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Session 2-Introduction To Data Analytics
No ratings yet
Session 2-Introduction To Data Analytics
24 pages
Unit 6 (NLP)
No ratings yet
Unit 6 (NLP)
8 pages
NLP and Python Course Overview
No ratings yet
NLP and Python Course Overview
121 pages
Part B Notes
No ratings yet
Part B Notes
62 pages
NLP for Tech Enthusiasts
No ratings yet
NLP for Tech Enthusiasts
40 pages
Engineering Economic Risk Analysis
No ratings yet
Engineering Economic Risk Analysis
39 pages
PDF NLP
No ratings yet
PDF NLP
7 pages
Introduction To NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
No ratings yet
Introduction To NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
35 pages
G.T.N. Arts College (Autonomous)
No ratings yet
G.T.N. Arts College (Autonomous)
20 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Math 7 LAS 1 Well-Defined Sets
No ratings yet
Math 7 LAS 1 Well-Defined Sets
1 page
Ai NLP
No ratings yet
Ai NLP
9 pages
Wavelet Theory and Application in Communication An
No ratings yet
Wavelet Theory and Application in Communication An
18 pages
Q ClassX AI Ch7
No ratings yet
Q ClassX AI Ch7
6 pages
NLP Ai X
No ratings yet
NLP Ai X
6 pages
Lec 2
No ratings yet
Lec 2
21 pages
Text Preprocessing
No ratings yet
Text Preprocessing
39 pages
Unit-6 Natural Language Processing
No ratings yet
Unit-6 Natural Language Processing
7 pages
NLP m2
No ratings yet
NLP m2
71 pages
Part-of-Speech (POS) Tagging
No ratings yet
Part-of-Speech (POS) Tagging
4 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
NLP Lecture 6 Week 3
No ratings yet
NLP Lecture 6 Week 3
9 pages
Natural Language Processing Notes Class 10 AI
No ratings yet
Natural Language Processing Notes Class 10 AI
25 pages
CAT King Study Material 5
No ratings yet
CAT King Study Material 5
21 pages
VO - MCA - SEM 4 - Text Mining - U2
No ratings yet
VO - MCA - SEM 4 - Text Mining - U2
15 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
Bca Part 2 Differentiation and Integration 1 275 2020
No ratings yet
Bca Part 2 Differentiation and Integration 1 275 2020
2 pages
2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
NLP - Notes
No ratings yet
NLP - Notes
3 pages
NLP Learning Materials 1
No ratings yet
NLP Learning Materials 1
28 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Module 2 Reference Material 1
No ratings yet
Module 2 Reference Material 1
43 pages
NLP Notes
No ratings yet
NLP Notes
56 pages
v24dsl07t - Unit I - NLP
No ratings yet
v24dsl07t - Unit I - NLP
65 pages
NLP 1
No ratings yet
NLP 1
8 pages
Basics of Text Processing
No ratings yet
Basics of Text Processing
28 pages
NLP Module 2 1 (SAMI)
No ratings yet
NLP Module 2 1 (SAMI)
19 pages
Natural Language Processing Notes Class 10 AI
No ratings yet
Natural Language Processing Notes Class 10 AI
24 pages
NLP (DP) Notes1
No ratings yet
NLP (DP) Notes1
61 pages
NLP Class X AI
No ratings yet
NLP Class X AI
36 pages
L-6 NLP
No ratings yet
L-6 NLP
11 pages

Text Data Preprocessing 2025

Uploaded by

Text Data Preprocessing 2025

Uploaded by

Basics Text Pre-processing

BOW = [welcome, great, learning, start, good, practice]

fij = frequency of term i in document j

• Normalize term frequency (tf) across the

IDFi = inverse document frequency of term i,

wij = tfij idfi = tfij log2 (N/ dfi)

• A term occurring frequently in the document

D1 = 2T1+ 3T2 + 5T3

D3 = 0T1 + 0T2 + 2T3

• We could encode the word ‘Queen’ as:

King Queen Woman Man Child

i.e., a d ← dimensional vector, which is mostly learnt!

You might also like