Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views39 pages

Text Data Preprocessing 2025

The document outlines the essential steps and techniques involved in text preprocessing, including data cleaning, tokenization, stop-word removal, stemming, lemmatization, and named entity recognition. It highlights challenges in sentence segmentation and word tokenization, as well as methods for text normalization and vectorization. Additionally, it discusses various approaches to word embeddings, such as Word2Vec and GloVe, for representing words in a numerical format.

Uploaded by

pramodadudhal303
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views39 pages

Text Data Preprocessing 2025

The document outlines the essential steps and techniques involved in text preprocessing, including data cleaning, tokenization, stop-word removal, stemming, lemmatization, and named entity recognition. It highlights challenges in sentence segmentation and word tokenization, as well as methods for text normalization and vectorization. Additionally, it discusses various approaches to word embeddings, such as Word2Vec and GloVe, for representing words in a numerical format.

Uploaded by

pramodadudhal303
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Basics Text Pre-processing

1
Need of Text Preprocessing
1. Data Cleaning
2. Tokenization
3. Lowercasing
4. Stop-word Removal
5. Stemming and Lemmatization
6. Text Normalization
7. Part-of-Speech Tagging
8. Named Entity Recognition (NER)
9. Text Vectorization
2
Instance of Text Data from twitter
For every retweet this gets, Pedigree will donate one bowl of dog food to dogs in need! 😊
#tweetforbowls
We got kicked out of a @Delta airplane because I spoke Arabic to my mom on the phone
and with my friend slim... WTHHHHHH please spread
Hello hiiiii! well... It is already quite late.... :( @@yash......
Thank you for everything. My last ask is the same as my first. I'm asking you to believe—not
in my ability to create change, but in yours.@POTUSredo your work , will you ok? criss-
cross !!!
I can't can you? you should've books' Yash's pens and papers are lying on the sofa! aren't
they? USA U.S.A. :-(:)<><http://www.google.com>.....????.?!.#@@2@a1@abc
@leighabelle maybe. I doubt it. Lol. Sooooooo...

3
Data Cleaning
• Removing unnecessary whitespace, special
characters, punctuations etc
INPUT Text:
For every retweet this gets, Pedigree will donate one bowl of dog food to dogs in need! 😊
#tweetforbowls
We got kicked out of a @Delta airplane because I spoke Arabic to my mom on the phone
and with my friend slim... WTHHHHHH please spread
Hello hiiiii! well... It is already quite late.... :( @@yash......
Thank you for everything. My last ask is the same as my first. I'm asking you to believe—not
in my ability to create change, but in yours.@POTUSredo your work , will you ok? criss-
cross !!!
OUTPUT Text:
For every retweet this gets Pedigree will donate one bowl of dog food to dogs in need 😊
tweetforbowls We got kicked out of a Delta airplane because I spoke Arabic to my mom
on the phone and with my friend slim WTHHHHHH please spread Hello hiiiii well It is
already quite late yash Thank you for everything My last ask is the same as my first Im
asking you to believe—not in my ability to create change but in yoursPOTUSredo your
work will you ok crisscross
4
Tokenization
• Splitting text into individual units, such as words or
sentences, called tokens. Tokenization is a fundamental
step in text processing.
INPUT Text:
Data is the new oil. AI is the last invention

OUTPUT Text:
['Data', 'is', 'the', 'new', 'oil', '.', 'AI', 'is', 'the', 'last', 'invention']

INPUT Text:
Data is the new oil. AI is the last invention
OUTPUT Text:
['Data is the new oil.', 'AI is the last invention']

5
Challenges involved in Sentence
Segmentation
• Deciding about how to mark the beginning and
end of a sentence.
• Decision Criteria:
1. Wherever there is a dot, is it end of the sentence
2. Whether ?, !, : indicate end of the sentence
3. Lots of blank line after dot
• !, ? are relatively unambiguous
• Period “.” is quite ambiguous
– Sentence boundary
– Abbreviations like Ph.D. or Dr. or Inc.
– Numbers like 0.02% or 4.3
6
Challenges involved in Sentence
Segmentation
Issue: Treating white spaces alone as separators.

7
Challenges involved in Sentence
Segmentation
Issue: Treating white spaces alone as separators.
Challenge: handling multi-word expression
eg., New Delhi, Hong Kong, Tamil Nadu

8
Challenges involved in Sentence
Segmentation
Issue: Treating white spaces alone as separators.
Challenge: handling multi-word expression
eg., New Delhi, Hong Kong, Tamil Nadu
Solution:
• Detecting names, date, time, organization, emails, as
entities.

9
Challenges involved in Sentence
Segmentation
Issue: Treating white spaces alone as separators.

Solution:
• Treat punctuation in addition with white space as
word boundary.

10
Challenges involved in Sentence
Segmentation
Issue: Treating white spaces alone as separators.

Solution:
• Treat punctuation in addition with white space as
word boundary.

Challenge:
• punctuation often occurs in word internally,
examples, m.p.h, Ph.D., 29/1/2016, google.com ,
87.8, what’re , we’re
11
Issues in Word Tokenization
• Finland’s capital  Finland Finlands Finland’s?
• what’re, I’m, isn’t  What are, I am, is not
• Hewlett-Packard  Hewlett Packard ?
• state-of-the-art  state of the art ?
• Lowercase  lower-case lowercase lower
case ?
• San Francisco  one token or two?
• m.p.h., Ph.D.  ??

12
Lowercasing
• Converting text data into lower case.

INPUT Text:
Data is the new oil. A.I is the last invention

OUTPUT Text:
['data is the new oil. a.i is the last invention!]

13
Challenge of Case folding
• Applications like IR: reduce all letters to lower
case
– Words with upper or sentence case in mid-
sentence?
• General Motors
• US versus us

14
Stopword Removal
• Common words like "the," "and," "is," which
don't carry significant meaning, are removed to
reduce the dimensionality of the data.

INPUT Text:
Data is the new oil. A.I is the last invention

OUTPUT Text:
['Data', 'new', 'oil', '.', 'A.I', 'last', 'invention']

15
Stemming and Lemmatization
• Stemming is the process of reduction of a word
into its root or stem word. The word affixes are
removed leaving behind only the root form.

INPUT Text:
Products, Product, Production, Producing
OUTPUT Text:
['product', 'product', 'product', 'produc']

INPUT Text:
Studies Studying Study
OUTPUT Text:
[‘studi', ' studi ', ' studi ‘] 16
Stemming and Lemmatization
• Lemmatization is the process of reduction of a
word into its root or lemma form by referring
the knowledge base or dictionary.

INPUT Text:
Studies is going on. Keep Studying. Need to Study
for exams
OUTPUT Text:
['Studies', 'be', 'go', 'on', '.', 'Keep', 'Studying', '.',
'Need', 'to', 'Study', 'for', 'exams']

17
Text Normalization
• Normalizing text means converting text into
standard form (canonical form).
• Normalizing text, numbers, dates, and other
entities helps ensure consistency and
comparability across documents.
For Example,
– "ok" and "k" can be transformed to "okay", its
canonical form.
– near identical words such as "preprocessing",
"pre-processing" and "pre processing“ are
mapped to just "preprocessing".
18
Named Entity Recognition (NER)
• What is Named Entity (NE) ?
– Named entities are proper nouns.
• Named entity tasks often include :
– expressions for date and time,
– names of persons, organization, location,
sports adventure activities, etc
– terms for biological species and substances

19
Named Entity Recognition (NER)
Categories and subcategories of Named Entities:
1)Entity (ENAMEX): person, organization,
location
2)Time expression (TIMEX): date, time
3)Numeric expression (NUMEX): money,
percent.

20
Named Entity Recognition (NER)
• Recognition of information units like names,
including person, organization and location
names, and numeric expressions including
time, date, money and percent expressions
required for various Information Extraction
and NLP tasks.
• Identifying references to these entities in text
is called as Named Entity Recognition and
Classification (NER)

21
Named Entity Recognition (NER)
• Most NER systems has been structured as
taking an unannotated block of text,
Example:“The delegation, which included the
commander of the U .N. troops in Bosnia , Lt. Gen.
Sir Michael Rose reached Sarajevo on 13th October
.”
Annotated block of text: “The delegation, which
included the commander of the <ORG> U .N. </ORG>
troops in <LOC> Bosnia </LOC>, <PERS>Lt. Gen. Sir
Michael Rose </PERS> reached <LOC> Sarajevo </LOC>
on <TIME>13th October </TIME>”.
Note: Both the boundaries of an expression and its label must be marked 22
Text Vectorization
• Process of converting textual data into
numerical vectors
• Techniques includes:
1. Bag of Words (BoW),
2. Term Frequency - Inverse Document
Frequency (TF-IDF), and
3. Word Vectors
4. Word Embeddings

23
Bag of Words
• Unordered set of words of a document are called
bag of words
• Sentence 1: ”Welcome to Great Learning, Now
start learning”
• Sentence 2: “Learning is a good practice”
BOW = [Welcome, to, Great, Learning, Now, start,
learning, is, a, good, practice]

After pre-processing:
- Stop-word removal - Lowercasing

BOW = [welcome, great, learning, start, good, practice]


24
Term Weights: Term Frequency
• Term Frequency: Count of term in document.

fij = frequency of term i in document j

• Normalize term frequency (tf) across the


entire corpus:
tfij = fij / max {fij}

25
Term Weight:
Inverse Document Frequency
• Terms that appear in many different
documents are less indicative of overall topic.

IDFi = inverse document frequency of term i,


= log2 (N/ df i)
where,
N: total number of documents
df i : document frequency of term i
: number of documents containing term i
26
TF-IDF Weighting
• TF-IDF weighting:

wij = tfij idfi = tfij log2 (N/ dfi)

• A term occurring frequently in the document


but rarely in the rest of the collection is given
high weight.

27
Data Representation
• Each document of the dataset is converted
into an object in an abstract space, where we
can measure distance between objects.
• The most obvious abstract space is the
Euclidean space, Rt.
• A document d ϵ Rt can be thought of as a
t-dimensional vector,
d = (d1, d2, ……,dt)

28
Vector-Space Model
• t distinct terms remain after preprocessing
– Unique terms that form the VOCABULARY
• These terms form a vector space.
Dimension = t = |vocabulary|
• Each term, i, in a document j, is given a real-
valued weight, wij.
• Documents are expressed as t-dimensional
vectors:
dj = (w1jx1 , w2jx2 , …, wtjxt )

29
Graphic Representation
Example:
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
D3 = 0T1 + 0T2 + 2T3
T3

D1 = 2T1+ 3T2 + 5T3

D3 = 0T1 + 0T2 + 2T3


2 3
T1
D2 = 3T1 + 7T2 + T3

7
T2

30
Document Collection Representation
• A collection of n documents can be
represented in the vector space model by a
term-document matrix.
• An entry in the matrix corresponds to the
“weight” of a term in the document; zero
means the term has no significance in the
document or it simply doesn’t exist in the
document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
31
Word Embedding
• Word embeddings are a way of representing
words as numerical vectors in a continuous
space, capturing semantic relationships
between words.

32
Word Vector
• It is simply a vector of weights.
• It is 1-of-N (or ‘one-hot’) encoding where every
element in the vector is associated with a word
in the vocabulary and represented as N-
dimensional vector
• The encoding of a given word is simply the vector
in which the corresponding element is set to one,
and all other elements are zero.
• One-hot representation:
Motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] AND
Hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0
33
Word Vectors - One-hot Encoding
• Suppose our vocabulary has only five words:
King, Queen, Man, Woman, and Child.

• We could encode the word ‘Queen’ as:

King Queen Woman Man Child


0 1 0 0 0

1-of-N Encoding

34
Limitations of One-hot encoding
Word vectors are not comparable
• Using such an encoding, there is no
meaningful comparison we can make between
word vectors other than equality testing.

35
Word Embedding
• Distributional representation
Any word wi in the corpus is given a distributional
representation by an embedding
wi Є Rd

i.e., a d ← dimensional vector, which is mostly learnt!

36
Word Embedding: Illustration
• If we label the dimensions in a hypothetical
word vector (there are no such pre-assigned
labels in the algorithm of course), it might
look a bit like this:
King Green Queen Princess
Royalty --- 0.99 0.02 0.99 0.98
Masculine --- 0.99 0.05 0.01 0.02
Feminine --- 0.05 0.88 0.99 0.94
Age --- 0.7 0.6 0.5 0.1
... . . . . .
. . . . .
Such a vector comes to represent in some abstract way the ‘meaning’
37
of a word
Word2Vec
• One of the most popular methods for creating
word embeddings.
• It has 2 main architectures:
– Continuous bag of words (CBOW)
– Skip gram

38
Glove
• Global vectors for word representations
• Another word embedding technique
• It constructs co-occurrence matrix of words
and optimizes the embeddings to capture
global word co-occurrence statistics.

39

You might also like