0% found this document useful (0 votes)

15 views10 pages

Text Vectorization

The document provides an overview of text vectorization, explaining its importance in machine learning for transforming text into numeric representations. It covers various encoding techniques such as Frequency based encoding, One-Hot encoding, and TF-IDF, as well as advanced models like Word2Vec and GloVe. Additionally, it discusses the concept of feature extraction and the vector space model, emphasizing the role of distributed representations in capturing semantic relationships between words.

Uploaded by

harrypoter

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views10 pages

Text Vectorization

Uploaded by

harrypoter

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

MDAN 54233

Outline

• What is vectorization?
• Types
– Frequency based encoding
– One-Hot encoding
– Term frequency- inverse document frequency (TF-IDF)

Text Vectorization – Distributed representation

Supunmali Ahangama

MDAN 54233 2

1 2

Vectorization Feature Extraction

• ML Algorithms work on numeric feature space • The process of extracting and selecting features
• Input should 2D array where rows are instances and • Also known as feature engineering
columns are features
• Features are unique, measurable attributes or properties
• To apply ML to text documents  vector representation for each observation or data point in a dataset.
• Map words or phrases from vocabulary to a corresponding
vector of real numbers.
• This process of converting words into numbers is called as
vectorization or feature extraction
MDAN 54233 3 MDAN 54233 4

3 4
Vector Space Model Vector Space Model

• Vector space model (or Term Vector Model) How to represent a Document D in the VS?
• It is a mathematical and algebraic model for transforming and • D = {wD1, wD2, …, wDn}
representing text documents as numeric vectors of specific terms
• where wDn denotes the weight for word n in document D.
that form the vector dimensions.
• This weight is a numeric value and can represent anything,
• Vector Space VS
ranging from the frequency of that word in the document,
– The number of dimensions or columns for each document will be the total
number of distinct terms or words for all documents in the vector space. to the average frequency of occurrence, or even to the TF-
– VS = {W1, W2, ….., Wn} IDF weight.
– where there are n distinct words across all documents.
MDAN 54233 5 MDAN 54233 6

5 6

Bag of Words (BoW) Model

• A commonly used approach to representing text data as

numerical feature vectors .
• The BoW model represents a text document as a "bag"
(multiset) of its words.
– A vector of word frequencies or binary values indicating whether a
word occurs in the document or not.

• It disregards grammar and word order but retaining

information on the frequency of occurrence of each word.
https://postimg.cc/rKTq580h
MDAN 54233 7 MDAN 54233 8

7 8
Vectorize a Corpus with BoW approach Bag of Words (BoW) Model

• Every document from the corpus as a vector whose length

is equal to the vocabulary of the corpus.
• Also, can use the same model for occurrences for n-
grams, which would make it an n-gram Bag of Words model
such that frequency of distinct n-grams in each document
would also be considered.

MDAN 54233 9 MDAN 54233 10

9 10

N-gram Output - Unigram

https://devopedia.org/images/article/219/7356.1569499094.png MDAN 54233 11 MDAN 54233 12

11 12
Output – Tri-gram Vector Encoding Models

• Frequency based encoding

• One-Hot encoding
• Term frequency- inverse document frequency (TF-IDF)
• Distributed representation

MDAN 54233 13 MDAN 54233 14

13 14

Libraries/Frameworks Frequency Vectors (Frequency based encoding)

• NLTK • Fill in the vector with the frequency of each word as it

appears in the document
• Scikit-Learn
• Each document is represented as the multiset of the
• Gensim
tokens that compose it and the value for each word
position in the vector is its count.
• Value in the vector can be:
– Straight count (integer) encoding or
– normalized encoding where each word is weighted by the total
number of words in the document
MDAN 54233 15 MDAN 54233 16

15 16
Token Frequency as Vector Encoding One-Hot Encoding

• It is a Boolean vector encoding method that marks a

particular vector index with a value of true (1) if the token
exists in the document and false (0) if it does not.
• That is, it reflects either the presence or absence of the
token in the described text.
• Each vector has a dimensionality equal to the number of
unique words in the corpus.

MDAN 54233 17 MDAN 54233 18

17 18

One-Hot Encoding

https://towardsdatascience.com/word-embedding-in-nlp-one-hot-encoding-and-skip-gram-neural-network-
MDAN 54233 19 81b424da58f2 MDAN 54233 20

19 20
One-Hot Encoding Long tail

• Alternative for frequency-based encoding as frequency- • A large number of words have relatively low frequency,
based encoding suffers from the long tail, or Zipfian while a small number of items have very high frequency.
distribution. • This distribution is called a "long tail" because it is
– It reduces the imbalance issue of the distribution of tokens. characterized by a long, thin tail on the distribution curve.
• Tokens that occur very frequently are orders of magnitude
more “significant” than other, less frequent ones.
• Will not work on models that expect normal distribution.

MDAN 54233 21 MDAN 54233 22

21 22

Zipf's law

• Frequency of a word is inversely proportional to its rank in

the frequency table.
• Most frequent word occurs approximately twice as often
as the second most frequent word, three times as often as
the third most frequent word, and so on.

MDAN 54233 23 MDAN 54233 24

23 24
TF-IDF TF-IDF
• Term Frequency-Inverse Document Frequency • idf is the inverse of the document frequency for each term.
• It is a combination of two metrics: term frequency (tf) and
inverse document frequency (idf).
• tfidf = tf x idf
• tf is raw frequency value of a word (=term) in a particular
document. • where idf(t) represents the idf for the term t, C represents the
– If required can use the log values, average the frequencies or binanry count of the total number of documents in our corpus, and df(t)
values (1 means the term has occurred in the document and 0 means it represents the frequency of the number of documents in which
has not) the term t is present.

MDAN 54233 25 MDAN 54233 26

25 26

TF-IDF - Variant Advanced Word Vectorization Models

• The final TF-IDF metric will be using a normalized version Distributed representations:
of the tfidf matrix (product of tf and idf). • Google’s Word2Vec model
• The tfidf matrix will be normalized by dividing it with the L2 • GloVe (Global Vectors for Word Representation)
norm of the matrix (Euclidean norm)
• Facebook’s fastText (extends the Word2Vec model)
– L2 norm is the square root of the sum of the square of each term’s
tfidf weight. • ELMo (Embeddings from Language Models)
• BERT (Bidirectional Encoder Representations from
– where tfidf represents the Euclidean L2 norm for the tfidf matrix. Transformers)
MDAN 54233 27 MDAN 54233 28

27 28
Distributed Representation Distributed Representation
• Distributed representation in word embedding refers to the idea • “The meaning of a word can be inferred by the company it
of representing words as high-dimensional vectors, where each
keeps”
dimension of the vector captures a different aspect of the
word's meaning. • If you have two words that have very similar neighbors,
• It is based on the co-occurrence patterns of words in large text then these words are probably quite similar in meaning or
datasets. are at least related.
• By training the model to predict which words are likely to occur
in a given context, the model learns to generate embeddings • E.g., “powerful” should be closer to “strong” than “Paris”.
that capture the underlying semantic and syntactic relationships
between words.
MDAN 54233 29 MDAN 54233 30

29 30

Word2Vec Word2Vec
• Word2Vec is a neural network-based model (single hidden • It include skip-gram and CBOW mode
layer)
• CBOW (Continuous Bag of Words)
• It captures the meaning and semantic relationships between
words. – predicts the target word based on the context words surrounding
it
• E.g., if we train Word2Vec on a large corpus of text that
includes the words "cat", "dog", and "animal", the model may • Skip-gram model
represent the word "cat" as a vector that is close to the vector
– predicts the context words based on the target word.
for "animal", but far from the vector for "dog".
– This reflects the fact that "cat" and "animal" are semantically similar, but
"cat" and "dog" are not.

MDAN 54233 31 MDAN 54233 32

31 32
Example – CBOW and Skip-gram

• Sentence is "The quick brown fox jumps over the lazy dog“

• CBOW predicts “jumps” based on the surrounding context

words "The quick brown fox over the lazy dog".
• Skip-gram model predicts the context words "The", "quick",
"brown", "fox", "over", and "the" based on the target word
"jumps".

https://israelg99.github.io/2017-03-23-Word2Vec-Explained/ MDAN 54233 33 MDAN 54233 34

33 34

How to get the

embeddings? Word2Vec

• Adjust those • https://radimrehurek.com/gensim/models/word2vec.html

weights to reduce a • Word2Vec(documents, vector_size=100, window=5,
loss function. min_count=5, workers=4)
• Use the hidden
weights as the word
embeddings

MDAN 54233 35 MDAN 54233 36

https://towardsmachinelearning.org/inner-working-of-word2vec-model/

35 36
Word2Vec – Parameters in Gensim Window size = 2
• sentences: This parameter takes the input corpus as a list of sentences. Each sentence is a list of words. For
example: sentences = [["the", "quick", "brown", "fox"], ["jumps", "over", "the", "lazy", "dog"]].
• vector_size: This parameter specifies the size of the vector space in which words will be embedded. It represents
the number of dimensions in the vector space. Typical values are in the range of 100-300.
• window: This parameter specifies the maximum distance between the current and predicted word within a
sentence. For example, if window=5, then the algorithm will consider a word within a range of 5 words on either side
of the target word.
• min_count: This parameter specifies the minimum frequency a word must have to be included in the vocabulary.
Words that occur less frequently than min_count will be ignored.
• workers: This parameter specifies the number of worker threads to use when training the model. It is recommended
to use a value equal to the number of cores on your machine.
• sg: This parameter specifies the training algorithm to use. A value of 0 indicates CBOW (Continuous Bag of Words)
and a value of 1 indicates Skip-gram.
• epochs: This parameter specifies the number of iterations (epochs) to run during training.
• sample: This parameter is used to downsample effects of occurrence of frequent words. Values between 0.01 and
0.0001 are usually ideal
https://towardsmachinelearning.org/inner-working-of-word2vec-
MDAN 54233 37 MDAN 54233 38
model/

37 38

Summary

MDAN 54233 39

Focus 4 Test 1 GR A
80% (5)
Focus 4 Test 1 GR A
4 pages
1 - People V Adriano, GR 205228
50% (2)
1 - People V Adriano, GR 205228
1 page
Endogenic Processes 1
100% (2)
Endogenic Processes 1
59 pages
Unit IV
No ratings yet
Unit IV
58 pages
NLP Text Representation Guide
No ratings yet
NLP Text Representation Guide
131 pages
Unit IV
No ratings yet
Unit IV
57 pages
Unit 2
No ratings yet
Unit 2
48 pages
Lect 04
No ratings yet
Lect 04
44 pages
Ch4 Word Embeddings
No ratings yet
Ch4 Word Embeddings
21 pages
Lab 5
No ratings yet
Lab 5
27 pages
Wordembed
No ratings yet
Wordembed
31 pages
DM Chapter 9 - Word Embedding
No ratings yet
DM Chapter 9 - Word Embedding
7 pages
Unit Ii
No ratings yet
Unit Ii
20 pages
Unit 2 TB
No ratings yet
Unit 2 TB
20 pages
Text Vectorization
No ratings yet
Text Vectorization
18 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
Unit 2
No ratings yet
Unit 2
21 pages
21 Word2Vec 24 09 2024
No ratings yet
21 Word2Vec 24 09 2024
63 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
Wordembedding
No ratings yet
Wordembedding
25 pages
Word Embedding & Language Modelling
No ratings yet
Word Embedding & Language Modelling
111 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
Vector Semantics and Embeddings
No ratings yet
Vector Semantics and Embeddings
29 pages
Natural Language Processing: Lecture # 7
No ratings yet
Natural Language Processing: Lecture # 7
36 pages
Lesson 2 Feature Engineering On Text Data
No ratings yet
Lesson 2 Feature Engineering On Text Data
89 pages
Unit 2 Updated New
No ratings yet
Unit 2 Updated New
77 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
Word Embedding
No ratings yet
Word Embedding
35 pages
Constructing and Evaluating Word Embeddings
No ratings yet
Constructing and Evaluating Word Embeddings
33 pages
NLP DL Lecture2
No ratings yet
NLP DL Lecture2
54 pages
NLP Word Embeddings Explained
No ratings yet
NLP Word Embeddings Explained
55 pages
Unit - 2
No ratings yet
Unit - 2
58 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
Word Embeddings
No ratings yet
Word Embeddings
59 pages
Sheet 3
No ratings yet
Sheet 3
5 pages
NLP Word Vectors for Students
No ratings yet
NLP Word Vectors for Students
33 pages
Traditional Word Embedding
No ratings yet
Traditional Word Embedding
9 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
Sentiment Analysis Based On Vector Embeding
No ratings yet
Sentiment Analysis Based On Vector Embeding
5 pages
Word Embeddings
No ratings yet
Word Embeddings
163 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
ML For NLP-LO4
No ratings yet
ML For NLP-LO4
42 pages
Word Embeddings 1
No ratings yet
Word Embeddings 1
42 pages
Word2Vec for NLP Enthusiasts
100% (1)
Word2Vec for NLP Enthusiasts
12 pages
Word Vectors for NLP Students
No ratings yet
Word Vectors for NLP Students
34 pages
Word Embadding
No ratings yet
Word Embadding
24 pages
NLP - L9 Word Embedding
No ratings yet
NLP - L9 Word Embedding
5 pages
NLP 2
No ratings yet
NLP 2
8 pages
Neural Network Language Models
No ratings yet
Neural Network Language Models
23 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
Vector Representation of Text: Vagelis Hristidis Prepared With The Help of Nhat Le Many Slides Are From Richard Socher
No ratings yet
Vector Representation of Text: Vagelis Hristidis Prepared With The Help of Nhat Le Many Slides Are From Richard Socher
20 pages
Chapter II
No ratings yet
Chapter II
26 pages
Lec 6
No ratings yet
Lec 6
2 pages
Foundations of Text Representation, LLMs and Transformers
No ratings yet
Foundations of Text Representation, LLMs and Transformers
87 pages
Module III
No ratings yet
Module III
42 pages
DeekshikaJadyada26 AP24LDS11
No ratings yet
DeekshikaJadyada26 AP24LDS11
7 pages
Natural Language Processing With Neural Network - Class3
No ratings yet
Natural Language Processing With Neural Network - Class3
25 pages
DLNLP CH-3 N
No ratings yet
DLNLP CH-3 N
11 pages
Anthony 8
No ratings yet
Anthony 8
2 pages
Anterior Uveitis
No ratings yet
Anterior Uveitis
65 pages
Physics1 PDF
No ratings yet
Physics1 PDF
7 pages
Structure Syllabi
No ratings yet
Structure Syllabi
19 pages
Value Added Products From PFAD PDF
No ratings yet
Value Added Products From PFAD PDF
60 pages
Calcaneus Anatomy Overview
No ratings yet
Calcaneus Anatomy Overview
4 pages
A General Theory of Domination and Justice 1st Edition Lovett Instant Download
No ratings yet
A General Theory of Domination and Justice 1st Edition Lovett Instant Download
145 pages
BestSub Heat Press Catalog 2024
No ratings yet
BestSub Heat Press Catalog 2024
37 pages
The Yellow World How Fighting For My Life Taught Me How To Live Espinosa Albert Download
No ratings yet
The Yellow World How Fighting For My Life Taught Me How To Live Espinosa Albert Download
35 pages
Business 70 PDF
No ratings yet
Business 70 PDF
1 page
Advanced Flight Ops Training
No ratings yet
Advanced Flight Ops Training
3 pages
The World During Rizal's Time PDF
No ratings yet
The World During Rizal's Time PDF
29 pages
DM GTU Study Material E-Notes Unit-4 29012022085557AM
No ratings yet
DM GTU Study Material E-Notes Unit-4 29012022085557AM
12 pages
AES DRRM Memo PASS
No ratings yet
AES DRRM Memo PASS
2 pages
Ps 1320 Gbnlfresd
No ratings yet
Ps 1320 Gbnlfresd
8 pages
The Genius Guide To - Divine Archetypes
100% (1)
The Genius Guide To - Divine Archetypes
18 pages
CSEC Biology June 2014 P032
No ratings yet
CSEC Biology June 2014 P032
12 pages
GD4400
No ratings yet
GD4400
52 pages
Mini-Vert Brochure
No ratings yet
Mini-Vert Brochure
4 pages
Listening Starter 1
No ratings yet
Listening Starter 1
9 pages
How To Raise SAP OSS Call
No ratings yet
How To Raise SAP OSS Call
6 pages
Prac 7
No ratings yet
Prac 7
7 pages
AR-M208 Service Manual
No ratings yet
AR-M208 Service Manual
32 pages
The Ultimate Guide To Reading The Water
No ratings yet
The Ultimate Guide To Reading The Water
39 pages
Thuyết Trình Anh Văn Sáng Thứ 5
No ratings yet
Thuyết Trình Anh Văn Sáng Thứ 5
7 pages
Getting Started With Excel: Comprehensive
0% (1)
Getting Started With Excel: Comprehensive
10 pages
Authentic Assessment Rubric - New Dog Breed
No ratings yet
Authentic Assessment Rubric - New Dog Breed
2 pages