Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views10 pages

Text Vectorization

The document provides an overview of text vectorization, explaining its importance in machine learning for transforming text into numeric representations. It covers various encoding techniques such as Frequency based encoding, One-Hot encoding, and TF-IDF, as well as advanced models like Word2Vec and GloVe. Additionally, it discusses the concept of feature extraction and the vector space model, emphasizing the role of distributed representations in capturing semantic relationships between words.

Uploaded by

harrypoter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views10 pages

Text Vectorization

The document provides an overview of text vectorization, explaining its importance in machine learning for transforming text into numeric representations. It covers various encoding techniques such as Frequency based encoding, One-Hot encoding, and TF-IDF, as well as advanced models like Word2Vec and GloVe. Additionally, it discusses the concept of feature extraction and the vector space model, emphasizing the role of distributed representations in capturing semantic relationships between words.

Uploaded by

harrypoter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

MDAN 54233

Outline

• What is vectorization?
• Types
– Frequency based encoding
– One-Hot encoding
– Term frequency- inverse document frequency (TF-IDF)

Text Vectorization – Distributed representation

Supunmali Ahangama

MDAN 54233 2

1 2

Vectorization Feature Extraction

• ML Algorithms work on numeric feature space • The process of extracting and selecting features
• Input should 2D array where rows are instances and • Also known as feature engineering
columns are features
• Features are unique, measurable attributes or properties
• To apply ML to text documents  vector representation for each observation or data point in a dataset.
• Map words or phrases from vocabulary to a corresponding
vector of real numbers.
• This process of converting words into numbers is called as
vectorization or feature extraction
MDAN 54233 3 MDAN 54233 4

3 4
Vector Space Model Vector Space Model

• Vector space model (or Term Vector Model) How to represent a Document D in the VS?
• It is a mathematical and algebraic model for transforming and • D = {wD1, wD2, …, wDn}
representing text documents as numeric vectors of specific terms
• where wDn denotes the weight for word n in document D.
that form the vector dimensions.
• This weight is a numeric value and can represent anything,
• Vector Space VS
ranging from the frequency of that word in the document,
– The number of dimensions or columns for each document will be the total
number of distinct terms or words for all documents in the vector space. to the average frequency of occurrence, or even to the TF-
– VS = {W1, W2, ….., Wn} IDF weight.
– where there are n distinct words across all documents.
MDAN 54233 5 MDAN 54233 6

5 6

Bag of Words (BoW) Model

• A commonly used approach to representing text data as


numerical feature vectors .
• The BoW model represents a text document as a "bag"
(multiset) of its words.
– A vector of word frequencies or binary values indicating whether a
word occurs in the document or not.

• It disregards grammar and word order but retaining


information on the frequency of occurrence of each word.
https://postimg.cc/rKTq580h
MDAN 54233 7 MDAN 54233 8

7 8
Vectorize a Corpus with BoW approach Bag of Words (BoW) Model

• Every document from the corpus as a vector whose length


is equal to the vocabulary of the corpus.
• Also, can use the same model for occurrences for n-
grams, which would make it an n-gram Bag of Words model
such that frequency of distinct n-grams in each document
would also be considered.

MDAN 54233 9 MDAN 54233 10

9 10

N-gram Output - Unigram

https://devopedia.org/images/article/219/7356.1569499094.png MDAN 54233 11 MDAN 54233 12

11 12
Output – Tri-gram Vector Encoding Models

• Frequency based encoding


• One-Hot encoding
• Term frequency- inverse document frequency (TF-IDF)
• Distributed representation

MDAN 54233 13 MDAN 54233 14

13 14

Libraries/Frameworks Frequency Vectors (Frequency based encoding)

• NLTK • Fill in the vector with the frequency of each word as it


appears in the document
• Scikit-Learn
• Each document is represented as the multiset of the
• Gensim
tokens that compose it and the value for each word
position in the vector is its count.
• Value in the vector can be:
– Straight count (integer) encoding or
– normalized encoding where each word is weighted by the total
number of words in the document
MDAN 54233 15 MDAN 54233 16

15 16
Token Frequency as Vector Encoding One-Hot Encoding

• It is a Boolean vector encoding method that marks a


particular vector index with a value of true (1) if the token
exists in the document and false (0) if it does not.
• That is, it reflects either the presence or absence of the
token in the described text.
• Each vector has a dimensionality equal to the number of
unique words in the corpus.

MDAN 54233 17 MDAN 54233 18

17 18

One-Hot Encoding

https://towardsdatascience.com/word-embedding-in-nlp-one-hot-encoding-and-skip-gram-neural-network-
MDAN 54233 19 81b424da58f2 MDAN 54233 20

19 20
One-Hot Encoding Long tail

• Alternative for frequency-based encoding as frequency- • A large number of words have relatively low frequency,
based encoding suffers from the long tail, or Zipfian while a small number of items have very high frequency.
distribution. • This distribution is called a "long tail" because it is
– It reduces the imbalance issue of the distribution of tokens. characterized by a long, thin tail on the distribution curve.
• Tokens that occur very frequently are orders of magnitude
more “significant” than other, less frequent ones.
• Will not work on models that expect normal distribution.

MDAN 54233 21 MDAN 54233 22

21 22

Zipf's law

• Frequency of a word is inversely proportional to its rank in


the frequency table.
• Most frequent word occurs approximately twice as often
as the second most frequent word, three times as often as
the third most frequent word, and so on.

MDAN 54233 23 MDAN 54233 24

23 24
TF-IDF TF-IDF
• Term Frequency-Inverse Document Frequency • idf is the inverse of the document frequency for each term.
• It is a combination of two metrics: term frequency (tf) and
inverse document frequency (idf).
• tfidf = tf x idf
• tf is raw frequency value of a word (=term) in a particular
document. • where idf(t) represents the idf for the term t, C represents the
– If required can use the log values, average the frequencies or binanry count of the total number of documents in our corpus, and df(t)
values (1 means the term has occurred in the document and 0 means it represents the frequency of the number of documents in which
has not) the term t is present.

MDAN 54233 25 MDAN 54233 26

25 26

TF-IDF - Variant Advanced Word Vectorization Models

• The final TF-IDF metric will be using a normalized version Distributed representations:
of the tfidf matrix (product of tf and idf). • Google’s Word2Vec model
• The tfidf matrix will be normalized by dividing it with the L2 • GloVe (Global Vectors for Word Representation)
norm of the matrix (Euclidean norm)
• Facebook’s fastText (extends the Word2Vec model)
– L2 norm is the square root of the sum of the square of each term’s
tfidf weight. • ELMo (Embeddings from Language Models)
• BERT (Bidirectional Encoder Representations from
– where tfidf represents the Euclidean L2 norm for the tfidf matrix. Transformers)
MDAN 54233 27 MDAN 54233 28

27 28
Distributed Representation Distributed Representation
• Distributed representation in word embedding refers to the idea • “The meaning of a word can be inferred by the company it
of representing words as high-dimensional vectors, where each
keeps”
dimension of the vector captures a different aspect of the
word's meaning. • If you have two words that have very similar neighbors,
• It is based on the co-occurrence patterns of words in large text then these words are probably quite similar in meaning or
datasets. are at least related.
• By training the model to predict which words are likely to occur
in a given context, the model learns to generate embeddings • E.g., “powerful” should be closer to “strong” than “Paris”.
that capture the underlying semantic and syntactic relationships
between words.
MDAN 54233 29 MDAN 54233 30

29 30

Word2Vec Word2Vec
• Word2Vec is a neural network-based model (single hidden • It include skip-gram and CBOW mode
layer)
• CBOW (Continuous Bag of Words)
• It captures the meaning and semantic relationships between
words. – predicts the target word based on the context words surrounding
it
• E.g., if we train Word2Vec on a large corpus of text that
includes the words "cat", "dog", and "animal", the model may • Skip-gram model
represent the word "cat" as a vector that is close to the vector
– predicts the context words based on the target word.
for "animal", but far from the vector for "dog".
– This reflects the fact that "cat" and "animal" are semantically similar, but
"cat" and "dog" are not.

MDAN 54233 31 MDAN 54233 32

31 32
Example – CBOW and Skip-gram

• Sentence is "The quick brown fox jumps over the lazy dog“

• CBOW predicts “jumps” based on the surrounding context


words "The quick brown fox over the lazy dog".
• Skip-gram model predicts the context words "The", "quick",
"brown", "fox", "over", and "the" based on the target word
"jumps".

https://israelg99.github.io/2017-03-23-Word2Vec-Explained/ MDAN 54233 33 MDAN 54233 34

33 34

How to get the


embeddings? Word2Vec

• Adjust those • https://radimrehurek.com/gensim/models/word2vec.html


weights to reduce a • Word2Vec(documents, vector_size=100, window=5,
loss function. min_count=5, workers=4)
• Use the hidden
weights as the word
embeddings

MDAN 54233 35 MDAN 54233 36


https://towardsmachinelearning.org/inner-working-of-word2vec-model/

35 36
Word2Vec – Parameters in Gensim Window size = 2
• sentences: This parameter takes the input corpus as a list of sentences. Each sentence is a list of words. For
example: sentences = [["the", "quick", "brown", "fox"], ["jumps", "over", "the", "lazy", "dog"]].
• vector_size: This parameter specifies the size of the vector space in which words will be embedded. It represents
the number of dimensions in the vector space. Typical values are in the range of 100-300.
• window: This parameter specifies the maximum distance between the current and predicted word within a
sentence. For example, if window=5, then the algorithm will consider a word within a range of 5 words on either side
of the target word.
• min_count: This parameter specifies the minimum frequency a word must have to be included in the vocabulary.
Words that occur less frequently than min_count will be ignored.
• workers: This parameter specifies the number of worker threads to use when training the model. It is recommended
to use a value equal to the number of cores on your machine.
• sg: This parameter specifies the training algorithm to use. A value of 0 indicates CBOW (Continuous Bag of Words)
and a value of 1 indicates Skip-gram.
• epochs: This parameter specifies the number of iterations (epochs) to run during training.
• sample: This parameter is used to downsample effects of occurrence of frequent words. Values between 0.01 and
0.0001 are usually ideal
https://towardsmachinelearning.org/inner-working-of-word2vec-
MDAN 54233 37 MDAN 54233 38
model/

37 38

Summary

MDAN 54233 39

39

You might also like