Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
18 views48 pages

Unit 2

Uploaded by

SaMPaTH CM 19&[
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views48 pages

Unit 2

Uploaded by

SaMPaTH CM 19&[
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

M.S.N.Murty,Asst.Professor,M.Tech,(Ph.

D-NITR)
Text Representations
• Feature Extraction is an important step in machine learning
• Transformation of a given text into numerical form so that it can be
fed into NLP and ML algorithms.
• In NLP, this conversion of raw text to a suitable numerical form is
called text representation.
Image Representation in a computer
• Feature representation is a common step in any ML project, whether
the data is text, images, videos, or speech.
• Suppose we want to build a classifier that can distinguish images of
cats from images of dogs.
• Now, in order to train an ML model to accomplish this task, we need
to feed it (labeled) images.
• The way an image is stored in a computer is in the form of a matrix of
pixels where each cell[i,j] in the matrix represents pixel i,j of the
image.
• The real value stored at cell[i,j] represents the intensity of the
corresponding pixel in the image
Image Representation in a computer
Sampling a speech wave

To represent it mathematically, we sample the wave and record its


amplitude (height)
Sampling a speech wave

This gives us a numerical array representing the amplitude of a


sound wave at fixed time intervals, as shown below:
Common Terms Used While Representing Text in NLP

• Corpus( C ): All the text data or records of the dataset together are
known as a corpus.

• Vocabulary(V): This consists of all the unique words present in the


corpus.

• Document(D): One single text record of the dataset is a Document.

• Word(W): The words present in the vocabulary.


Approaches for Text Representation

Text
Representation

Basic Universal
Distributed Handcrafted
vectorization language
representations features
approaches representation
In order to correctly extract the meaning of the sentence, the most
crucial data points are:

1. Break the sentence into lexical units such as lexemes, words, and
phrases
2. Derive the meaning for each of the lexical units
3. Understand the syntactic (grammatical) structure of the sentence
4. Understand the context in which the sentence appears
The semantics (meaning) of the sentence arises from the combination
of the above points.
Vector Space Models
• It should be clear from the introduction that, in order for ML algorithms to work
with text data, the text data must be converted into some mathematical form.
• VSM is fundamental to many information-retrieval operations, from scoring
documents on a query to document classification and document clustering.
• It’s a mathematical model that represents text units as vectors.
• In the simplest form, these are vectors of identifiers, such as index numbers in a
corpus vocabulary.
• the most common way to calculate similarity between two text blobs is using
cosine similarity.
Basic Vectorization Approaches
• Map each word in the vocabulary (V) of the text corpus to a unique ID
(integer value), then represent each sentence or document in the
corpus as a V-dimensional vector.
Table 3-1. Our toy corpus
D1: Dog bites man.
D2: Man bites dog.
D3: Dog eats meat.
D4: Man eats food.
• Lowercasing text and ignoring punctuation, the vocabulary of this
corpus is comprised of six words:
[dog, bites, man, eats, meat, food].
One Hot Encoding
• In one-hot encoding, each word w in the corpus vocabulary is given a unique
integer ID wid that is between 1 and |V|, where V is the set of the corpus
vocabulary.
• Each word is then represented by a V-dimensional binary vector of 0s and 1s.
• This is done via a |V| dimension vector filled with all 0s barring the index, where
index = wid. At this index, we simply put a 1.
• We first map each of the six words to unique IDs: dog = 1, bites = 2, man = 3,
meat = 4 , food = 5, eats = 6.
• Let’s consider the document D1: “dog bites man”. As per the scheme, each word
is a six-dimensional vector.
• Dog is represented as [1 0 0 0 0 0], as the word “dog” is mapped to ID 1. Bites is
represented as [0 1 0 0 0 0], and so on and so forth.
• Thus, D1 is represented as [ [1 0 0 0 0 0] [0 1 0 0 0 0] [0 0 1 0 0 0]].
One Hot Encoding
One Hot Encoding
Advantages
• Easy to understand and implement
Disadvantages
• We get a highly sparse matrix with each sentence, the majority of the
values are zero.
• If the documents are of different sizes, we get different-sized vectors.
• This representation does not capture the semantic meaning of the
words.
Bag of Words
• This representation, unlike one hot encoding, converts a whole piece of text into
fixed-length vectors.
• This is done by counting the number of times a word has appeared in the
document.
• The frequency count of words helps us to compare and contrast documents.
Bag of Words
Bag of Words
#Import libraries
from sklearn.feature_extraction.text import CountVectorizer

# Create sample documents


documents = ["This is the first document.",
"This document is the second document.",
"And this is the third one."]

# Create the Bag-of-Words model


vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Print the feature names and the document-term matrix


print("Feature Names:", vectorizer.get_feature_names_out())
print("Document-Term Matrix:\n", X.toarray())Feature Names: ['and' 'document'
'first' 'is' 'one' 'second' 'the' 'third' 'this']
Document-Term Matrix:
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]]
Bag of Words

• Let us consider 3 sentences :


• The cat in the hat
• The dog in the house
• The Bird in the Sky
Bag of Words
• The key idea behind it is as follows: represent the text under consideration as a
bag (collection) of words while ignoring the order and context.
• If two text pieces have nearly the same words, then they belong to the same bag
(class). Thus, by analyzing the words present in a piece of text,one can identify
the class (bag) it belongs to.
• Thus, for our toy corpus (Table 3-1), where the word IDs are dog = 1, bites = 2,
man = 3, meat = 4 , food = 5, eats = 6, D1 becomes [1 1 1 0 0 0]. This is because
the first three words in the vocabulary appeared exactly once in D1, and the last
three did not appear at all. D4 becomes [0 0 1 0 1 1].
Advantages of BoW
• Like one-hot encoding, BoW is fairly simple to understand and
implement.
• With this representation, documents having the same words will have
their vector representations closer to each other in Euclidean space as
compared to documents with completely different words.
• The distance between D1 and D2 is 0 as compared to the distance
between D1 and D4, which is 2. Thus, the vector space resulting from
the BoW scheme captures the semantic similarity of documents.
• So if two documents have similar vocabulary, they’ll be closer to each
other in the vector space and vice versa.
• We have a fixed-length encoding for any sentence of arbitrary length.
Disadvantages of BoW
• The size of the vector increases with the size of the vocabulary. Thus,
sparsity continues to be a problem.
• One way to control it is by limiting the vocabulary to n number of the most
frequent words.
• It does not capture the similarity between different words that mean the
same thing. Say we have three documents: “I run”, “I ran”, and “I ate”.
• BoW vectors of all three documents will be equally apart.
• This representation does not have any way to handle out of vocabulary
words(i.e., new words that were not seen in the corpus that was used to
build the vectorizer).
• As the name indicates, it is a “bag” of words—word order information is
lost in this representation. Both D1 and D2 will have the same
representation in this scheme.
Bag of N-Grams
• An N-gram is a traditional text representation technique that involves
breaking down the text into contiguous sequences of n-words.
• A uni-gram gives all the words in a sentence. A Bi-gram gives sets of
two consecutive words and similarly, a Tri-gram gives sets of
consecutive 3 words, and so on.
Example: The dog in the house
• Uni-gram: “The”, “dog”, “in”, “the”, “house”
• Bi-gram: “The dog”, “dog in”, “in the”, “the house”
• Tri-gram: “The dog in”, “dog in the”, “in the house”
Bag of N-Grams:example
Using N grams to predict the next word in the
sentence
Using N grams to predict the next word in the
sentence
Bag of N-Grams: Pros & Cons
• It captures some context and word-order information in the form of
n-grams.
• Thus, resulting vector space is able to capture some semantic
similarity.
• Documents having the same n-grams will have their vectors closer to
each other in Euclidean space as compared to documents with
completely different n-grams.
• As n increases, dimensionality (and therefore sparsity) only increases
rapidly.
• It still provides no way to address the OOV problem.
TF-IDF
• TF-IDF, or term frequency–inverse document frequency
• It aims to quantify the importance of a given word relative to other words in the
document and in the corpus.
• It’s a commonly used representation scheme for information-retrieval systems,
for extracting relevant documents from a corpus for a given text query.
Working:
• if a word w appears many times in a document di but does not occur much in the
rest of the documents dj in the corpus, then the word w must be of great
importance to the document di.
• The importance of w should increase in proportion to its frequency in di, but at
the same time, its importance should decrease in proportion to the word’s
frequency in other documents dj in the corpus.
TF-IDF
• Mathematically, this is captured using two quantities: TF and IDF. The two are
then combined to arrive at the TF-IDF score.
•Term frequency (TF): The number of times a word appears in a document

•Inverse document frequency (IDF): A measure of how common or rare a word is in the entire
corpus of documents. The goal is to penalize words that are common across all documents.
TF-IDF
TF-IDF
The TF-IDF score for a term in a document is obtained by multiplying its
TF and IDF scores.
TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)
TF-IDF: example
Distributed Representations
• Distributed representations are a fundamental concept in the field of machine
learning and natural language processing (NLP). They refer to a way of
representing data, typically words or phrases, as continuous vectors in a
high-dimensional space.
• In distributed representations, also known as embeddings, the idea is that the
"meaning" or "semantic content" of a data point is distributed across multiple
dimensions.
• For example, in NLP, words with similar meanings are mapped to points in the
vector space that are close to each other.
Applications of Distributed Representations
• Word Similarity: Measuring the semantic similarity between words.
• Text Classification: Categorizing documents into predefined classes.
• Machine Translation: Translating text from one language to another.
• Information Retrieval: Finding relevant documents in response to a
query.
• Sentiment Analysis: Determining the sentiment expressed in a piece
of text.
Key terms to understand word and text
• Distributional similarity:
This is the idea that the meaning of a word can be understood from the context
in which the word appears.
example: “NLP rocks.” The literal meaning of the word “rocks” is “stones,” but
from the context, it’s used to refer to something good and fashionable.
• Distributional hypothesis
• This hypothesizes that words that occur in similar contexts have similar
meanings.
• Thus, if two words often occur in similar context, then their corresponding
representation vectors must also be close to each other.
• For example, the English words “dog” and “cat” occur in similar contexts.
Key terms to understand word and text
Distributional representation:
• This refers to representation schemes that are obtained based on distribution
of words from the context in which the words appear.
• These schemes are based on distributional hypotheses. The distributional
property is induced from context (textual vicinity).
• Mathematically, distributional representation schemes use high-dimensional
vectors to represent words.
• These vectors are obtained from a co-occurrence matrix that captures
co-occurrence of word and context.
• The four schemes that we’ve seen so far—one-hot, bag of words, bag of
n-grams, and TF-IDF—all fall under the umbrella of distributional
representation.
Key terms to understand word and text
Distributed representation
• It, too, is based on the distributional hypothesis.
• The vectors in distributional representation are very high dimensional and sparse.
• This makes them computationally inefficient and hampers learning.
• To alleviate this, distributed representation schemes significantly compress the
dimensionality.
• This results in vectors that are compact (i.e., low dimensional) and dense (i.e.,
hardly any zeros).
Embedding
• The resulting vector space is known as distributed representation. Embedding
• For the set of words in a corpus, embedding is a mapping between vector space
• coming from distributional representation to vector space coming from distributed
• representation.
Key terms to understand word and text
Vector semantics
• This refers to the set of NLP methods that aim to learn the word
representations
• based on distributional properties of words in a large corpus.
Word Embeddings
• Word Embedding or Word Vector is a numeric vector input that represents a
word in a lower-dimensional space.
• Word Embeddings are numeric representations of words in a
lower-dimensional space, capturing semantic and syntactic information.
Need for Word Embedding?
• To reduce dimensionality
• To use a word to predict the words around it.
• Inter-word semantics must be captured.
How are Word Embeddings used?

Give their Use in training


Take the words numeric or
representation inference.
Word vector representation from the emojis
Pre-trained word embeddings
• Pre-trained word embeddings are trained on large datasets and
capture the syntactic as well as semantic meaning of the words.
• This technique is known as transfer learning in which you take a
model which is trained on large datasets and use that model on your
own similar tasks.
• There are two broad classifications of pre trained word embeddings
– word-level and character-level.
• Eg: Word2Vec by Google, GloVe by Stanford.
Neural Approach: Word2Vec
• Word2Vec is a neural approach for generating word embeddings.

• Given a text corpus, the aim is to learn embeddings for every word in the corpus such
that the word vector in the embedding space best captures the meaning of the word.
Used to represent words as continuous vector spaces.

• Developed by a team at Google, Word2Vec aims to capture the semantic relationships


between words by mapping them to high-dimensional vectors.

• The underlying idea is that words with similar meanings should have similar vector
representations.

• In Word2Vec every word is assigned a vector.


Training Our Own Embeddings
The two variants are:
• Continuous bag of words (CBOW)
• SkipGram
Continuous Bag Of Words(CBOW)
• In CBOW, the primary task is to build a model that correctly predicts the center
word given the context words in which the center word appears.
Language Model:
• Is a statistical model that tries to give a probability distribution over a
sequence of words.
• Given a sentence of m words, it assigns a probability Pr(w1,w2,w3,..,wn) to
the whole sentence. The objective of a language
• Model is to assign probabilities in such a way that it gives high probability to
good sentences and low probability to “bad” sentences.
• By good, we mean sentences that are semantically and syntactically correct.
By bad, we mean sentences that are incorrect—semantically or syntactically
or both.
Continuous Bag Of Words(CBOW)
• CBOW tries to learn a language model that tries to predict the “center” word from the
words in its context.
• If we take the word “jumps” as the center word, then its context is formed by words in its
vicinity.
• Here, the context of size=2
• CBOW tries to do this for every word in the corpus; i.e., it takes
• every word in the corpus as the target word and tries to predict the target word from
• its corresponding context words.

Fig: CBOW: given the context words, predict the center word
Continuous Bag Of Words(CBOW)
• CBOW is a feedforward neural network with a single hidden layer.
• The input layer represents the context words, and the output layer
represents the target word.
• The hidden layer contains the learned continuous vector representations
(word embeddings) of the input words.
• The architecture is useful for learning distributed representations of words
in a continuous vector space.
Continuous Bag Of Words(CBOW)
• The hidden layer contains the continuous vector representations
(word embeddings) of the input words.
• The weights between the input layer and the hidden layer are learned
during training.
• The dimensionality of the hidden layer represents the size of the word
embeddings (the continuous vector space).
SkipGram
• The Skip-Gram model learns distributed representations of words in a
continuous vector space.
• The main objective of Skip-Gram is to predict context words (words
surrounding a target word) given a target word.
• This is the opposite of the Continuous Bag of Words (CBOW) model,
where the objective is to predict the target word based on its
context.
• It is shown that this method produces more meaningful embeddings.
SkipGram

You might also like