0% found this document useful (0 votes)

18 views48 pages

Unit 2

Uploaded by

SaMPaTH CM 19&[

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views48 pages

Unit 2

Uploaded by

SaMPaTH CM 19&[

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

M.S.N.Murty,Asst.Professor,M.Tech,(Ph.

D-NITR)
Text Representations
• Feature Extraction is an important step in machine learning
• Transformation of a given text into numerical form so that it can be
fed into NLP and ML algorithms.
• In NLP, this conversion of raw text to a suitable numerical form is
called text representation.
Image Representation in a computer
• Feature representation is a common step in any ML project, whether
the data is text, images, videos, or speech.
• Suppose we want to build a classifier that can distinguish images of
cats from images of dogs.
• Now, in order to train an ML model to accomplish this task, we need
to feed it (labeled) images.
• The way an image is stored in a computer is in the form of a matrix of
pixels where each cell[i,j] in the matrix represents pixel i,j of the
image.
• The real value stored at cell[i,j] represents the intensity of the
corresponding pixel in the image
Image Representation in a computer
Sampling a speech wave

To represent it mathematically, we sample the wave and record its

amplitude (height)
Sampling a speech wave

This gives us a numerical array representing the amplitude of a

sound wave at fixed time intervals, as shown below:
Common Terms Used While Representing Text in NLP

• Corpus( C ): All the text data or records of the dataset together are
known as a corpus.

• Vocabulary(V): This consists of all the unique words present in the

corpus.

• Document(D): One single text record of the dataset is a Document.

• Word(W): The words present in the vocabulary.

Approaches for Text Representation

Text
Representation

Basic Universal
Distributed Handcrafted
vectorization language
representations features
approaches representation
In order to correctly extract the meaning of the sentence, the most
crucial data points are:

1. Break the sentence into lexical units such as lexemes, words, and
phrases
2. Derive the meaning for each of the lexical units
3. Understand the syntactic (grammatical) structure of the sentence
4. Understand the context in which the sentence appears
The semantics (meaning) of the sentence arises from the combination
of the above points.
Vector Space Models
• It should be clear from the introduction that, in order for ML algorithms to work
with text data, the text data must be converted into some mathematical form.
• VSM is fundamental to many information-retrieval operations, from scoring
documents on a query to document classification and document clustering.
• It’s a mathematical model that represents text units as vectors.
• In the simplest form, these are vectors of identifiers, such as index numbers in a
corpus vocabulary.
• the most common way to calculate similarity between two text blobs is using
cosine similarity.
Basic Vectorization Approaches
• Map each word in the vocabulary (V) of the text corpus to a unique ID
(integer value), then represent each sentence or document in the
corpus as a V-dimensional vector.
Table 3-1. Our toy corpus
D1: Dog bites man.
D2: Man bites dog.
D3: Dog eats meat.
D4: Man eats food.
• Lowercasing text and ignoring punctuation, the vocabulary of this
corpus is comprised of six words:
[dog, bites, man, eats, meat, food].
One Hot Encoding
• In one-hot encoding, each word w in the corpus vocabulary is given a unique
integer ID wid that is between 1 and |V|, where V is the set of the corpus
vocabulary.
• Each word is then represented by a V-dimensional binary vector of 0s and 1s.
• This is done via a |V| dimension vector filled with all 0s barring the index, where
index = wid. At this index, we simply put a 1.
• We first map each of the six words to unique IDs: dog = 1, bites = 2, man = 3,
meat = 4 , food = 5, eats = 6.
• Let’s consider the document D1: “dog bites man”. As per the scheme, each word
is a six-dimensional vector.
• Dog is represented as [1 0 0 0 0 0], as the word “dog” is mapped to ID 1. Bites is
represented as [0 1 0 0 0 0], and so on and so forth.
• Thus, D1 is represented as [ [1 0 0 0 0 0] [0 1 0 0 0 0] [0 0 1 0 0 0]].
One Hot Encoding
One Hot Encoding
Advantages
• Easy to understand and implement
Disadvantages
• We get a highly sparse matrix with each sentence, the majority of the
values are zero.
• If the documents are of different sizes, we get different-sized vectors.
• This representation does not capture the semantic meaning of the
words.
Bag of Words
• This representation, unlike one hot encoding, converts a whole piece of text into
fixed-length vectors.
• This is done by counting the number of times a word has appeared in the
document.
• The frequency count of words helps us to compare and contrast documents.
Bag of Words
Bag of Words
#Import libraries
from sklearn.feature_extraction.text import CountVectorizer

# Create sample documents

documents = ["This is the first document.",
"This document is the second document.",
"And this is the third one."]

# Create the Bag-of-Words model

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Print the feature names and the document-term matrix

print("Feature Names:", vectorizer.get_feature_names_out())
print("Document-Term Matrix:\n", X.toarray())Feature Names: ['and' 'document'
'first' 'is' 'one' 'second' 'the' 'third' 'this']
Document-Term Matrix:
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]]
Bag of Words

• Let us consider 3 sentences :

• The cat in the hat
• The dog in the house
• The Bird in the Sky
Bag of Words
• The key idea behind it is as follows: represent the text under consideration as a
bag (collection) of words while ignoring the order and context.
• If two text pieces have nearly the same words, then they belong to the same bag
(class). Thus, by analyzing the words present in a piece of text,one can identify
the class (bag) it belongs to.
• Thus, for our toy corpus (Table 3-1), where the word IDs are dog = 1, bites = 2,
man = 3, meat = 4 , food = 5, eats = 6, D1 becomes [1 1 1 0 0 0]. This is because
the first three words in the vocabulary appeared exactly once in D1, and the last
three did not appear at all. D4 becomes [0 0 1 0 1 1].
Advantages of BoW
• Like one-hot encoding, BoW is fairly simple to understand and
implement.
• With this representation, documents having the same words will have
their vector representations closer to each other in Euclidean space as
compared to documents with completely different words.
• The distance between D1 and D2 is 0 as compared to the distance
between D1 and D4, which is 2. Thus, the vector space resulting from
the BoW scheme captures the semantic similarity of documents.
• So if two documents have similar vocabulary, they’ll be closer to each
other in the vector space and vice versa.
• We have a fixed-length encoding for any sentence of arbitrary length.
Disadvantages of BoW
• The size of the vector increases with the size of the vocabulary. Thus,
sparsity continues to be a problem.
• One way to control it is by limiting the vocabulary to n number of the most
frequent words.
• It does not capture the similarity between different words that mean the
same thing. Say we have three documents: “I run”, “I ran”, and “I ate”.
• BoW vectors of all three documents will be equally apart.
• This representation does not have any way to handle out of vocabulary
words(i.e., new words that were not seen in the corpus that was used to
build the vectorizer).
• As the name indicates, it is a “bag” of words—word order information is
lost in this representation. Both D1 and D2 will have the same
representation in this scheme.
Bag of N-Grams
• An N-gram is a traditional text representation technique that involves
breaking down the text into contiguous sequences of n-words.
• A uni-gram gives all the words in a sentence. A Bi-gram gives sets of
two consecutive words and similarly, a Tri-gram gives sets of
consecutive 3 words, and so on.
Example: The dog in the house
• Uni-gram: “The”, “dog”, “in”, “the”, “house”
• Bi-gram: “The dog”, “dog in”, “in the”, “the house”
• Tri-gram: “The dog in”, “dog in the”, “in the house”
Bag of N-Grams:example
Using N grams to predict the next word in the
sentence
Using N grams to predict the next word in the
sentence
Bag of N-Grams: Pros & Cons
• It captures some context and word-order information in the form of
n-grams.
• Thus, resulting vector space is able to capture some semantic
similarity.
• Documents having the same n-grams will have their vectors closer to
each other in Euclidean space as compared to documents with
completely different n-grams.
• As n increases, dimensionality (and therefore sparsity) only increases
rapidly.
• It still provides no way to address the OOV problem.
TF-IDF
• TF-IDF, or term frequency–inverse document frequency
• It aims to quantify the importance of a given word relative to other words in the
document and in the corpus.
• It’s a commonly used representation scheme for information-retrieval systems,
for extracting relevant documents from a corpus for a given text query.
Working:
• if a word w appears many times in a document di but does not occur much in the
rest of the documents dj in the corpus, then the word w must be of great
importance to the document di.
• The importance of w should increase in proportion to its frequency in di, but at
the same time, its importance should decrease in proportion to the word’s
frequency in other documents dj in the corpus.
TF-IDF
• Mathematically, this is captured using two quantities: TF and IDF. The two are
then combined to arrive at the TF-IDF score.
•Term frequency (TF): The number of times a word appears in a document

•Inverse document frequency (IDF): A measure of how common or rare a word is in the entire
corpus of documents. The goal is to penalize words that are common across all documents.
TF-IDF
TF-IDF
The TF-IDF score for a term in a document is obtained by multiplying its
TF and IDF scores.
TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)
TF-IDF: example
Distributed Representations
• Distributed representations are a fundamental concept in the field of machine
learning and natural language processing (NLP). They refer to a way of
representing data, typically words or phrases, as continuous vectors in a
high-dimensional space.
• In distributed representations, also known as embeddings, the idea is that the
"meaning" or "semantic content" of a data point is distributed across multiple
dimensions.
• For example, in NLP, words with similar meanings are mapped to points in the
vector space that are close to each other.
Applications of Distributed Representations
• Word Similarity: Measuring the semantic similarity between words.
• Text Classification: Categorizing documents into predefined classes.
• Machine Translation: Translating text from one language to another.
• Information Retrieval: Finding relevant documents in response to a
query.
• Sentiment Analysis: Determining the sentiment expressed in a piece
of text.
Key terms to understand word and text
• Distributional similarity:
This is the idea that the meaning of a word can be understood from the context
in which the word appears.
example: “NLP rocks.” The literal meaning of the word “rocks” is “stones,” but
from the context, it’s used to refer to something good and fashionable.
• Distributional hypothesis
• This hypothesizes that words that occur in similar contexts have similar
meanings.
• Thus, if two words often occur in similar context, then their corresponding
representation vectors must also be close to each other.
• For example, the English words “dog” and “cat” occur in similar contexts.
Key terms to understand word and text
Distributional representation:
• This refers to representation schemes that are obtained based on distribution
of words from the context in which the words appear.
• These schemes are based on distributional hypotheses. The distributional
property is induced from context (textual vicinity).
• Mathematically, distributional representation schemes use high-dimensional
vectors to represent words.
• These vectors are obtained from a co-occurrence matrix that captures
co-occurrence of word and context.
• The four schemes that we’ve seen so far—one-hot, bag of words, bag of
n-grams, and TF-IDF—all fall under the umbrella of distributional
representation.
Key terms to understand word and text
Distributed representation
• It, too, is based on the distributional hypothesis.
• The vectors in distributional representation are very high dimensional and sparse.
• This makes them computationally inefficient and hampers learning.
• To alleviate this, distributed representation schemes significantly compress the
dimensionality.
• This results in vectors that are compact (i.e., low dimensional) and dense (i.e.,
hardly any zeros).
Embedding
• The resulting vector space is known as distributed representation. Embedding
• For the set of words in a corpus, embedding is a mapping between vector space
• coming from distributional representation to vector space coming from distributed
• representation.
Key terms to understand word and text
Vector semantics
• This refers to the set of NLP methods that aim to learn the word
representations
• based on distributional properties of words in a large corpus.
Word Embeddings
• Word Embedding or Word Vector is a numeric vector input that represents a
word in a lower-dimensional space.
• Word Embeddings are numeric representations of words in a
lower-dimensional space, capturing semantic and syntactic information.
Need for Word Embedding?
• To reduce dimensionality
• To use a word to predict the words around it.
• Inter-word semantics must be captured.
How are Word Embeddings used?

Give their Use in training

Take the words numeric or
representation inference.
Word vector representation from the emojis
Pre-trained word embeddings
• Pre-trained word embeddings are trained on large datasets and
capture the syntactic as well as semantic meaning of the words.
• This technique is known as transfer learning in which you take a
model which is trained on large datasets and use that model on your
own similar tasks.
• There are two broad classifications of pre trained word embeddings
– word-level and character-level.
• Eg: Word2Vec by Google, GloVe by Stanford.
Neural Approach: Word2Vec
• Word2Vec is a neural approach for generating word embeddings.

• Given a text corpus, the aim is to learn embeddings for every word in the corpus such
that the word vector in the embedding space best captures the meaning of the word.
Used to represent words as continuous vector spaces.

• Developed by a team at Google, Word2Vec aims to capture the semantic relationships

between words by mapping them to high-dimensional vectors.

• The underlying idea is that words with similar meanings should have similar vector
representations.

• In Word2Vec every word is assigned a vector.

Training Our Own Embeddings
The two variants are:
• Continuous bag of words (CBOW)
• SkipGram
Continuous Bag Of Words(CBOW)
• In CBOW, the primary task is to build a model that correctly predicts the center
word given the context words in which the center word appears.
Language Model:
• Is a statistical model that tries to give a probability distribution over a
sequence of words.
• Given a sentence of m words, it assigns a probability Pr(w1,w2,w3,..,wn) to
the whole sentence. The objective of a language
• Model is to assign probabilities in such a way that it gives high probability to
good sentences and low probability to “bad” sentences.
• By good, we mean sentences that are semantically and syntactically correct.
By bad, we mean sentences that are incorrect—semantically or syntactically
or both.
Continuous Bag Of Words(CBOW)
• CBOW tries to learn a language model that tries to predict the “center” word from the
words in its context.
• If we take the word “jumps” as the center word, then its context is formed by words in its
vicinity.
• Here, the context of size=2
• CBOW tries to do this for every word in the corpus; i.e., it takes
• every word in the corpus as the target word and tries to predict the target word from
• its corresponding context words.

Fig: CBOW: given the context words, predict the center word
Continuous Bag Of Words(CBOW)
• CBOW is a feedforward neural network with a single hidden layer.
• The input layer represents the context words, and the output layer
represents the target word.
• The hidden layer contains the learned continuous vector representations
(word embeddings) of the input words.
• The architecture is useful for learning distributed representations of words
in a continuous vector space.
Continuous Bag Of Words(CBOW)
• The hidden layer contains the continuous vector representations
(word embeddings) of the input words.
• The weights between the input layer and the hidden layer are learned
during training.
• The dimensionality of the hidden layer represents the size of the word
embeddings (the continuous vector space).
SkipGram
• The Skip-Gram model learns distributed representations of words in a
continuous vector space.
• The main objective of Skip-Gram is to predict context words (words
surrounding a target word) given a target word.
• This is the opposite of the Continuous Bag of Words (CBOW) model,
where the objective is to predict the target word based on its
context.
• It is shown that this method produces more meaningful embeddings.
SkipGram

Lect 04
No ratings yet
Lect 04
44 pages
Unit 2 TB
No ratings yet
Unit 2 TB
20 pages
NLP Text Representation Guide
No ratings yet
NLP Text Representation Guide
131 pages
Unit IV
No ratings yet
Unit IV
57 pages
Unit - 2
No ratings yet
Unit - 2
58 pages
Unit IV
No ratings yet
Unit IV
58 pages
Unit 2
No ratings yet
Unit 2
21 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
Lab 5
No ratings yet
Lab 5
27 pages
Module 2 Cont... Text Classification
No ratings yet
Module 2 Cont... Text Classification
14 pages
Unit Ii
No ratings yet
Unit Ii
20 pages
Word Embedding
No ratings yet
Word Embedding
60 pages
Sheet 3
No ratings yet
Sheet 3
5 pages
DM Chapter 9 - Word Embedding
No ratings yet
DM Chapter 9 - Word Embedding
7 pages
Traditional Word Embedding
No ratings yet
Traditional Word Embedding
9 pages
Text Vectorization
No ratings yet
Text Vectorization
10 pages
Lesson 2 Feature Engineering On Text Data
No ratings yet
Lesson 2 Feature Engineering On Text Data
89 pages
Text Representation: Lecture # 6
No ratings yet
Text Representation: Lecture # 6
21 pages
Word2Vec for NLP Enthusiasts
100% (1)
Word2Vec for NLP Enthusiasts
12 pages
Text Representation Techniques Guide
No ratings yet
Text Representation Techniques Guide
4 pages
Embeddings
No ratings yet
Embeddings
3 pages
Word Embedding
No ratings yet
Word Embedding
35 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
NLP Basic - YL
No ratings yet
NLP Basic - YL
16 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
Wordembed
No ratings yet
Wordembed
31 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
BBC Sports Text Preprocessing Guide
No ratings yet
BBC Sports Text Preprocessing Guide
6 pages
Unit 2a
No ratings yet
Unit 2a
51 pages
Basics of Bag of Words Model
No ratings yet
Basics of Bag of Words Model
32 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
NLP Word Vectors for Students
No ratings yet
NLP Word Vectors for Students
33 pages
DSC 202
No ratings yet
DSC 202
8 pages
Chapter II
No ratings yet
Chapter II
26 pages
Lecture 2 Bag of Words
No ratings yet
Lecture 2 Bag of Words
25 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
27 pages
21 Word2Vec 24 09 2024
No ratings yet
21 Word2Vec 24 09 2024
63 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
Foundations of Text Representation, LLMs and Transformers
No ratings yet
Foundations of Text Representation, LLMs and Transformers
87 pages
Word Embeddings
No ratings yet
Word Embeddings
59 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
12 Subrata DL
No ratings yet
12 Subrata DL
25 pages
Intro LLM
No ratings yet
Intro LLM
6 pages
Part 3
No ratings yet
Part 3
5 pages
NLP Notes
No ratings yet
NLP Notes
11 pages
NLP
No ratings yet
NLP
6 pages
Word Embedding & Language Modelling
No ratings yet
Word Embedding & Language Modelling
111 pages
Vector Representation of Text: Vagelis Hristidis Prepared With The Help of Nhat Le Many Slides Are From Richard Socher
No ratings yet
Vector Representation of Text: Vagelis Hristidis Prepared With The Help of Nhat Le Many Slides Are From Richard Socher
20 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
Vector Semantics
No ratings yet
Vector Semantics
83 pages
CPP String Functions Notes With Code
No ratings yet
CPP String Functions Notes With Code
2 pages
DP Master Roadmap
No ratings yet
DP Master Roadmap
2 pages
Week 4
No ratings yet
Week 4
4 pages
DL Questions
No ratings yet
DL Questions
2 pages
Micron IT Assessment Detailed Notes
No ratings yet
Micron IT Assessment Detailed Notes
2 pages
UNIT-3 Part2
No ratings yet
UNIT-3 Part2
14 pages
UNIT-5 Part1
No ratings yet
UNIT-5 Part1
15 pages
ML-Unit 3
No ratings yet
ML-Unit 3
58 pages
All-Pairs Shortest Paths & TSP
No ratings yet
All-Pairs Shortest Paths & TSP
85 pages
Dbms Lab
No ratings yet
Dbms Lab
59 pages
Data Mining 456
No ratings yet
Data Mining 456
8 pages
Wa0000.
No ratings yet
Wa0000.
4 pages
7 TH
No ratings yet
7 TH
30 pages
DBMS 5Q
No ratings yet
DBMS 5Q
10 pages
MODULE 6 - 1 Variational Autoencoders (VAE)
No ratings yet
MODULE 6 - 1 Variational Autoencoders (VAE)
46 pages
Hyperparemter and Cross Validaton
No ratings yet
Hyperparemter and Cross Validaton
8 pages
Kernel Methods and Machine Learning S. Y. Kung Download
No ratings yet
Kernel Methods and Machine Learning S. Y. Kung Download
58 pages
Bounding Box Regression Network For Infrared Small Target Detection With Adaptive Receptive Field and Cross-Scale Fusion
No ratings yet
Bounding Box Regression Network For Infrared Small Target Detection With Adaptive Receptive Field and Cross-Scale Fusion
14 pages
AI Engineer (Gen AI)
No ratings yet
AI Engineer (Gen AI)
3 pages
HW 01 - CSL 537
No ratings yet
HW 01 - CSL 537
6 pages
Business Case For PRCL-0012
No ratings yet
Business Case For PRCL-0012
16 pages
Mcau-Iii TT
No ratings yet
Mcau-Iii TT
1 page
Deep Learning Unit I II MCQ
No ratings yet
Deep Learning Unit I II MCQ
2 pages
K - Nearest Neighbors
No ratings yet
K - Nearest Neighbors
33 pages
Deep Learning - AD3501 - Important Question and 2 Marks With Answers - Unit 1
No ratings yet
Deep Learning - AD3501 - Important Question and 2 Marks With Answers - Unit 1
9 pages
Huawei Assignment 1
No ratings yet
Huawei Assignment 1
20 pages
Seminar
No ratings yet
Seminar
16 pages
Welcome To The Basics Guide To Generative AI and Prompt Engineering!
No ratings yet
Welcome To The Basics Guide To Generative AI and Prompt Engineering!
8 pages
Task 2 Model Plan
No ratings yet
Task 2 Model Plan
2 pages
Lec 28 Variations in BPNN
100% (1)
Lec 28 Variations in BPNN
20 pages
AI Quiz Generator Base Paper
No ratings yet
AI Quiz Generator Base Paper
2 pages
Analytical Review of Deep Learning Architectures in Financial Frameworks For Fraud Identification of Credit Cards
No ratings yet
Analytical Review of Deep Learning Architectures in Financial Frameworks For Fraud Identification of Credit Cards
10 pages
Hyperparameter Optimization
No ratings yet
Hyperparameter Optimization
1 page
Importance and Challenges of Handwriting Recognition With The Implementation of Machine Learning Techniques: A Survey
No ratings yet
Importance and Challenges of Handwriting Recognition With The Implementation of Machine Learning Techniques: A Survey
22 pages
Indian Traffic Sign Detection and Classification Through A Unified Framework
No ratings yet
Indian Traffic Sign Detection and Classification Through A Unified Framework
13 pages
2023 New Edition
No ratings yet
2023 New Edition
172 pages
Gagneet CV Internship
No ratings yet
Gagneet CV Internship
1 page
Soft Computing Unit 2
No ratings yet
Soft Computing Unit 2
28 pages
Machine Learning Labs Manual
No ratings yet
Machine Learning Labs Manual
60 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Employee Salary Prediction
No ratings yet
Employee Salary Prediction
10 pages
Facial Emotion Recognition For University Students Using CNN Transforming Learning Environment
No ratings yet
Facial Emotion Recognition For University Students Using CNN Transforming Learning Environment
6 pages
Remote Sensing Digital Image Analysis 6th Ed. 2022 Edition John A. Richards Instant Download
No ratings yet
Remote Sensing Digital Image Analysis 6th Ed. 2022 Edition John A. Richards Instant Download
116 pages
Takeoff Edu Group Matlab Title List
No ratings yet
Takeoff Edu Group Matlab Title List
4 pages

Unit 2

Uploaded by

Unit 2

Uploaded by

M.S.N.Murty,Asst.Professor,M.Tech,(Ph.

To represent it mathematically, we sample the wave and record its

This gives us a numerical array representing the amplitude of a

• Vocabulary(V): This consists of all the unique words present in the

• Document(D): One single text record of the dataset is a Document.

• Word(W): The words present in the vocabulary.

# Create sample documents

# Create the Bag-of-Words model

# Print the feature names and the document-term matrix

• Let us consider 3 sentences :

Give their Use in training

• Developed by a team at Google, Word2Vec aims to capture the semantic relationships

• In Word2Vec every word is assigned a vector.

You might also like