Nlput-Unit2 Notes
Nlput-Unit2 Notes
Word embeddings are one of the most important concepts in modern NLP. Word embeddings
are a type of representation for words in a continuous vector space where the positioning of
words captures semantic relationships between them.
In simpler terms, word embeddings are numerical representations of words that allow
computers to understand their meanings and relationships with other words. Word embeddings
in natural language processing (NLP) are numerical representations of words that capture
semantic relationships between words based on their usage in text.
Word embeddings are typically learned from large text corpora using neural network models,
such as Word2Vec, GloVe, or FastText. These models map words to high-dimensional vectors
in such a way that words with similar meanings are represented by vectors that are close to
each other in the vector space. Word embeddings are useful in NLP tasks such as language
modeling, sentiment analysis, and machine translation, as they allow models to capture the
meaning of words and the relationships between them.
One way to deal with discrete words programmatically is to assign indices to individual words
as follows (here we simply assume that these indices are assigned alphabetically): ¡
index("cat") = 1
index("dog") = 2
index("pizza") = 3
The entire, finite set of words that one NLP application or task deals with is called vocabulary.
Just because words are now represented by numbers doesn’t mean you can do arithmetic
operations on them and conclude that “cat” is equally similar to “dog” (difference between 1
and 2), as “dog” is to “pizza” (difference between 2 and 3). Those indices are still discrete and
arbitrary.
Conceptually, the numerical scale would look like the one shown in figure below in 1D
This is a step forward. Now we can represent the fact that “cat” and “dog” are more similar to
each other than “pizza” is to those words.
Much better! Because computers are really good at dealing with multidimensional spaces you
can simply keep doing this until you have a sufficient number of dimensions.
In this 3-D space, you can represent those three words as follows:
The x -axis (the first element) here represents some concept of “animal-ness” and the z-axis
(the third dimension) corresponds to “food-ness.
This is essentially what word embeddings are.
Think of a multidimensional space that has as many dimensions as there are words.
Then, give to each word a vector that is filled with zeros but just one 1, as shown
¡ vec("cat") = [1, 0, 0]
¡ vec("dog") = [0, 1, 0]
¡ vec("pizza") = [0, 0, 1]
Notice that each vector has only one 1 at the position corresponding to the word’s index.
These special vectors are called one-hot vectors.
A word embedding is a real-valued vector representation of a word. If you find the concept of
vectors intimidating, think of them as single-dimensional arrays of float numbers, like the
following:
Because each array contains three elements, you can plot them as points in a 3-D space as in
figure below.
Notice that semantically-related words (“cat” and “dog”) are placed close to each other
• Word embeddings are not just important but essential for using neural networks to solve
NLP tasks.
• Neural networks are pure mathematical computation models that can deal only with
numbers. They can’t do symbolic operations, such as concatenating two strings or
conjugating a verb to past tense, unless these items are all represented by numbers and
arithmetic operations.
• On the other hand, almost everything in NLP, such as words and labels, is symbolic and
discrete.
This is why you need to bridge these two worlds, and using embeddings is a way to do it. See
figure below for an overview on how to use word embeddings for an NLP application.
Word embeddings, just like any other neural network models, can be trained, because they are
simply a collection of parameters.
Embeddings are used with your NLP model in the following three scenarios:
Scenario 1: Train word embeddings and your model at the same time using the train set for
your task.
Scenario 2: First, train word embeddings independently using a larger text dataset.
Alternatively, obtain pretrained word embeddings from somewhere else. Then initialize your
model using the pretrained word embeddings, and train them and your model at the same time
using the train set for your task.
Scenario 3: Same as scenario 2, except you fix word embeddings while you train your model.
Word embeddings are crucial in natural language processing (NLP) for several reasons:
4. Improved Performance: NLP models that use word embeddings often achieve better
performance compared to models that use traditional sparse representations of words,
such as one-hot encoding. Word embeddings capture more nuanced relationships
between words, leading to improved performance on tasks like text classification,
sentiment analysis, and machine translation.
5. Transfer Learning: Pre-trained word embeddings can be used as a starting point for
training NLP models on specific tasks. This allows models to leverage knowledge
learned from large text corpora and achieve better performance with less training data.
a. Characters
A character (also called a grapheme in linguistics) is the smallest unit of a writing system.
Characters do not necessarily carry meaning by themselves or represent any fixed sound when
spoken, although in some languages (e.g., Chinese), most do.
A typical character in many languages can be represented by a single Unicode codepoint (by
string literals such as "\uXXXX" in Python), but this is not always the case. Many languages
use a combination of more than one Unicode codepoint (e.g., accent marks) to represent a
single character. Punctuation marks, such as “.” (period), “,” (comma), and “?” (question mark),
are also characters.
b. Words
A word is the smallest unit in a language that can be uttered independently and that usually
carries some meaning.
In most written languages that use alphabetic scripts, words are usually separated by spaces or
punctuation marks. In some languages, like Chinese, Japanese, and Thai, however, words are
not explicitly delimited by spaces and require a preprocessing step called word segmentation
to identify words in a sentence.
A token is a string of contiguous characters that play a certain role in a written language. Most
words (“apple,” “banana,” “zebra”) are also tokens when written.
Punctuation marks such as the exclamation mark (“!”) are tokens but not words, because you
can’t utter them in isolation.
Word and token are often used interchangeably in NLP. In fact, when you see “word” in NLP
text (including this book), it often means “token,” because most NLP tasks deal only with
written text that is processed in an automatic way. Tokens are the output of a process called
tokenization.
A typical word consists of one or more morphemes. For example, “apple” is a word and also a
morpheme. “Apples” is a word comprised of two morphemes, “apple” and “-s,” which is used
to signify the noun is plural.
English contains many other morphemes, including “-ing,” “-ly,” “-ness,” and “un-.”
For example, “the quick brown fox” is a noun phrase (a group of words that behaves like a
noun), whereas “jumps over the lazy dog” is a verb phrase.
The concept of phrase may be used somewhat liberally in NLP to simply mean any group of
words.
f. N-grams
An n-gram of size 1 (when n = 1) is called a unigram. N-grams of size 2 and 3 are called a
bigram and a trigram, respectively.
1. Tokenization Tokenization is a process where the input text is split into smaller units.
• Word tokenization splits a sentence into tokens (rough equivalent to words and
punctuation), which I mentioned earlier.
• Sentence tokenization, on the other hand, splits a piece of text that may include more
than one sentence into individual sentences. If you say tokenization, it usually means
word tokenization in NLP.
2. Stemming Stemming is a process for identifying word stems.
• A word stem is the main part of a word after stripping off its affixes (prefixes and
suffixes).
• For example, the word stem of “apples” (plural) is “apple.” The stem of “meets” (with
a third-person singular s) is “meet.”
• In search, for example, you can improve the chance of retrieving relevant documents if
you index documents using word stems instead of words.
• The most popular algorithm used for stemming English words is called the Porter
stemming algorithm, originally written by Martin Porter. It consists of a number of rules
for rewriting affixes (e.g., if a word ends with “-ization,” change it to “-ize”).
• ["dancing","dance",danced"]-----[Danc,danc,danc]
3.Lemmatization
• A lemma is the original form of a word that you often find as a head word in a
dictionary.
• For example, the lemma of “meetings” (as a plural noun) is “meeting.” The lemma of
“met” (a verb past form) is “meet.”
• Notice that it differs from stemming, which simply strips off affixes from a word and
cannot deal with such irregular verbs and nouns.
• Examples:
•
• ["bats", "are","feet" , "best"] → ["bat","be","foot","good"]
Each word is encoded using a distinct vector in this manner. The number of words is equal to
the size of the vectors. As a result, if there are 1.000 words, the vectors are 1x1.000 in size.
Except for a value one that distinguishes each word representation, all values in the vectors are
zeros. Each word is given its vector, but this representation has its problems.
To begin with, if the vocabulary is extensive, the vectors will be enormous. Employing a model
with this encoding would result in the curse of dimensionality. Adding or removing terms from
the vocabulary will also affect the display of all words.
One of the classic methods is one-hot encoding, which represents each word in a
vocabulary as a binary vector. The dimensionality of the embedding is equal to the size of
the vocabulary, and each element of the vector corresponds to a word in the vocabulary.
For example, in the sentence “Word embedding represents a word as numerical data.”, there
are 7 unique words. Thus, the dimensionality of the word embedding is 7:
With the development of neural networks, scientists introduced more advanced methods that
generate distributed word representations, such as TF-IDF, Word2vec, GloVe, etc.
Word2Vec
• Through the produced vectors, word embeddings eventually assist in forming the
relationship of a term with another word with a similar meaning. Context is used in
these models. This means that it looks at neighboring words to learn the embedding; if
a set of words is always found close to the exact words, their embeddings will be similar.
• The Skip-Gram and Continuous Bag of Words models are two distinct architectures
that Word2Vec can build word embeddings.
• To calculate word embeddings from large textual data using two popular algorithms—
Skip-gram and CBOW.
In the CBOW model, the distributed representations of context (or surrounding words) are
combined to predict the word in the middle. While in the Skip-gram model, the distributed
representation of the input word is used to predict the context.
CBOW
The Continuous Bag of Words (CBOW) is a Word2Vec model that predicts a target word
based on the surrounding context words. It takes a fixed-sized context window of words and
tries to predict the target word in the middle of the window. The model learns by maximizing
the probability of predicting the target word correctly given the context words.
Let's say we have the sentence, "I eat Pizza on Friday". First, we will tokenize the sentence:
["I", "eat", "pizza","on", "Friday"]. Now, let's create the training examples for this sentence for
the CBOW model, considering a window size of 2.
In CBOW, there are typically three main layers involved: the input layer, the hidden layer, and
the output layer.
Skip-gram is another neural network architecture used in Word2Vec that predicts the context
of a word, given a target word.
The input to the skip-gram model is a target word, while the output is a set of context words.
The goal of the skip-gram model is to learn the probability distribution of the context words,
given the target word.
During training, the skip-gram model is fed with a set of target words and their corresponding
context words.
The model learns to adjust the weights of the hidden layer to maximize the probability of
predicting the correct context words, given the target word.
Example
• The sentence was, "I eat Pizza on Friday". First, we will tokenize the sentence:
• Now, let's create the training examples for this sentence for the skip-gram model,
considering a window size of 2.
Both CBOW and skip-gram provide different approaches to word embeddings, offering a trade-
off between training efficiency, semantic capture, and the ability to handle different dataset
characteristics. Therefore, choosing the appropriate algorithm among them requires careful
consideration of the specific requirements of your task.
Skip-gram uses a prediction task where a context word (“bark”) is predicted from the target
word (“dog”).
This process is repeated as many times as there are word tokens in the dataset. It basically scans
the entire dataset and asks the question, “Can this word be predicted from this other word?” for
every single occurrence of words in the dataset.
What if there were two or more identical sentences in the dataset? Or very similar sentences?
In that case, Skip-gram would repeat the exact same set of updates multiple times. “Can ‘bark’
be predicted from ‘dog’?” you might ask. But chances are :you already asked that exact same
question a couple of hundred sentences ago. If you know that the words “dog” and “bark”
appear together in the context N times in the entire dataset, why repeat this N times? It’s as if
you were adding “1” N times to something else (x + 1 + 1 + 1 + ... + 1) when you could simply
add N to it (x + N).
Could we somehow use the global information directly? The design of Glove is motivated by
this insight. Instead of using local word co-occurrences, it uses aggregated word co-occurrence
statistics in the entire dataset. Let’s say “dog” and “bark” co-occur N times in the dataset. I’m
not going into the details of the model, but roughly speaking, the GloVe model tries to predict
this number N from the embeddings of both words. It still makes some predictions about word
relations, but notice that it makes one prediction per a combination of word types, but Skip-
gram does so for every combination of word tokens!
A token is an occurrence of a word in text. There may be multiple occurrences of the same
word in a corpus. A type, on the other hand, is a distinctive, unique word.
For example, in the sentence “A rose is a rose is a rose,” there are eight tokens but only three
types (“a,” “rose,” and “is”).
• By using pretrained word embeddings, you can “stand on the shoulders of giants” and
quickly leverage high-quality linguistic knowledge distilled from large text corpora.
Fast Text
• Lets see how to train word embeddings using your own text data using fastText, a
popular word-embedding toolkit. This is handy when your textual data is not in a
general domain (e.g., medical, financial, legal, and so on) and/or is not in English
• All the word-embedding methods we’ve seen so far in this chapter assign a distinct
word vector for each word.
• For example, word vectors for “dog” and “cat” are treated distinctly and are
independently trained at the training time.
• Because “-y” is an English suffix that denotes some familiarity and affection (other
examples include “grandma” and “granny” and “kitten” and “kitty”), these pairs of
words have some semantic connection. However, word-embedding algorithms that treat
words as distinct cannot make this connection.
• In most languages, there’s a strong connection between word orthography (how you
write words) and word semantics (what they mean).
• For example, words that share the same stems (e.g., “study” and “studied,” “repeat”
and “repeatedly,” and “legal” and “illegal”) are often related. By treating them as
separate words, word-embedding algorithms are losing a lot of information.
• How can they leverage word structures and reflect the similarities in the learned word
embeddings?
It uses subword information, which means information about linguistic units that are smaller
than words, to train higher-quality word embeddings.
Specifically, fastText breaks words down to character n-grams and learns embeddings for them.
• Another benefit in leveraging subword information is that it can alleviate the out-of
vocabulary (OOV) problem.
• Many NLP applications and models assume a fixed vocabulary. For example, a typical
word-embedding algorithm such as Skip-gram learns word vectors only for the words
that were encountered in the train set. However, if a test set contains words that did not
appear in the train set (which are called OOV words), the model would be unable to
assign any vectors to them.
• For example, if you train Skipgram word embeddings from books published in the
1980s and apply them to modern social media text, how would it know what vectors to
assign to “Instagram”? It won’t.
• On the other hand, because fast Text uses subword information (character n-
grams), it can assign word vectors to any OOV words, as long as they contain
character n-grams
• Facebook provides the open source for the fastText toolkit, a library for training the
word-embedding model.
• One way to achieve this is to simply use the average of all word vectors in a sentence.
You can average vectors by taking the average of first elements, second elements, and
so on and make a new vector by combining these averaged numbers.
• You can use this new vector as an input to traditional machine learning models.
Although this method is simple and can be effective, it is also very limiting.
• The biggest issue is that it cannot take word order into consideration.
• For example, both sentences “Mary loves John.” and “John loves Mary.” would have
exactly the same vectors if you simply averaged word vectors for each word in the
sentence.
• NLP researchers have proposed models and algorithms that can specifically address this
issue.
• One of the most popular is Doc2Vec, originally proposed by Le and Mikolov in 2014
(https://cs.stanford.edu/~quocle/paragraph_vector.pdf).
• This model, as its name suggests, learns vector representations for documents. In fact,
“document” here simply means any variable-length piece of text that contains multiple
words. Similar models are also called under many similar names such as Sentence2Vec,
Paragraph2Vec, paragraph vectors (this is what the authors of the original paper used),
but in essence, they all refer to the variations of the same model.
• Retrieving similar words given a query word is a great way to quickly check if word
embeddings are trained correctly.
• But it gets tiring and time-consuming if you need to check a number of words to see if
the word embeddings are capturing semantic relationships between words as a whole.
• word embeddings are simply N-dimensional vectors, which are also “points” in an N-
dimensional space. We were able to see those points visually in a 3-D space because N
was 3. But N is typically a couple of hundred in most word embeddings, and we cannot
simply plot them on an N-dimensional space.
• A solution is to reduce the dimension down to something that we can see (two or three
dimensions) while preserving relative distances between points.
• This technique is called dimensionality reduction. We have a number of ways to reduce
dimensionality, including PCA (principal component analysis) and ICA (independent
component analysis), but by far the most widely used visualization technique for word
embeddings is called t-SNE (t-distributed Stochastic Neighbor Embedding, pronounced
“tee-snee).
•
To understand how to process text, it is important to understand the general workflow for NLP.
The following diagram illustrates the basic steps:
The first two steps of the process in the preceding diagram involve collecting labeled data. A
supervised model or even a semi-supervised model needs data to operate. The next step is
usually normalizing and featuring the data. Models have a hard time processing text data as is.
There is a lot of hidden structure in a given text that needs to be processed and exposed. These
two steps focus on that. The last step is building a model with the processed inputs.
a. Data collection
The first step of any Machine Learning (ML) project is to obtain a dataset. Fortunately, in the
text domain, there is plenty of data to be found. A common approach is to use libraries such as
scrapy or Beautiful Soup to scrape data from the web. However, data is usually unlabeled, and
as such can't be used in supervised models directly. This data is quite useful though. Through
the use of transfer learning, a language model can be trained using unsupervised or semi-
supervised methods and can be further used with a small training dataset specific to the task at
hand.
The goal is to gather raw text data from various sources relevant to the problem domain.
Sources of Text Data:
• Public Datasets – e.g., IMDB reviews (sentiment analysis), Wikipedia (language
modeling), news articles, etc.
• Web Scraping – Extracting text from websites, blogs, forums, or social media.
• APIs & Databases – Twitter API, Reddit API, news aggregators, and company
databases.
• Customer Feedback – Emails, reviews, support tickets, chat logs, or surveys.
• Documents & Reports – PDFs, Word files, OCR-extracted text.
Challenges in Data Collection:
• Data inconsistency (noisy, unstructured text).
• Privacy and ethical concerns (sensitive user data).
• Domain-specific jargon (technical, medical, or legal text).
b. Data labelling
In the labeling step, textual data sourced in the data collection step is labeled with the right
classes. Let's take some examples. If the task is to build a spam classifier for emails, then the
previous step would involve collecting lots of emails. This labeling step would be to attach a
spam or not spam label to each email. Another example could be sentiment detection on tweets.
The data collection step would involve gathering a number of tweets. This step would label
each tweet with a label that acts as a ground truth. A more involved example would involve
collecting news articles, where the labels would be summaries of the articles. Yet another
example of such a case would be an email auto-reply functionality. Like the spam case, a
number of emails with their replies would need to be collected. The labels in this case would
be short pieces of text that would approximate replies.
Once text data is collected, it needs to be labeled (annotated) for supervised learning tasks.
Types of Text Labeling:
• Text Classification – Assigning categories (e.g., spam vs. non-spam, sentiment labels).
• Named Entity Recognition (NER) – Identifying entities (e.g., names, locations,
dates).
• Part-of-Speech (POS) Tagging – Labeling words as nouns, verbs, adjectives, etc.
• Sentiment Analysis – Labeling text as positive, negative, or neutral.
• Question Answering – Mapping questions to correct answers in a dataset.
• Text Summarization – Generating summaries and verifying correctness.
Labeling Methods:
• Manual Labeling: Human annotators label text using annotation tools.
• Semi-Supervised Labeling: A mix of human labeling and machine-assisted
annotation.
• Crowdsourcing: Using platforms like Amazon Mechanical Turk for large-scale
labeling.
• Automated Labeling: Using weak supervision, heuristics, or pre-trained models.
Challenges in Labeling:
• Time-consuming and expensive.
• Subjectivity in annotations (e.g., sentiment can vary by person).
• Class imbalance (some labels may appear less frequently).
c. Text normalization
Text normalization is a pre-processing step aimed at improving the quality of the text and
making it suitable for machines to process. Four main steps in text normalization are case
normalization, tokenization and stop word removal, Parts-of-Speech (POS) tagging, and
stemming. Case normalization applies to languages that use uppercase and lowercase letters.
In case normalization, all letters are converted to the same case. It is quite helpful in semantic
use cases. However, in other cases, this may hinder performance.
Another common normalization step removes punctuation in the text. Again, this may or may
not be useful given the problem at hand. In most cases, this should give good results. However,
in some cases, such as spam or grammar models, it may hinder performance. It is more likely
for spam messages to use more exclamation marks or other punctuation for emphasis.
i. Tokenization
Definition: Splitting text into smaller units (tokens), usually words or subwords.
Example:
Input: "Natural Language Processing is amazing!"
Tokenized: ["Natural", "Language", "Processing", "is", "amazing", "!"]
Types:
Example:
Input: "Machine Learning"
Lowercased: "machine learning"
Why?
• Some tasks (e.g., NER) may not need this step to retain proper noun distinctions.
Definition: Removes common words that don’t carry much meaning (e.g., "is", "the", "and").
Example:
Input: "The cat is sitting on the mat"
After stopword removal: "cat sitting mat"
Caution:
• Not always recommended for deep learning models, where context matters.
iv. Stemming
Example:
Input: ["running", "flies", "better"]
Stemmed: ["run", "fli", "better"]
Algorithm:
Limitations:
iv. Lemmatization
Definition: Reduces words to their dictionary base form (lemma) while preserving meaning.
Example:
Input: ["running", "flies", "better"]
Lemmatized: ["run", "fly", "good"]
• More accurate (e.g., "went" → "go", while stemming wouldn't recognize this).
Example:
Input: "The quick brown fox jumps over the lazy dog"
POS Tags: [("The", DT), ("quick", JJ), ("brown", JJ), ("fox", NN), ("jumps", VBZ), ...]
Why?
2.7 Vectorization
Vectorization in NLP is the process of converting text data into numerical vectors that can be
processed by machine learning algorithms.
Vectorization is the process of converting text data into numerical vectors. In the context of
Natural Language Processing (NLP), vectorization transforms words, phrases, or entire
documents into a format that can be understood and processed by machine learning models.
These numerical representations capture the semantic meaning and contextual relationships of
the text, allowing algorithms to perform tasks such as classification, clustering, and prediction.
1. Vectorization converts text into a format that these models can process, enabling the
application of statistical and machine learning techniques to textual data.
3. Techniques like TF-IDF and word embeddings reduce the dimensionality of the data
compared to one-hot encoding. This not only makes computation more efficient but
also helps in capturing the most relevant features of the text.
4. Vectorization helps manage large vocabularies by creating fixed-size vectors for words
or documents. This is essential for handling the vast amount of text data available in
applications like search engines and social media analysis.
6. : Pre-trained models like BERT and GPT use vectorization to create embeddings that
can be fine-tuned for various NLP tasks. This transfer learning approach saves time and
resources by leveraging existing knowledge.
Here, we explore three traditional vectorization techniques: Bag of Words (BoW), Term
Frequency-Inverse Document Frequency (TF-IDF), and Count Vectorizer.\
The Bag of Words model represents text by converting it into a collection of words (or tokens)
and their frequencies, disregarding grammar, word order, and context. Each document is
represented as a vector of word counts, with each element in the vector corresponding to the
frequency of a specific word in the document.
# Sample documents
documents = [
"The cat sat on the mat.",
Limitations of TF Alone:
• TF does not account for the global importance of a term across the entire corpus.
• Common words like “the” or “and” may have high TF scores but are not meaningful in
distinguishing documents.
➢ The logarithm is used to dampen the effect of very large or very small values, ensuring
the IDF score scales appropriately.
➢ It also helps balance the impact of terms that appear in extremely few or extremely
many documents.
➢ IDF does not consider how often a term appears within a specific document.
➢ A term might be rare across the corpus (high IDF) but irrelevant in a specific document
(low TF).
Imagine we have a corpus (a collection of documents) with three documents. calculate the
TF-IDF score for specific term word “cat” in these documents. Assume the Data is only
Stemmed ,No lower case ,No stop word Removal.
# assign documents
d1 = 'Geeks'
d2 = 'r2j'
Summary
• Word embeddings are numeric representations of words, and they help convert discrete
units (words and sentences) to continuous mathematical objects (vectors).
• The Skip-gram model uses a neural network with a linear layer and SoftMax to learn
word embeddings as a by-product of the “fake” word-association task.
• GloVe makes use of global statistics of word co-occurrence to train word embeddings
efficiently.
• Doc2Vec and fastText learn document-level embeddings and word embeddings with
subword information, respectively.