Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views28 pages

Nlput-Unit2 Notes

Word embeddings are numerical representations of words in a continuous vector space that capture semantic relationships, allowing NLP models to understand meanings and relationships between words. These embeddings are essential for various NLP tasks, improving performance and enabling models to generalize to unseen words. Techniques like tokenization, stemming, and lemmatization further process linguistic units to enhance the effectiveness of NLP applications.

Uploaded by

rishikareddy983
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views28 pages

Nlput-Unit2 Notes

Word embeddings are numerical representations of words in a continuous vector space that capture semantic relationships, allowing NLP models to understand meanings and relationships between words. These embeddings are essential for various NLP tasks, improving performance and enabling models to generalize to unseen words. Techniques like tokenization, stemming, and lemmatization further process linguistic units to enhance the effectiveness of NLP applications.

Uploaded by

rishikareddy983
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY

(AN AUTONOMOUS INSTITUTE)

Accredited by NBA & NAAC, Approved by AICTE, Affiliated to JNTUH, Hyderabad


2.1 What are word embeddings?

Word embeddings are one of the most important concepts in modern NLP. Word embeddings
are a type of representation for words in a continuous vector space where the positioning of
words captures semantic relationships between them.

In simpler terms, word embeddings are numerical representations of words that allow
computers to understand their meanings and relationships with other words. Word embeddings
in natural language processing (NLP) are numerical representations of words that capture
semantic relationships between words based on their usage in text.

Word embeddings are typically learned from large text corpora using neural network models,
such as Word2Vec, GloVe, or FastText. These models map words to high-dimensional vectors
in such a way that words with similar meanings are represented by vectors that are close to
each other in the vector space. Word embeddings are useful in NLP tasks such as language
modeling, sentiment analysis, and machine translation, as they allow models to capture the
meaning of words and the relationships between them.

In the eyes of computers, “cat” is no closer to “dog” than it is to “pizza.”

One way to deal with discrete words programmatically is to assign indices to individual words
as follows (here we simply assume that these indices are assigned alphabetically): ¡
index("cat") = 1

index("dog") = 2

index("pizza") = 3

The entire, finite set of words that one NLP application or task deals with is called vocabulary.
Just because words are now represented by numbers doesn’t mean you can do arithmetic
operations on them and conclude that “cat” is equally similar to “dog” (difference between 1
and 2), as “dog” is to “pizza” (difference between 2 and 3). Those indices are still discrete and
arbitrary.

“What if we can represent them on a numerical scale?”

Conceptually, the numerical scale would look like the one shown in figure below in 1D

This is a step forward. Now we can represent the fact that “cat” and “dog” are more similar to
each other than “pizza” is to those words.

But still, “pizza” is slightly closer to “dog” than it is to “cat.”


• What if we wanted to place it somewhere that is equally far from “cat” and “dog?”

• Maybe only one dimension is too limiting.

• How about adding another dimension to this, as shown in figure represted in 2D

Much better! Because computers are really good at dealing with multidimensional spaces you
can simply keep doing this until you have a sufficient number of dimensions.

Let’s have three dimensions.

In this 3-D space, you can represent those three words as follows:

¡ vec("cat") = [0.7, 0.5, 0.1]

¡ vec("dog") = [0.8, 0.3, 0.1]

¡ vec("pizza") = [0.1, 0.2, 0.8]

Figure below illustrates this three-dimensional space in 3D

The x -axis (the first element) here represents some concept of “animal-ness” and the z-axis
(the third dimension) corresponds to “food-ness.
This is essentially what word embeddings are.

Think of a multidimensional space that has as many dimensions as there are words.

Then, give to each word a vector that is filled with zeros but just one 1, as shown

¡ vec("cat") = [1, 0, 0]

¡ vec("dog") = [0, 1, 0]

¡ vec("pizza") = [0, 0, 1]

Notice that each vector has only one 1 at the position corresponding to the word’s index.
These special vectors are called one-hot vectors.

2.2 What are embeddings?

A word embedding is a real-valued vector representation of a word. If you find the concept of
vectors intimidating, think of them as single-dimensional arrays of float numbers, like the
following:

vec("cat") = [0.7, 0.5, 0.1]

vec("dog") = [0.8, 0.3, 0.1]

vec("pizza") = [0.1, 0.2, 0.8]

Because each array contains three elements, you can plot them as points in a 3-D space as in
figure below.

Notice that semantically-related words (“cat” and “dog”) are placed close to each other

2.2 b Why are embeddings important?

They are used as input to machine learning models.


Take the words —-> Give their numeric representation —-> Use in training or inference.
To represent or visualize any underlying patterns of usage in the corpus that was used to
train them. There are multiple ways to generate word representation.

• Word embeddings are not just important but essential for using neural networks to solve
NLP tasks.
• Neural networks are pure mathematical computation models that can deal only with
numbers. They can’t do symbolic operations, such as concatenating two strings or
conjugating a verb to past tense, unless these items are all represented by numbers and
arithmetic operations.

• On the other hand, almost everything in NLP, such as words and labels, is symbolic and
discrete.

This is why you need to bridge these two worlds, and using embeddings is a way to do it. See
figure below for an overview on how to use word embeddings for an NLP application.

Word embeddings, just like any other neural network models, can be trained, because they are
simply a collection of parameters.

Embeddings are used with your NLP model in the following three scenarios:

Scenario 1: Train word embeddings and your model at the same time using the train set for
your task.

Scenario 2: First, train word embeddings independently using a larger text dataset.
Alternatively, obtain pretrained word embeddings from somewhere else. Then initialize your
model using the pretrained word embeddings, and train them and your model at the same time
using the train set for your task.

Scenario 3: Same as scenario 2, except you fix word embeddings while you train your model.

2.2c Need for Word Embedding?

Word embeddings are crucial in natural language processing (NLP) for several reasons:

1. Semantic Representation: Word embeddings provide a way to represent words in a


continuous vector space, where words with similar meanings are closer together. This
allows NLP models to capture semantic relationships between words and understand
their meanings in context.

2. Dimensionality Reduction: Word embeddings reduce the dimensionality of the input


space, making it easier for NLP models to process and learn from text data. This can
lead to more efficient and effective models.
3. Generalization: Word embeddings can generalize to unseen words based on their
similarity to words in the training data. This is particularly useful in NLP tasks where
the vocabulary is large and constantly evolving.

4. Improved Performance: NLP models that use word embeddings often achieve better
performance compared to models that use traditional sparse representations of words,
such as one-hot encoding. Word embeddings capture more nuanced relationships
between words, leading to improved performance on tasks like text classification,
sentiment analysis, and machine translation.

5. Transfer Learning: Pre-trained word embeddings can be used as a starting point for
training NLP models on specific tasks. This allows models to leverage knowledge
learned from large text corpora and achieve better performance with less training data.

Here are some examples of word embeddings and their applications:

1. Word Similarity: Word embeddings can capture semantic similarity between


words. For instance, in a word embedding space, the vectors for "king" and "queen"
are expected to be close together because they are both related to royalty.
2. Word Analogies: Word embeddings can solve analogies like "king is to queen as
man is to woman." If you subtract the vector for "man" from "king" and add the
vector for "woman," you should get a vector close to "queen" in the embedding
space.
3. Sentiment Analysis: Word embeddings can be used in sentiment analysis tasks. By
representing words in a continuous vector space, models can capture the sentiment
of words and phrases more effectively.
4. Named Entity Recognition: In natural language processing tasks like named entity
recognition, word embeddings can help identify entities more accurately by
capturing contextual information about words.
5. Machine Translation: Word embeddings are useful in machine translation tasks.
By representing words in a continuous space, it becomes easier for models to learn
mappings between words in different languages.
6. Clustering and Classification: Word embeddings can be used in clustering tasks
to group similar words together or in classification tasks to classify text documents
into different categories based on the semantic meaning of words.

2.3Building blocks of language:

2.3a Characters, words, and phrases

a. Characters

A character (also called a grapheme in linguistics) is the smallest unit of a writing system.

In written English, “a,” “b,” and “z” are characters.

Characters do not necessarily carry meaning by themselves or represent any fixed sound when
spoken, although in some languages (e.g., Chinese), most do.

A typical character in many languages can be represented by a single Unicode codepoint (by
string literals such as "\uXXXX" in Python), but this is not always the case. Many languages
use a combination of more than one Unicode codepoint (e.g., accent marks) to represent a
single character. Punctuation marks, such as “.” (period), “,” (comma), and “?” (question mark),
are also characters.

b. Words

A word is the smallest unit in a language that can be uttered independently and that usually
carries some meaning.

In English, “apple,” “banana,” and “zebra” are words.

In most written languages that use alphabetic scripts, words are usually separated by spaces or
punctuation marks. In some languages, like Chinese, Japanese, and Thai, however, words are
not explicitly delimited by spaces and require a preprocessing step called word segmentation
to identify words in a sentence.

c. Tokens A closely related concept to a word in NLP is a token.

A token is a string of contiguous characters that play a certain role in a written language. Most
words (“apple,” “banana,” “zebra”) are also tokens when written.

Punctuation marks such as the exclamation mark (“!”) are tokens but not words, because you
can’t utter them in isolation.

Word and token are often used interchangeably in NLP. In fact, when you see “word” in NLP
text (including this book), it often means “token,” because most NLP tasks deal only with
written text that is processed in an automatic way. Tokens are the output of a process called
tokenization.

d. Morphemes Another closely related concept is morpheme.

A morpheme is the smallest unit of meaning in a language.

A typical word consists of one or more morphemes. For example, “apple” is a word and also a
morpheme. “Apples” is a word comprised of two morphemes, “apple” and “-s,” which is used
to signify the noun is plural.

English contains many other morphemes, including “-ing,” “-ly,” “-ness,” and “un-.”

The process for identifying morphemes in a word or a sentence is called morphological


analysis, and it has a wide range of NLP/linguistics applications.

e. Phrase A phrase is a group of words that play a certain grammatical role.

For example, “the quick brown fox” is a noun phrase (a group of words that behaves like a
noun), whereas “jumps over the lazy dog” is a verb phrase.
The concept of phrase may be used somewhat liberally in NLP to simply mean any group of
words.

f. N-grams

An n-gram is a contiguous sequence of one or more occurrences of linguistic units, such as


characters and words.

For example, a word n-gram is a contiguous sequence of words, such as

• “the” (one word) ,

• “quick brown” (two words),

• “brown fox jumps” (three words).

Similarly, a character n-gram is composed of characters, such as

• “b” (one character),

• “br” (two characters),

• “row” (three characters), and so on,

which are all character n-grams made from “brown.”

An n-gram of size 1 (when n = 1) is called a unigram. N-grams of size 2 and 3 are called a
bigram and a trigram, respectively.

2.3 b Tokenization, stemming, and lemmatization

Steps where linguistic units are processed in a typical NLP pipeline.

1. Tokenization Tokenization is a process where the input text is split into smaller units.

There are two types of tokenization: word and sentence tokenization.

• Word tokenization splits a sentence into tokens (rough equivalent to words and
punctuation), which I mentioned earlier.

• Sentence tokenization, on the other hand, splits a piece of text that may include more
than one sentence into individual sentences. If you say tokenization, it usually means
word tokenization in NLP.
2. Stemming Stemming is a process for identifying word stems.

• A word stem is the main part of a word after stripping off its affixes (prefixes and
suffixes).

• For example, the word stem of “apples” (plural) is “apple.” The stem of “meets” (with
a third-person singular s) is “meet.”

• It is often a part that remains unchanged after inflection.

• Stemming—that is, normalizing words to something closer to their original forms—has


great benefits in many NLP applications.

• In search, for example, you can improve the chance of retrieving relevant documents if
you index documents using word stems instead of words.

• The most popular algorithm used for stemming English words is called the Porter
stemming algorithm, originally written by Martin Porter. It consists of a number of rules
for rewriting affixes (e.g., if a word ends with “-ization,” change it to “-ize”).

• Examples a. [play, playing, played] ---- stemming [play ,play ,play]

• ["improvement", "improving", "improve"]------['improv', 'improv', 'improv']

• ["dancing","dance",danced"]-----[Danc,danc,danc]

• ['running', 'flies', 'better', 'drove']----- ['run', 'fli', 'better', 'drove']

• ["bats", "are","feet" , "best"] ------ ['bat', 'are', 'feet', 'best']

3.Lemmatization

• A lemma is the original form of a word that you often find as a head word in a
dictionary.

• It is also the base form of the word before inflection.

• For example, the lemma of “meetings” (as a plural noun) is “meeting.” The lemma of
“met” (a verb past form) is “meet.”
• Notice that it differs from stemming, which simply strips off affixes from a word and
cannot deal with such irregular verbs and nouns.

• Examples:

• ["Dancing, dance, danced"]-----→ [Dance,Dance,Dance]

• ['improvement', 'improving', 'improve']--------→ ['improve', 'improve', 'improve']

• ['running', 'flies', 'better', 'drove']----- ['run', 'fly', 'good', 'drive']


• ["bats", "are","feet" , "best"] → ["bat","be","foot","good"]

Comparison of Stemming and Lemmatization

2.3c Representation Of Word Vectors:


One-hot encoding is a typical representation.

Each word is encoded using a distinct vector in this manner. The number of words is equal to
the size of the vectors. As a result, if there are 1.000 words, the vectors are 1x1.000 in size.
Except for a value one that distinguishes each word representation, all values in the vectors are
zeros. Each word is given its vector, but this representation has its problems.
To begin with, if the vocabulary is extensive, the vectors will be enormous. Employing a model
with this encoding would result in the curse of dimensionality. Adding or removing terms from
the vocabulary will also affect the display of all words.

One of the classic methods is one-hot encoding, which represents each word in a
vocabulary as a binary vector. The dimensionality of the embedding is equal to the size of
the vocabulary, and each element of the vector corresponds to a word in the vocabulary.

For example, in the sentence “Word embedding represents a word as numerical data.”, there
are 7 unique words. Thus, the dimensionality of the word embedding is 7:

With the development of neural networks, scientists introduced more advanced methods that
generate distributed word representations, such as TF-IDF, Word2vec, GloVe, etc.

2.3 d : Word2Vec, Skip Gram and Continuous bag of words

Word2Vec

• Word2Vec is a neural network model for word embeddings.

• Word2Vec generates word vectors, which are distributed numerical representations of


word features - these word features could be words that indicate the context of particular
words in our vocabulary.

• Through the produced vectors, word embeddings eventually assist in forming the
relationship of a term with another word with a similar meaning. Context is used in
these models. This means that it looks at neighboring words to learn the embedding; if
a set of words is always found close to the exact words, their embeddings will be similar.

Skip-gram and continuous bag of words (CBOW)

• The Skip-Gram and Continuous Bag of Words models are two distinct architectures
that Word2Vec can build word embeddings.

• To calculate word embeddings from large textual data using two popular algorithms—
Skip-gram and CBOW.
In the CBOW model, the distributed representations of context (or surrounding words) are
combined to predict the word in the middle. While in the Skip-gram model, the distributed
representation of the input word is used to predict the context.

CBOW

The Continuous Bag of Words (CBOW) is a Word2Vec model that predicts a target word
based on the surrounding context words. It takes a fixed-sized context window of words and
tries to predict the target word in the middle of the window. The model learns by maximizing
the probability of predicting the target word correctly given the context words.

Let’s take a look at a simple example.

Let's say we have the sentence, "I eat Pizza on Friday". First, we will tokenize the sentence:
["I", "eat", "pizza","on", "Friday"]. Now, let's create the training examples for this sentence for
the CBOW model, considering a window size of 2.

• Training example 1: Input: ["I", "pizza"], Target: "eat".

• Training example 2: Input: ["eat", "on"], Target: "pizza".

• Training example 3: Input: ["pizza", "Friday"], Target:"on".

In CBOW, there are typically three main layers involved: the input layer, the hidden layer, and
the output layer.
Skip-gram is another neural network architecture used in Word2Vec that predicts the context
of a word, given a target word.

The input to the skip-gram model is a target word, while the output is a set of context words.
The goal of the skip-gram model is to learn the probability distribution of the context words,
given the target word.

During training, the skip-gram model is fed with a set of target words and their corresponding
context words.

The model learns to adjust the weights of the hidden layer to maximize the probability of
predicting the correct context words, given the target word.

Let’s take a look at the same example discussed above.

Example

• The sentence was, "I eat Pizza on Friday". First, we will tokenize the sentence:

• ["I", "eat", "pizza", "on", "Friday"].

• Now, let's create the training examples for this sentence for the skip-gram model,
considering a window size of 2.

• Training example 1: Input: "eat", Target: ["I", "pizza"].

• Training example 2: Input: "pizza", Target: ["eat", "on"].

• Training example 3: Input: "on", Target: ["pizza", "Friday"].

• Training example 4: Input: "Friday", Target: ["on"].


Both CBOW and skip-gram provide different approaches to word embeddings, offering a trade-
off between training efficiency, semantic capture, and the ability to handle different dataset
characteristics. Therefore, choosing the appropriate algorithm among them requires careful
consideration of the specific requirements of your task.

Both CBOW and skip-gram provide different approaches to word embeddings, offering a trade-
off between training efficiency, semantic capture, and the ability to handle different dataset
characteristics. Therefore, choosing the appropriate algorithm among them requires careful
consideration of the specific requirements of your task.

2.3e Glove and FastText (Pretrained word Embeddings)


Another popular word-embedding model—GloVe, named after Global Vectors. Pretrained
word embeddings generated by GloVe are probably the most widely used embeddings in NLP
applications today

Skip-gram uses a prediction task where a context word (“bark”) is predicted from the target
word (“dog”).

CBOW basically does the opposite of this.

This process is repeated as many times as there are word tokens in the dataset. It basically scans
the entire dataset and asks the question, “Can this word be predicted from this other word?” for
every single occurrence of words in the dataset.

What if there were two or more identical sentences in the dataset? Or very similar sentences?

In that case, Skip-gram would repeat the exact same set of updates multiple times. “Can ‘bark’
be predicted from ‘dog’?” you might ask. But chances are :you already asked that exact same
question a couple of hundred sentences ago. If you know that the words “dog” and “bark”
appear together in the context N times in the entire dataset, why repeat this N times? It’s as if
you were adding “1” N times to something else (x + 1 + 1 + 1 + ... + 1) when you could simply
add N to it (x + N).

How Glove learns word embeddings

Could we somehow use the global information directly? The design of Glove is motivated by
this insight. Instead of using local word co-occurrences, it uses aggregated word co-occurrence
statistics in the entire dataset. Let’s say “dog” and “bark” co-occur N times in the dataset. I’m
not going into the details of the model, but roughly speaking, the GloVe model tries to predict
this number N from the embeddings of both words. It still makes some predictions about word
relations, but notice that it makes one prediction per a combination of word types, but Skip-
gram does so for every combination of word tokens!

A token is an occurrence of a word in text. There may be multiple occurrences of the same
word in a corpus. A type, on the other hand, is a distinctive, unique word.

For example, in the sentence “A rose is a rose is a rose,” there are eight tokens but only three
types (“a,” “rose,” and “is”).

Using pretrained Glove vectors


• More often, we download and use word embeddings, which are pretrained using large
text corpora. This is not only quick but usually beneficial in making your NLP
applications more accurate, because those pretrained word embeddings are usually
trained using larger datasets and more computational power than most of us can afford.

• By using pretrained word embeddings, you can “stand on the shoulders of giants” and
quickly leverage high-quality linguistic knowledge distilled from large text corpora.

• The official GloVe website (https://nlp.stanford.edu/projects/glove/) provides multiple


word-embedding files trained using different datasets and vector sizes.

Fast Text

• Lets see how to train word embeddings using your own text data using fastText, a
popular word-embedding toolkit. This is handy when your textual data is not in a
general domain (e.g., medical, financial, legal, and so on) and/or is not in English

Making use of subword information

• All the word-embedding methods we’ve seen so far in this chapter assign a distinct
word vector for each word.

• For example, word vectors for “dog” and “cat” are treated distinctly and are
independently trained at the training time.

• But what if the words were, say, “dog” and “doggy?”

• Because “-y” is an English suffix that denotes some familiarity and affection (other
examples include “grandma” and “granny” and “kitten” and “kitty”), these pairs of
words have some semantic connection. However, word-embedding algorithms that treat
words as distinct cannot make this connection.

• This is obviously limiting.

• In most languages, there’s a strong connection between word orthography (how you
write words) and word semantics (what they mean).

• For example, words that share the same stems (e.g., “study” and “studied,” “repeat”
and “repeatedly,” and “legal” and “illegal”) are often related. By treating them as
separate words, word-embedding algorithms are losing a lot of information.

• How can they leverage word structures and reflect the similarities in the learned word
embeddings?

FastText, an algorithm and a word-embedding library developed by Facebook, is one such


model.

It uses subword information, which means information about linguistic units that are smaller
than words, to train higher-quality word embeddings.

Specifically, fastText breaks words down to character n-grams and learns embeddings for them.
• Another benefit in leveraging subword information is that it can alleviate the out-of
vocabulary (OOV) problem.

• Many NLP applications and models assume a fixed vocabulary. For example, a typical
word-embedding algorithm such as Skip-gram learns word vectors only for the words
that were encountered in the train set. However, if a test set contains words that did not
appear in the train set (which are called OOV words), the model would be unable to
assign any vectors to them.

• For example, if you train Skipgram word embeddings from books published in the
1980s and apply them to modern social media text, how would it know what vectors to
assign to “Instagram”? It won’t.

• On the other hand, because fast Text uses subword information (character n-
grams), it can assign word vectors to any OOV words, as long as they contain
character n-grams

Using the fastText toolkit

• Facebook provides the open source for the fastText toolkit, a library for training the
word-embedding model.

• (http://realworldnlpbook.com/ch3.html #fasttext) and follow the instruction to


download and compile the library.

2.4 Document-level embeddings


• All the models described so far learn embeddings for individual words.
• However, if you wish to solve NLP tasks that are concerned with larger linguistic
structures such as sentences and documents using word embeddings and traditional
machine learning tools such as logistic regression and support vector machines (SVMs),
word-level embedding methods are still limited.
How can you represent larger linguistic units such as sentences using vector
representations?
How can you use word embeddings for sentiment analysis, for example?

• One way to achieve this is to simply use the average of all word vectors in a sentence.
You can average vectors by taking the average of first elements, second elements, and
so on and make a new vector by combining these averaged numbers.
• You can use this new vector as an input to traditional machine learning models.
Although this method is simple and can be effective, it is also very limiting.
• The biggest issue is that it cannot take word order into consideration.
• For example, both sentences “Mary loves John.” and “John loves Mary.” would have
exactly the same vectors if you simply averaged word vectors for each word in the
sentence.
• NLP researchers have proposed models and algorithms that can specifically address this
issue.
• One of the most popular is Doc2Vec, originally proposed by Le and Mikolov in 2014
(https://cs.stanford.edu/~quocle/paragraph_vector.pdf).
• This model, as its name suggests, learns vector representations for documents. In fact,
“document” here simply means any variable-length piece of text that contains multiple
words. Similar models are also called under many similar names such as Sentence2Vec,
Paragraph2Vec, paragraph vectors (this is what the authors of the original paper used),
but in essence, they all refer to the variations of the same model.

2.5 Visualizing embeddings

• Retrieving similar words given a query word is a great way to quickly check if word
embeddings are trained correctly.
• But it gets tiring and time-consuming if you need to check a number of words to see if
the word embeddings are capturing semantic relationships between words as a whole.
• word embeddings are simply N-dimensional vectors, which are also “points” in an N-
dimensional space. We were able to see those points visually in a 3-D space because N
was 3. But N is typically a couple of hundred in most word embeddings, and we cannot
simply plot them on an N-dimensional space.
• A solution is to reduce the dimension down to something that we can see (two or three
dimensions) while preserving relative distances between points.
• This technique is called dimensionality reduction. We have a number of ways to reduce
dimensionality, including PCA (principal component analysis) and ICA (independent
component analysis), but by far the most widely used visualization technique for word
embeddings is called t-SNE (t-distributed Stochastic Neighbor Embedding, pronounced
“tee-snee).

2.6 A typical text processing workflow

To understand how to process text, it is important to understand the general workflow for NLP.
The following diagram illustrates the basic steps:
The first two steps of the process in the preceding diagram involve collecting labeled data. A
supervised model or even a semi-supervised model needs data to operate. The next step is
usually normalizing and featuring the data. Models have a hard time processing text data as is.
There is a lot of hidden structure in a given text that needs to be processed and exposed. These
two steps focus on that. The last step is building a model with the processed inputs.

a. Data collection

The first step of any Machine Learning (ML) project is to obtain a dataset. Fortunately, in the
text domain, there is plenty of data to be found. A common approach is to use libraries such as
scrapy or Beautiful Soup to scrape data from the web. However, data is usually unlabeled, and
as such can't be used in supervised models directly. This data is quite useful though. Through
the use of transfer learning, a language model can be trained using unsupervised or semi-
supervised methods and can be further used with a small training dataset specific to the task at
hand.
The goal is to gather raw text data from various sources relevant to the problem domain.
Sources of Text Data:
• Public Datasets – e.g., IMDB reviews (sentiment analysis), Wikipedia (language
modeling), news articles, etc.
• Web Scraping – Extracting text from websites, blogs, forums, or social media.
• APIs & Databases – Twitter API, Reddit API, news aggregators, and company
databases.
• Customer Feedback – Emails, reviews, support tickets, chat logs, or surveys.
• Documents & Reports – PDFs, Word files, OCR-extracted text.
Challenges in Data Collection:
• Data inconsistency (noisy, unstructured text).
• Privacy and ethical concerns (sensitive user data).
• Domain-specific jargon (technical, medical, or legal text).

b. Data labelling

In the labeling step, textual data sourced in the data collection step is labeled with the right
classes. Let's take some examples. If the task is to build a spam classifier for emails, then the
previous step would involve collecting lots of emails. This labeling step would be to attach a
spam or not spam label to each email. Another example could be sentiment detection on tweets.
The data collection step would involve gathering a number of tweets. This step would label
each tweet with a label that acts as a ground truth. A more involved example would involve
collecting news articles, where the labels would be summaries of the articles. Yet another
example of such a case would be an email auto-reply functionality. Like the spam case, a
number of emails with their replies would need to be collected. The labels in this case would
be short pieces of text that would approximate replies.

Once text data is collected, it needs to be labeled (annotated) for supervised learning tasks.
Types of Text Labeling:
• Text Classification – Assigning categories (e.g., spam vs. non-spam, sentiment labels).
• Named Entity Recognition (NER) – Identifying entities (e.g., names, locations,
dates).
• Part-of-Speech (POS) Tagging – Labeling words as nouns, verbs, adjectives, etc.
• Sentiment Analysis – Labeling text as positive, negative, or neutral.
• Question Answering – Mapping questions to correct answers in a dataset.
• Text Summarization – Generating summaries and verifying correctness.
Labeling Methods:
• Manual Labeling: Human annotators label text using annotation tools.
• Semi-Supervised Labeling: A mix of human labeling and machine-assisted
annotation.
• Crowdsourcing: Using platforms like Amazon Mechanical Turk for large-scale
labeling.
• Automated Labeling: Using weak supervision, heuristics, or pre-trained models.
Challenges in Labeling:
• Time-consuming and expensive.
• Subjectivity in annotations (e.g., sentiment can vary by person).
• Class imbalance (some labels may appear less frequently).

c. Text normalization

Text normalization is a pre-processing step aimed at improving the quality of the text and
making it suitable for machines to process. Four main steps in text normalization are case
normalization, tokenization and stop word removal, Parts-of-Speech (POS) tagging, and
stemming. Case normalization applies to languages that use uppercase and lowercase letters.
In case normalization, all letters are converted to the same case. It is quite helpful in semantic
use cases. However, in other cases, this may hinder performance.

Another common normalization step removes punctuation in the text. Again, this may or may
not be useful given the problem at hand. In most cases, this should give good results. However,
in some cases, such as spam or grammar models, it may hinder performance. It is more likely
for spam messages to use more exclamation marks or other punctuation for emphasis.

i. Tokenization

Definition: Splitting text into smaller units (tokens), usually words or subwords.
Example:
Input: "Natural Language Processing is amazing!"
Tokenized: ["Natural", "Language", "Processing", "is", "amazing", "!"]

Types:

• Word Tokenization – Splits text into words.

• Sentence Tokenization – Splits text into sentences.

• Subword Tokenization – Used in BPE and WordPiece models (e.g., "unhappiness" →


["un", "happiness"]).

Tools: NLTK, spaCy, Tokenizers from Hugging Face

ii. Lower Case Conversion

Definition: Converts text to lowercase to avoid case sensitivity issues.

Example:
Input: "Machine Learning"
Lowercased: "machine learning"

Why?

• Reduces vocabulary size.

• Avoids treating "Apple" and "apple" as different words.

• Some tasks (e.g., NER) may not need this step to retain proper noun distinctions.

iii. Stop Words Removal

Definition: Removes common words that don’t carry much meaning (e.g., "is", "the", "and").

Example:
Input: "The cat is sitting on the mat"
After stopword removal: "cat sitting mat"

Caution:

• Useful for search engines and topic modeling.

• Not always recommended for deep learning models, where context matters.

Tools: NLTK, spaCy, Scikit-learn

iv. Stemming

Definition: Reduces words to their root form by chopping off suffixes.

Example:
Input: ["running", "flies", "better"]
Stemmed: ["run", "fli", "better"]
Algorithm:

• Porter Stemmer (aggressive)

• Lancaster Stemmer (more aggressive)

• Snowball Stemmer (improved version of Porter)

Limitations:

• May produce non-real words (e.g., "flying" → "fli").

• Can cause loss of meaning.

Tools: NLTK, spaCy

iv. Lemmatization

Definition: Reduces words to their dictionary base form (lemma) while preserving meaning.

Example:
Input: ["running", "flies", "better"]
Lemmatized: ["run", "fly", "good"]

Why Use Lemmatization Over Stemming?

• More accurate (e.g., "went" → "go", while stemming wouldn't recognize this).

• Requires POS tagging for better results.

Tools: spaCy, WordNetLemmatizer from NLTK

v. POS (Part-of-Speech) Tagging

Definition: Assigns grammatical categories (noun, verb, adjective, etc.) to words.

Example:
Input: "The quick brown fox jumps over the lazy dog"
POS Tags: [("The", DT), ("quick", JJ), ("brown", JJ), ("fox", NN), ("jumps", VBZ), ...]

Why?

• Improves lemmatization accuracy.

• Helps in syntactic parsing and entity recognition.

• Essential for Named Entity Recognition (NER).

Tools: spaCy, NLTK, StanfordNLP

2.7 Vectorization
Vectorization in NLP is the process of converting text data into numerical vectors that can be
processed by machine learning algorithms.

Vectorization is the process of converting text data into numerical vectors. In the context of
Natural Language Processing (NLP), vectorization transforms words, phrases, or entire
documents into a format that can be understood and processed by machine learning models.
These numerical representations capture the semantic meaning and contextual relationships of
the text, allowing algorithms to perform tasks such as classification, clustering, and prediction.

Why is Vectorization Important in NLP?

Vectorization is crucial in NLP for several reasons:

1. Vectorization converts text into a format that these models can process, enabling the
application of statistical and machine learning techniques to textual data.

2. Effective vectorization methods, like word embeddings, capture the semantic


relationships between words. This allows models to understand context and perform
better on tasks like sentiment analysis, translation, and summarization.

3. Techniques like TF-IDF and word embeddings reduce the dimensionality of the data
compared to one-hot encoding. This not only makes computation more efficient but
also helps in capturing the most relevant features of the text.

4. Vectorization helps manage large vocabularies by creating fixed-size vectors for words
or documents. This is essential for handling the vast amount of text data available in
applications like search engines and social media analysis.

5. Advanced vectorization techniques, such as contextualized embeddings, significantly


enhance model performance by providing rich, context-aware representations of words.
This leads to better generalization and accuracy in NLP tasks.

6. : Pre-trained models like BERT and GPT use vectorization to create embeddings that
can be fine-tuned for various NLP tasks. This transfer learning approach saves time and
resources by leveraging existing knowledge.

Traditional Vectorization Techniques in NLP

Here, we explore three traditional vectorization techniques: Bag of Words (BoW), Term
Frequency-Inverse Document Frequency (TF-IDF), and Count Vectorizer.\

1. Bag of Words (BoW)

The Bag of Words model represents text by converting it into a collection of words (or tokens)
and their frequencies, disregarding grammar, word order, and context. Each document is
represented as a vector of word counts, with each element in the vector corresponding to the
frequency of a specific word in the document.

# Sample documents

documents = [
"The cat sat on the mat.",

"The dog sat on the log.",

"Cats and dogs are pets."

. Advantages of Bag of Words (BoW)

• Simple and easy to implement.

• Provides a clear and interpretable representation of text.

Disadvantages of Bag of Words (BoW)

• Ignores the order and context of words.

• Results in high-dimensional and sparse matrices.

• Fails to capture semantic meaning and relationships between words.

2. Term Frequency-Inverse Document Frequency (TF-IDF)

Term Frequency (TF):

➢ Measures how often a word appears in a document.

➢ A higher frequency suggests greater importance.

➢ If a term appears frequently in a document, it is likely relevant to the document’s


content.

Limitations of TF Alone:

• TF does not account for the global importance of a term across the entire corpus.

• Common words like “the” or “and” may have high TF scores but are not meaningful in
distinguishing documents.

Inverse Document Frequency (IDF):


Reduces the weight of common words across multiple documents while increasing the weight
of rare words. If a term appears in fewer documents, it is more likely to be meaningful and
specific.

➢ The logarithm is used to dampen the effect of very large or very small values, ensuring
the IDF score scales appropriately.

➢ It also helps balance the impact of terms that appear in extremely few or extremely
many documents.

Limitations of IDF Alone:

➢ IDF does not consider how often a term appears within a specific document.

➢ A term might be rare across the corpus (high IDF) but irrelevant in a specific document
(low TF).

Problem:Converting Text into vectors with TF-IDF

Imagine we have a corpus (a collection of documents) with three documents. calculate the
TF-IDF score for specific term word “cat” in these documents. Assume the Data is only
Stemmed ,No lower case ,No stop word Removal.

1. Document 1: “The cat sat on the mat.”

2. Document 2: “The dog played in the park.”

3. Document 3: “Cats and dogs are great pets.”


Problem 2

# assign documents

d0 = 'Geeks for geeks'

d1 = 'Geeks'

d2 = 'r2j'
Summary

• Word embeddings are numeric representations of words, and they help convert discrete
units (words and sentences) to continuous mathematical objects (vectors).

• The Skip-gram model uses a neural network with a linear layer and SoftMax to learn
word embeddings as a by-product of the “fake” word-association task.

• GloVe makes use of global statistics of word co-occurrence to train word embeddings
efficiently.

• Doc2Vec and fastText learn document-level embeddings and word embeddings with
subword information, respectively.

• You can use t-SNE to visualize word embeddings.

You might also like