NLP (TB)—Unit-1
NLP—Unit-1
Difference between Natural language and Computer Language
NLP is difficult because Ambiguity and Uncertainty exist in the language.
Ambiguity
There are the following three ambiguity -
○ Lexical Ambiguity
Lexical Ambiguity exists in the presence of two or more possible meanings of the sentence within a single
word.
Example:
Manya is looking for a match.
In the above example, the word match refers to that either Manya is looking for a partner or Manya is
looking for a match. (Cricket or other match)
○ Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible meanings within the sentence.
Example:
I saw the girl with the binocular.
In the above example, did I have the binoculars? Or did the girl have the binoculars?
○ Referential Ambiguity
Referential Ambiguity exists when you are referring to something using the pronoun.
1
NLP (TB)—Unit-1
Example: Kiran went to Sunita and she said "I am hungry."
In the above sentence, you do not know that who is hungry, either Kiran or Sunita.
NLP stands for Natural Language Processing, which is a part of Computer Science, Human language, and
Artificial Intelligence. It is the technology that is used by machines to understand, analyse, manipulate, and
interpret human's languages.
Components of NLP
1. Natural Language Understanding (NLU)
Natural Language Understanding (NLU) helps the machine to understand and analyse human language by
extracting the metadata from content such as concepts, entities, keywords, emotion, relations, and semantic
roles.
2. Natural Language Generation (NLG)
Natural Language Generation (NLG) acts as a translator that converts the computerized data into natural
language representation. It mainly involves Text planning, Sentence planning, and Text Realization.
Difference between NLU and NLG
2
NLP (TB)—Unit-1
Phases of NLP
There are the following five phases of NLP:
1. Lexical Analysis
The first phase of NLP is the Lexical Analysis. This phase scans the source code as a stream of characters
and converts it into meaningful lexemes. It divides the whole text into paragraphs, sentences, and words.
2. Syntactic Analysis (Parsing)
Syntactic Analysis is used to check grammar, word arrangements, and shows the relationship among the
words.
Example: Agra goes to the Poonam
In the real world, Agra goes to the Poonam, does not make any sense, so this sentence is rejected by the
Syntactic analyzer.
3. Semantic Analysis
Semantic analysis is concerned with the meaning representation. It mainly focuses on the literal meaning
of words, phrases, and sentences.
3
NLP (TB)—Unit-1
4. Discourse Integration
Discourse Integration depends upon the sentences that proceeds it and also invokes the meaning of the
sentences that follow it.
5. Pragmatic Analysis
Pragmatic is the fifth and last phase of NLP. It helps you to discover the intended effect by applying a set
of rules that characterize cooperative dialogues.
For Example: "Open the door" is interpreted as a request instead of an order.
Applications of NLP:
Natural Language Processing (NLP) has a wide range of applications across various fields.
● Sentiment Analysis: NLP can be used to analyze the sentiment of text data, which is valuable for
understanding public opinion, customer feedback, and social media sentiment.
● Machine Translation: NLP powers machine translation systems that translate text from one
language to another, facilitating communication across language barriers.
4
NLP (TB)—Unit-1
● Chatbots and Virtual Assistants: NLP enables the development of chatbots and virtual assistants
that can understand and respond to user queries, providing customer support, information retrieval,
and task automation.
● Information Extraction: NLP techniques can extract structured information from unstructured
text data, such as named entity recognition, relation extraction, and event extraction.
● Text Summarization: NLP algorithms can summarize large volumes of text into concise
summaries, which is useful for quickly understanding the key points of lengthy documents or
articles.
● Question Answering Systems: NLP powers question answering systems that can understand
natural language questions and provide relevant answers by extracting information from structured
or unstructured data sources.
● Text Classification: NLP techniques are used for text classification tasks such as spam detection,
sentiment analysis, topic classification, and document categorization.
● Named Entity Recognition (NER): NLP models can identify and classify named entities
mentioned in text, such as names of people, organizations, locations, dates, and numerical
expressions.
● Language Generation: NLP can generate human-like text, including generating creative content,
writing articles, composing poetry, and generating code snippets.
● Information Retrieval: NLP is used in search engines to understand user queries and retrieve
relevant documents or web pages from large collections of text data.
● Speech Recognition and Speech-to-Text: NLP techniques are employed in speech recognition
systems to convert spoken language into text, enabling applications such as virtual assistants, voice-
controlled devices, and dictation software.
● Text Mining: NLP facilitates the mining of valuable insights and patterns from large volumes of
text data, including social media data, customer reviews, scientific literature, and financial reports.
These applications demonstrate the versatility and importance of NLP in various domains, ranging from
customer service and healthcare to finance and entertainment.
5
NLP (TB)—Unit-1
Approaches to NLP
DL Approaches
Artificial Neural Networks (ANN), are computational networks that are able to solve complex, nonlinear
mathematical problems.
The field of ANN has been inspired by the ambition to model biological neural systems.
NN are modelled as collections of layers of neurons that are connected in an acyclic graph.
The output of such ANN could be a predicted numerical value, but in many cases, it is usually taken to represent
the class scores (e.g. in text classification) .
In most cases of NLP, we are interested in multinomial classification, such as part-of-speech tagging. Hence, the
output layer yields a probability distribution across the output nodes.
DNNs stack up several hidden layers, with each layer acting as the input to the next layer.
DL allows a computer to build complex concepts out of simpler concepts. Another perspective on deep learning
is that the depth allows the computer to learn a multi-step program.
Each layer can be interpreted as the state of the computer’s memory after executing a set of instructions.
Overview of different DL techniques.
• Multilayer Perceptron (MLP) is a feed-forward neural network with multiple (one or more) hidden layers
between the input layer and output layer.
• Autoencoder (AE) is an unsupervised model attempting to reconstruct its input data in the output layer.
• Convolutional Neural Network (CNN) is a special kind of feed-forward neural network with convolution
layers and pooling operations.
6
NLP (TB)—Unit-1
• Recurrent Neural Network (RNN) use loops and memorises to remember former computations.
• Reinforcement learning Learning operates (DRL). on a trial-and-error paradigm. The whole framework mainly
consists of the following components: agents, environments, states, actions, and rewards.
RNN (For Further Reading in later units)
● A sentence in any language flows from one direction to another.
● Thus, a model that can progressively read an input text from one end to another can be very useful for
language understanding.
● Recurrent neural networks (RNNs) are specially designed to keep such sequential processing and learning
in mind.
● RNNs have neural units that are capable of remembering what they have processed so far.
● This memory is temporal, and the information is stored and updated with every time step as the RNN reads
the next word in the input.
LSTM
● RNNs suffer from the problem of forgetful memory—they cannot remember longer contexts and therefore
do not perform well when the input text is long, which is typically the case with text inputs.
● Long shortterm memory networks (LSTMs), a type of RNN, were invented to mitigate this shortcoming of
the RNNs.
● LSTMs circumvent this problem by letting go of the irrelevant context and only remembering the part of
the context that is needed to solve the task at hand.
● Gated recurrent units (GRUs) are another variant of RNNs that are used mostly in language generation.
CNN
7
NLP (TB)—Unit-1
● A word in a sentence can be replaced with its corresponding word vector, and all vectors are of the same
size (d).
● Thus, they can be stacked one over another to form a matrix or 2D array of dimension n ✕d, where
n is the number of words in the sentence and d is the size of the word vectors.
● This matrix can now be treated similar to an image and can be modeled by a CNN.
● The main advantage CNNs have is their ability to look at a group of words together using a context window.
● For example, we are doing sentiment classification, and we get a sentence like, “I like this movie very
much!” In order to make sense of this sentence, it is better to look at words and different sets of contiguous words.
● CNN uses a collection of convolution and pooling layers to achieve this condensed representation of the
text, which is then fed as input to a fully connected layer to learn some NLP tasks like text classification.
8
NLP (TB)—Unit-1
Transformers
● Transformer models are the recent advancement in past two years.
● They model the textual context but not in a sequential manner.
● Given a word in the input, it prefers to look at all the words around it (known as self- attention) and represent
each word with respect to its context.
● For example, the word “bank” can have different meanings depending on the context.
● With transformers, a very large transformer mode is trained in an unsupervised manner (known as pre-
training) to predict a part of a sentence given the rest of the content.
● These models are trained on more than 40 GB of textual data, scraped from the whole internet.
● An example of a large transformer is BERT (Bidirectional Encoder Representations from Transformers),
which is pre-trained on massive data and open sourced by Google.
9
NLP (TB)—Unit-1
NLP- Pipeline
The key stages in the pipeline are as follows:
1. Data acquisition
2. Text cleaning
3. Pre-processing
4. Feature engineering
5. Modeling
6. Evaluation
7. Deployment
8. Monitoring and model updating
1. Data acquisition
10
NLP (TB)—Unit-1
● Use a public dataset
We could see if there are any public datasets available that we can leverage.
● Scrape data
We could find a source of relevant data on the internet—for example, a consumer or discussion forum
where people have posted queries (sales or support). Scrape the data from there and get it labeled by
human annotators.
● Product intervention
The AI team should work with the product team to collect more and richer data by developing better
instrumentation in the product.
● Data augmentation
We can take a small dataset and use some tricks to create more data. These tricks are also called data
augmentation, and they try to exploit language properties to create text that is syntactically similar to
source text data. They are
⮚ Synonym replacement
⮚ Back translation
Say we have a sentence, S1, in English. We use a machine-translation library like
Google Translate to translate it into some other language—say, German. Let the
corresponding sentence in German be S2. Now, we’ll use the machine-translation
library again to translate back to English. Let the output sentence be S3.
⮚ TF-IDF–based word replacement
⮚ Bigram flipping
Divide the sentence into bigrams. Take one bigram at random and flip it.
⮚ Replacing entities
Replace entities like person name, location, organization, etc., with other entities
in the same category.
⮚ Adding noise to data
In many NLP applications, the incoming data contains spelling mistakes.
(example, Twitter). In such cases, we can add a bit of noise to data to train robust
models. For example, randomly choose a word in a sentence and replace it with
another word that’s closer in spelling to the first word. (“fat finger” on mobile
keyboards, QWERTY keyboard error)
There are other advanced techniques and systems that can augment text data.
● Snorkel
● Easy Data Augmentation (EDA)
● Active learning
2. Text Extraction and Cleanup
11
NLP (TB)—Unit-1
Text extraction and cleanup refers to the process of extracting raw text from the input data by removing
all the other non-textual information, such as markup, metadata, etc., and converting the text to the
required encoding format.
(a) PDF invoice
(b) HTML texts
(c) text embedded in an image
● HTML Parsing and Cleanup
⮚ Say we’re working on a project where we’re building a forum search engine for
programming questions. Eg. Stack Overflow as a source and decided to extract
question and best-answer pairs from the website. We notice that questions and
answers have special tags associated with them. We can utilize this information
while extracting text from the HTML page.
⮚ It’s more feasible to utilize existing libraries such as Beautiful Soup and Scrapy,
which provide a range of utilities to parse web pages.
Extracting a question and its best-answer pair from a Stack Overflow web page:
⮚ Unicode Normalization
As we develop code for cleaning up HTML tags, we may also encounter various Unicode characters,
including symbols, emojis, and other graphic characters.
⮚ Spelling Correction
Shorthand text messages in social microblogs often hinder language processing and context
understanding. Two such examples follow:
Shorthand typing: Hllo world! I am back!
12
NLP (TB)—Unit-1
Fat finger problem: I pronise that I will not bresk the silence again!
⮚ System-Specific Error Correction
The pipeline in this case starts with extraction of plain text from PDF documents.
However, different PDF documents are encoded differently, and sometimes, we may not be able to extract
the full text, or the structure of the text may get messed up.
While there are several libraries, such as PyPDF, PDFMiner etc., to extract text from PDF documents, they
are far from perfect, and it’s common to encounter PDF documents that can’t be processed by such libraries.
Pre-processing
Other Pre-Processing Steps
⮚ Text normalization
⮚ A word can be spelled in different ways, including in shortened forms, a phone number can be
written in different formats (e.g., with and without hyphens), names are sometimes in
lowercase, and so on.
⮚ This is known as text normalization. Some common steps for text normalization are to convert
all text to lowercase or uppercase, convert digits to text (e.g., 9 to nine), expand abbreviations,
and so on.
⮚ Language detection
⮚ A lot of web content is in non-English languages. For example, say we’re asked to collect
all reviews about our product on the web.
13
NLP (TB)—Unit-1
⮚ In such cases, language detection is performed as the first step in an NLP pipeline. We can
use libraries like Polyglot [36] for language detection.
⮚ Code mixing and transliteration
⮚ Many people across the world speak more than one language in their day-to-day lives.
Thus, it’s common to see them using multiple languages in their social media posts, and a
single post may contain many languages.
Advanced Processing
⮚ POS Tagging,
⮚ Co-reference resolution, etc
14
NLP (TB)—Unit-1
Feature Engineering
⮚ The goal of feature engineering is to capture the characteristics of the text into a numeric vector
that can be understood by the ML algorithms. We refer to this step as “text representation”.
⮚ We’ll briefly touch on two different approaches taken in practice for feature engineering in (1) a
classical NLP and traditional ML pipeline and (2) a DL pipeline.
15
NLP (TB)—Unit-1
Modeling
The next step is about how to build a useful solution out of this. At the start, when we have limited data,
we can use simpler methods and rules.
⮚ Start with Simple Heuristics
Part of that could be due to a lack of data, but human-built heuristics can also provide a great start in some
ways.
Heuristics may already be part of your system, either implicitly or explicitly.
For instance, in email spam-classification tasks, we may have a blacklist of domains that are used
exclusively to send spam. This information can be used to filter emails from those domains. Similarly, a
blacklist of words in an email that denote a high chance of spam could also be used for this classification.
Building Your Model
⮚ Create a feature from the heuristic for your ML model
When there are many heuristics where the behavior of a single heuristic is deterministic but their combined
behavior is fuzzy in terms of how they predict, it’s best to use these heuristics as features to train your ML
model.
Eg- in the email spam-classification example, we can add features, such as the number of words from the
blacklist in a given email or the email bounce rate, to the ML model.
Building the Model
● Ensemble and stacking
o Model stacking
We can feed one model’s output as input for another model, thus sequentially going from one model to
another and obtaining a final output. This is called model stacking.
o Model ensembling
We can also pool predictions from multiple models and make a final prediction. This is called model
ensembling.
16
NLP (TB)—Unit-1
Stacking (Stacked Generalization) is an ensemble learning technique that aims to combine multiple models
to improve predictive performance. Steps are:
1. Base Models: Training multiple models (level-0 models) on the same dataset.
2. Meta-Model: Training a new model (level-1 or meta-model) to combine the predictions of the
base models. Using the predictions of the base models as input features for the meta-model.
Evaluation
Evaluations are of two types: intrinsic and extrinsic.
Intrinsic focuses on intermediary objectives, while extrinsic focuses on evaluating performance on the final
objective.
● Intrinsic Evaluation
For most metrics in this category, we assume a test set where we have the ground truth or labels (human
annotated, correct answers).
Labels could be binary (e.g., 0/1 for text classification), one-to-two words (e.g., names for named entity
recognition), or large text itself (e.g., text translated by machine translation).
The output of the NLP model on a data point is compared against the corresponding label for that data
point, and metrics are calculated based on the match (or mismatch) between the output and label.
For most NLP tasks, the comparison can be automated, hence intrinsic evaluation can be automated.
17
NLP (TB)—Unit-1
18
NLP (TB)—Unit-1
19