0% found this document useful (0 votes)

34 views28 pages

Natural Language Processing

The document provides an overview of various concepts in Natural Language Processing (NLP), including language models, grammar-based models, statistical models, tokenization, and parts-of-speech tagging. It discusses the advantages and disadvantages of different modeling techniques, such as n-grams and smoothing methods, as well as the importance of grammar and dependency structures in understanding language. Additionally, it highlights practical applications of these concepts in tasks like machine translation, speech recognition, and text generation.

Uploaded by

alisha shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views28 pages

Natural Language Processing

Uploaded by

alisha shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Language Model

A language model in natural language processing (NLP) is a statistical or

machine learning model that is used to predict the next word in a sequence
given the previous words. Language models play a crucial role in various NLP
tasks such as machine translation, speech recognition, text generation, and
sentiment analysis. They analyze and understand the structure and use of
human language, enabling machines to process and generate text that is
contextually appropriate and coherent.

Grammar Based LM
Grammar-based language models are a type of statistical language model
that uses formal grammars to represent the underlying structure of language.
Unlike n-gram models, which focus on the probability of sequences of words,
grammar-based models explicitly model the grammatical relationships
between words in a sentence.
-> Training
-> Parsing
-> Probability calculation
-> Best Parse selection
Advantage
->Explicit Modelling of structure
-> Robustness
Disadvantage
->Complexity
->Limited Coverage
Application
-> NLP, Speech recognition, computational linguistics

1
Statistical based LM
Statistical language models (SLMs) are a cornerstone of natural language
processing (NLP), aiming to predict the likelihood of a sequence of words in a
given language. They do this by analyzing vast amounts of text data and
identifying statistical patterns in word usage.

Common Types of Statistical Language Models

1. N-gram Models:

○ Unigrams: Predict the probability of a single word.

○ Bigrams: Predict the probability of a word given the previous
word.
○ Trigrams: Predict the probability of a word given the two
previous words.
○ Higher-order n-grams: Consider longer sequences of words.
2. Maximum Likelihood Estimation (MLE): A common method for
estimating the probabilities of n-grams based on their frequency in the
training data.

Advantages
-> Simplicity, efficiency, widelyused
Disadvantages
-> Data sparsity(as many word combinations may not appear frequently in
the training data), Limited Context, Lack of Generalization.
Application
-> Speech recognition, Machine translation, text generation, Information
retrieval.

2
Regular Expression
A regular expression (regex) is a sequence of characters that define a search
pattern. Here’s how to write regular expressions:

1. Start by understanding the special characters used in regex, such as

“.”, “*”, “+”, “?”, and more.

2. Choose a programming language or tool that supports regex, such

as Python, Perl, or grep.

3. Write your pattern using the special characters and literal

characters.

4. Use the appropriate function or method to search for the pattern in a

string.

Finite set automata

Finite automata are abstract machines used to recognize patterns in input
sequences, forming the basis for understanding regular languages in
computer science. They consist of states, transitions, and input symbols,
processing each symbol step-by-step. If the machine ends in an accepting
state after processing the input, it is accepted; otherwise, it is rejected. Finite
automata come in deterministic (DFA) and non-deterministic (NFA), both of
which can recognize the same set of regular languages. They are widely used
in text processing, compilers, and network protocols.

Features of Finite Automata

● Input: Set of symbols or characters provided to the machine.

● Output: Accept or reject based on the input pattern.

3
● States of Automata: The conditions or configurations of the

machine.

● State Relation: The transitions between states.

● Output Relation: Based on the final state, the output decision is

made.

English Morphology
In Natural Language Processing (NLP), morphology plays a crucial role in
understanding the structure and meaning of words. It involves analyzing
words into their constituent parts (morphemes) and understanding how
these parts contribute to the overall meaning.

Example

Consider the word "unhappiness."

● Morphemes: un- (prefix), happy (root), -ness (suffix)

● Analysis: The prefix "un-" negates the meaning of the root "happy,"
and the suffix "-ness" converts it into a state or quality.

Tokenization
Tokenization is a fundamental process in Natural Language Processing (NLP)
that involves breaking down a stream of text into smaller units called tokens.
These tokens can range from individual characters to full words or phrases,
depending on the level of granularity required. By converting text into these
manageable chunks, machines can more effectively analyze and understand
human language.
Types

1. Word Tokenization

4
This is the most common method where text is divided into individual words.
It works well for languages with clear word boundaries, like English. For
example, "Machine learning is fascinating" becomes:
["Machine", "learning", "is", "fascinating"]

2.Character Tokenization

In this method, text is split into individual characters. This is particularly

useful for languages without clear word boundaries or for tasks that require
a detailed analysis, such as spelling correction. For instance, "NLP" would be
tokenized as:
["N", "L", "P"]

3.Subword Tokenization

This strikes a balance between word and character tokenization by breaking

down text into units that are larger than a single character but smaller than a
full word. For example, "Chatbots" might be tokenized into:
["Chat", "bots"]

Detecting and correcting spelling errors

Spelling error detection and correction is a crucial aspect of
Natural Language Processing (NLP). It involves identifying
misspelled words in text and suggesting accurate replacements.
This is essential for improving the quality of written
communication and enhancing the user experience in various
applications.

Example

5
Input: "Teh cat sat on teh mat."

Output: "The cat sat on the mat."

________________________________________________________________

Unsmoothed N-grams in NLP

In Natural Language Processing (NLP), unsmoothed n-grams are a

basic approach to language modeling. They estimate the
probability of a word sequence by directly counting the
occurrences of that sequence in a given training corpus.

Example

Let's consider the following sentence: "The quick brown fox

jumps over the lazy dog."

● Bigram (2-gram) Probabilities:

○ P("quick" | "The") = Count("The quick") / Count("The")

○ P("brown" | "quick") = Count("quick brown") /
Count("quick")
○ ...
● Trigram (3-gram) Probabilities:

○ P("brown" | "The quick") = Count("The quick brown") /

Count("The quick")
○ P("jumps" | "quick brown") = Count("quick brown
jumps") / Count("quick brown")

Smoothing in NLP

Smoothing is a crucial technique in Natural Language Processing

(NLP), particularly when dealing with n-gram models. It
addresses the issue of data sparsity, where many possible word
sequences have zero probability due to their infrequent or
non-existent occurrence in the training data.

6
Why Smoothing is Necessary

● Zero Probability Problem: Unsmoothed n-gram models assign

zero probability to unseen n-grams, even if they are
grammatically correct and likely to occur. This leads to:

○ Underestimation of probabilities
○ Inability to handle unseen data
● Overfitting: Unsmoothed models tend to overfit the training
data, meaning they perform poorly on unseen text.

Common Smoothing Techniques

1.Laplace Smoothing (Add-One Smoothing)

○ Adds a small constant (usually 1) to all n-gram

counts.
○ Ensures that no n-gram has zero probability.
○ Simple but can be overly aggressive, especially for
higher-order n-grams.
2.Good-Turing Smoothing

○ Redistributes probability mass from frequent n-grams

to infrequent or unseen n-grams.
○ More sophisticated than Laplace smoothing, often
providing better results.
3.Back-off Smoothing

○ If the count of an n-gram is zero, back off to the

(n-1)-gram, and so on, until a non-zero count is
found.
○ Combines information from different n-gram orders.
4.Katz Back-off

○ A refinement of back-off smoothing that uses

Good-Turing estimates to adjust the probabilities at
each back-off level.

Example

Let's consider a bigram model with the following counts:

7
● Count("the cat") = 10
● Count("the dog") = 5
● Count("the bird") = 0

Laplace Smoothing:

● Adjusted Count("the cat") = 10 + 1 = 11

● Adjusted Count("the dog") = 5 + 1 = 6
● Adjusted Count("the bird") = 0 + 1 = 1

Impact of Smoothning

1.Improve mode robustness

2.Increase accuracy
3.Improve generalization

What is POS(Parts-Of-Speech) Tagging?

Parts of Speech tagging is a linguistic activity in Natural Language
Processing (NLP) wherein each word in a document is given a particular part
of speech (adverb, adjective, verb, etc.) or grammatical category. Through the
addition of a layer of syntactic and semantic information to the words, this
procedure makes it easier to comprehend the sentence’s structure and
meaning.

In many NLP applications, including machine translation, sentiment analysis,

and information retrieval, PoS tagging is essential. PoS tagging serves as a
link between language and machine understanding, enabling the creation of
complex language processing systems and serving as the foundation for
advanced linguistic analysis.

8
Example of POS Tagging

Consider the sentence: “The quick brown fox jumps over the lazy dog.”

After performing POS Tagging:

● “The” is tagged as determiner (DT)

● “quick” is tagged as adjective (JJ)

● “brown” is tagged as adjective (JJ)

● “fox” is tagged as noun (NN)

● “jumps” is tagged as verb (VBZ)

● “over” is tagged as preposition (IN)

● “the” is tagged as determiner (DT)

● “lazy” is tagged as adjective (JJ)

● “dog” is tagged as noun (NN)

Types of POS Tagging in NLP

1. Rule-Based Tagging
Rule-based part-of-speech (POS) tagging involves assigning words their
respective parts of speech using predetermined rules, contrasting with

9
machine learning-based POS tagging that requires training on annotated text
corpora. In a rule-based system, POS tags are assigned based on specific
word characteristics and contextual cues.

2. Transformation Based tagging

Transformation-based tagging (TBT) is a part-of-speech (POS) tagging
method that uses a set of rules to change the tags that are applied to words
inside a text. In contrast, statistical POS tagging uses trained algorithms to
predict tags probabilistically, while rule-based POS tagging assigns tags
directly based on predefined rules.

When compared to rule-based tagging, TBT can provide higher accuracy,

especially when dealing with complex grammatical structures. To attain ideal
performance, nevertheless, it might require a large rule set and additional
computer power.

Text: “The cat chased the mouse”.

Initial Tags:

● “The” – Determiner (DET)

● “cat” – Noun (N)

● “chased” – Verb (V)

● “the” – Determiner (DET)

● “mouse” – Noun (N)

Transformation rule applied:

Change the tag of “chased” from Verb (V) to Noun (N) because it follows the
determiner “the.”

Updated tags:

10
● “The” – Determiner (DET)

● “cat” – Noun (N)

● “chased” – Noun (N)

● “the” – Determiner (DET)

● “mouse” – Noun (N)

Advantages of POS Tagging

There are several advantages of Parts-Of-Speech (POS) Tagging including:

● Text Simplification: Breaking complex sentences down into their

constituent parts makes the material easier to understand and

easier to simplify.

● Information Retrieval: Information retrieval systems are enhanced

by point-of-sale (POS) tagging, which allows for more precise

indexing and search based on grammatical categories.

● Named Entity Recognition: POS tagging helps to identify entities

such as names, locations, and organizations inside text and is a

precondition for named entity identification.

● Syntactic Parsing: It facilitates syntactic parsing, which helps with

phrase structure analysis and word link identification.

Disadvantages of POS Tagging

Some common disadvantages in part-of-speech (POS) tagging include:

11
● Ambiguity: The inherent ambiguity of language makes POS tagging

difficult since words can signify different things depending on the

context, which can result in misunderstandings.

● Idiomatic Expressions: Slang, colloquialisms, and idiomatic phrases

can be problematic for POS tagging systems since they don’t always

follow formal grammar standards.

● Out-of-Vocabulary Words: Out-of-vocabulary words (words not

included in the training corpus) can be difficult to handle since the

model might have trouble assigning the correct POS tags.

● Domain Dependence: For best results, POS tagging models trained

on a single domain should have a lot of domain-specific training

data because they might not generalize well to other domains.

_____________________________________________________________________

Context-Free Grammar
Context Free Grammar is formal grammar, the syntax or structure of a formal

language can be described using context-free grammar (CFG), a type of

formal grammar. The grammar has four tuples: (V,T,P,S).

V - It is the collection of variables or non-terminal symbols.

T - It is a set of terminals.

P - It is the production rules that consist of both terminals

and non-terminals.

12
S - It is the starting symbol.

A grammar is said to be the Context-free grammar if every production is in

the form of :

G -> (V∪T)*, where G ∊ V

● And the left-hand side of the G, here in the example, can only be a

Variable, it cannot be a terminal.

● But on the right-hand side here it can be a Variable or Terminal or

both combination of Variable and Terminal.

The above equation states that every production which contains any

combination of the ‘V’ variable or ‘T’ terminal is said to be a context-free

grammar.

Grammar in NLP
Grammar in NLP is a set of rules for constructing sentences in a language

used to understand and analyze the structure of sentences in text data.

This includes identifying parts of speech such as nouns, verbs, and adjectives,

determining the subject and predicate of a sentence, and identifying the

relationships between words and phrases.

13
Grammar is defined as the rules for forming well-structured sentences.

Grammar also plays an essential role in describing the syntactic structure of

well-formed programs, like denoting the syntactical rules used for

conversation in natural languages.

● In the theory of formal languages, grammar is also applicable in Computer

Science, mainly in programming languages and data structures. Example - In the

C programming language, the precise grammar rules state how functions are

made with the help of lists and statements.

Treebanks

In Natural Language Processing (NLP), treebanks are collections of text that have been

manually annotated with syntactic or semantic structures. These annotations represent

the grammatical relationships between words in a sentence, often visualized as

tree-like diagrams.

In formal language theory, particularly in the context of context-free grammars (CFGs),

a normal form is a restricted form of the grammar that maintains the same language

generated by the original grammar. These restricted forms simplify the analysis and

processing of the language. Two common normal forms are:

Dependency Grammar

In Natural Language Processing (NLP), dependency grammar is a framework for

analyzing the grammatical structure of sentences by focusing on the relationships

between words rather than their grouping into phrases.

14
Dependency Grammar

In Natural Language Processing (NLP), dependency grammar is a framework for

analyzing the grammatical structure of sentences by focusing on the relationships

between words rather than their grouping into phrases.

Key Concepts

● Dependency: A directed link between two words, indicating that one word (the

head) governs or modifies the other (the dependent).

● Head: The central word in a dependency relation.

● Dependent: The word that is modified or governed by the head.

● Dependency Tree: A graphical representation of the dependency relations in a

sentence, where words are nodes and the directed links represent

dependencies.

How it Works

Dependency grammar aims to capture the underlying syntactic structure of a sentence

by identifying the head-dependent relationships between words. The head word

typically determines the grammatical function of the dependent word.

Example:

Consider the sentence: "The cat sat on the mat."

A possible dependency tree for this sentence would look like this:

15
● sat is the root of the sentence, as it is the main verb.

● The modifies cat and mat.

● on is a preposition that governs the noun phrase the mat.

Advantages of Dependency Grammar

● Focus on Meaning: By emphasizing the relationships between words,

dependency grammar can better capture the underlying meaning of a sentence.

● Flexibility: Suitable for analyzing languages with free word order, where the

position of words in a sentence is less fixed.

16
● Simplicity: Dependency trees can be more concise and easier to interpret than

phrase structure trees.

Applications

● Parsing: Dependency parsing is a widely used technique for analyzing the

syntactic structure of sentences.

● Machine Translation: Dependency-based approaches can be effective for

capturing the underlying meaning of sentences and translating them accurately.

● Information Extraction: Dependency relations can help identify key

relationships between entities in text, such as who did what to whom.

—____________________________________________________________________

Application of NLP

Intelligent work processor-

1. Machine translation : Machine translation (MT) is a prominent application

of Natural Language Processing (NLP) within intelligent work processors. It
automates the translation of text from one human language to another,
significantly enhancing efficiency and accessibility in a globalized world

How Machine Translation Works:

1. Text Analysis: The input text is broken down into smaller units like words,
phrases, or sentences.
2. Language Identification: The source language is identified.

17
3. Translation: The system uses various techniques like statistical,
rule-based, or neural machine translation to find the most appropriate
equivalent in the target language.
4. Post-Editing: While modern MT systems are quite accurate, human
post-editing is often necessary to refine the translation, especially for
nuanced or complex texts.
Benefits of Machine Translation in Intelligent Work Processors:
● Efficiency: Translating large volumes of text becomes significantly faster,
saving time and resources.
● Accessibility: Content can be made accessible to a wider global
audience, breaking down language barriers.
● Cost-Effectiveness: Automating translation reduces the need for human
translators, lowering costs.
● 24/7 Availability: MT systems can translate text anytime, anywhere,
providing on-demand access to information.

Challenges and Limitations:

● Accuracy: While accuracy has improved significantly, MT systems may

still struggle with complex sentences, idioms, and cultural nuances.
● Nuance: Capturing the full meaning and tone of the original text can be
challenging, potentially leading to misinterpretations.
● Data Dependence: The quality of MT output relies heavily on the
availability of large amounts of high-quality training data.

Examples of Machine Translation in Action:

18
● Google Translate: A widely used online translation service that supports
numerous languages.
● Microsoft Translator: Integrated into various Microsoft products, offering
real-time translation for text and speech.
● DeepL: A commercial translation service known for its high-quality output,
particularly for European languages.

2. User interfaces - A User Interface (UI) is the point of interaction between

humans and machines. It's the way we interact with and control a device,
software, or system. Think of it as the face of technology – how it looks and feels
to use.

Key Components of a UI:

● Visual Design: This encompasses the overall aesthetic appeal, including

colors, typography, imagery, and layout. It aims to create a visually
engaging and pleasing experience.
● Interaction Design: This focuses on how users interact with the interface.
It involves elements like buttons, menus, sliders, and touch gestures,
ensuring they are intuitive and easy to use.
● Usability: This emphasizes how easy it is for users to achieve their goals
with the interface. It considers factors like clarity, efficiency, and error
prevention.
● Accessibility: This ensures the UI is usable by people with disabilities,
such as those with visual, auditory, motor, or cognitive impairments.

Types of User Interfaces:

● Command-Line Interface (CLI): Users interact by typing commands.

19
● Graphical User Interface (GUI): Uses visual elements like icons and
windows for interaction.
● Touchscreen Interface: Allows users to interact directly with the screen
using touch gestures.
● Voice User Interface (VUI): Enables interaction through voice commands.
● Gesture-Based Interface: Relies on body movements and gestures for
control.

Importance of a Good UI:

● Improved User Experience: A well-designed UI makes the product more

enjoyable and engaging to use.
● Increased Efficiency: Users can accomplish tasks more quickly and
easily.
● Reduced Errors: Intuitive interfaces minimize the risk of user errors.
● Enhanced Brand Image: A visually appealing and user-friendly UI can
enhance the perception of a brand.
● Greater Accessibility: A well-designed UI ensures the product is usable
by a wider range of users.

3. Man Machine interfaces- Man-Machine Interfaces (MMI), also known as

Human-Machine Interfaces (HMI), are systems that enable humans to interact
with machines or automated systems. They act as a bridge between the human
operator and the machine, facilitating control, monitoring, and data feedback in
real-time.

key Components of an MMI:

● Hardware: This includes physical devices like touchscreens, keyboards,

mice, buttons, knobs, and sensors.

20
● Software: This encompasses the programs and applications that interpret
user input and control the machine's behavior.
● Visual Display: This presents information to the operator, often through
screens, gauges, or indicators.

Types of MMIs:

● Command-Line Interface (CLI): Users interact by typing commands.

● Graphical User Interface (GUI): Uses visual elements like icons and
windows for interaction.
● Touchscreen Interface: Allows users to interact directly with the screen
using touch gestures.
● Voice User Interface (VUI): Enables interaction through voice commands.
● Gesture-Based Interface: Relies on body movements and gestures for
control.

Importance of a Good MMI:

● Improved Efficiency: A well-designed MMI can significantly enhance the

operator's ability to control and monitor the machine, leading to increased
productivity and reduced downtime.
● Enhanced Safety: MMIs can help prevent accidents by providing clear
and concise information, alarms, and safety warnings.
● Reduced Errors: Intuitive interfaces minimize the risk of operator errors.
● Better Decision-Making: MMIs can provide real-time data and
visualizations, enabling operators to make informed decisions.

Applications of MMI:

21
● Industrial Automation: MMIs are widely used in factories and plants to
control machinery, monitor processes, and manage production lines.
● Transportation: Vehicle dashboards, flight control systems, and traffic
management systems rely on MMIs.
● Healthcare: Medical devices like MRI machines and patient monitoring
systems use MMIs for control and data visualization.
● Consumer Electronics: Remote controls, smartphone interfaces, and
gaming consoles are examples of MMIs in everyday life.

4. Natural Language qurifying - Natural Language Querying (NLQ) is a powerful

application of Natural Language Processing (NLP) that allows users to interact
with databases and other data sources using everyday language instead of
specialized query languages like SQL.

How it works:

1. User Input: The user poses a question in natural language, such as "What
were the sales figures for California in Q3 2023?"
2. Natural Language Understanding (NLU): The system analyzes the
user's question to:
○ Identify the intent (e.g., "find sales figures")
○ Extract key entities (e.g., "California," "Q3 2023")
○ Determine the relationships between entities (e.g., "sales figures for
California")
3. Query Translation: The system translates the natural language question
into a formal query language (like SQL) that the database can understand.
4. Data Retrieval: The database executes the generated query and retrieves
the relevant data.

22
5. Answer Generation: The system processes the retrieved data and
presents it to the user in a clear and concise format, such as a table, chart,
or summary.

Benefits of NLQ:

● Accessibility: Makes data analysis accessible to a wider audience,

including those without technical expertise in SQL or other query
languages.
● Efficiency: Significantly speeds up the data analysis process by
eliminating the need to write complex queries.
● Improved Productivity: Enables users to focus on insights and
decision-making rather than on the technical aspects of data retrieval.
● Enhanced User Experience: Provides a more intuitive and user-friendly
way to interact with data.

Applications of NLQ:

● Business Intelligence: Analyzing sales trends, customer behavior, and

financial performance.
● Customer Service: Answering customer questions about products,
services, and orders.
● Research: Exploring scientific data, conducting literature reviews, and
answering research questions.
● Data Exploration: Discovering hidden patterns and insights within large
datasets.

4. Speech recognition - Speech recognition, also known as automatic speech

recognition (ASR), computer speech recognition, or speech-to-text, is the
capability that enables a program to process human speech into a written format.

23
How it works:

● Speech Input: The user speaks into a microphone or other input device.
● Acoustic Analysis: The speech signal is converted into a digital
representation and analyzed to extract features like pitch, intensity, and
frequency.
● Feature Extraction: Key features of the speech signal are extracted and
represented in a numerical format.
● Pattern Recognition: The extracted features are compared to a database
of known speech patterns to identify the most likely words or phrases.
● Language Modeling: The system uses language models to predict the
most probable sequence of words based on the context and grammatical
rules.
● Output: The recognized text is displayed to the user.

Applications:

● Virtual Assistants: Siri, Alexa, Google Assistant

● Dictation Software: Dragon NaturallySpeaking, Google Docs voice typing
● Transcription Services: For meetings, interviews, and legal proceedings
● Accessibility: For individuals with disabilities who have difficulty typing
● Smart Home Devices: Controlling devices with voice commands

Challenges:

● Accuracy: Handling accents, dialects, background noise, and different

speaking styles.
● Vocabulary: Recognizing and understanding rare words or technical
jargon.

24
● Real-time Processing: Ensuring fast and accurate recognition for
real-time applications.

Natural Language Processing (NLP) has a wide range of commercial

applications across various industries. Here are some key areas:

1. Customer Service & Support:

● Chatbots & Virtual Assistants: Powering interactive conversations with

customers, answering FAQs, resolving simple issues, and providing 24/7
support.
● Sentiment Analysis: Analyzing customer feedback (reviews, social media
posts, surveys) to understand customer sentiment, identify areas for
improvement, and proactively address concerns.
● Text Summarization: Summarizing customer interactions (emails, chat
logs) to quickly identify key issues and expedite resolution.

2. Marketing & Sales:

● Social Media Monitoring: Tracking brand mentions, identifying

influencers, and analyzing customer sentiment towards competitors.
● Targeted Advertising: Personalizing ad campaigns based on customer
interests and preferences extracted from text data (e.g., website content,
social media profiles).
● Market Research: Analyzing market trends, competitor analysis, and
customer behavior to inform business decisions.

3. Finance:

25
● Fraud Detection: Identifying and preventing fraudulent activities by
analyzing patterns in financial documents (e.g., transactions, reports) and
detecting anomalies.
● Risk Assessment: Assessing credit risk by analyzing loan applications,
financial statements, and news articles.
● Investment Analysis: Analyzing news articles, financial reports, and
social media sentiment to inform investment decisions.

4. Healthcare:

● Clinical Documentation: Automating the process of documenting patient

information, such as medical histories and discharge summaries.
● Drug Discovery: Analyzing research papers and clinical trials to identify
potential new drug candidates.
● Patient Monitoring: Analyzing patient records and medical conversations
to identify potential health risks and improve patient care

5. E-commerce:

● Product Recommendations: Personalizing product recommendations

based on customer search history, browsing behavior, and purchase
history.
● Product Categorization: Automating the process of categorizing products
based on their descriptions.
● Sentiment Analysis: Analyzing customer reviews to identify product
strengths and weaknesses.

6. Legal:

26
● E-discovery: Analyzing large volumes of legal documents (emails,
contracts, legal briefs) to identify relevant information for legal proceedings.
● Contract Analysis: Automating the process of reviewing and analyzing
contracts to identify potential risks and inconsistencies.

27
28

NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Unit I - NLP
No ratings yet
Unit I - NLP
24 pages
CB3591 - Engineering Ssecure Software Systems - Notes
No ratings yet
CB3591 - Engineering Ssecure Software Systems - Notes
50 pages
NLP Units Iv V
No ratings yet
NLP Units Iv V
30 pages
EPPP Psychological Assessment
100% (1)
EPPP Psychological Assessment
8 pages
NLP Revision Notes and Applications
No ratings yet
NLP Revision Notes and Applications
4 pages
N-gram Models in NLP Explained
No ratings yet
N-gram Models in NLP Explained
28 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Effects of Anxiety On Athletic Performance: CRI SON
No ratings yet
Effects of Anxiety On Athletic Performance: CRI SON
5 pages
NLP Language Models Explained
No ratings yet
NLP Language Models Explained
65 pages
Generation Alpha Student Behaviour Research
No ratings yet
Generation Alpha Student Behaviour Research
18 pages
Cooperation of The Eye
No ratings yet
Cooperation of The Eye
5 pages
Marzano 9 High-Yield Strategies
No ratings yet
Marzano 9 High-Yield Strategies
5 pages
Future Perfect Tense Guide
100% (1)
Future Perfect Tense Guide
5 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
74 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
Language Model Evaluation Methods
No ratings yet
Language Model Evaluation Methods
21 pages
Conflict Management
100% (1)
Conflict Management
28 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
Text Analytics and Natural Language Processing - KAI073
No ratings yet
Text Analytics and Natural Language Processing - KAI073
24 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Module5 DS PPT
No ratings yet
Module5 DS PPT
38 pages
NLTK - N-Gram LM
No ratings yet
NLTK - N-Gram LM
13 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
NLP m2
No ratings yet
NLP m2
71 pages
Theory Building in Management
No ratings yet
Theory Building in Management
18 pages
NLP for Language Model Enthusiasts
No ratings yet
NLP for Language Model Enthusiasts
74 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
Intro To Statistical NLP
No ratings yet
Intro To Statistical NLP
57 pages
v24dsl07t - Unit I - NLP
No ratings yet
v24dsl07t - Unit I - NLP
65 pages
Introduction To NLPAbebe Zerihun
No ratings yet
Introduction To NLPAbebe Zerihun
45 pages
NLP 1
No ratings yet
NLP 1
13 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
NLP Basics: Key Concepts and Processes
No ratings yet
NLP Basics: Key Concepts and Processes
15 pages
NLP Ai X
No ratings yet
NLP Ai X
6 pages
NLP Module 2
No ratings yet
NLP Module 2
18 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
What Is NLP?: Components of An FSA
No ratings yet
What Is NLP?: Components of An FSA
16 pages
NLP DeepNLP
No ratings yet
NLP DeepNLP
61 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
Module-1 ch-2
No ratings yet
Module-1 ch-2
31 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
13) Natural Language Processing
No ratings yet
13) Natural Language Processing
28 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Sentiment Analysis for Engineers
No ratings yet
Sentiment Analysis for Engineers
7 pages
Ngrams
No ratings yet
Ngrams
22 pages
Lect 05 Preprocessing Text
No ratings yet
Lect 05 Preprocessing Text
25 pages
Lifespan Development Insights
No ratings yet
Lifespan Development Insights
15 pages
Thesis Statement Mini-Lesson
No ratings yet
Thesis Statement Mini-Lesson
5 pages
The (Almost) Complete Machine Learning Roadmap: Milestone 0: Python 3 and Other Basic Stuff
No ratings yet
The (Almost) Complete Machine Learning Roadmap: Milestone 0: Python 3 and Other Basic Stuff
5 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
NLP Unit-1-Introduction-And-Word-Level-Analysis
No ratings yet
NLP Unit-1-Introduction-And-Word-Level-Analysis
25 pages
Hanlon's Razor Explained
No ratings yet
Hanlon's Razor Explained
1 page
Middle School Impact Project
No ratings yet
Middle School Impact Project
10 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Sample
No ratings yet
Sample
8 pages
Project Report
No ratings yet
Project Report
12 pages
Module 5-Natural Language Processing
No ratings yet
Module 5-Natural Language Processing
13 pages
ZS Data Science Challenge 2019 PDF
No ratings yet
ZS Data Science Challenge 2019 PDF
14 pages
NLP L IA2
No ratings yet
NLP L IA2
23 pages
Tutoring Reflection 3
No ratings yet
Tutoring Reflection 3
3 pages
Objects of Instructional Materials in Teaching English Language
No ratings yet
Objects of Instructional Materials in Teaching English Language
4 pages
I Want To Study String Theory. Where Do I Start - Quora
No ratings yet
I Want To Study String Theory. Where Do I Start - Quora
6 pages
Secondary 2nd Grade Unit 4A
No ratings yet
Secondary 2nd Grade Unit 4A
3 pages
Lesson 4 Natural Language Generation
No ratings yet
Lesson 4 Natural Language Generation
61 pages
Module 2
No ratings yet
Module 2
26 pages
Unit 6 (NLP)
No ratings yet
Unit 6 (NLP)
8 pages
NLP 1.2
No ratings yet
NLP 1.2
22 pages
MacDonald-1974-Evaluation and The Control of Education
No ratings yet
MacDonald-1974-Evaluation and The Control of Education
8 pages
FRENCH - Code No. 018 Class Ix
No ratings yet
FRENCH - Code No. 018 Class Ix
3 pages
Origins and Challenges of NLP
No ratings yet
Origins and Challenges of NLP
16 pages
NLP Final
No ratings yet
NLP Final
27 pages
Lexical Bundles 20091
No ratings yet
Lexical Bundles 20091
12 pages
Cs383 Lecture16 PDF
No ratings yet
Cs383 Lecture16 PDF
46 pages
Artificial Intelligence and Machine Learning Applications in Musculoskeletal PT
No ratings yet
Artificial Intelligence and Machine Learning Applications in Musculoskeletal PT
22 pages
Reflection - Diverse Learners
No ratings yet
Reflection - Diverse Learners
1 page
Action Plan and Training Design Design Thinking
No ratings yet
Action Plan and Training Design Design Thinking
5 pages
DLL - All Subjects 2 - Q4 - W2 - D4
No ratings yet
DLL - All Subjects 2 - Q4 - W2 - D4
6 pages
The Johari Window - Building Self-Awareness and Trust
No ratings yet
The Johari Window - Building Self-Awareness and Trust
15 pages
Research Proposal
No ratings yet
Research Proposal
7 pages
Managing Strong Emotions Effectively
No ratings yet
Managing Strong Emotions Effectively
2 pages
TTC Consciousness and Its Implications Bibliography
100% (1)
TTC Consciousness and Its Implications Bibliography
2 pages
Describe Ambiguity and Its Types
No ratings yet
Describe Ambiguity and Its Types
3 pages
Ballpoint Pen Portrait
No ratings yet
Ballpoint Pen Portrait
1 page

Natural Language Processing

Uploaded by

Natural Language Processing

Uploaded by

Language Model

A language model in natural language processing (NLP) is a statistical or

Common Types of Statistical Language Models

1.​ N-gram Models:​

○​ Unigrams: Predict the probability of a single word.

1.​ Start by understanding the special characters used in regex, such as

“.”, “*”, “+”, “?”, and more.

2.​ Choose a programming language or tool that supports regex, such

as Python, Perl, or grep.

Finite set automata

Features of Finite Automata

●​ Output: Accept or reject based on the input pattern.

●​ State Relation: The transitions between states.

●​ Output Relation: Based on the final state, the output decision is

Consider the word "unhappiness."

●​ Morphemes: un- (prefix), happy (root), -ness (suffix)

1.​ Word Tokenization

In this method, text is split into individual characters. This is particularly

This strikes a balance between word and character tokenization by breaking

Detecting and correcting spelling errors

Output: "The cat sat on the mat."

Unsmoothed N-grams in NLP

In Natural Language Processing (NLP), unsmoothed n-grams are a

Let's consider the following sentence: "The quick brown fox

●​ Bigram (2-gram) Probabilities:​

○​ P("quick" | "The") = Count("The quick") / Count("The")

○​ P("brown" | "The quick") = Count("The quick brown") /

Smoothing is a crucial technique in Natural Language Processing

●​ Zero Probability Problem: Unsmoothed n-gram models assign

Common Smoothing Techniques

1.​Laplace Smoothing (Add-One Smoothing)​

○​ Adds a small constant (usually 1) to all n-gram

○​ Redistributes probability mass from frequent n-grams

○​ If the count of an n-gram is zero, back off to the

○​ A refinement of back-off smoothing that uses

Let's consider a bigram model with the following counts:

●​ Adjusted Count("the cat") = 10 + 1 = 11

1.​Improve mode robustness

What is POS(Parts-Of-Speech) Tagging?

In many NLP applications, including machine translation, sentiment analysis,

After performing POS Tagging:

●​ “The” is tagged as determiner (DT)

●​ “quick” is tagged as adjective (JJ)

●​ “brown” is tagged as adjective (JJ)

●​ “fox” is tagged as noun (NN)

●​ “jumps” is tagged as verb (VBZ)

●​ “over” is tagged as preposition (IN)

●​ “the” is tagged as determiner (DT)

●​ “lazy” is tagged as adjective (JJ)

●​ “dog” is tagged as noun (NN)

Types of POS Tagging in NLP

2. Transformation Based tagging

When compared to rule-based tagging, TBT can provide higher accuracy,

Text: “The cat chased the mouse”.

●​ “The” – Determiner (DET)

●​ “cat” – Noun (N)

●​ “chased” – Verb (V)

●​ “the” – Determiner (DET)

●​ “mouse” – Noun (N)

Transformation rule applied:

●​ “cat” – Noun (N)

●​ “chased” – Noun (N)

●​ “the” – Determiner (DET)

●​ “mouse” – Noun (N)

Advantages of POS Tagging

There are several advantages of Parts-Of-Speech (POS) Tagging including:

●​ Text Simplification: Breaking complex sentences down into their

constituent parts makes the material easier to understand and

●​ Information Retrieval: Information retrieval systems are enhanced

by point-of-sale (POS) tagging, which allows for more precise

indexing and search based on grammatical categories.

●​ Named Entity Recognition: POS tagging helps to identify entities

such as names, locations, and organizations inside text and is a

precondition for named entity identification.

●​ Syntactic Parsing: It facilitates syntactic parsing, which helps with

phrase structure analysis and word link identification.

1. N-gram Models:

○ Unigrams: Predict the probability of a single word.

1. Start by understanding the special characters used in regex, such as

2. Choose a programming language or tool that supports regex, such

● Output: Accept or reject based on the input pattern.

● State Relation: The transitions between states.

● Output Relation: Based on the final state, the output decision is

● Morphemes: un- (prefix), happy (root), -ness (suffix)

1. Word Tokenization

● Bigram (2-gram) Probabilities:

○ P("quick" | "The") = Count("The quick") / Count("The")

○ P("brown" | "The quick") = Count("The quick brown") /

● Zero Probability Problem: Unsmoothed n-gram models assign

1.Laplace Smoothing (Add-One Smoothing)

○ Adds a small constant (usually 1) to all n-gram

○ Redistributes probability mass from frequent n-grams

○ If the count of an n-gram is zero, back off to the

○ A refinement of back-off smoothing that uses

● Adjusted Count("the cat") = 10 + 1 = 11

1.Improve mode robustness

● “The” is tagged as determiner (DT)

● “quick” is tagged as adjective (JJ)

● “brown” is tagged as adjective (JJ)

● “fox” is tagged as noun (NN)

● “jumps” is tagged as verb (VBZ)

● “over” is tagged as preposition (IN)

● “the” is tagged as determiner (DT)

● “lazy” is tagged as adjective (JJ)

● “dog” is tagged as noun (NN)

● “The” – Determiner (DET)

● “cat” – Noun (N)

● “chased” – Verb (V)

● “the” – Determiner (DET)

● “mouse” – Noun (N)

● “cat” – Noun (N)

● “chased” – Noun (N)

● “the” – Determiner (DET)

● “mouse” – Noun (N)

● Text Simplification: Breaking complex sentences down into their

● Information Retrieval: Information retrieval systems are enhanced

● Named Entity Recognition: POS tagging helps to identify entities

● Syntactic Parsing: It facilitates syntactic parsing, which helps with

● Idiomatic Expressions: Slang, colloquialisms, and idiomatic phrases

● Out-of-Vocabulary Words: Out-of-vocabulary words (words not

● Domain Dependence: For best results, POS tagging models trained

● But on the right-hand side here it can be a Variable or Terminal or

● In the theory of formal languages, grammar is also applicable in Computer

● Head: The central word in a dependency relation.

● Dependent: The word that is modified or governed by the head.

● Dependency Tree: A graphical representation of the dependency relations in a

● The modifies cat and mat.

● on is a preposition that governs the noun phrase the mat.

● Focus on Meaning: By emphasizing the relationships between words,

● Parsing: Dependency parsing is a widely used technique for analyzing the

● Machine Translation: Dependency-based approaches can be effective for

● Information Extraction: Dependency relations can help identify key

1. Machine translation : Machine translation (MT) is a prominent application

● Accuracy: While accuracy has improved significantly, MT systems may