NLP Insem
NLP Insem
Natural Language
CHAPTER 1 Processing
Syllabus
Introduction Natural Why NLP is hard?
Language Processing,
Programming languages Vs Natural Languages, Are natural languages
Finite automata for NLP, Stages of NLP, Challenges and
regular?
Issues(OpenProblems)in NLP.
Basics of text processing: Tokenization, Stemming, Lemmatization, Part of
Speech Tagging
our lives at home and at work. We can send voice commands to our home assistants,
our smartphones, etc.
Voice enabled applications such as alexa, siri, and google assistant use NLP to
and also call the contacts
answer our questions. It can add activities to our calendars
that we mention in our voice commands.
NLP has made our lives easier. But more than that it has revolutionised the way
we work, live and play.
Communication is an act that agent can perform so as to exchange
important
information with the environment. Communication can be carried out by
producing and Percepting certain signs drawn from a shared system of
conventional signs.
The advantageof
natural language
processing can be seen when considering
t
followingtwo statements
Venture
(PB-86) Tech-Neo Publications.. SACHIN SHAH
NLP (SPPU-Sem8-Comp.) (Introduction to NLP).. Page no. (1-3)
Ifyou use national language processing for search, the program will recognize
that
cloud computing is an entity, that cloud is an abbreviated form of cloud
computing
and that SLA is an industry
acronym for service level agreement.
The ultimate goal of NLP is to do away with computer programming languages
altogether.
Instead of specialized languages such as Java or Ruby or C, there would
be "human."
only
History of
(MT) in 1950s.
NLP:The
It was Allen
-- -
of NLP systems Or Given a
---.
work related to NLP was started with machine translation
Turing who proposed what today is called
the
brief history of NLP.
Turing test in 1950s. It is the testing ability of the machine program to have
written conversation with human.
This program should be written so well so that one would find it difficult to
It was the simulation of a psychotherapist. Ata very later stage, it was the case
grammars that came up. Now, there has been a complete revolution in the NLP
approaches coming up. Many NLP systems
with the machine learning have
been developed till today and a lot of competitions are being organized that are
GQ. What is
pragmatic analysis in
natural language processing?
Pragmatichas not been the central concern of most NLP system. Only after
ambiguities arise at syntactic or semantic level are the context and purpose of the
utterance considered for analysis. Considered a problem in which pragmatic has been
used in this kind of "support" capacity ambiguous noun phrases.
semantic primitives
instead of words.
that words can fit together
2. Syntactically driven parsing :Syntaxmeans ways
to form higher level units such as phrases, clauses and sentences. Therefore
In a way this is the opposite of pattern matching as here the interpretation of the
is done as a whole. Syntactic analysis are obtained by application of a
input
grammar involved.
is also
is considered a difficult
Natural language processing problem in computer
science. It is the nature of the human
language that makes NLP hard.
The rules that dictate the passing of information no
are
using natural languages
easy for
computers to understand.
There are several factors that make this process hard. For example, there are
hundreds of natural languages, each of which has different syntax rules, words
sentence of
of sentences. Everyone
set
English, but
agrees that "Not to be invited
people disagree on the grammatically of
"To
is
be not
invited is sad".
saw a waterfowl belonging to her, or that he saw that she is evading something
This implies that we cannot speak of a single meaning for a sentence, but rather
a probability distribution over possible meaning.
Text can be encoded using schemes such as UTF -16 or Latin -1.Other factor
sequence of words. These words are called tokens. The process is called as
Tokenization.
With a language like Chinese, it can be quite difficult since it uses unique
symbols for words. Words and morphemes are assigned a part of speech label
of unit
identifying what type
it is.
A morpheme is the smallest division text that has meaning. Prefixes and suffixes
as stemming stemming,
NLP)..Page no
1S
the process
of
(1
findt
the word stem of a word. For example, words such as running, "rung
runs
"run" have the word stem
"run
o
than stemming and uses
Lemmatization is a more refined process vocabu
and morphological techniques to find a lemma. This process determine.
es
the
base form of a word called its lemma.
stem lem is but its
For example, its
word "oper
"operating",
for the emma
is some
in more analysis situation
"operate. Lemmatization results precise s.
Sentence
Words are combined into phrases and sentences. detection c.
verbs
We are concerned with the relationship between words.
the
For example, conferences resolutions determines relationship bet.
tween
certain words in one or more sentences.
The word "it is the conference to city. When a word has multiple
meanings We
determine the actual meaning.
perform 'word sense Disambiguation'
to
Sometimes this is difficult to do. For example, "Arjun went back home.
Does the home refer to a house, a city, or some other unit? Its
meaning can h
inferred from the context in which it is used. For example, "Arjun went back home. k
was situated at the end of Rasta Peth".
(P8-86)
Tech-Neo Publications...A SACHIN SHAH Venture
NLP (SPPU-Sem8-Comp.) (Introduction
to NLP)...Page no. (1-7)
to clarify
the meaning of an typically artificial command
it to
provides programmers.
We mention below the features that a programming language must possess.
i) Simplicity: The language must offer clear and simple concepts that facilitate
capability,
(i) Naturalness
must be done
: It implies
naturally,
that its application
providing operatos,
structures
in the area for which
to work efficiently.
use complicated structures or
ii) Abstraction It is the ability to define and to
meaning of sentences.
For example, NLP can be used to represent all the knowledge of an autonomous
ca
robot. After that, its tasks can be scripted by its users so that the robot
a
execute them autonomously while keeping to prescribed rules of behavior
robots.
Ven
ture
(P8-86) Tech-Neo Publication..A SACHIN SHAH
NLP (SPPU-Sem8-Comp.) (Introduction to NLP)..Page no.(1-9)
Some methods for program are based on natural language
synthesis
programming.
1.5.4 Interpretation
1.5.5 SoftwareParadigm
Natural-language programming is a top-downmethod of writing software.
Each concept and all their attributes are defined in natural language
words.
This ontology will define the data structures which the NLP can use in
) Definition
ontology.
sentences.
of one or more top-level sentences in
These sentences are later used to invoke
terms
the
of concepts from the
most important activities
.
in the topic.
(vi) Using testing objects, to test the meaning of each sentence by executing its code.
(vii) Providing a library of procedure calls (in the underlying high-level language)
which are needed in the code definition of some low-levle sentence meanings.
(vii) Providing a title, author data and compiling the sentences into an HTML or
LateX file.
(ix) Publishing the natural language program
as a webpage on the internet or as a
PDF file compiled from the LaTeX document.
In other words, the kid matches when he can with the pictures (types) and skills
In theoretical
computer science and formal language theory, a regular language
is also called as Rational
language(RL).
It is defined by a regular expression in the strict sense in theoretical computer
science.
a formulain
classes of strings, a sequenceof
a special language
symbols.
and
properties of
i) Regular expression requires two things, one is the pattern that is to be searched
and the other is a corpus of text from which we need to search.
1. X, Y
2 X. Y. (orXNY-concatenation of XY)
3. X+Y(orXUY-union of X and Y)
4. X*, Y* (Closure of X and Y; also called as Kleen closure)
(iv) Ifa string is derived from above rules then that is also a regular expression.
(a+b)* abb
If would be set of
.
strings of a's
which also includes the null
bb,ba,aaa,
strin
and
i.e.
b's of
E,a,b,;
an
,ab,
be set
,
It would of strings of a's
and b's
ending with:
stringabb,i.e.,{abb,aabb,babb,
aaabb,ababb.. the
(11* It would be set
consisting of even number
of 1's
also includes an empty whi
strng 1.e. E, 11, 111
111111,.. 1,
(aa)* (bb)*b It would be set of
strings consisting of even
a's followed
numberne
by odd number of
of b's
b, aab, aabbb, aabbbbb,aaaab, ie.
(aa+ ab +ba+bb)* aaaabbbb,...)
Itwould be string of a's and b's of even
length that can
be obtained by
concatenating any combination of the
strings aa, ab, ba and bb including null
i.e. (aa,ab, ba, bb, aaab, aaba,
...)
ora
a
remember the enact number
(P8-86) ure
Tech-Neo Publications...A SACHIN SHAH Venu
NLP (SPPU-Sem8-Comp.) no. (1-13)
Introductionto NLP).. Page
It is the language
ix) accepted by a Deterministic Finite Automation (DFA).
() Itis
the
language accepted by a nondeterministic finite automation (NFA).
It can be generated by a
(xi) regular grammar.
It can be generated by a
(xii) prefix grammar.
(xii) It can be accepted by a
read-only.
SL) =2 Sn)
n20
z".
recursive.
This implies that there exists an
integer constant no, complex constants A1
.AK and complex polynomials p(x). Pz(x), .PEx).Such that for every n 2 no
the number SL(n)of words of lengthnin L is.
n
SLn) = P1(n) ^ +P2{n)22t..+PK (n)AK
Thus, non-regularity of certain be proved by counting
languages L can the
words of a given length in L'.
the plural
The term automata means 'self acting. Tt
is
of
automa NLP
Automation
predetermined sequence of
An automation
is defined as a self-propelled
operations
a finite
automatically.
numberof
computing
states
device,
is called a
u
which
folo.
a (SPPU-Sem8-Comp.)
1.7.2
Typesof
having Finis. There are Finit
FA) or Finite State Automation two
(FSA). types of
finite sta
Automatin
Mathematically, an automation can be represented by a 5-tuple (o
where 1.7.2.1(A)
It is Determi-
of states defined as the
Q is a finite set can
determine the type of fin
state to
Hence the which the
(ii)2is a finite set of symbols,called the alphabet
ofthe automation. machine is
called De
(ii) & is the transition function,
Mathematically,a DFA
where can
(iv) the initial state from where any input is processed
)
is
go (i.e. go E O) G) Qis a finite
(v)Fis a set of finalstate/states of Q (FCQ. set
of states
2is a finite set of
(iv) 9o
ôis the
is
transition
the initial
symbols,c=
function, wh
Grammarsand Regular ExpressionsRegular (v) F is a set
state from
whe
of final
state/statess0
a DFA
We mention below the points which a
Graphically, can be
will
give us
r-
clear idea where
about
relationship between finite automata, regular grammarsand
regular
he ) The states are
expressions. represented by
(G) Finite state automata are the (ii) The transitions are
theoretical foundation of shown by
computational work and (ii) The initial state is
regular expression is one way of describing them. represented
(iv) The final state is
ii) Any represented
regular expression can be implemented as FSA and any FSA can he
described with a
regular expression. 1.7.2.2 Example c
(iii) Since regular
expression is a way to characterise a kind of Suppose a DFA be
language called
regular language so we can say that regular language can be described with Q
help of both FSA
the
{lm,n}
(iv) Reqular
grammar, a formal
and regular
expression.
grammarthat can be
o
right-regular or isa Transition function d is as sh
left-regular,
way to characterise
regular language. Current state Nexts
We mention below the
diagram to show that finite
automata, regular expressions
and regular M
grammarsare the
equivalent ways of describing
regular languages.
N
The graphical representatior-
Regular Finite
Regular
Regular grammers
Fig. 1.7.1:
RegularGrammars
(P8-86)
(P8-86) Tech-Neo Publications...A SACHIN SHAH Ventu
NLP (SPPU-Sem8-Comp.)
(Introduction to NLP)....Pageno. (1-15)
Q (
Supposea DFA be
m,n}
0
1
b
0
OD
Fig. 1.7.2
of is called as
Since it has finite number states, it
Non-determinics
eterministic
finite
Automation (NDFA) machine.
)
i)
(iii)
Again,
9o F), where
Qis
is
a
as usual, NDFA can
finite
a finite
6:isthe transition
set
set
of states,
of symbols,
function where
be represented mathematically
automation.
5-typle
(Q.
(o
Zd,
iv) go: is the initial state from where any input is processed (9,E Q),
(v) F: is a set offinal state / states of QFcQ).
Graphically, NDFA can be represented by diagraphs. (same as DFA) and iis
called as state diagrams where:
i) The states are represented by vertices.
(ii) The transition are shown by labelled arcs.
b, c C
The graphical representation of NDFA is:
0,1
Fig. 1.7.3
(i)
FA
There
can only count finite input,
is no finite automata that can find and recognise set of binary string of
equal Os and 1s.
(il) Set of strings over "("and'"y' and have balanced parenthesis.
iv) Input tape is read only and only memory it has is, state ofstate
(v) It can have only string pattern.
GQ. Briefly explain the NLP tasks and write the different levels of NLP. Or
Explain the synthetic and semantic analysis in NLP.
NLP problem can be divided into two tasks: Processing written text,
using
lexical, syntactic and semanticknowledge of the language as well as the
required real
world information.
Level of NLP
1.
Morphology
It is the analysis of individual words that consist of
morphemes the small.
grammatical Generally, words with 'ing', 'ed' change the meaning
unit. est
2. Syntax
Syntax is concemed with the nules. It includes legal formulation of the senta.
to check the structures. (Some aspects are covered in
compiler's phase of
analysis that you must have studied). For example, "Hari is good not syntax
to. Th
sentence structure is invalid here.
e
totally
3. Semantic
During this
phase, meaning check is carried out- The way in which the meaning
is is The
conveyed analyzed. previous example is
syntactically as well a
semantically wrong. Now, consider one more example, i.e., "The table is on the
ceiling" This is syntactically correct, but semantically wrong.
4. Discourseintegration
In communication or even in text formats, often the
meaning of the current
sentence is
dependent on the one that is prior to it. Disburse analysis deals with
the identification of discourse structure.
5. Pragmatic
In this
phase, analysis of the response from the user with reference to what
the language meant to handled.
actually convey is So,it deals with the mapping
for what the user has
interpreted from the conveyed part arid what was actually
expected. For a question like "Do know how you long it will take to complete
the job?", the expected answer is the number of hours rather than a yes or no.
6. Prosody
It is an analysis phase that handles rhythm. This is the most difficult analysis
7. Phonology
This involves analysis of the different kinds of sounds that are combined. It 1s
concerned be
with speech recognition. Can
the analysis levels discussed
cond
actually forming a fuzzy structure. They can work in stages, where the seco
level makes use of the analysis or the outcomes of the first level. We now stuu
them in detail.
1. Lexical
analysis and morphological Lexical Analysis
of meaning.
Pragmatic Analysis
syntactic analyser.
3. Semantic analysis
4. Pragmatic Knowledge
Praggmatic is the last phase of NLP. It helps one to discover the intended effect
It is mainly concerned with how the sentences are used and what the inner
6. World knowledge
ii) Procedural
relationships
perform a specific
knowledge
skill
:
This knowledge refers to the
query:
Internet search engines are a form of IR. However, one change from classical IR
is that Internet search now uses techniques that rank documents
according to how
many links there are to them
(e.g., Google's PageRank) as well as the presence of
search terms. Information extraction involves
trying to discover specific information
from a set of documents. The information required can be described as a
template. For
instance, for company joint ventures, the template
might have slots for the companies,
the dates, the products, the amount of involved. The slot fillers are generally
money
strings. Question answering attempts to find a answer to a specific question
specific
from a set of documents,or at least a short piece of text that contains the answer.
What is the capital of France? Paris has been the French
capital for many centuries.
There are some
question-answering systems on the Web, but most use very basic
techniques. For instance, Ask Jeeves relies on a fairly large staff of people who search
the web to find pages which are answers to
potential questions. The system performs
very limited manipulation on the input to map to a known question. The same basic
technique is used in many online help systems.
ways. For example, the sentence "The man saw the girl with the
in different
be
2) The dream of an all-inclusive Digital India cannot realized without
brinein.
NLP research and application in India at par with that of languages like Engi
sh.
barrier can be a
When engaging with smartphones, the language huge obstacle ,
to
many.
(3) Take the case of farmers and agriculture which has long been
considered the
share and learn about new farming practices since most of the information is in
English.
4) Can you imagine a mobile application like Google assistant but tailor-made for
Indian farmers? It'd allow them to ask their questions in their native tongue, the
(5) Do you think this is possible to do without NLP for Indian regional languages?
And, this is
just one possible use-case. From making information more
accessible to understanding farmer suicides [4], NLP has a huge role to play.
Thus, there is a clear need to bolster NLP research for Indian languages so
that such people who don't know English can get "online" in the true sense of the
(6) There are many more. The ideal scenario would be to have
corpora and tools
available in as good quality as they are for English to support work in these
areas.
2) Training Data: NLP is all about analysing language to better understand it.
One must spend years constantly to become fluent in a
language. One must
spend a significant amount of time reading, listening to, and a utilising language.
The abilities of an NLP system dependson the training data provided to it.
If
questionable data is fed to the system it is going to learn wrong things, or
learn in an inefficient way
(3) Development Time :One also must think about the developmenttime for an
NLP system. With a distributed deep learning mode and multiple GPUS working
(4)
in coordination,
parse out
one
Phrasing Ambiguities Sometimes,it :
can trim
is
they
hard even
few hours.
for another
misspellings
are a simple problem
can be harder
common
to identify.
for
One
human
should S,(124
bein.
P an
but
fOra
with capabilities
to recognise misspellings of words,
and
and moLP
move ton
them beyon
(6) Innate Biases: Incases,
NLP tools can carry the bia
some
biases of
within the data sets.
programmers as well as biases their
(10) Keeping a conversation moving :Many modern NLP applications are built on
enture
(P8-86) Tech-Neo Publications...A SACHIN SHAH
NLP (SPPU-Sem8-Comp.) to NLP)....Pageno. (1-25)
(Introduction
(
systems, chatbots, automatic test summarisation,
analysis, guestion/answer
market intelligence, automatic text classification, and automatic grammar
checking.
AI
AI = Artificial
intelligence
M NLD ML =Machine learming
DL DL =Deep learning
NLP =Natural language
processing
Fig. 1.12.1
(1) Translation
Translating languages is more complex task than a simple word-to-word
replacement method. Sine each language has grammar rules, the challenge of
translating a text is to be done without changing its meaning and style.
Google Now, alexa, and Siri are some of the most popular examples of speech
Simply by saying
recognition. 'call Ravi', a mobile recognises what the
command means and it makes a call to the contach saved as 'Ravi',
(4) Chatbots
Chatbots are programs used to automated answers to common
provide custome-
mer
queries
They have pattern recognition systems with heuristic responses, which are used
ised
to hold conversations with humans.
data, and
They can answer
?", or "How
do I go to the
audio, imagesand
Question
-
answersystems can be found in
siri and IBM's
social media chats
Waston. and tools such as
In 2011, IBM's
Watson computer on
which answers are given first, and
competed Jeopardy, a game
the contestants show during
computer connected against the show's supply the
two biggest all questions. The
astounded the tech time
industry asit won first champions and
place.
(6) Automatic Text
Summarisation
Automatic text summarisation is the task of
shorter version. It extracts its condensing a piece
main ideas and oftext
This preserving the a
application of NLP is used in meaning of conten
news headlines ent.
and buttetins of market result
snipers in
reports. web
search,
(P8-86)
Tech-Neo Publications..A SACHIN
SH
NLP (SPPU-Sem8-Comp.)
(Introductionto NLP)...Pageno. (1-27)
Spam
Machine learning
Model
NLP
Not spam
Fig. 1.12.2
P)...Page no.
(12) Natural Language Understand (NIU)
converts more formal
(1-28
28
large set of
It a text into
representations such ac
logic structures that are easier for the computer programs to -order
notations of the natural language processing.
manipulate
With the help ofcomplex algorithms and intelligent analysis, NLP toole
way for digital assistants, chatbots, voice search, and dozens of the
application
then there are some of the most important issues that we have to resolve, Thev Even
follows y are s
1. Language differences
In USA, people speak English, but if we are thinking of
reaching
audience, we shall need to multicultural
provide support for multiple languages.
Different
languages have not only vastly different sets of
vocabulary but
different
types of phrasing, different modes of inflection, also
and different
expectations. cultural
of NLP
need some time
to train
is to
analyse language to better
in a understand it. To be fluent
language, one must immersein a
language constantlyfor a
Similarly, even the best period of years.
AI must also
spend a
reading, listening to and a significant amount of time
utilising language.
The ability of an NLP
system depends on the
or training data
questionable data is fed the provided to it. If bad
learn in an
system, the it is to learn
going
3. inefficient
Developmenttime
way. wrong things, or
5. Misspellings
human gs, misspellings are not a very big problem. One can easily
beings
or
For word with its properly spelled counterpart, and
a misspelled
associate the vest of the sentence in which it is used.
erstand
misspell can be harder to
for a machine identity.
But
use an NILP tool with common
we need to capabilities to recognise
Hence
of words, and more beyond them.
misspellings
6.Innate biases
cases, NLP
NLI tools can carry the biases of their programmers, as well as
In some cases,
the data sets, that are used to train them.
biases
within
he
on the an NLP could exploit certain biases. It is
application,
Depending
to make a system that works equally well in all situations, with all
hallenging
people
"
upon
user who asks, "now are you" has a a
For example, a totally different goal than
asks something like "how do I add a new debit card
uSer who
Some questions
and phrases have multiple intentions. In such a case NLP
cannot oversimplify the situation by interpreting only one of those
system
intentions.
For example, a user may prompt your chatbot with something like, "I need to
cancel my previous
order and update my card on file."
9. False positive
and uncertainty
Many of the modern NLP applications are built an dialogue between a human
and a machine. Hence, our NLP AI needs to be able to keep the conversation
moving, providing additional questions to collect more information and always
pointing towards a solution.
we have
major challenges of using NLP.
Here, discussed the
P8-86)
Tech-NeoPublications...A SACHIN SHAH Venture
NLP (SPPU-Sem8-Comp.) (Introduction to
NLP)....Page
no.i30
1.14 TOKENIZATION
Tokenization is a common
both
task in
traditional
Natural Language
NLP methods like Count
Processino T
(NP).
fundamental step in
It's
a
Advanced Deep Learning-based architectures like Transformers. Vectorizer
and
Tokens are the building blocks of Natural Language.
t
What time is it ?
Fig. 1.14,1
Examples
Tokenization is the
process of replacing sensitive data with
preserve all the essential symbols
information about the data that
without
security. compromising
its
(P8-86)
Tech-Neo Publications...A SACH
(Introductionto NLP)...Page no.(1-31)
NLP(SPPU-Sem8-Co
tries to minimise the amount of data a business needs to have at
isation
Tokenisa
nd. It has become popular for small and midsize businesses to bolster the
rity of credit card and e-commerce transactions while minimising cost and
ofcompliance
e with industry standards and government egulations.
complexity
technology
tech can be used with sensitive data of all kinds, including
Tokenization
medical records, criminal records, vehicle Iriver information,
banktransactions, stock and voter
trading
loan plications, registration.
Lenisation is often used to protect credit card data, bank account information
a
other sensitivedata handled by payment processor.
and essing cases that use tokenize sensitive
process credit information
ayment
include
like android pay and apple
mobile wallets pay.
(i)
e-commerce sites; and
i)
i businesses
that keep a customer's card on file.
Token
1.14.2
Token ? How Created?
GO. What is it is
information.
The nonsensitive, replacement information is called a token.
)Tokens can be
Using
created in various
Using a mathematically
a non reversible
an index
ways:
reversible
function such as
cryptographic function with
number.
a key,
ii) Using
short, the token becomes the exposed information, and the sensitive
In
the token stands for is stored safely in a centralised server
information that in
known as a token vault. The original information can only be traced back to its
corresponding token
from the token vault.
is vault less. a
Some tokenisation Instead of storing the sensitive information in
works
(i) The data are stored with a randomly generated token, which is generated in
most cases by the merchant's payment gateway.
(ii) The tokenized information is then cent to a paymentprocessor.
The original sensitive payment information is stored in a token vault in the
merchant's payment gateway.
(P8-86)
Tech-Neo Publications...A SACHIN SHAH Venture
toNLP)...Paoe
Pageno.
ntroduction
(1
by the pavm
ment processor e
is sent again
bet
NLP(SPPU-Sem8-Comp
information
The tokenisedfinal verification.
(iv) for
sent
being
Tokenization alne.
Word used tokenization algorithm.It
1.14.3 commonly splits
Tokenization
Word with Out Of Vons
Drawbacksof word tokens
is dealing
which
cabulay
are encounte red
issues
with
new words at
Generally, pre-trained
models are trained on a large volume of the text corpus.
So, just imagine building the vocabulary with all the unique words in such a
large corpus. This explodes the vocabulary!
behavior.
it word tokenization.
you.)
In older
systems, credit card numbers were stored in databasesand exchanged
freely over networks.
() Itis more compatible with legacy
systemsthan encryption.
(m) It is a less resource-intensive process than encryption.
(P6-86)
Tech-Neo Publications...A SACHIN SHAH Venture
NLP(SPPU-Sem8-Comp.) (Introduction to
NLP)...Pa
iv) The risk of the fallout in a data breach is
Page
reduced no.(1
(V) The payment industry is made more convenient
by allowino
like mobile wallets, new
one-click paymentand crypto currency.
This improves customer trust because it
improves both tha technologe
convenience of a merchant's service. the
(vi) It security
reduces the steps involved in
complying regulations for
merchants
token and an 'ing' token. From there it can make valuable inferences abouthow
the word functions in the sentence
It can form a verb into a noun, like the verb build turned into the noun
building'. It can form a verb into its present like the verb 'run
also participle,
becoming 'running.
Tf an NLP model is given this information about the ing' suffix, it can make
several
valuable inferences aboutany word that uses the sub word 'ing.' If ing'
technological leaps.
)
commission.
i)
Asset/security
investment.
Utility
token: These are tokens
These are analogous to bands
token: These
that promise a positive return
and equities.
1.15 STEMMING
.
We employ stemming
word in
the language.
conne
be a legitimate ected
three words, connections,
may not of these
connects
the stem
For example, is
and troubles
troubled,
"connect" of trouble,
ieh
the root
hand,
On the other
word.
a term. The presene
ence
is not a recognised of single of
variants these
has several wher
language in data redundancy NLP
English results
variances
in a text-corpus
models may become ineffective.
models. Such test by removing roe
machine learning to normalise
essential repetition
it is
model,
To build a robust
base form through
stemming.
words to their
and transforming variants of a root/base
that produces word
is a rule-based approach word. This heuristic Dr
Stemming word to its stem
words, it reduces a base
In simple indiscriminate cutting of
two the process involves
as
is the of the and normalise
simpler the look-up
to shorten e
ends of the words. Stemming helps
sentences for a better understanding.
Over stemming
is nonsensical.
:
inflected word is cut off so much
The that the resultant stem
meanings having the same stem. For example, "universal", "university" and
"universe" is reduced to "univers". Here, even though these three words are
etymologically related, their modern meanings are widely different. Treating
them as synonyms in a search engine will lead to inferior search results.
Understemming Here, various inflected words have the same stem despite
different meanings. The issue crops up when we have several words that
actually are forms of one
example of understemming
another. An
in the Porter
stemmer is "alumnus" "alumnu", "alumni" "alumni",
"alumna"alumnae" "alumna". The English
and so these
word has Latin
morphology,
near-synonyms are not combined.
1.15.3 Types of
I -------. Stemmer in NLTK
GQ. -
----- Discuss
There
different
are different
types ofstemmerin NLTK.
- --
kinds of
stemming and
python NLTK,we discuss them.
algorithms, all of them are
included i
diin
(P8-86)
ATech-Neo Publications...A SACHIN
SHA
(1-37)
toNLP)...Page no.
Comp.) (Introduction
(SPPUSeme-c Stemmer
15.3.1 Porter Each step has its own
redoction are ised in this method.
Fve steprs of word
meaning
mnpping
riles
stiltanit stem is a shorter word with the same root
the resul
for its ease of use and rapidity.
Frequently, is the renownied stermmer
etemmer
in NLTK that implements the porter stemming
porter
stemmer is n module
()
Porter
We consider an example: to
echnique stemmer ()and use the porter algorithm
an instano of porter
We constrvctof words.
the list
stem
stemmer
import porter
nltk.setm
From
stemmer
()
Porterporter 'connection',
connections'"Connected',
'connects, 'connecting,
Words
connecting,
connect
in words
word
for
porter, stem (word))
Print (word,
Output
connect
Connects
Connectingconnect
Connectionsconnect
Connectedconnect
Connectionconnect
Connectingconnect
Connectsconnect
Snowball Stemmer ()
The method used in this instance is more precise and is referred to as "English
stemmer" or "porter 2 stemmer".
It is rather faster and more logical than the original porter stemmer.
Snowball
stemmer ()is a module in IV LTK that implements the snowball
stemming technique.
P&-86)
Tech-Neo Publications.A SACHIN SHAH Venture
NLP
(SPPU-Sem8-Comp.) (Introduction toNLP)....Page
no.
no.(1
(13
Example of Snowball stemmer ()
We first construct an instance of
snowball stemmer ()to us the
algorithm to stem the list of words.
snOwbal
From nltk.stem import snowball stemmer
Snowball = snowball stemmer (language
= 'english)
[out]:
generousgenerous
generategenerat
generously generous
generation generat
Words
['eating','eats','eaten',
puts','putting
For word in words:
Print (word,">",Lancaster: stem
(word))
[Out]
eating eat
eats
eat
eaten eat
putsput
puttingput
1.15.3.4 Regexp
stemmer-regexp stemmer ()
Regexp stemmer identifies
morphological affixes using
Substrings matching the regular expressions.
regular expressions will
be discarded.
Regexpstemmer () is a module
in NLTK
that
technique. implements the regexstemming
Example ofRegexstemmer ()
Here, we first construct an
object ofRegexp stemmer()and then
to stem the use the regex
stemming method list of words.
[out]
mass mas
was was
bee bee
computer computer
advisable advis
Words that are derived from one another can be mapped to a base word or
symbol, especially if
they have the same meaning.
CONSULTANT
CONSULTING
CONSULT
CONSULTANTATIVE
CONSULTANTS
CONSULTING
----- --.
GQ. What are the most common types of error
stemming in text mining or NLP? associated is
ed with
text
We can not be sure that
give us a
it will
100% result, so we
error in have
ave two
tu.
stemming:overstemmingand under
stemming. types of
GQ. What is
Over stemming
--- error?
1.16 LEMMATIZATION
Lemmatization is the
process of grouping
of a word so that together the different inflected for
they can be analysed as a
single item.
(P8-86)
Tech-Neo Publications...A SACHIN SHAH nture
NLP (SPPU-Sem8-Comp.)
(Introduction
to NLP)....Pageno.(1-41)
Lemmatization is
similar to it
stemming but it brings context to the words, S
links words with similar
meaning to one word. Some people treat these as
same. But, t
lemmatization is ation
preferred over stemming because
does morphological lemmaua
analysis of words.
Lemmatization is
responsible for
into the root form,
grouping different inflected forms of words
having the same meaning.
Tagging systems, indexing,
SEOs,information retrieval, and web search all use
lemmatization to a vast extent.
Lemmatization involves
using a vocabulary and
words, removing morphological analysis or
inflectional
endings, and returning the dictionary form of a
word (the lemma).
The process of
lemmatization seeks to get rid of
inflectional suffixes and
prefixes for the purpose ofbringing out the word's
dictionary form.
Ileafs
leaf
leaves
Fig. 1.16.1
Fig. 1.16.2
Lemmatization
-
GQ. Statethe difference between stemming and
lemmatizatin
n.
Sr.No.
Stemming
Lemmatization
Stemming attempts to reduce Lemmatization also
inflectional form of each reduce
inflectional form attempts to
word into a common base or of each
into a common
base or word
root root.
|
any
language.
4. Stemming tends to be faster
Lemmatization is a slow process,
it is
process because it
chops because it knows the context
words ofthe
without the
knowing word before
processing.
context of word in the
sentence.
accuracy.
(P8-86)
Tech-Neo Publications...A SACHIN SHAH Venture
NLP (SPPU-Sem8-Comp.) (Introduction to NLP)....Pageno. (1-43)
Stemming vs Lemmatization
Change Change
Changing Changing
Changed Changed
Changer Changer
Fig. 1.16.3
Natural
is a vital
(i) It
plays critical roles both in Artificial and big data analysis.
Intelligence
ii) Lemmatization is extremely important because it is for more accurate than
stemming. This brings great value when working with a chatbot where it is
The major disadvantage to lemmatization algorithm is that they are much slower
than stemming algorithms.
Some ofthe other areas where lemmatization can be used are as follows:
1. Sentiment analysis
Sentiment analysis refers to an analysis of people's messages, reviews or
comments to understand how they feel about something before the text is
analysed, it is lemmatized
2. Informationretrieval environments
3. Biomedicine
5. Search engines
implemented the
Advantages
i) Lemmatization is more accurate
i) It is useful to get root words from the
dictionary,unlike just
like
cutting the worde
stemming.
(iii) Lemmatization gives more context to chatbot conversations as it
recognises
words based on their exact and contextual
meaning.
Disadvantages
G) Lemmatization is a
time-consuming and slow process.
i) As it extracts the root words and
meaning of the words from the so
dictionary,
most lemmatization
algorithms are slower compared to their
stemming counter
parts.
SYNTAX ANALYSIS
1.17
Cntax analysis or parsing is the second
phase of a compiler.
Synta
A lexical tokens with the
help of regular expressions
and
analyser can identify
a cannot
pattern rules. But lexical analyser check the syntax of a given sentence
due to the limitations of the regular expressions.
cannot check
Regular expressions balancing tokens, such as parenthesis.
the uses context free
Therefore, phase grammar (CFG). Thus, CFG is a superset
of regular grammar
Context-free
Grammar
(Regular
Gramma
Fig. 1.17.1
The diagram implies that every regular grammar is also context-free CFG is an
of strings.
(V) :
Non-terminals are syntactic variables that denote
The non-terminals define sets of strings that help define the language
generated
by the grammar
A oftokens,known
set as terminal
(Gi) symbols (2): Terminals are basic symbols
from which strings are formed.
Verb
Adjective
Adverb
We can also say that
tagging is a kind of classification that may he .
the automatic
assignment of description to the tokens. defined as
The
below the properties
rules in Rule-based
of Rule-based
These taggers are knowledge-driven
POS
1000 number of
tagging
taggers.
POS taggers:
ggers;
aredone manually.
ii) There are around rules.
probability of a givensequence
of tags occurring.
probability at which occurs
it
The best tag for a given word is determined by the
n is also called as n-gram approach.
with previous tags. Hence
it
)
i)
We mention below its properties:
The POS tagging is
based
associatea
(P8-86)
Tech-Neo snture
Publications SACHINSHAH
NLP (SPPU-Sem8-Comp.) to
(Introduction NLP)....Pageno.(1-49)
ii) As in TBL there is
interlacing of machine learned rules, its
and human-generated
is reduced.
complexity
disadvantages.
employing a person.
(i) NLP can also help businesses. It offers faster customer service response times.
(i) Pre-trained learning models are available for developers to facilitate different
(iv)
Natural Language Processing is the practice of teaching machines to understand
(v) NLP can be used to establish communication channels between humans and
machines.
(vi) The different implantations of NLP can help businesses and individuals save
the use model needs to be developed without the use of a pre-trained model, it
can take weeks before achieving a high level of performar
(i) There is
always a possibility of errors
in predictions and results that need to be
i)
Using
Fonts
the phonetic
Download
in Indians regional language
keyboard
are:
(ii)Padma Plugin
Padma is a
technology for
transforming Indic text between
proprietory formate. The technology public and
currently supports Telusu,
Tamil, Devenagri Malyalam
(including Marathi),
Gujarathi, Bengali and Gurmukhi.
Padma's goal is to
bridge the gap between closed
and
day Unicodesupport is widely open standard until the
available on all
platforms.
Padma transforms Indic text
encoded in
Unicode. proprietary formats automatically
(P8-86)
ETech-Neo Publications...A SACHIN SHAH Venture
to NLP)...Page no.(1-51)
NLP (SPPU-Sem8-Comp.) (Introduction
in English
It is comparatively easy for computers to process the data represented
But
through standard ASCII codes than other natural languages.
language
is
machine capability of understudying other natural languages
building the
and is carried out using various techniques.
arduous
the mternet is no more monolingual contents of the other regional
Nowadays
the are
Janguages
are groWing rapidly. According to 2001 census there
Recognition (NEER).
translate
translation is communication where machine
Machine inter-lingual
morphological
Tokenisation analysis
Feature ML Post
Machine
Pre-processing extraction algorithm processing
transmliteration
Neural
network
Fig. 1.19.1
Evaluation
Fig. 1.19.2
Tokenization
In natural language processing applications, the raw text
initially under.
rgoes a
process called tokenization.
In this process, the given text is tokenized into the lexical
units, and thes
ese a
basic units.
(a) Sentence-level tokenization deals with the challenges like sentence ending
detection and sentence boundary ambiguity.
(b) In world-level tokenization, words are the lexical units, hence the whole
processing applications.
(c) The n-gram tokenization is a token of n-words, where 'n' indicates the number
Ifn = 1, then lexical unit is called as unigram similarly if 'n =2' lexical
unit s
bigram and if n is 3, it is trigram.
ill
For n-gram tokenization, (n 2 2), to
satisfy the n-words in the tokens there
w
be overlapping of terms in the token.
ture
(P8-86) ATech-Neo Publications...A SACHIN SHAH
(Introduction to NLP)....Page no.(1-53)
NLP (SPPU-Sem8-Comp.)
Machine transliteration
natural language
processing, machine a vital role in
In transliteration plays
like cross-language machine translation, named
applications entity recognition,
app retrieval etc.
information
a process of
is source
Transliteration converting a word or character from the
languages, alphabetical system to the target languages, alphabetical system
the phonetics of the source
without losing languages word or character.
Before transliteration, words are divided into syllabic units using Unicode and
character encoding standards. Then each of the syllabic units of a word gets
to target language.
converted
For example
Hindi English
|/v al
|/v«l'a'or'a'/
ChapterEnds...
UNIT I1
Language Syntax
8x Semantics
CHAPTER 2
Syllabus
Morphological Analysis :What is Morphologyn Types of Morphemes,
Inflectional morphology & Derivational morphology. Morphological
parsing with Finite State Transducers
(FST)
Syntactic Analysis: Syntactic Representations of Natural Language,
Parsing Algorithms, Probabilistic context-freegrammars, and Statistical
parsing
Semantic Analysis: Lexical Semantic, Relations among lexemes &
WordNet,
theirsenses-Homonymy,Polysemy, Synonymy, Hyponymy,
Word Sense Disambiguation (WSD), Dictionary based approach,
Latent Semantic Analysis.
In English, morphemes are often but 'not necessarily' words. Morphemes tht
stand alone are considered as 'roots' (such as the morpheme cat); other
Vve
SHAH
(P8-86) 4Tech-Neo Publications...A SACHIN
NLP (SPPU-Sem8-Comp.) (Lang. Syntax& Semantics)....Pageno.(2-3)
makes a verb out of a noun, or an adjective out of a verb, etc. Not always very
regular,
not very productive But useful in various ways, especially in the
formation of abstract nouns, esp. in development of scientific registers.
kinds of N's
-ness, -hood, -dom, -ling. Likeness, likelihood (but not *ike hood,
-ize,
-ing
In the
ed | Treating
Treated
above example,
) The
create
Inflectionalmorpheme
noun
s
is combined with
the
the noun stem
stem
plural 'words, word' to
i) The inflectional morpheme -ing' is combined with the verb
verb stem
create a gerund 'treating' 'treat to
6) Inflectional
morphemes do not change the essential
meaning or the pra
category of a word.
comparison, or a
possession. Here are some examples of inflectional
morphemes.
o Plural: Bikes, Cars, Trucks, Lions, Monkeys, Buses, Matches,
Classes
Possessive : Boy's, Girl's, Man's, Mark's, Robert's,
Samantha's,
Teacher's, Officer's
Tense: cooked, played, marked, waited, watched, roasted, grilled; sang,
O
drank, drove
:
Comparison Faster, Slower, Quicker,
Weaker,Stronger, Sharper, Bigger
Taller, Higher, Shorter, Smaller,
(2)enPast participle
She has eaten
Ventud
(PB-86) Tech-Neo Publications...A SACHIN SHAH
sPPU-Sem8-Comp.) (Lang.Syntax & Semantics)...Page no.(2-5)
NLP (SPPU-Ser
word.
most common ways to derive new words is combine derivational
One of the to
with root words (stems).
affixes
formed through
The new words morphology may be a stem
deivational for
another affix.
New words are derived from the root words in this type of morphology.
(a)
Less productive. That is, a morpheme added with a sets of verbs to make
new meaningful words cannot always be added with all verbs.
For example, the base word 'summarise' can be added with the
morpheme cannotbe added with all the verbs to make similar effects.
Example
Category Stem Affixes Derived word Target category
Black + bird combine to form black bird, dis + connect combine to form
disconnect.
(2) Some more examplesof English derivational patterns and their suffixes:
Derivation Inflection
also may be
formal effect.
by
may be
effected formal means
the
() Derivation
reduplication, etc.
like
of derivation
in English tends
Derivational prefixes
but it does
not to change category,
new for
add substantial meaning,
ves (unhappy,
example creating negati
inconsequential).
Remark
are difficult to or that seem to fall
If must beable to
distinguish between orthographic rules and morpholo
logical
rules.For example, the word 'foxes' can be decomposed into "fox (ne
and 'es' (a suffix indicating plurality).
is a lot of
morphological analysis. For languages for which
there available
training data, FST is less is use.
2.3.1 Orthographic
Orthographic rules are
general rules. They are used when a word into
breaking
its stem and modifiers.
Consider an example : singular words
it ends with - English ending with -y, when it is
pluralised, ies.
2.3.2 Morphological
rules are to the
Morphological exceptions orthographic rules. It is when
breaking a word into its stem and modifiers.
important.
Example of lexicon
An example of lexicon is An example of
a.set
YourDictionary.Com.
lexicon is
of medical terms.
Lexicon in NLTK
Lexicon is a vocabulary, a of words, a dictionary. In
list
NLTK, any lexicon is
Importance of lexicon
Different language research suggest that the lexicon is
representationallyrich
that it is the source of much behaviour. Its lexically
productive speciñic
information plays a critical and early role in the
interpretation of grammatical
structure.
Lexicon in Communication
A lexicon is the collection
of words
Or the intemalised dictionary also
that every speaker of a language has. Iti
a stock of terms used in a part
called Lexicon articular
lexis. also refers to
nture
Ven
(P8-86) Tech-Neo Publications...A SACHIN SHAH
Sem8-Comp.) (Lang. Syntax & Semantics)....Page no. (2-9)
(SPPUs
NLP
in AI
Lexicon form, a lexicon is
simplest the
vocabulary of a person, language, or branch
It 1s the
of words used
Lnowledge. catalog
often in conjunction with
of
of rules for the use of
ammar,
the set these words.
) Porter
stemmer.
stemmerhas two major achievements:
The rules associated with suffix removalare
much less complex in case of porter
Implementation of
porter stemmer
We follow the
following steps:
Step (1) : Import the NLTK and from NLTK
library import porterstemmer,
import nltk from nltk, stem import
porterstemmer.
Step (2) : Create a variable and store porterstemmer into it, (PS = Porter
Stemmer)
Step (3) : See how to use porter stemmer print (ps.stem ('bat )) print (ps.stem
(batting ))
)
2.4.2 Difference between Semantics)..Page r
Lemmatization Stemming nd
and
Stemming
Stemming is a process that stems
Lemmatization
or removes last few characters Lemmatization
and
converts considersthe
from a word, often leading to the
meaningful base word COntex
incorrect meaningsand spellings called to
Lemma. form,
Gi) For which
example, stemming the wordLemmatizing the
e
'Caring' would returm 'car. would return word
'care',
iii) Stemming is used in case Carine
of large Lemmatization
dataset where is an
performance
expensive since it
issue. tables mputational
involves N
Symbol table
Fig. 2.5.1
enture
(P8-86) Tech-Neo Publications...A SACHIN SHAH
m8-Comp.)(Lang. Syntax & Semantics)..Pageno.(2-1
(SPPUu
NLP from a frequently recurring error so that the
from rest of the program may
Torecover
recover
(ti)
be processed.
make a parse
tree.
To
ii) table
To make symbol
a
iv) representations (IR).
interme
Creating
follows:
I is defined as of
analyzing the strings of symbols in natural
the process
language conforming to the rules of formal grammar.
taken
Lexical analyser Parser
Input
Get next
Symbol table
Fig. 2.6.1
We can understand the relevance of parsing in NLP with the help of following
points
(iv) Parser is used to create symbol table, and that plays an important role in
NLP
(v) Parser is also used to produce intermediate representations (IR).
Remark: The word Parsing' whose origin is
from Latin word 'pars' and it
means 'part'.
Its shift
reduce
symbol.
parser.
parser.
2.6.3 ConstituencyParsing
Constituency Parsing is the process of
analyzing the sentences by breaking
down it into sub-phrases also known as
constituents.
These a
sub-phrases belong to
specific category of grammar like NP nou
phrase) and VP (verb phrase).
A termir tag
a part-of-speech
and "A cat" and "a box beneath the bed", are noun phrases, while
vample,"A
an example,
As and "drive a car" are verb phrases.
a letter"
"Write sentence: "I shot an element in my pajamas."
an example
We consider
mention the constituency parse-tree graphically as:
NP
Verb NP
Pronoun
an Nominal PP
Noun in my pajamas
elephant
(1)
Fig. 2.6.2(Contd...)
pajamas.
sentence is broken down into sub-phrases till we have got terminal
The entire
phrases remaining.
N
Pronoun VP
PP
Verb NP in my pajamas
an Noun
elephant
andNP
stands
NLP
(SPPU-Sem8-Comp.)
Verb-Phrases,
for
VPstands Parsing lear, so
so that we car
clear,
parsing
Dependencyofdependency compre
2.6.4 parsing
the concept to the proc
wemnke with constituency refers
in orde r toexamining the
First (DP)
of a sentence
parsing determine
Parsing ts
dependency
Dependency the phrases
term is
The between The proc
The process basedd
on
sections.
dependences sections. each lingu the
into many
structure between nit
grammatical is
divided
is a direct relationship
dependencies.
Consiistic"I prefer
the
A sentence
that
there
are called
links
assumption These hyper Denver." n mod
sentence.
through
morning
flight
dobj
root case
det
n modi
n subj flight
through Denver
the
moming
Prefer
2.6.3
Fig.
bv d-
are expressed direct
or phraseunit,
the sentens
ntence.
each linguistic of preceding
between the pinnacle
varies
The relationships
the tree "prefer two phrases.
root of between
arcs. The
indicates
the relationship of the noun "Denver
the meaning
A dependencetag changes and Denver
is the kid or
the word "fligh"
For example, is the pinnacle
where flight the nominal modifier.
Denver, stands for
flight by nmod, which two phrases, where
It is represented between the
dependent. for dependency
the scenario
as the dependent.
This distinguishes
and the other
the pinnacle
one serves as
V/s Constituency
a 2.6.5 Dependency
Parsing
Parsing
it is
1deal
into sub-phrases,
a sentence
is the best
to interrupt
the man objective
is
parsing.
implementconstituency sentence
between phrases in a
discovering the dependencies
The
to note the difference es.
Let us consider an example
tree denotes the subdivision
of atext into sub-prabthe
ls are
A constituency parse
of phrases, the
terna
are different sort
tree's non-terminals
Verb phrase
Noun phrase
John
Verb Noun phrase
Sees Bill
Fig. 2.6.4
Subject Object
John Bill
Fig. 2.6.5
One should choose the parser type that is closely related to the objective.
generate strings.
The set of all strings that can be generated from a grammar is called the
is form if
of the each
ii)
ABC
A
[with at most two non-terminal symbols on R.H.S.
or [one terminal symbol on RHS]
nile
ofc
a,
The process involves filing the table with the solutions to be subproblens
encountered in the bottom-up parsing process. Therefore, cells will be flai
2 3 4 5
1[1,1] [1,2 [1,3] [1,41 [1,51
4 14,4) 451
5 I5, 51
1
the
j
il.
In T the end index.
denotes the phrase, "a very heavy orange
the book
us consider
Let (3) range (4) book (5).
(2) heavy
the
23
table from left to right and bottom as
a
the tabl totop, according to the rules
fill up
We
above:
G (M,T. R, S,P)
Where
symbols
(i) M is the set of non-terminal age
ii) Tis the set ofterminal symbols.
the total
of all derivations that are
PCFG model computes probability
consistent
based on some PCFG.
with a given sequence,
the probability of the
to
PCFG generating the sequenoo
This is equivalent Itisa
the sequence with the given grammar.
measure of the consistency of
variants of the CYK
algorithm find the Viterhi noo
Dynamic programming parse of
Ventun
2.9.1 BasicOperations
)
()
Shift: This involves
Reduce : moving symbolsfrom the input buffer onto the stack.
the handle appears on
If
top of the stack then, its reduction by using
appropriate production rule is done. It means that RHS of a production rule is
popped out of a stack and LHS of a production rule is
pushed onto the stack.
() Accept:If only the start symbol is present in the stack and the input buffer is
empty, then the parsing action is called accept.
When accepted action is obtained, it implies that successful parsing is done.
iv) Error :This is the situation where the
parser can
i) neither perform shift action
)
i) nor reduce action
not even
accept action.
and
Parser ? Shift
Red..
Parser ? duce
In shift reduce parsing, there are two types of
conflicts:
) Shift-reduce (SR) conflict and
(ii) Reduce-reduce conflict (RR)
For example, if a programming language contains a
terminal for ths
word "while", only one entry for "while" can exist in the
the state.
A shift-reduce action is caused when the system does
reserved
not know if
'reduce' for a if to
to
given token. 'shift
or
2.9.5 Example
Ex. 2.9.1: Consider
the grammar
E2 E2,
Perform shift-reduce
E>3 E3, E>4
parsing for input string "32423".
Soln.
Stack Input Buffer Parsing Action
32423 $
shift
3 2423 $ shift
$ 32 423 $ shift
$ 32 E2 3$ Reduce by E> 2 E2
$ 3E 3$ shift
| $ 3 E3
SE
$ Reduce by
Accept
E3 E3
Worst-case performance
Best-caseperformance
2 (n) for all
2(n) for
deterministic context-free
unambiguousgrammars
grammars
expected.
is the position
Input position O is the position prior to input. Input positionn
th
after
accepting the n token.
h state is
the
production currently being matched (X-a8) a
(i) the current in that tunle
position production (represented by
the
(ii) the
position i in the input at which the matching
the origin
of the
position.
The ction
state set at input position K is called S (K). The parser is began
consisting of only the top-level rule. seeded
The parser then repeatedly executes three operations
completion prediction,
,
6) Scanning
S (K)ofthe form (X
Prediction: For every state in
V and
is the origin position as above), add
) side (Y y) even
Scanning:If ais the next symbol in the
input stream, for
(K) of the form (X
S(K+ 1).
a a5, add (X
ever.
a B)
e $ in
(K).
in
Yß, i) and add (X
YB,
aY. j),
find
i)
to s
The Y, O) ends S (n), where
algorithm accepts
top-level-rule and n is
if(X
the input length, otherwise
up in
it
(x> Y)isthe
rejects.
2.11 PREDICTIVEPARSER
Inputstring
Here a, +,b are terminal symbols.
by $.
S Predictive parser
Stack
S program Output
a I+bs
B Parsing table
Predictive parser
Fig. 2.11.1
(vAction : Parsing
program takes various actions
of the upon the
the top stack and the current depending symbol on
input symbol.
Output: Predictive M
)
parsing table
Method: For the
production A a ofgrammar G.
For each terminal a in FIRST
(o)add A> a to M [A,a]
e
ii) If
M
is
[A,b].
in FIRST (), and b is in FOLLOW (A), then add A >a to
Non-terminal
symbols
B -M(B,b)
M(C, +)
Fig. 2.11.2
terminal symbol.
)t is
inherently
a recursive parser, so it consumesa lot of memorya
grows.
SHAHVentue
(P8-86) Tech-Neo Publications.. SACHIN
-Comp.)
(SPPU-Sem8-
(Lang. Syntax & no. (2-25)
ntion
Doing optimisati may not be as Semantics)...Page
2.12
INTRODUCTION TO
SEMANTIC ANALYSIS
ALYSIS
Semantic Analysis is the
process of
computers to finding the
understand and meaning from text.
It can direct
documents, by interpret
analysing sentences,
their
paragraphs, or whole
relationshipsbetween grammatical
individual structure, and
Thus the aim of words of the identifying the
semantic sentence in a
analysis is particular
meaning from the text. to draw
exact
context
The meaning or
dictionary
purpose of asemantic
The most analyser is to check
the textfor
importanttask of semantic
meaningfulness.
sentence. For analysis is to
example, analyse the getthe proper
the sentence "Govind meaningof the
speaker is talking about is
Lord great". In this sentence,
Govind. Govind or about a
person whose name is
in Syntactic Anak.
2.12.4 Steps to be Carried ysis
boundaries and word boundaries
1) I:Identify clause
Segmentation
parts of speech.
(2) Classification I: Determine
II:Identify constituents.
(3)
4)
Segmentation
Classification II :Determine
the grammatical
the syntactic categories for the constiha
(5) Determine
)
In representation of the meaning
(ii)
example, Haryana, Kejari
Concepts:
of words, the following blocks are used
(
Frames
(i) (CD)
eptual dependency
(i) - based architecture
Rule
Grammar
i) Case Graphs
Conceptual
(171)
3.3 Need of
2.13.3 Meaning Representations
a mention below the reasons to show the need of
meaning representation.
inking of linguistic elements to
i) non-linguistic elements
1inking oflinguisnc Iements tO the
non-linguistic elements can be done
meaning representation. using
Allthe words, sub words etc. are collectively called lexical items.
) Thesteps
Classification
lexical
involved
semantics.
in lexical
of lexical items
semanticsare as follows:
like words, sub-words, affixes etc. is preformed in
etc. is
items like words, sub-words, affixes, preformed
(11) Decomposition of lexical
in lexical semantics.
between various lexical semantic
(11) Analyse the differences and similarities
structures.
enture
(P8-86) Tech-Neo Publications...A. SACHIN SHAR
em8-Comp.)
NLP (SPPU-Sen Syntax & Semantics)... Pageno.(2
results can be used to find the
The relationships between thesubject languag
and these other languages which have been
undergone a similar analysis.
not only
Corpora have be used for
linguistics research, they have also been used
to form dictionaries (e.g The American
Heritage Dictionary of the English
in 1969), and
Language grammar guides, such as A ComprehensiveGrammar
of the English Language,Published in 1985.
() Annotation
Annotation consists of the
applications of a scheme to texts.
Annotation may include structural
make-up, part-of-speech tagging. parsing.
and numerous other representation.
(ii)
Abstraction
(ii) Analysis
of-speech -
tagged (POS tagged).
But even corpus who work with 'unannotated
linguists plain text' also apply
some method to isolate salient terms.
Linguists with differing perspectives and other interests than the originator's can
GQ. What
---a corpus Approach?
is
-
The corpus utilizes a large and
Approach principled collectie
Occurring texts as the basis for analysis. lection of
The characteristic of the corpus approach refers to the naturaly
corpus itselc
One may work, with a written corpus, a
spoken corpus,
rpus, an
an
corpus, etc academi C
spoke
a2.16.3 Corpus Linguistic Techniques
dictionaries include a
Specialized words in
specialized fields, rather than
enture
(PB-86) Tech-Neo Publications...A SACHIN SHAH Ve
(SPPU-Sem8-Con (Lang. Syntax&
NLP Semantics)..Pageno.(2-31)
first iden
lentify
concepts and
They then
In practice,the two establishing the terms used to designate
approaches are used for
both types.
There are other
types of dictionaries that
do not fit into the above
For example, bilingual distinction.
(translation)
(the saurs) and rhyming dictionaries
ries, dictionaries of synonyms
dictionaries.
The word dictionary 1s
usually meant to refer to
dictionary. a general purpose monolinguaal
There is also a difference
between
The prescriptive dictionary Prescriptive and
reflects what descriptive dictionaries.
is seen as
The descriptive reflects recorded correct use of the
actual language.
use,
Stvlistic indications (e.g.
'informal' or
are also considered by some
to be less
'vulgar)in many modern
than dictionaries
objectively descriptive.
dictionary.
subject fields, e.g. abusiness
(i) A single field-
dictionary covers one particular
(ii) A sub-field subject (e.g. law) and
dictionary. It covers a more field
constitutional law). specialised (e.g.
The 23-1anguage
Inter-Active Terminology for Europe is a multi-field
dictionary. The American National
Biography is a single field.
The African American National
Biography project is a sub-field
dictionary.
Another variant is the an alphabetical
glossary, list of defined terms in a
specialised field, such as medicine (medical dictionary).
antu/e
Statistics of BabelNet
2.18.2
covers 500
(Version 5.0) languages. It contains almost 20 million
BabelNet
nsets
and around 1.4. billion word senses.
average
Version 5.0 also assoCiates around 51 million images with Babel
Synsets and
ides a Lemon RDF encoding of the
resource, available via a SPARQL
endpoint 2.67 million synsets are assigned domain labels.
2.18.3 Applications
) Semantic relatedness
() In the first part of the semantic analysis, the study of the meaning of individual
words This
is
performed. part is called lexical semantics.
(2) In the second part, the individual words will be combined to
provide meaning in
sentences.
(1) Hyponymy
It is defined as the relationship between a generic term and instances of that
generic term.
(3) Polysemy
used
t
Polysemi means "many signs". It is a Greek word.
)
meaningss.
For example, the word "bank"is a polysemy word with
the follo
meanings following
differer
A financial institution
erent
example of Homonymy
(5) Synonymy
It is a relation between two lexicalitems having different forms but
expressing
the same or a close meaning.
Examplesare 'author / writer', 'fate/ destiny'
(6) Antonymy
It is a relation between two lexical items possessing between
symmetry their
'certitude/incertitude'.
() Lexical Ambiguity
a single word is
The ambiguity of called lexical
ambiguity. For example the
word walk as a noun or a verb.
For example, the sentence, "The bike hit the pole when it was moving'" is
having semantic ambiguity.
The interpretation can be done as "The bike, while moving, hit the and
pole
"The bike hit the pole while the was
pole moving"
(iv) Anaphoric Ambiguity
This type of arises due to the use of
ambiguity anaphora entities in discourse.
For example, the horse ran was very
up the hill. It
steep. It soon got tired.
For example, the sentence, "I likeyou too can have multiple interpretations:
situations.
)
and NLP
(ii)
We
wSD
mention
fields.
can
corpus-based.
WSD can
be
below the various
used
WSD
also be used
with Lexicography.
in Lexicography
in
applications of
text mining
Much
text proc.
of the
Lexicograph
indicator
tasks,
ators.
i
(v) WSD can be used for Information Retrieval purposes. Information Retrieval
ii) Different algorithms are to be formed for different applications and thatbecou
a big challenge for WSD. And
(ii) Words cannot be divided into discrete meaning they have relatedmeaning
this causes a lot of problems.
Ventue
(P8-86) Tech-Neo Publications..A SACHINSHAH
cPPU-Sem8-comp.) (Lang.Syntax no. (2-37)
&Semantics)...Page
This is known
required information in
as declarative emptyknowledge based - system.
approach.
In this
method, it converts
empty knowledge -based system.
required behaviour
directly into
program code in
Compared to declarative
approach, it is a contrast
system is designed. approach. Here, coding
2.21.1
Lesk Algorithm
The lesk algorithm is based on the
assumption that words in a
given
'neighbourhood' will tend to share a common topic.
In a
simplified manner, the Lesk algorithm is to
compare the dictionary
of an
definition
ambiguous word with the terms contained in its
neighbourhood.
Versions have been adapted to use wordnet. An implementation appears like
this:
1. For every sense of the word one should
being disambiguated count the number
of words that are in both the neighbourhood of that word and in the dictionary
definition of that sense.
2
1. tree with
needle-shaped leaves.
Waste away through sorrow or illness.
cONE
1. Solid body which narrows to a point.
2.
Somethingof this shape whether solid or hollow.
3. Fruit of certain
evergreen trees.
We note that, the best intersection is
pine #lncone#3 =2.
of th
n
A comparative evaluation performed has shown that
simplified Lesk aleorith
can the original definition of the thm
outperform
algorithm, both in terms
ot
precision and efficiency.
Evaluating the
disambiguation algorithms on the Senseval-2
English all words
data, they measure 58%
precision using the simplified lesk
algorithm comparecd
to only 42% under the
original algorithm.
(ii) The absence of a certain word can change the results considerably.
considered.
(iv) D
glosses are fairly short and do not provide suficient vocabulary
Dictionary
relate sense distinctions.
dern
words from definitions of words
from definitions.
Vent
(PB-86) Tech-Neo Publications..A SACHIN SHAH
NLP (SPPU-Sem8-Comp.) no. (2-39)
(Lang. Syntax & Semantics)..Page
Lang.
2.22 DICTIONARIES FOR REGIONAL LANGUAGES
() Hindi is the oficial
language ofIndia
(2)
language. But there
The oxford
is no while English being the secona
national
language as per the
o
dictionary is one of constitution.
in the world. It the most
has famousEnglish language
many extra dictiona
language learners, Iike the features that it as ol for
augment dictona
ability to make
flashcard learning notes on definitions and a
systemand a spellings
(3) Hindi is the official great the saurus.
language. In
constitution recognises 22 addition to the e
official
as scheduled regional language,
languages, which does not
languages. include
(4) The Sanskrit Engist
language is the
been spokensince oldest
5000 years language in India.
before Sanskrit language nas
of India. But, in Christ. Sanskrit is
the still the
present time, official
and ritual instead Sanskrit has language
of the become a
language of language of worship
(5) There are 22 official speech.
regional
Bodo,Dogri, languages in India.
Gujarati, Hindi, They are: Assamese,
Manipuri, Marathi,
Kannada,Kashmiri, Bengall,
Nepali, Oriya, Konkani, Maithili,
and Malyalam,
Telugu Urdu. Punjabi, Sanskrit.
Santhali, Sindhi Tami,
(6) The youngest
languagein India is
It
belongs to Dravidian Malayalam.
language group and
youngest language of the Dravidian considered an the
smallest and the
declared this language group.
language an 'the
classical
Government of India
(7) Currently six language of India in
languages enjoy the 2013".
'classical status' Tamil
Sanskrit
(2005).Kannada(2008), (declared 2004), m
(2014). Telugu (2008),
Malyalam(2013) and odiya
(V)
The underlying
popularised also here.
idea that 'a
(Lang.Syntax & Semantics
word is charactised
cs)...Page
by the
the co no.
company it
keeps
(
Distributional structure
Wa
The distribution of an element will be understood as
as
the
environments. An environment of an element, A is an existis
Sum of
the other elements, each in a all
occurrents, i.e. particu
icular array of
position
occurs to yield an utterance. with itso
(vi) Distributional properties shicha
There are three basic properties of a distribution:
location, Snr.
The location refers to the typical
value of the
distribution, such
the amount by which
as t shape
spread of the distribution, is
similar
milar valmean
values he
larger ones. differ
Topic modelling is
recognising the words from the
topics
present in
document or the corpus of data. the
This appears simple than processing the entire document and this is how topie
modelling has come up to solve the problem and also visualising things beter.
Venture
(P8-86) Tech-Neo Publications...A SACHIN SHAR
n8-Comp.) (Lang.Syntax & Semantics)....Pageno. (2-41)
P(SPPU, ofLs
AsSumption
bhich are used in the same
context are
words
of the data analogous to each other.
The semantic structure is
unclear due to the i
ambiguity of the
thehidden
S
chosen.
words
value Decomposition
eular (SVD)
Singular
the stical
statistica
method that is used to find
SVD is acrossthe
the latent
(hidden) semantic
spread document,
ofwords
struct
Let
collection
of documents.
C
number of documents,
d
2,26 SELF-LEARNINGTOPICS
www AARRwRw
for instance, as
simpleand
Kup, word hoat
d the
lists, usually
binary analyses.
results
and so on. search quick.
trees, can
Because the set of associations trhes,
between word Dictonare
hash
for
descriptions
finite
is declared
and the generati ve potential of the
by plain
cove
meration, the
their
rms
d tubje
Annotation 1:
Type: Countries
Base form: Sri Lanka
Id: 1
begin:
end 9
covered Text Sri Lanka
enture
Type:Countries
Lanka
Baseform:Sri
id:1
20
begin:
end:
26
Ceylon
Text:
covered
include the following features
r
Annotations
id
begin
The offset that indicates the begin of the covered
text
end
text
Covered
The variant of the base form that is found in the text.
Traversing the network from the set of initial states to the set of final states
2. Morphological Alternations
The shape
piti in
of a
the context of
morpheme often depends on the environment
less, die as by in dying.
:pity is realised u
initial state to
transducer, each path
a
and
surreo
aa rom tk
and I8
prlen/
s,
b
0-0-00O in a Transducer for English
Fie.
2.26.1:A path
b:bare reduced to a single symbol.
redistinct.the pairs
in Fig. 2.26.1 is labelled as
symbols thepath
rithe notation,
r
00:e+comp:
Instandard
g Adj:+ contain
hundreds of thousands, even millions, of states
igo sducers may
paths in the case of languages such as
an infinite number
transdu
Lexical
arcs allows noun compounds of any length.
and German language
from whic such complex networks are complied
xpressions
expres hese are developed to make it possible to
The regular operators.
lternations that are commonly found in natural
th-level
clude and alte
constraints
are present in a convenient and perspicuous way.
aes And these
describe
And
languages.
Calculus
Basic Expression
Xerox finite-state
comes from the
calculus.
used here
A, B, etc; as variables over regular expressions. Lower
The notation
ters
letter
We use uppercase Stand for symbols.
b etc.
letters a,
case that stands for empty
symbo 0, the Epsilon symbol,
are two special
There
and ?, the any symbol.
string
up from simpler ones by means
can be built of
Complex regular expressions
Square brackets, I ], are used for grouping
regular expression operators.
expressions.
and regular relations are closed under
Because
both regular languages
and union, the following basic operators can be combined
with
concatenation
any
kind of regular expression:
AIB union,
AB concatenation
(A)Optionality;equivalent to [A 10],
A+ Iteration; one or more concatenations of A,
A* kleene star
; equivalent to (A+)
subtraction, and
) Regular languages are closed
under complementation,
intersection, but regular relations are not.
with a regular language:
Hence, the following operators can be combined only
A complement
A Term complement; all single symbolsstrings not
in A.
A &B Intersection
A-B Subtraction (minus)
() Regular relations can be constructed by means of two basic operators:
A XB Cross product
(P8-86)
Tech-Neo Publications...A SACHIN SHAH Venture
NLP
(SPPU-Sem8-Comp.) (Lang.Syntax & Sema
A OBcomposition Semantics)...
Thecross product operator,X,
is Used Pageo.
regular language: it constructs a relation
only with
with no.
between tha
express
them.
[A X B]designates the relation that maps essions
B
: hat
every that
string of A
The notation a b is a convenient short hand
for [a . to denee
every
b. every
Remarks stina
(5)
application speed.
They
strings,
constitute
languages and
a kind of
relations.
high level
which are clean
programming
and decl
lang
sed
.
clarative.
for
to
to
increase
th
a formalism
ism.
2.26.4 Noisy Channel Models
(1) The noisy channel model is a framework used in
translation programs,
natural
NLP) to identify the correct word in situations where it is
The framework helps detect intended
question
words for spell
answeringsystems and speech to
checkers,
language
unclear.
virtual
.
assit.
SIstants,
text softwe
tware.
(2) Difference Between Noisy Channel and
Noiseless Channe
inel
The capacity of a noiseless channel is
numerically equal to the rate at which
communicatesbinary digits.
it
The capacity of a noisy channel is less than this because it is limited by the
amount of noise in the channel
Asignal =[lo8,,(10))
o (1000))
los,
10[3 log,10)]
= 30 dB
= 10 (3 x 1)
1|log,,
SNR]
SNR can never transmit
the channel
haracteristics, much more than 13
matter how many or how few Signal levels are used
With
no matter and no matter
Mbps,
or how infrequently samples are taken.
often
how has a bandwidth of 300
telephone line normally Hz (3000 to 3000
Examples:
for data communication
) Hz) assigned
Noiseless
An
protocd
ealistic
does not
Channel
channel in vhich
implementerror control
no frames are
in
lost, corrupted
this category.
or
duplicated. The
(3) The minimum edit distance between two strings is defined as the minimum
number of editing
operations insertion, deletion, substitution) needed to
transform one string into another.
(P&-86)
ch-Neo Publications..A SACHIN SHAH Venture