Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
29 views100 pages

NLP Insem

The document provides an introduction to Natural Language Processing (NLP), discussing its significance, challenges, and historical evolution. It highlights the complexities of human language that make NLP difficult, such as ambiguity and varying syntax rules across different languages. Key components and methods of NLP analysis, including tokenization, stemming, and lemmatization, are also outlined.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views100 pages

NLP Insem

The document provides an introduction to Natural Language Processing (NLP), discussing its significance, challenges, and historical evolution. It highlights the complexities of human language that make NLP difficult, such as ambiguity and varying syntax rules across different languages. Key components and methods of NLP analysis, including tokenization, stemming, and lemmatization, are also outlined.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

UNITI Introduction to

Natural Language
CHAPTER 1 Processing

Syllabus
Introduction Natural Why NLP is hard?
Language Processing,
Programming languages Vs Natural Languages, Are natural languages
Finite automata for NLP, Stages of NLP, Challenges and
regular?
Issues(OpenProblems)in NLP.
Basics of text processing: Tokenization, Stemming, Lemmatization, Part of

Speech Tagging

1.1 ORIGIN AND HISTORY OF NLP


Natural language processing (NLP) is part of everday
life and it is essential to

our lives at home and at work. We can send voice commands to our home assistants,
our smartphones, etc.

Voice enabled applications such as alexa, siri, and google assistant use NLP to
and also call the contacts
answer our questions. It can add activities to our calendars
that we mention in our voice commands.
NLP has made our lives easier. But more than that it has revolutionised the way
we work, live and play.
Communication is an act that agent can perform so as to exchange
important
information with the environment. Communication can be carried out by
producing and Percepting certain signs drawn from a shared system of
conventional signs.

observable communication can help agents to learn


In a partially world,
information that is observed or inferred by others. This information can make

agent more successful.


the world.
Language is meant for communicating about By studying language,
we can come to understand more about the world. We can test our theories
about the world by how well they suppot our attempt to understand language.
of language, we will
And, if we can succeed at building a computational model
have a powerful tool for communicating about the world. In this chapter, we
look at how we can exploit knowledge about the world, in combination with

facts, to build computational


natural language systems.
linguistic
NLP (SPPU-Sem8-Comp.) (Introduction to NLP).. Pageno.(1-2

(NLP) refers to of communicatino


AI method
Natural Language Processing
such as English.
with an intelligent systems using a natural
language Processino
want an intelligent system like robot
of Natural
Language is required when you
to perform as per your instructions, when you want to hear decision from a

dialogue based clinical expert system, etc.


The field of NLP
involves making computers to perform useful tasks with the
naturallanguages humans use. The input and output of an NLP system can be
Speech, Written text.

Naturallanguage understanding is a subtopic of natural language processing in


artificial that deals with machine
intelligence reading comprehension.
The goal of the Natural Language
Processing (NLP)group is to design and build
software that will analyze, understand, and
generate languages that humans use
so that eventually you will be able to address
naturally,
your computer as though
you were addressing another person.

1.2 OVERVIEW OF NLP TASK


--
GQ. Give general approaches to natural language process. Or Write short
note on NLP.

Natural language processing (NLP) is the ability of a computer program to


understand human speech as it is spoken. NLP is a componentof artificial intelligence
(AI).
The development of NLP applications is challenging because
computers
traditionally require humans to "speak" to them in a programminglanguage that

is precise, unambiguous and highly structured or, perhaps through a limited

number of clearly-enunciated voice commands. Human speech, however, is not


- it is often
always precise ambiguous and the linguistic structure can depend on

many complex variables,including slang, regional dialects and social context.


Current approaches to NLP are based on machine learning, a type of artificial
intelligence that examines and uses patterns in data to improvea program's own

understanding. Most of the research being done on natural language


processing

revolves around search, especially enterprise search.

Common NLP tasks in


softwareprograms today include
(1) Sentencesegmentation, part-of-speech tagging and parsing.
(2) Deep analytics.

(3) Named entity extraction.


(4) Co-reference resolution.

The advantageof
natural language
processing can be seen when considering
t
followingtwo statements

"Cloud computing insurance should be part of every ment"


service level agreemenu
and
"A good SLA ensures an easier night's sleep
-even in the cloud."

Venture
(PB-86) Tech-Neo Publications.. SACHIN SHAH
NLP (SPPU-Sem8-Comp.) (Introduction to NLP).. Page no. (1-3)
Ifyou use national language processing for search, the program will recognize

that
cloud computing is an entity, that cloud is an abbreviated form of cloud

computing
and that SLA is an industry
acronym for service level agreement.
The ultimate goal of NLP is to do away with computer programming languages
altogether.
Instead of specialized languages such as Java or Ruby or C, there would
be "human."
only

1.3 EVOLUTION OF NLP SYSTEMS


----------------.
GQ. Discuss the evolution

History of
(MT) in 1950s.
NLP:The
It was Allen
-- -
of NLP systems Or Given a
---.
work related to NLP was started with machine translation
Turing who proposed what today is called
the
brief history of NLP.

Turing test in 1950s. It is the testing ability of the machine program to have
written conversation with human.

This program should be written so well so that one would find it difficult to

determine whether the conversation is with a machine or it is with the other

person actually. During the same period of cryptography


and language
translation took place. Later on, syntactic structures came up along with

linguistics Further, the sentences were considered with knowledge


augmentation and semantics. In 1960s, KI.IZA (the most common NLP system)
was developed that gained popularity.

It was the simulation of a psychotherapist. Ata very later stage, it was the case

grammars that came up. Now, there has been a complete revolution in the NLP
approaches coming up. Many NLP systems
with the machine learning have
been developed till today and a lot of competitions are being organized that are

based on the Turing test.

GQ. What is
pragmatic analysis in
natural language processing?

Pragmatichas not been the central concern of most NLP system. Only after
ambiguities arise at syntactic or semantic level are the context and purpose of the

utterance considered for analysis. Considered a problem in which pragmatic has been
used in this kind of "support" capacity ambiguous noun phrases.

1.3.1 Components ofNLP


the given input in the natural
There are two components of NLP: Mapping
into a useful Different level of analysis required:
language representation.
semantic analysis, discourse analysis.
morphological analysis, syntactic analysis,
Natural language generation: Producing output in the natural language from
some internal representation. Different level of synthesis required: deep
planning (whatto say), syntactic generation
NL understanding: NL Understanding is much harder than NL Generation.

But, still both of them are hard.

(P8-86) Tech-Neo Publications..A SACHIN SHAH Venture


NLP (SPPU-Sem8-Comp.) (Introduction to NLP)Page no.(1-4
Planning:Planning problems are hard problems. They are certainly nontri.
Method which we focus on ways of decomposing the ori ginal problem ivial
i
appropriate subparts and on ways of handling interactions among the subpa arts
as
during the problem- solving process are often called ing refers
planning.Plannin
to the of a problem-solving procedure bes.
process of computing several steps
fore
executing any of them.

1.3.2 Major Methods of NLP Analysis


used in analyzing natural
There are several main techniques languaoa
ge
Someofthem can be briefly described as follows
processing.
matching:The idea an approach language processino
to natural
1. Pattern here is

is to utterances as a whole further than building up thei


interpret input
the structure and meaning of words or other lower
interpretation by combining
That means the are obtained by matchino
level constituents. interpretations

the input utterance.


of words For a deep level of analysis in
patterns against

pattern matching a large number of patterns


are required even for a restricted
domain. This problem can be ameliorated by hierarchical pattern matching in
which the input isgradually canonical through pattern matching against sub

phrases. Another way to reduce the number of patterns is by matching with

semantic primitives
instead of words.
that words can fit together
2. Syntactically driven parsing :Syntaxmeans ways
to form higher level units such as phrases, clauses and sentences. Therefore

driven parsing means interpretation


of larger groups of words are
syntactically

built up out of the interpretation of their syntactic


constituent words or phrases

In a way this is the opposite of pattern matching as here the interpretation of the
is done as a whole. Syntactic analysis are obtained by application of a
input

grammarthat determines what


3. Semantic grammars: Natural language analysis based on semantic grammaris
bit similar to syntactically
driven parsing except that in semantic grammar the

categories used are defined semantically and syntactically.


There here semantic

grammar involved.
is also

4 Case frameinstantiation: Case frame instantiation is one


of the major parsing
techniques under active research today. The has some very useful computational
properties such as its recursive nature and its ability to combine bottom-up

recognition of key constituents with top-down instantiation of less structured


constituents.

M 1.4 NLP IS HARD

is considered a difficult
Natural language processing problem in computer
science. It is the nature of the human
language that makes NLP hard.
The rules that dictate the passing of information no
are
using natural languages
easy for
computers to understand.

(P8-86) Tech-Neo Publications...A SACHIN SHAH nture


NLP (SPPU-Sem8-Comp.) (Introduction to NLP)....Page no.(1-5)

There are several factors that make this process hard. For example, there are
hundreds of natural languages, each of which has different syntax rules, words

can be ambiguous where their meaning is dependent on their context.

Natural languages, such as English or cannot be characterized as


Spanish,
sad" is a
definite

sentence of
of sentences. Everyone
set

English, but
agrees that "Not to be invited
people disagree on the grammatically of
"To
is

be not
invited is sad".

Therefore it is more to define a natural language


fruitful model as a probability

distribution over sentences rather than a definitive set.

Hence, instead of asking words is or is not a member of the


it a string of set

defining' the language,


we better ask for P (s words) i.e. what is = the

that a random sentence would be words.


probability
Natural languages are ambiguous,"He saw duck" can mean either that he
her

saw a waterfowl belonging to her, or that he saw that she is evading something
This implies that we cannot speak of a single meaning for a sentence, but rather
a probability distribution over possible meaning.

English language is Once Bernard show asked the


phonetically also not sound.
audience, "Whatis the meaningof the word Ghoti The anwer was, "Thereis ".
no such word in English'". Show said,
"
yes, it is, Ghoti means fish." The
Whole
audience was shocked. But then he explained. "Enough is the word in English.

But we pronounce 'gh'as 'F' woman is a word in English but we pronounce 1


as i'. Nation is a word in English, but we pronounce 'ti' as sh.

Therefore Ghoti means Fish."

Words be ambinguous where their meaning is dependent on their


can content.

Here, we study a few more significant problem


of the
At the character level, there are several factors that need to be considered. For

example, the encoding scheme used for a document is to be considered.

Text can be encoded using schemes such as UTF -16 or Latin -1.Other factor

to be considered as:Whether the text should be treated as case-sensitive or not.

Special processing is required for punctuations and numbers.


Sometimes we have to consider the use of emoticons (character combinations
and special character images), hyperlinks, repeated punctuations, file extension,

and usernames with embedded periods


When we tokenize text, it means that we are breaking up the text into a

sequence of words. These words are called tokens. The process is called as
Tokenization.

With a language like Chinese, it can be quite difficult since it uses unique

symbols for words. Words and morphemes are assigned a part of speech label
of unit
identifying what type
it is.

A morpheme is the smallest division text that has meaning. Prefixes and suffixes

are examples of morphemes. We also consider synonyms. acronyms,


abbreviations and spellings when we work with words.

(P8-86) Tech-Neo Publications...A SACHIN SHAH Venture


NLP
(SPPU-Sem8-Comp.)
We apply another task, called
(Introduction to

as stemming stemming,
NLP)..Page no
1S
the process
of
(1
findt
the word stem of a word. For example, words such as running, "rung
runs
"run" have the word stem
"run
o
than stemming and uses
Lemmatization is a more refined process vocabu
and morphological techniques to find a lemma. This process determine.
es
the
base form of a word called its lemma.
stem lem is but its
For example, its
word "oper
"operating",
for the emma
is some
in more analysis situation
"operate. Lemmatization results precise s.
Sentence
Words are combined into phrases and sentences. detection c.

problematic and is not a simples as looking


for the periods at the
end of ene
a
sentence.

understand which words in a sentence are nouns and whist


We need to ch
ate

verbs
We are concerned with the relationship between words.
the
For example, conferences resolutions determines relationship bet.
tween
certain words in one or more sentences.

Consider the sentence

The city is large but beautiful, it fills the entire valley".

The word "it is the conference to city. When a word has multiple
meanings We
determine the actual meaning.
perform 'word sense Disambiguation'
to

Sometimes this is difficult to do. For example, "Arjun went back home.
Does the home refer to a house, a city, or some other unit? Its
meaning can h
inferred from the context in which it is used. For example, "Arjun went back home. k
was situated at the end of Rasta Peth".

a 1.4.1 Performance of NLP

In spite of these difficulties, NLP performs these tasksS


reasonably well in most
situations and also adds value to
many problems
-domains.
For example, sentiment
analysis can be performed on customer-tweets resulting
in
possible free product offers for dissatisfied customers.

Medical documents can be summered to highlight the relevant


readily
topics
Summarisation is the process of
producing
description
a short of different units
These include multiple
units
sentences, paragraphs, a document, or mulipe
documents. The content of the text is
important in accomplishing this task.
natural
Finally, languages are difficult to deal with because
they are very larg
and constantly
changing.
Thus, our
language models are an
approximation. We begin with the simpie
possible approximations and more
up from there.

(P8-86)
Tech-Neo Publications...A SACHIN SHAH Venture
NLP (SPPU-Sem8-Comp.) (Introduction
to NLP)...Page no. (1-7)

1.5 PROGRAMMING LANGUAGE


LANGUAGE
VS NATURAL

Natural language spoken by the people while programming


language is the
language is intended for machines.

There are important similarities between both such as the


languages,
make betweensyntax and
semantics, their purpose
to
differentiation they
communicateand the existence of a basic composition.

Both the types were created to communicateideas, expressions and instuctions.

1.5.1 Difference between Programming Language


and Natural Language
Sr. Programming language Natural language
No.

(1) Programming language is Natural language is not that strict

stricter and less tolerant. and is somewhat tolerant. It is

Programming languages have because human languages have a

practically no redundancy, built- in redundancy that allows

otherwise it would be some ambiguity to be resolved


very easy
to fall into ambiguity and may using context.

not indicate correct command.


(2) Programming Languages are In computing, natural language

stricter because computersS are refers to a human language such as

very precise in addition, English, Russian, German or


machines do not have the ability Japanses as distinct from the

to clarify
the meaning of an typically artificial command

expression as a human being


would.

1.5.2 Main Features for Programming Languages


The popularity of a programming language dependson the features and utilities

it to
provides programmers.
We mention below the features that a programming language must possess.
i) Simplicity: The language must offer clear and simple concepts that facilitate

learning and application, in a way that is simple to understand and maintain.

Simplicity is a difficult balance to strike without compromise the overall

capability,

(P8-86) Tech-Neo Publications..A SACHIN SHAH Venture


NLP (SPPU-Sem8-Comp.) (Introduction to NLP)..Page no. (1-81

(i) Naturalness
must be done
: It implies
naturally,
that its application

providing operatos,
structures
in the area for which

and syntax for


it
is
designed
operators

to work efficiently.
use complicated structures or
ii) Abstraction It is the ability to define and to

certain low level details.


operations while ignoring
and executed
(iv) Efficiency: Programming languages must be translated efficiently

so as not to consumetoo much memory or require too


much time.
the language allows
(v) : In order to avoid creating errors,
Structuring
to write the code according to the structured programming
programmers
concepts.
itis possible to express operations
(vi) Compacters: Using this characteristics,
without to write too many details.
concisely, having
with
(vii) Locality: It refers to the code concentrating
on the part ofthe program
which one is working at a given time.
to a computer. The terms usually
In programming language one usually talks

language but might also apply


to spoken language.
refers to a written

(vii) Natural language not to be used for programming.


Natural language programming is not to be mixed up with natural language

In NLP the functionality


of a program is organised only for the
interfacing.

definition of the meaning of sentences.


a high degree of expertise,
(ix) Expertise languages need
Programming
because computer cannot think outside the statement
completeness and precision
errors are neglected.
while in speaking some minor
cannot be replaced by natural
Replacement: Programming languages
are more like natural languages only in the
languages. Programming languages
A feature of programminglanguage is
sense of "words we have in English." key
a program and executes it, it has a well-defined meaning,
that, when one writes

which is its behaviour.

a 1.5.3 Ontology-assisted NLP


of programming in terms of natural language
NLP is an ontology-assisted way
sentences, e.g. English.
for explanations
A structured document with content, sections and subsections

of sentences forms a NLP document, which is really a computerprogram.


written and then communicated with
Natural language programming is first

an interface added on.


through natural language using
of a program for the definition of the
In NLP the functionality
is
organized only

meaning of sentences.

For example, NLP can be used to represent all the knowledge of an autonomous
ca
robot. After that, its tasks can be scripted by its users so that the robot
a
execute them autonomously while keeping to prescribed rules of behavior

determined by the robot's users. Such robots are called as transparentrobo


It is because their reasoning is transparent to users and this develops trust

robots.

Ven
ture
(P8-86) Tech-Neo Publication..A SACHIN SHAH
NLP (SPPU-Sem8-Comp.) (Introduction to NLP)..Page no.(1-9)
Some methods for program are based on natural language
synthesis
programming.

1.5.4 Interpretation

The smallest unit of statement in NLP is a sentence.


Each sentence is stated in terms of concepts form the underlying ontology
attributes
in that ontology are named objects in
Capital Letters.
In an NLP text every sentence
unambiguouslycompiles into a procedure cal
in

the underlying high-level programming language such as MATLAB, Octave,

SciLab, Pythan etc.

1.5.5 SoftwareParadigm
Natural-language programming is a top-downmethod of writing software.

We mention below the stages.


Definition of Ontology: Taxonomy of concepts needed to describe tasks in
(1)

the topic addressed.

Each concept and all their attributes are defined in natural language
words.

This ontology will define the data structures which the NLP can use in

) Definition

ontology.
sentences.
of one or more top-level sentences in
These sentences are later used to invoke
terms
the
of concepts from the
most important activities

.
in the topic.

i) Defining of each of the top-level sentences in terms of a sequenceof sentences.


each of the lower-level sentences in terms of other sentences or by a
iv) Defining
simple sentence of the form Execute code "..where, stands for a code in
terms of the associated high-level
programming language.
(v)Repeating the previous step till no sentence is leftundefined. During this process
each of the sentences can be classified to belong to a section of the document to
be produced in HTML or Latex format to form the final-language program.

(vi) Using testing objects, to test the meaning of each sentence by executing its code.

(vii) Providing a library of procedure calls (in the underlying high-level language)
which are needed in the code definition of some low-levle sentence meanings.

(vii) Providing a title, author data and compiling the sentences into an HTML or
LateX file.
(ix) Publishing the natural language program
as a webpage on the internet or as a
PDF file compiled from the LaTeX document.

1.5.6 Publication Valve of Natural -Language


Programs and Documents
A natural-language program is a precise description of some procedure, that is

created by its author.

(P8-86) Tech-Neo Publications...A SACHIN SHAH Venture


NLP (SPPU-Sem8-Comp.) (Introduction to NLP)....Pageno

It is human readable and it can also be read by a suitable software -10


example, a
page web in an NLP format can be read by a agent.
software For
can ask the agent
person and he or she
to execute
assistant to a te somesome personal
i.e., carry out some task or answer a question.
sentences
There reader agent available for English interpretation
is a
of HTML.
documents that a person can run on the personal computer. NLP

1.5.7 Contribution to Machine Knowledge


An ontology class is a natural-language program is not a
concept in the
that we use it in our regular course. sense

Concepts in an NLP are examples or samples of generic human concen


Each sentence in a natural is either:
language program
a relationship is a world model or
)Stating
i) carries out an action in the environment, or

(i) carries out a computational procedure or

iv) invokes an answering mechanism in response to a question.


A set of NLP sentences, with their associated ontology, can also be used as.
pseudo-code that does not provide the details in any underlying high level
programming language.
In such an application
the sentences used become high level
abstractions of
computing procedures that are computer language and machineindependent.

1.5.8 The Theory


Consider, for examples, a father saying to his baby son.
"Want to suck on this bottle, dear bay
and the kid says..
"blah, suck, blah, blah, BOTTLE, blah, blah" ....but he properly responds
because he has got a 'picture' of a bottle in the right side of his head connected
to the word "bottle" on the left-side, and an existing "skill" near the back of his
neck connected to the term "suck".

In other words, the kid matches when he can with the pictures (types) and skills

(routines),he has accumulated and he simply


disregards the rest.
Out compiles does very much the same thing, with new pictures (types) and
skills (routines) being defined not by us but
by the programmer,ashe
writes new applications code.

1.5.9 Web Programming Languagee


Web development can be done through different that
programming languages
allows to build a site or design an
applicaton.
The following options are:
(i) Java: multipurpose that adjusts
language efficiently to web development.

(P8-86) Tech-Neo Publications...A SACHIN SHAH Venture


NLP(SPPU-Sem8-Comp.) (Introduction toNLP)....Page no.(1-11)
i) Go: This is a flexible
language that facilitates the creation ofapplications.
(ii) Ruby on Rails :It allows to
design web applications quickly.
(iv) Python: It works on a wide variety ofcontents and on the web has technical
advantages.
(v) Java script:It is an the client's side and can to the server for
be extended
different functions.

1.6 ARE NATURAL LANGUAGE REGULAR?

In theoretical
computer science and formal language theory, a regular language
is also called as Rational
language(RL).
It is defined by a regular expression in the strict sense in theoretical computer
science.

A regular expression (RE)is a language


for specifying text search strings.
RE helps us to match other strings or sets of It uses a specialised syntax
strings.
held in a pattern.

Regular expressions are used to search in identical way.

1.6.1 Properties of Regular Expressions

) We mentionbelow some of the important


RE
It is
is an algebraic notation for

a formulain
classes of strings, a sequenceof
a special language

symbols.
and
properties of

characterising a set of strings.


it
RE.

can be used for specifying simple

i) Regular expression requires two things, one is the pattern that is to be searched
and the other is a corpus of text from which we need to search.

1.6.2 Mathematical Definition ofRE

Mathematically, A regular expression


can be defined as:
'e'isa which indicates that the language is having an empty
regular expression,
string.

i) o is a Regular Expression which denotes that it is an empty language.

(i) X and Y are Regular expressions,


If then

1. X, Y

2 X. Y. (orXNY-concatenation of XY)
3. X+Y(orXUY-union of X and Y)
4. X*, Y* (Closure of X and Y; also called as Kleen closure)

are also regular expressions.

(iv) Ifa string is derived from above rules then that is also a regular expression.

(P8-86) LTech-NeoPublications...A SACHIN SHAH Venture


NLP
(SPPU-Sem8-Comp.) (Introduction to
NLP)....Paco.
Page
1.6.3 Examples of Regular Expressions no.
(1-1

We mention below a few examples of Regular


expressions
ions in
in t
table. the
folloy
ollowj
Regular expressions Regularset
(0+10* (0, 1, 10, 100, 1000, 10000, ...)
(0* 10) (1,01, 10, 010,0010, ..
0+E) 1+€) (E,0, 1,0,1)
(a+b)*

(a+b)* abb
If would be set of

.
strings of a's
which also includes the null

bb,ba,aaa,
strin
and
i.e.
b's of

E,a,b,;
an

,ab,
be set

,
It would of strings of a's
and b's
ending with:
stringabb,i.e.,{abb,aabb,babb,
aaabb,ababb.. the
(11* It would be set
consisting of even number
of 1's
also includes an empty whi
strng 1.e. E, 11, 111
111111,.. 1,
(aa)* (bb)*b It would be set of
strings consisting of even
a's followed
numberne
by odd number of
of b's
b, aab, aabbb, aabbbbb,aaaab, ie.
(aa+ ab +ba+bb)* aaaabbbb,...)
Itwould be string of a's and b's of even
length that can
be obtained by
concatenating any combination of the
strings aa, ab, ba and bb including null
i.e. (aa,ab, ba, bb, aaab, aaba,
...)

1.6.4 Properties of Regular Sets


If we do the union of two
regular sets, then the
resulting set would also be
regular.
ii) It we do the intersectionof two
regular sets, then the would also
resulting set be
regular
(iii) If we carry out the
complementof regular sets, then the set would
be regular. resulting also

iv) If we carry the difference of two


regular sets, then the be
resulting set would also
regular.
(v) If we carry out the reversal of regular sets, then the resulting set would also be
regular
(vi) If we take the closure of
regular sets, then the would aso be
resulting set
regular
(vii) If we
carry out the concatenation of two regular sets, wou
ould
then the resulting set
also be regular.

(vii) A simple example of a language that is not


regular is the set of
fa b
stn
In 2 0}. It cannot be
recognised with a automation, since
finite

automation has finite


memory and it cannot
finite

ora
a
remember the enact number

(P8-86) ure
Tech-Neo Publications...A SACHIN SHAH Venu
NLP (SPPU-Sem8-Comp.) no. (1-13)
Introductionto NLP).. Page
It is the language
ix) accepted by a Deterministic Finite Automation (DFA).
() Itis
the
language accepted by a nondeterministic finite automation (NFA).
It can be generated by a
(xi) regular grammar.
It can be generated by a
(xii) prefix grammar.
(xii) It can be accepted by a
read-only.

1.6.5 The Number of Words in a Regular Language


Let SL n)denote the number of words of length n in L.
The ordinarygenerating function for L is the formal power series:

SL) =2 Sn)
n20
z".

The generating function of a language L is a rational function if L is regular.


Thus for every regular language L, the sequenceSt(n)n 20 is constant

recursive.
This implies that there exists an
integer constant no, complex constants A1
.AK and complex polynomials p(x). Pz(x), .PEx).Such that for every n 2 no
the number SL(n)of words of lengthnin L is.
n
SLn) = P1(n) ^ +P2{n)22t..+PK (n)AK
Thus, non-regularity of certain be proved by counting
languages L can the
words of a given length in L'.

Consider, for example, the Dyck language of strings of balanced parentheses.

The number of words of length 2n in the is equal to Cn 3/2


Dyck language
Ca
and this is not of the form p(n) ".Hence the Dyck language is not regular.

1.6.6 English is Not a Regular Languagge


The English language is regular if one considers if as a set of single words. But
English is more than a set of words in a dictionary.

English grammar is the non-regular part. Given a paragraph, there is no DFA


which will decide whether it is a well-written paragraph in the
English
language.
Of course,it can say whether each word is an English word or not, but it cannot
judge whole paragraphs.

example is that one can build sentences of the form


In particular, the standard

"the mouse escaped"


"The mouse the cat chased escaped. "The mouse the cat
the man owned chase escaped" that are grammaticaland are arbitrarily long, but
are irregular.

(P8-86) Tech-Neo Publications...A SACHIN SHAH Venture


NLP to
(Introduction NLP)....p.
(SPPU-Sem8-Comp.)
no.
1.7 FINITE AUTOMATION OF NLP (1

the plural
The term automata means 'self acting. Tt
is
of
automa NLP
Automation
predetermined sequence of
An automation
is defined as a self-propelled

operations

a finite
automatically.
numberof
computing

states
device,

is called a
u
which
folo.
a (SPPU-Sem8-Comp.)
1.7.2
Typesof
having Finis. There are Finit
FA) or Finite State Automation two
(FSA). types of
finite sta
Automatin
Mathematically, an automation can be represented by a 5-tuple (o
where 1.7.2.1(A)
It is Determi-
of states defined as the
Q is a finite set can
determine the type of fin
state to
Hence the which the
(ii)2is a finite set of symbols,called the alphabet
ofthe automation. machine is
called De
(ii) & is the transition function,
Mathematically,a DFA
where can
(iv) the initial state from where any input is processed

)
is
go (i.e. go E O) G) Qis a finite
(v)Fis a set of finalstate/states of Q (FCQ. set
of states
2is a finite set of

a 1.7.1 Relation between Finite Automata,


ata, Regular
(iii)

(iv) 9o
ôis the
is
transition
the initial
symbols,c=
function, wh
Grammarsand Regular ExpressionsRegular (v) F is a set
state from
whe
of final
state/statess0
a DFA
We mention below the points which a
Graphically, can be
will
give us

r-
clear idea where
about
relationship between finite automata, regular grammarsand
regular
he ) The states are
expressions. represented by
(G) Finite state automata are the (ii) The transitions are
theoretical foundation of shown by
computational work and (ii) The initial state is
regular expression is one way of describing them. represented
(iv) The final state is
ii) Any represented
regular expression can be implemented as FSA and any FSA can he
described with a
regular expression. 1.7.2.2 Example c
(iii) Since regular
expression is a way to characterise a kind of Suppose a DFA be
language called
regular language so we can say that regular language can be described with Q
help of both FSA
the
{lm,n}
(iv) Reqular
grammar, a formal
and regular
expression.

grammarthat can be
o
right-regular or isa Transition function d is as sh
left-regular,
way to characterise
regular language. Current state Nexts
We mention below the
diagram to show that finite
automata, regular expressions
and regular M
grammarsare the
equivalent ways of describing
regular languages.
N
The graphical representatior-

Regular Finite

Regular

Regular grammers

Fig. 1.7.1:
RegularGrammars

(P8-86)
(P8-86) Tech-Neo Publications...A SACHIN SHAH Ventu
NLP (SPPU-Sem8-Comp.)
(Introduction to NLP)....Pageno. (1-15)

1.7.2 Types of Finite State Automation (FSA)


There are two types of
finitestate automation:

1.7.2.1(A) Deterministic finite automation (DFA)


defined as the type of
symbol we
It is
finite automation where, for every input
can deternmine the state to which the machinewill move. It has finite number of states.
Hence the machine is called Deterministic Finite Automation
(DFA)
Mathematically, a DFA can be represented by a 5-typle (Q, 2, 8, Go» F
where
a) Qis a finite set
of states

i) 2is a finite set of symbols, called the alphabet


ofthe automation,
iw) 8 is the transition function, where :Qx2 »Q.
iv) is the initial state from where any input is processed (0E Q).
(V)Fis a set offinal state/states
ofQ(FC).
Graphically, a DFA can be represented by diagraphs, called as state diagram,
where
The states are represented
by vertices.
(ii) The transitions are shown by labelled arcs.
(ii) The initial state is represented
by an empty incoming arc.
iv) The final state is represented
by double circle.

1.7.2.2 Example ofDFA

Q (
Supposea DFA be

m,n}

Transition function & is as shown in


2F
0, 1}
n}
the table below.

Current state Next stateforinput O Nextstateforinput1


M
M M
N N N
The graphical representation of this DFA is:
1

0
1
b
0
OD
Fig. 1.7.2

(P8-86) Al Tech-Neo Publications..A SACHIN SHAH Venture


NLP (Introduction to
(SPPU-Sem8-Comp.) NLP)...Page no.(1-18
1.7.2.3 Non-Deterministic Finite
Automation
(NDFA)
where for every
It is defined as the type of finite automation input s
cannot determine the state to which the machine will more. It means that mabolwe
machineCan
move to any combination of states.

of is called as
Since it has finite number states, it
Non-determinics
eterministic
finite
Automation (NDFA) machine.

)
i)
(iii)
Again,

9o F), where
Qis
is
a
as usual, NDFA can

finite

a finite

6:isthe transition
set

set
of states,

of symbols,
function where
be represented mathematically

called the alphabet


: Qx2>2Q
ofthe
by a

automation.
5-typle
(Q.
(o
Zd,

iv) go: is the initial state from where any input is processed (9,E Q),
(v) F: is a set offinal state / states of QFcQ).
Graphically, NDFA can be represented by diagraphs. (same as DFA) and iis
called as state diagrams where:
i) The states are represented by vertices.
(ii) The transition are shown by labelled arcs.

(ii) The initial state is represented by an empty incomingarc.


(iv) The final state is
represented by double circle.

1.7.2.4 Example of NDFA

Let NDFA be:


Q = {a, b, c} 0,1)
Io fa), f
c
We exhibit transition function & in the table as

Current state Next state for input ONext state forinput 1


A a, b B
B C a, c

b, c C
The graphical representation of NDFA is:

0,1

Fig. 1.7.3

(P8-86) Tech-Neo Publications...A SACHIN SHAH Venture


NLP(SPPU-Sem8-Comp.) (Introduction to NLP)..Page no.(1-17)

1.7.3 Regular Expression and Automate


(1) Role of regular expression and automata in NLP system
The key concept of Natural
Ianguage Processing is that every NLP expert should
be proficient in Regular Expressions. They are used in various tasks such as
data-processing, rule-based information mining systems, pattern matchrmg.
text, feature engineering, web scraping, data extraction etc.

(II) Bag of words in NLP


A bag of words is a representation of text that describes the occurrence of words
within a document.
We keep the track of word-counts
and disregard the and the
grammatical details
word order. It is called a
"bag" of words because any information about the
order or structure of words in the document is discarded.

1.7.4: (I)Applicationsand Limitations of Finite


Automata
() Applications are as follows
)Forthe designing of lexical analysis
of acompiler.
i) For recognising the pattern using regular expressions.
the of the combination
(ii)
For designing
and sequential circuits using mealy and
Moore machines.

iv) Used in text editors.


()forthe implementation of spell checkers.

)(IT) Limitations of Finite Automata

(i)
FA
There
can only count finite input,

is no finite automata that can find and recognise set of binary string of
equal Os and 1s.
(il) Set of strings over "("and'"y' and have balanced parenthesis.
iv) Input tape is read only and only memory it has is, state ofstate
(v) It can have only string pattern.

1.8 LEVELS AND TASKS OF NLP

GQ. Briefly explain the NLP tasks and write the different levels of NLP. Or
Explain the synthetic and semantic analysis in NLP.

NLP problem can be divided into two tasks: Processing written text,
using
lexical, syntactic and semanticknowledge of the language as well as the
required real
world information.

(P8-86) alTech-Neo Publications...A SACHIN SHAH Venture


NLP (Introduction to
(SPPU-Sem8-Comp.) NLP).Page no.
(1-18
the information
Processing spokenlanguage, using all needed ah
additional
knowled about phonology as well as enough added ove
ormationto plus
the further
ambiguities that arise in handle
speech.

Level of NLP
1.
Morphology
It is the analysis of individual words that consist of
morphemes the small.
grammatical Generally, words with 'ing', 'ed' change the meaning
unit. est

word. This analysis becomes necessary in the determination of


o
tense as
the of
well.

2. Syntax
Syntax is concemed with the nules. It includes legal formulation of the senta.
to check the structures. (Some aspects are covered in
compiler's phase of
analysis that you must have studied). For example, "Hari is good not syntax
to. Th
sentence structure is invalid here.
e
totally

3. Semantic
During this
phase, meaning check is carried out- The way in which the meaning
is is The
conveyed analyzed. previous example is
syntactically as well a
semantically wrong. Now, consider one more example, i.e., "The table is on the
ceiling" This is syntactically correct, but semantically wrong.

4. Discourseintegration
In communication or even in text formats, often the
meaning of the current
sentence is
dependent on the one that is prior to it. Disburse analysis deals with
the identification of discourse structure.

5. Pragmatic
In this
phase, analysis of the response from the user with reference to what
the language meant to handled.
actually convey is So,it deals with the mapping
for what the user has
interpreted from the conveyed part arid what was actually

expected. For a question like "Do know how you long it will take to complete
the job?", the expected answer is the number of hours rather than a yes or no.

6. Prosody

It is an analysis phase that handles rhythm. This is the most difficult analysis

that plays an important role in the


poetry or shlokus (chants involving the nanme

of God) that follow a rhythm.

7. Phonology
This involves analysis of the different kinds of sounds that are combined. It 1s

concerned be
with speech recognition. Can
the analysis levels discussed

overlapped or interrelated? Yes. It is very much possible to have an analysi

cond
actually forming a fuzzy structure. They can work in stages, where the seco
level makes use of the analysis or the outcomes of the first level. We now stuu
them in detail.

(P8-86) Tech-Neo Publications...A SACHIN SHAH Ventue


NLP (SPPU-Sem8-Comp.)
(Introduction to NLP)...Page no.(1-19)

1.9 STAGES IN NLP

There are five phases of NLP (Refer Fig. 1.9.1)

1. Lexical
analysis and morphological Lexical Analysis

Lexical analysis is the first NLP


phased this

phase scans the source code os a stream of Syntactic Analysis

characters. Then it converts into meaningful

lexemes. It divides the whole text Semantic Analysis


into

paragraphs, sentence and words.

It studies the patterns of formation of


words. It Discourse Integration
combinessounds into minimaldistinctive units

of meaning.
Pragmatic Analysis

Fig. 1.9.1: Basic steps of NLP


2. Syntactic analysis (parsing)

Syntactic analysis is used to check grammar,word arrangements, and shows the


relationship among the words. Hence words are collected to form phrases,
phrases get converted to clauses and clauses form sentences. It shows the

relationship among words.

Example: Pune goes to gopal.


Pune goes to gopal, does not make any sense, so this sentence is rejected by the

syntactic analyser.

3. Semantic analysis

Semanticanalysis is concerned with the meaning representation. It focus on the

meaning of words, phrases,


literal and sentences. It studies meaning of the

words independent of context of the sentence.


Hence it may involve ambiguities to some extent.

4. Pragmatic Knowledge

Praggmatic is the last phase of NLP. It helps one to discover the intended effect

by applying a set of rules that characterise cooperative dialogues.

It is mainly concerned with how the sentences are used and what the inner

meaning of the sentence is

For example:"Open the door"is interpreted as a request instead of an order.

(P8-86) Tech-Neo Publications..A SACHIN SHAH Venture


NLP (SPPU-Sem8-Comp.) (Introduction to
NLP)...Page n

5. Discourse integration (1-20)

Discourse upon the sentences that


integration depends proceede
it
of the sentences that follow it. It connects
invokes the meaning
nces. als
senten dand
Discourse integration mainly studies the inter-sentential connection
It
how the preceding sentence can change the inter pretation of the nert
studies
sentence.

6. World knowledge

In language the non-linguistic information that


studies,
helps
helps a
the meanings of words and sentences. reader
listerner to interpret on
With knowledge, we are able to recognise things and people around the
The more we gain knowledge, the more things and we people should be world
able
recognise in the world.

EGenerally we experience four types of knowledge


) Factual knowledge: These are the terminologies, glossaries, details
necessary building details of any professional domain.
ana

(i) Conceptual knowledge: This knowledge is the understanding


of the princinles
and that underlie a domain.

ii) Procedural
relationships

perform a specific
knowledge
skill
:
This knowledge refers to the

or task, andi considered


knowledge of how to
knowledge related to methods,
procedures, or operation of equipment.
Procedural knowledge is also referred to as implicit
knowledge or know-how.
iv) Meta cognitive knowledge: This knowledge refers to what learners know
about learning.

This includes The learner's knowledge of their our cognitive abilities


(e.g1
have trouble remembering dates in
history') the learner's knowledge of particular
tasks 'the ideas in this chapter that I am
(e.g.
going to read ar complex.').

1.9.1 Phonetic and Phonological Knowledge


Phonetic the
knowledge is
knowledge of sound-symbolrelations and sound
patterns represented in a language.
It is when a child is to
learning talk, communicate and then they develop
phonemic awareness, which is an awareness of distinctive sounds and
speech
they use phonemes (smallest unit of create words.
sound) to

The primary differences between


phonological and phonemic awareness is
tna
phonological awareness is the to recognise
ability words made up of differen
sounds.

In contrast, phonemic awareness the


is
ability to understand how sou
functions in words.

(P8-86) Tech-Neo Publications..A SACHIN SHAH Ventu


NLP(SPPU-Sem8-Comp.) ntroduction to NLP)....Page no.(1-21)

ample of phonological knowledge


Counting number of syllables in a name,
the
recognising alterations, segmenting
a sentence into words, and
identifying the syllables in a word.
Example of phonemic knowledge.
Counting the number of sounds a word would be a phonemic awareness
activity.

Information retrieval, information extraction and question answering


Information retrieval involves
returning a set of documentsin response to a usSer

query:
Internet search engines are a form of IR. However, one change from classical IR
is that Internet search now uses techniques that rank documents
according to how
many links there are to them
(e.g., Google's PageRank) as well as the presence of
search terms. Information extraction involves
trying to discover specific information
from a set of documents. The information required can be described as a
template. For
instance, for company joint ventures, the template
might have slots for the companies,
the dates, the products, the amount of involved. The slot fillers are generally
money
strings. Question answering attempts to find a answer to a specific question
specific
from a set of documents,or at least a short piece of text that contains the answer.
What is the capital of France? Paris has been the French
capital for many centuries.
There are some
question-answering systems on the Web, but most use very basic
techniques. For instance, Ask Jeeves relies on a fairly large staff of people who search
the web to find pages which are answers to
potential questions. The system performs
very limited manipulation on the input to map to a known question. The same basic
technique is used in many online help systems.

1.10 AMBIGUITY AND UNCERTAINTY IN LANGUAGE

Ambiguity, generally used in natural language processing, can be referred as the


ability of being understood in more than one way. In simple terms, we can
say that
ambiguity is the capability of being understood in more than one
way. Natural
language is very ambiguous.
NLP has the following types of
ambiguities
1) Lexical Ambiguity: The
ambiguity of a single word is called lexical ambiguity.
For example,treating the word silver as a noun,an
adjective, or a verb.
(2) Ambiguity: This kind of ambiguity occurs when a sentence is parsed
Syntactic

ways. For example, the sentence "The man saw the girl with the
in different

telescope". It is ambiguous whether the man saw the


girl carrying a telescope or
he saw her through his telescope.

(3) Semantic Ambiguity: This kind of ambiguity occurs when the


meaning of the
words themselves can be misinterpreted. In other
words, semantic ambiguity
happens when a sentence an
ambiguous word or phrase. For example,
contains
the sentence "The car hit the
pole while it was moving" is having semantic
ambiguity because the interpretations can be "The car, while moving, hit the
pole" and "The car hit the pole while the pole was
moving".

(P8-86) Tech-NeoPublications...A SACHIN SHAH Venture


NLP(SPPU-Sem8-Comp.) (Introduction toNLP)....Page no

(4) This kind of ambiguity arises due to


Anaphoric Ambigurity: the
the horse ran up the
anaphora entities in discourse. For example, hin

Here, the anaphoric reference of "itin.


it" in two
steep. It soon got
tired.

cause ambiguity. situation.

kind refers to the


(5) Pragmaticambiguity:Such of ambiguity situatio

context ofa phrase gives it multiple interpretations.


In
simple words ethe
we
that
pragmatic ambiguity
arises when the statement is not specific ECanSay
the sentence "1 like you too" can
have multiple interpretations like mpa
like like me), I like you (Gjust
like someone else dose). you
just you

1.10.1 NLP for Indian Regional Languages


(1) One think that people
who are acquainted with computers
might are
are
However, it's worth
familiar with the English interface. noting that maiealready
ajority
the Indian population in India is still based in rural areas where teachiof
ing an
be where communities are
learning would
in local languages, munities literate, but
still
are not familiar with English.

So, yes, it is a worthwhile effort to upscale NLP research in India.

be
2) The dream of an all-inclusive Digital India cannot realized without
brinein.

NLP research and application in India at par with that of languages like Engi
sh.
barrier can be a
When engaging with smartphones, the language huge obstacle ,
to

many.
(3) Take the case of farmers and agriculture which has long been
considered the

backbone of India. Farmers play an obviously important role in


feeding the

such farmers improve their methods (through


country. Helping precision

agriculture, farmer helplines, chatbots, etc) has been an aim of


development
projects and an important part of the fight against global hunger. But many small
farmers are not knowledgeable in English, meaning it is difficult for them to

share and learn about new farming practices since most of the information is in

English.

4) Can you imagine a mobile application like Google assistant but tailor-made for

Indian farmers? It'd allow them to ask their questions in their native tongue, the

system would understand their


query and suggest relevant information from

around the globe <.

(5) Do you think this is possible to do without NLP for Indian regional languages?

And, this is
just one possible use-case. From making information more

accessible to understanding farmer suicides [4], NLP has a huge role to play.

Thus, there is a clear need to bolster NLP research for Indian languages so
that such people who don't know English can get "online" in the true sense of the

word, ask questions, in their mother tongue and get answers.


The need also becomes clear when we look at some of the of NLF
applications
in India.

(PB-86) Tech-Neo blications..A SACHIN SHAH Venture


NLP (SPPU-Sem8-Comp.)
(Introduction to NLP)..Page no.(1-23)

1.10.2 Applications of NLP in India


They are:
(1) Smartphoneusers in India crossed
500million in 2019. Businesses are feelinga
need to increase user
engagement at the NLP can go a long way in
local level.
achieving that-by improving search
accuracy(Google Assistant now supports
multiple Indian Languages), chatbots and virtual
agents, etc.
2) NLP has huge application in helping with of
people disabilities-interpretation
sign languages, text to speech,
speech to text, etc.

(3) Digitisation of Indian Manuscripts to


preserve knowledgecontained in them.
(4) Signboard Translation from Vernacular
Languages to make travel more
accessible.

(5) Fonts for Indian Scripts for


improving the
impact/readability of advertisements,
signboards, presentations, reports, etc.

(6) There are many more. The ideal scenario would be to have
corpora and tools
available in as good quality as they are for English to support work in these
areas.

1.11 CHALLENGES OF NLP


If we have to progressin terms of the
potential applications and overall
NLP,
of these are the issues we need to resolve:
capabilities important
1) Language Differences: If we speak English and if we are thinking of reaching
an international and/ or multicultural
andience, we shall need to provide support
for multiple languages.

Different languages have not only vastly different sets of


vocabulary, but also
different types of phrasing, different modes of inflection and
different cultural
expectations. We shall need to spend time retraining our NLP system for each
new languages.

2) Training Data: NLP is all about analysing language to better understand it.
One must spend years constantly to become fluent in a
language. One must
spend a significant amount of time reading, listening to, and a utilising language.
The abilities of an NLP system dependson the training data provided to it.

If
questionable data is fed to the system it is going to learn wrong things, or
learn in an inefficient way
(3) Development Time :One also must think about the developmenttime for an
NLP system. With a distributed deep learning mode and multiple GPUS working

(4)
in coordination,

parse out
one
Phrasing Ambiguities Sometimes,it :
can trim

what someone means when


down the training time to just a

is

they
hard even
few hours.
for another

say something ambiguous.There may


human being to

not be a clear, concise meaning to be found in a strict analysis of their words.


to resolve this, an NLP system must be able to seek context that can
Inorder
help it understand the phrasing. It may also need to ask the user for
clarity.

(P8-86) Tech-Neo Publications...A SACHIN SHAH Venture


NLP (SPPU-Sem8-Comp.) (Introduction to NLP)....Page
no
(5)
Misspelling:
machine,
Misspellings

misspellings
are a simple problem

can be harder
common
to identify.
for

One
human
should S,(124
bein.

P an
but
fOra
with capabilities
to recognise misspellings of words,
and
and moLP
move ton
them beyon
(6) Innate Biases: Incases,
NLP tools can carry the bia
some
biases of
within the data sets.
programmers as well as biases their

on the application, an NLP could provide a better


Depending ex
certain of users over others.
types
It is challenging to make a system that works equally well in all
all sitsos:
situations,
all with
people.
(7) Words with Multiple Meaning: Most of the languages have words that.
have multiple meanings, depending on the context. For example, a uUse
who
asks, "how are you" has a totally
different goal than a user.
Good NLP tools should be able to differentiate between these phrases
with s
the
help of context.
(8) Phrases with Multiple Intentions :Some phrases and questions
actually nave
multiple intentions, so the NLP system cannot over simplify the situation
y
interpreting only one of those intentions
For example, a user may prompt the Chabot with something like, "I need to

cancel any previous order and update my card on file."

The AI needs to be able to distinguish these intentions separately.


9) False Positives ad Uncertainty :A false positive occurs when an NLP noticesa

phrase that should be understand able but cannot be sufficiently answered.


The solution here is to develop a NLP system that can recognise its own
limitations, and use questions up
to clear the ambiguity.

(10) Keeping a conversation moving :Many modern NLP applications are built on

dialogue between a human and a machine.


Accordingly, your NLP AI needs be able to keep the conversation moving
to

providing additional questions to collect more information and always pointing


towards a solution.

1.12 GENERALAPPLICATIONS OF NLP


Natural Language processing, machine
learning and artificial intelligence
used interchangeably. Al is regarded as an umbrella-term for machines
tnat
simulate human intelligenic NLP and ML
are regarded as subsets of Al.

Natural language processing is a form of Al that the abin


gives machines
not just read, but to understand and interpret human
language.
With NLP, machines can make sense of written or spoken text and peform
tasks including speech recognition, sentiments automato
analysis, and
summarisation.
share
Thus, we can note that NLP and ML are parts of AI and both subsets
techniques, algorithms and knowledge.

enture
(P8-86) Tech-Neo Publications...A SACHIN SHAH
NLP (SPPU-Sem8-Comp.) to NLP)....Pageno. (1-25)
(Introduction

Some NLP-based solutions include recognition, sentiment


translation, speech

(
systems, chatbots, automatic test summarisation,
analysis, guestion/answer
market intelligence, automatic text classification, and automatic grammar
checking.

AI

AI = Artificial
intelligence
M NLD ML =Machine learming
DL DL =Deep learning
NLP =Natural language
processing

Fig. 1.12.1

These technologies help organisations to data, discover insights,


analyse
automate time-consuming processes, and/or gain competitive advantages.

(1) Translation
Translating languages is more complex task than a simple word-to-word

replacement method. Sine each language has grammar rules, the challenge of
translating a text is to be done without changing its meaning and style.

Since computers do not understand grammar,they need a process in which they


can deconstruct a sentence, then again reconstruct it in another language in a
way makes sense
that

Google translate is one of the most well-knownonline translation tools. Google


Translate once used phrase-based machine Translation (PBMT), which looks for
similar phrases between different languages.
At present Google uses Google neural machine translation (GNMT), which
useds ML with NLP to look for patterns in languages.

(2) Speech Recognition

Speech recognition is a machine's ability to identify and interpret phrases and


words from spoken language and convert them into a machine- readable format.
It uses NLP to allow computers to collect human interaction, and ML to respond
in a way that copies human responses.

Google Now, alexa, and Siri are some of the most popular examples of speech
Simply by saying
recognition. 'call Ravi', a mobile recognises what the
command means and it makes a call to the contach saved as 'Ravi',

(3) Sentiment Analysis


Sentiment analysis uses NLP to interpret and analyse emotions in subjective
data like news articles and tweets.

(P8-86) LTech-Neo Publications..A SACHIN SHAH Venture


to NLP)...Pa
(Introduction
NLP (SPPU-Sem8-Comp.) no.
(1-2
can be identified to d
neutral opinions
and
Positive, negative or service. terminea
towards a brand, product,
Customer's sentiment
monitorbrand ra
to measure public opinion,
Sentiment analysis is used
ation
and better understand customer experiences.
field thatcan be heavily influenced hu
The stock market a hu
emotion. Negative
is sensitive

sentiment con lead stock prices


to drop, while a
to buy more
of the company's stock, causinve
ng
sentiment may trigger people
stock
princes to increase.

(4) Chatbots
Chatbots are programs used to automated answers to common
provide custome-
mer
queries

They have pattern recognition systems with heuristic responses, which are used
ised
to hold conversations with humans.

Initially,chatbots were used to answer basic to alleviate


questions heavy volume
call centres and offer
quick customer support services. powered chatbots Al-
are
designed to handle more complicated request making
conversational
experiences increasingly original.
Chatbots in health-care can collect intake data, help patients to assess
their
symptoms,and detemine next These chatbots can set
with the
steps. up appointments
ight doctor and even recommend treatments.
(5) Question- Answer
systems
Question-Answer systems are
intelligent systems that can provide
customer answers to
queries.
Other than
cabtbost,
question-answer systems have a
and good language huge array of
understanding rather than canned knowledge
questions
Airport
videos.
"
like "When as Indira Gandhi assassinated
and it can be created
to deal with
textual
answers.

data, and
They can answer
?", or "How
do I go to the
audio, imagesand

Question
-
answersystems can be found in
siri and IBM's
social media chats
Waston. and tools such as
In 2011, IBM's
Watson computer on
which answers are given first, and
competed Jeopardy, a game
the contestants show during
computer connected against the show's supply the
two biggest all questions. The
astounded the tech time
industry asit won first champions and
place.
(6) Automatic Text
Summarisation
Automatic text summarisation is the task of
shorter version. It extracts its condensing a piece
main ideas and oftext
This preserving the a
application of NLP is used in meaning of conten
news headlines ent.
and buttetins of market result
snipers in
reports. web
search,

(P8-86)
Tech-Neo Publications..A SACHIN
SH
NLP (SPPU-Sem8-Comp.)
(Introductionto NLP)...Pageno. (1-27)

(7) Market Intelligence


Market intelligence is the gathering of valuable insights surrounding trends,
and competitors. It extracts action able information that
consumers. products
can be used for strategic
decision-making.
Market intelligence can analyse in
topics, sentiment, keywords, and intent
unstructured data and is less time
consumingthan traditional desk research.
Using Market intelligence, organizations and add
can pick up on search queries
relevant synonyms to search results.

It can also or services to


help organisations to decide which products
discontinue or what to target to customers.

(8) Automatic Text Classification

Automatic text classification


is another fundamental solution of NLP. It is the
process of assigning tags to text according to its content and semantics. It allows
for rapid, easy collection of information in the search
phase.
This NLP application can differentiate from based on it content.
span non-spam
(9) Automatic Grammar Checking
Automatic grammar checking is the task of detecting and correcting
erors and spelling mistakes in text depending on context, is
grammatical
another major part of NLP.

Automatic grammar checking will make one alert to a possible error by


underling the word in red.

(10) Span Detection


Span detection is used to detect unwanted e-mails getting to a user's inboOx.

Refer Fig. 1.12.2

Spam

Machine learning
Model
NLP
Not spam

Fig. 1.12.2

(11) Information extraction


Information extraction is one of the most important applications of NLP.
It is used for extracting structured information from unstructured or semi-
structural machine-readable documents.

(PB-86) ATech-Neo Publications..A SACHIN SHAH Venture


NLP (SPPU-Sem8-Comp.) (Introduction to

P)...Page no.
(12) Natural Language Understand (NIU)
converts more formal
(1-28
28
large set of
It a text into
representations such ac
logic structures that are easier for the computer programs to -order
notations of the natural language processing.
manipulate

1.13 ISSUES IN NLP

With the help ofcomplex algorithms and intelligent analysis, NLP toole
way for digital assistants, chatbots, voice search, and dozens of the
application
then there are some of the most important issues that we have to resolve, Thev Even
follows y are s
1. Language differences
In USA, people speak English, but if we are thinking of
reaching
audience, we shall need to multicultural
provide support for multiple languages.
Different
languages have not only vastly different sets of
vocabulary but
different
types of phrasing, different modes of inflection, also
and different
expectations. cultural

This issue can be


resolved with the
help of "Universal" models that
at least some can transfer
learning to other languages. But we shall

2 NLP system for each new


2. Training
data
The main aim
language.

of NLP
need some time
to train

is to
analyse language to better
in a understand it. To be fluent
language, one must immersein a
language constantlyfor a
Similarly, even the best period of years.
AI must also
spend a
reading, listening to and a significant amount of time
utilising language.
The ability of an NLP
system depends on the
or training data
questionable data is fed the provided to it. If bad
learn in an
system, the it is to learn
going

3. inefficient

Developmenttime
way. wrong things, or

One must also think about


the
development time for an
an AI must
sufficiently,
review millions of
NLP system. To
data train AI
may take life-time if points;
insufficiently PC processing all those
powered is used. data
But with a
distributed deep
learning model and
coordination, one can trim multiple GPUS
4. Phrasing
Ambiguities
down that
training time to
just a few working in
hours.
If
someone speaks ambiguous, then even it is
parse out what one means. difficult for
another nes
There not be a person to
may clear, concise
their words, In order to meaning to be found
in a
resolve this, an NLP strict
analu
that can help it system must be able
understand the toseek or
clarity.
phrasing. one may also need
to
ask the content
for
(P8-86)
Tech-Neo
Publications...A
SACHIN
re
Comp.) (Introduction to NLP)..Page no.(1-29)
(SPPU-Sem8-C
NLP

5. Misspellings

human gs, misspellings are not a very big problem. One can easily
beings
or
For word with its properly spelled counterpart, and
a misspelled
associate the vest of the sentence in which it is used.
erstand
misspell can be harder to
for a machine identity.
But
use an NILP tool with common
we need to capabilities to recognise
Hence
of words, and more beyond them.
misspellings

6.Innate biases

cases, NLP
NLI tools can carry the biases of their programmers, as well as
In some cases,
the data sets, that are used to train them.
biases
within
he
on the an NLP could exploit certain biases. It is
application,
Depending
to make a system that works equally well in all situations, with all
hallenging

people

7. Words with MultipleMeanings


Manv of the languages have words that have multiple
meanings, depending
the context.

"
upon
user who asks, "now are you" has a a
For example, a totally different goal than
asks something like "how do I add a new debit card
uSer who

NLP tools should be able to differentiate between these


Good phrases with the
of content.
help

8. Phrases with Multiple Intentions

Some questions
and phrases have multiple intentions. In such a case NLP
cannot oversimplify the situation by interpreting only one of those
system
intentions.

For example, a user may prompt your chatbot with something like, "I need to
cancel my previous
order and update my card on file."

.Here AI needs to be able to distinguish these intentions


separately.

9. False positive
and uncertainty

A false positive occurs when an NLP notices a phrase that should be


understandable and addressable, but cannot be sufficiently answered.
Here an NLP system is to be so developed, that can its own
recognise
limitations, and use questions or promptsto clear up the
ambiguity.

10. Keeping a conversation moving

Many of the modern NLP applications are built an dialogue between a human
and a machine. Hence, our NLP AI needs to be able to keep the conversation
moving, providing additional questions to collect more information and always
pointing towards a solution.

we have
major challenges of using NLP.
Here, discussed the

P8-86)
Tech-NeoPublications...A SACHIN SHAH Venture
NLP (SPPU-Sem8-Comp.) (Introduction to
NLP)....Page

no.i30
1.14 TOKENIZATION

Tokenization is a common
both
task in

traditional
Natural Language
NLP methods like Count
Processino T
(NP).
fundamental step in
It's
a
Advanced Deep Learning-based architectures like Transformers. Vectorizer
and
Tokens are the building blocks of Natural Language.

Tokenization is a way of separating piece of text into smaller


units cal.
tokens. Here, tokens can be either words, characters, or subwordsle
tokenization can be broadly classified into 3 types word, charac
subword(n-gram characters) tokenization. and

For example, consider the sentence: "Never give up".


The most common way of forming tokens is based on space.
Assuming space
a delimiter, the tokenization
of the sentence results in 3 tokens Never-give-in
As cach token is a word, it becomesan example of Word -up.
tokenization.
Similarly, tokens can be either characters or subwords. For
consider "smarter"
example, let s
1. Character tokens: s-m-a-r-t-e-T
2. Subword tokens: smart-er
But then is this Do we really
necessary? need tokenizationto do all of this?

1.14.1 Reasons behind Tokenization


As tokens are the
building blocks of Natural Language, the most
of
common way
processing the raw text happens at the token level.

For example, Transformer based


models -the State of The At (SOTA) Deep
Learning architectures in NLP - process the raw text at the token level.
Similarly, the most popular deep
learning architectures for NLP
GRU, and LSTM also process the raw text at the like RNN,
token level.
01 02 03 04 05

t
What time is it ?
Fig. 1.14,1

Examples
Tokenization is the
process of replacing sensitive data with
preserve all the essential symbols
information about the data that
without
security. compromising
its

(P8-86)
Tech-Neo Publications...A SACH
(Introductionto NLP)...Page no.(1-31)
NLP(SPPU-Sem8-Co
tries to minimise the amount of data a business needs to have at
isation
Tokenisa
nd. It has become popular for small and midsize businesses to bolster the

rity of credit card and e-commerce transactions while minimising cost and
ofcompliance
e with industry standards and government egulations.
complexity
technology
tech can be used with sensitive data of all kinds, including
Tokenization
medical records, criminal records, vehicle Iriver information,
banktransactions, stock and voter
trading
loan plications, registration.
Lenisation is often used to protect credit card data, bank account information
a
other sensitivedata handled by payment processor.
and essing cases that use tokenize sensitive
process credit information
ayment
include
like android pay and apple
mobile wallets pay.
(i)
e-commerce sites; and
i)
i businesses
that keep a customer's card on file.

Token
1.14.2
Token ? How Created?
GO. What is it is

substitutes sensitive information with non sensitive


Tokenization equivalent

information.
The nonsensitive, replacement information is called a token.

)Tokens can be

Using
created in various

Using a mathematically
a non reversible
an index
ways:
reversible

function such as
cryptographic function with

function or randomly generated


a hash function,

number.
a key,

ii) Using

short, the token becomes the exposed information, and the sensitive
In
the token stands for is stored safely in a centralised server
information that in

known as a token vault. The original information can only be traced back to its
corresponding token
from the token vault.

is vault less. a
Some tokenisation Instead of storing the sensitive information in

secure database, vault less tokens are stored using an algorithm.

the token is reversible, then the original sensitive information


If is generally not
stored in a vault.

We mention a real world example of how tokenization with a token vault

works

) A customer provides their payment details at a point-of-sale (POS)system


or online checkout form.

(i) The data are stored with a randomly generated token, which is generated in
most cases by the merchant's payment gateway.
(ii) The tokenized information is then cent to a paymentprocessor.
The original sensitive payment information is stored in a token vault in the
merchant's payment gateway.

(P8-86)
Tech-Neo Publications...A SACHIN SHAH Venture
toNLP)...Paoe
Pageno.
ntroduction
(1
by the pavm
ment processor e
is sent again
bet
NLP(SPPU-Sem8-Comp
information
The tokenisedfinal verification.
(iv) for
sent
being
Tokenization alne.
Word used tokenization algorithm.It
1.14.3 commonly splits

is the most on a certain delimiter


based form.
words are
Word Tokenization tokens
individual word-level Pretraine
a piece of
text into
Glo Ve comes under
w
Vec and
different
delimiters,
upon as Word2
Depending such
Word Embeddings
tokenization. to this.
are few drawbacks
But, there

Tokenization
Word with Out Of Vons
Drawbacksof word tokens
is dealing

which
cabulay
are encounte red
issues
with
new words at

One of the major to the


words refer Hence, these ethods
OOV in the vocabulary.
(OOV) words. not exist
words do
These new
testing.
fail in handling
OOV words.
conclusionsyet!
words. The trick
to any is to form
But wait-don't jump from OOV
word tokenizers rare
the words in
A small trick can rescue K Words and replace
with the Top Frequent the model to learn the
the vocabulary This helps
data with
unknown tokens (UNK).
training of UNK tokens
of OOV words in terms
representation in the vocabulary will be
word that is not present
So, during test time, any the problem of in O0V
This is how we can tackle
to a UNK token.
mapped
word tokenizers.
is that the entire information
of the word is lost
The problem with this approach
UNK tokens. The of the word might be
as we are mapping OOV to
structure

And another issue is that every


in the word accurately.
helpful representing

OOV word gets the same representation

Another issue with word tokens is connected to thesize of the vocabulary.

Generally, pre-trained
models are trained on a large volume of the text corpus.
So, just imagine building the vocabulary with all the unique words in such a
large corpus. This explodes the vocabulary!

This opens the door to Character Tokenization.

1.14.4 Character Tokenization


Character Tokenization splits apiece of text into a set of
overcomes the drawbacks we saw above characters. It
about Word
Tokenization.
Character Tokenizers handles OOV words
of coherently by
information the word. down preserving the
It breaks the OOV word into
represents the word in terms of these characters
characters S aand
It also limits the size of the
vocabulary. Want to
talk a
guess on the size
26since
vocabulary? the
vocabulary contains a unique set of
characters
of the
e
(P8-86)
ATech-Neo Publications...A SACHIN SHAH
Venture
PU-Sem8-Com (Introduction to NLP)...Page no. (1-33)
(SPPUu
WLP
awbacks of Character Tokenization
praw
tokens solve the OOV problem but the
ster length of the input and output
ces increases rapidly as we are representing a sentence as a
becomes
sequence of
senters As As a result, it
challenging to learn the relationship between
characters.
to formmeaningfulwords.
aracters
the
ngs us to another tokenization known as Subword Tokenization which is
a Word and Character tokenization.
in between

14.5 Need Of Tokenization


1.14.
we need tokenization ?
GO. Why do --
ration the first
is thee step in
any NLP pipeline. It has an
Tokenization A important effect on
peline.
of your pipeli tokenizer breaks
the rest unstructured data and natural
text into chunks of information that can be considered
language as discrete
The token occurrences in a document can
elements.The be used as a vector directly
that document.
representing
turns an unstructured
This immediately string (text document) into a numerical
ructure suitable
for
1 machine learning.
data They can also be used directly by a
ter
to trigger useful actions and
responses. Or they might be used in a
pip as features that
achine learning pipeline trigger more complex decisions or

behavior.

Trenization can separate sentences, words, characters, or subwords.When


we
we call it sentence tokenization. For
the text into sentences, words, we call
plit

it word tokenization.

Exampleof sentence tokenization

totenize (Lifeis a matter ot choices, and everychoice you make makes

you.)

Example of word tokenization


Word totenize (Thesole meaning of life is to servehumanity"

The', 'sole', 'meaning','of,life,is','to, 'serve', 'humanity]

1.14.6 Benefits ofTokenization


--. -
GQ, Whatare the benefits of tokenization?
-
) Tokenization makes it more difficult for hackers to gain access to cardholder
data.

In older
systems, credit card numbers were stored in databasesand exchanged
freely over networks.
() Itis more compatible with legacy
systemsthan encryption.
(m) It is a less resource-intensive process than encryption.

(P6-86)
Tech-Neo Publications...A SACHIN SHAH Venture
NLP(SPPU-Sem8-Comp.) (Introduction to
NLP)...Pa
iv) The risk of the fallout in a data breach is
Page
reduced no.(1
(V) The payment industry is made more convenient
by allowino
like mobile wallets, new
one-click paymentand crypto currency.
This improves customer trust because it
improves both tha technologe
convenience of a merchant's service. the
(vi) It security
reduces the steps involved in
complying regulations for
merchants

1.14.7 Tokenization Challenges in NLP


While breaking down
sentences seems
simple, after all we build
words the time, it can be a bit
all sente
more complexfor
machines. tencesfrom
A large challenge is
being able to segment words when
marks don't define the spaces or
boundaries of the word. This is punchu.
symbol-based languages like especially commo
Chinese, Japanese,
Korean, and Thai. for
Another
challenge is symbols that
change the meaningof the
We intuitively understand that a 'S
word
sign with a number significandl.
means something different attached to it
than the number (S10
itself
less common (100). Punction,
situations, can cause an issue
for machines especially i
meaning as a part of a data trying to isolate
string. theie
Contractions such as
'you're' and T'm' also need to
into their be
respective parts. properly broken
Failing to properly tokenize down
sentence can lead to every part of
misunderstandings later in the NLP the
Tokenization is the start process.
of the NLP
understandable bits of process, converting
data that a sentences into
foundation built program can work with.
through tokenization,the NLP
Without a
strong
a process can quickly
messy telephone
game. devolve into

Sub Word Tokenization


Sub word
tokenization is similar to word
words down a little tokenization,but it
bit further breaks individual
tools they using specific
utilize is One of the main
linguistic rules.
breaking off affixes.
the Because
change inherent prefixes, suffixes,and infixes
of
a word's function. meaning words, they can also
help
This can be programs understand
as especially valuable for out of
identifying an affix can
give a vocabulary words,
words function. program additional
insight into how
unknown
The sub word model will
search for these
include them subwords and break
into distinct down words
parts. For that
building" would be example, the query "What is the
broken down into tallest
How 'what' 'is' the'
does this "build' 'est'
method 'tall
help the issue of OOV "ing'
Perhaps a machine receives
words? Let's look at an
a more example:
present tense of verb complicated word, like
'machinate' which "machinating' (the
Ir's means to scheme or
unlikely that engage in
machinating is a word included in plots).
many basic
vocabularies.
(P8-86)
Tech-Neo Publications...A SACHIN SHAH
Vont
otur
to NLP)..Page no. (1-35)
NLP (SPPU-Sem8-Comp.) (Introduction

If the NLP model was


using word tokenization, this word would just be
converted unknown
into just antoken. However, if the NLP model was using
word it would be able to
separate the word into an 'unknown
sub tokenization,

token and an 'ing' token. From there it can make valuable inferences abouthow
the word functions in the sentence

Rut what information can a machinegather from a single suffix?


The common ing' sufix, for example, functions in few easily defined ways.
a

It can form a verb into a noun, like the verb build turned into the noun

building'. It can form a verb into its present like the verb 'run
also participle,

becoming 'running.
Tf an NLP model is given this information about the ing' suffix, it can make
several
valuable inferences aboutany word that uses the sub word 'ing.' If ing'

is being used in a word, it knows that it is either functioning as a verb turned


into noun, or as a present verb. This dramatically narrows
a down how the

unknown word,"machinating. may be used in a sentence.


There are multiple ways that text or speech can be tokenized, although each

method's success relies


heavily on the strength of the programming integrated
in other parts of the NLP process. Tokenization serves as the first step, taking a
complicated data input and it into useful building blocks for
transforming
the natural language processing program to work with.
As natural language processing continues to evolve using deep learning models,
humans and machinesare able to communicatemore efficiently. This is just one
of many ways that tokenization is providing a foundation for revolutionary

technological leaps.

1.14.8 Discuss Types ofTokens


There are three main types of tokens as defined by securities and exchange

)
commission.

i)
Asset/security
investment.

Utility
token: These are tokens
These are analogous to bands
token: These
that promise a positive return
and equities.

act as something other than a means of payment. For


on an

example, a utility token may give direct access to a product, or as a discount on

future goods and services. It adds value to the functioning of a product.

i) Currency / Payment token :These are created totally


as a means of payment
for goods and services external to the platform they exist on.

1.15 STEMMING

is a natural language processing technique. It lowers inflection in


Stemming
words to their root forms; it aids in the preprocessing of text, words and

documents for text normalisation.


Inflection is the process by which a word is modified to communicate many
and mood.
grammatical categories, including tense, gender

Tech-Neo Publications...A SACHIN SHAH Venture


(P8-86)
n
(Introduction
toNLP)...Page
Page
no.(1
form
form or
or se
NLP (SPPU-Sem8-Comp. basic stem,
words to their shich
to reduce may

.
We employ stemming
word in
the language.
conne
be a legitimate ected
three words, connections,
may not of these
connects
the stem
For example, is
and troubles
troubled,
"connect" of trouble,
ieh
the root
hand,
On the other
word.
a term. The presene
ence
is not a recognised of single of
variants these
has several wher
language in data redundancy NLP
English results

variances
in a text-corpus
models may become ineffective.
models. Such test by removing roe
machine learning to normalise
essential repetition
it is
model,
To build a robust
base form through
stemming.
words to their
and transforming variants of a root/base
that produces word
is a rule-based approach word. This heuristic Dr
Stemming word to its stem
words, it reduces a base
In simple indiscriminate cutting of
two the process involves
as
is the of the and normalise
simpler the look-up
to shorten e
ends of the words. Stemming helps
sentences for a better understanding.

1.15.1 Challenges in Stemming


The process hastwo main challenges:

Over stemming
is nonsensical.
:
inflected word is cut off so much
The that the resultant stem

Over stemming can also result in different words with different

meanings having the same stem. For example, "universal", "university" and
"universe" is reduced to "univers". Here, even though these three words are
etymologically related, their modern meanings are widely different. Treating
them as synonyms in a search engine will lead to inferior search results.
Understemming Here, various inflected words have the same stem despite
different meanings. The issue crops up when we have several words that
actually are forms of one
example of understemming
another. An
in the Porter
stemmer is "alumnus" "alumnu", "alumni" "alumni",
"alumna"alumnae" "alumna". The English
and so these
word has Latin
morphology,
near-synonyms are not combined.

1.15.2 Application of Stemming


In information retrieval, text minimise sEOs, web search
Lagging systems, and word is results, indexing,
analysis, stemming employed. For
search for and
prediction predicted returns instance, a Goolge
comparable results.

1.15.3 Types of
I -------. Stemmer in NLTK
GQ. -
----- Discuss

There
different

are different
types ofstemmerin NLTK.
- --
kinds of
stemming and
python NLTK,we discuss them.
algorithms, all of them are
included i
diin

(P8-86)
ATech-Neo Publications...A SACHIN
SHA
(1-37)
toNLP)...Page no.
Comp.) (Introduction

(SPPUSeme-c Stemmer
15.3.1 Porter Each step has its own
redoction are ised in this method.
Fve steprs of word
meaning
mnpping
riles
stiltanit stem is a shorter word with the same root
the resul
for its ease of use and rapidity.
Frequently, is the renownied stermmer
etemmer
in NLTK that implements the porter stemming
porter
stemmer is n module
()
Porter
We consider an example: to
echnique stemmer ()and use the porter algorithm
an instano of porter
We constrvctof words.
the list
stem
stemmer
import porter
nltk.setm
From
stemmer
()
Porterporter 'connection',
connections'"Connected',
'connects, 'connecting,
Words

connecting,
connect
in words
word
for
porter, stem (word))
Print (word,

Output
connect
Connects

Connectingconnect
Connectionsconnect
Connectedconnect
Connectionconnect
Connectingconnect
Connectsconnect

1.15.3.2 Snowball Stemmer

Snowball Stemmer ()
The method used in this instance is more precise and is referred to as "English
stemmer" or "porter 2 stemmer".
It is rather faster and more logical than the original porter stemmer.
Snowball
stemmer ()is a module in IV LTK that implements the snowball
stemming technique.

P&-86)
Tech-Neo Publications.A SACHIN SHAH Venture
NLP
(SPPU-Sem8-Comp.) (Introduction toNLP)....Page
no.
no.(1
(13
Example of Snowball stemmer ()
We first construct an instance of
snowball stemmer ()to us the
algorithm to stem the list of words.
snOwbal
From nltk.stem import snowball stemmer
Snowball = snowball stemmer (language
= 'english)

Words=['generous', generate','generously', 'generation']


For word in words
Print (word, "",snowball stem (word))

[out]:

generousgenerous
generategenerat
generously generous
generation generat

a 1.15.3.3 Lancaster Stemmer


Lancaster
stemmer ()
Lancaster stemmeris
excessive straightforward, although it often
stemming. produces results with
Over-stemming renders stems
non-linguistic or
meaningless.
Example of Lancaster stemmer (()
We construct an instance
to stem the
of Lancaster stemmer ()and
algorithm list of words. then use Lancaster
From nltk. Stem
import Lancaster
Lancaster =Lancasterstemmer () stemmner

Words
['eating','eats','eaten',
puts','putting
For word in words:
Print (word,">",Lancaster: stem
(word))
[Out]
eating eat
eats
eat
eaten eat

putsput
puttingput

(P8-86) Al Tech-Neo Publications...A


SACHIN
NLP (SPPU-Sem8-Comp.)
(Introduction to
NLP)...Page no. (1-39)

1.15.3.4 Regexp
stemmer-regexp stemmer ()
Regexp stemmer identifies
morphological affixes using
Substrings matching the regular expressions.
regular expressions will
be discarded.
Regexpstemmer () is a module
in NLTK
that
technique. implements the regexstemming

Example ofRegexstemmer ()
Here, we first construct an
object ofRegexp stemmer()and then
to stem the use the regex
stemming method list of words.

From nltk.stem importregexpstemmer

Regexp egexpstemmer (ing S|


s 8| e8|able$', min= 4)
Words ['mass,'was,bee, 'computer,
'advisable
Forword in words:
Print (word,", regexp.stem (word))

[out]
mass mas

was was

bee bee

computer computer
advisable advis

1.15.4 Text Stemming


As already mentioned, stemming is the process of reducing inflexion in words to
their forms, such as
"root""
mapping a group of words to the same stem. Stem
words mean the suffix and prefix that have been added to the root
word.
In
computer science, we need this
process to
produce grammatical variants of
root words. A stemming is provided by the NLP algorithms that are
stemming
algorithms or stemmers. The stemming algorithm removes the stem from the
word. For example,
'walking', 'walks', 'walked 'are made from the root word
Walk'. So here, the stemmer removes
ing, s, ed from the above words to take
out the
meaning that the sentence is about walking in somewhere or on
Something. The words are nothing but different tenses forms of verbs.
Below is an
example of stem 'Consult.' see how addition of different suffixes
generated longer form of the same
stem
This is the
general idea to reduce the different forms of the word to their root
word.

Words that are derived from one another can be mapped to a base word or
symbol, especially if
they have the same meaning.

(P8-86) Tech-Neo Publications...A SACHIN SHAH Venture


NLP (SPPU-Sem8-Comp.) to
(Introduction
D
NLP)...
Page
no.
CONSULT (14

CONSULTANT

CONSULTING
CONSULT
CONSULTANTATIVE
CONSULTANTS

CONSULTING

----- --.
GQ. What are the most common types of error
stemming in text mining or NLP? associated is
ed with
text
We can not be sure that
give us a
it will
100% result, so we
error in have
ave two
tu.
stemming:overstemmingand under
stemming. types of
GQ. What is
Over stemming
--- error?

This kind of error


occurs when there are too
possible that the
many words cut out.
segmentation of the long-form word It mo
may be
stems that are identical but may
may give birth to
two
actually differ in such
could be known as contextual
nonsensical items, where the meaning, The
or it can not be able meaning of the word
to
distinguish between two stems or has lo
where they should differ resolve the
same stem
from each other.
For take out the four
example, words
universe. A university, universities,
stemmer that resolves these four universal, and
stems to "Univers"
stemming. It should be the universe is over
stemmer that stemmed
university, universities stemmed together together, and
stem. they all four are not fit for the
single

GQ. What is Under


stemming error ?
Under-stemming is the
opposite of stemming. It comes from when we
different words that have
actually are forms of one
to all resolve to the another. It would be nice for
same stem, but them

This can be seen if we unfortunately, they do not.


have a
data and datum to stemming algorithm that stems from the words
"dar" and "datu."
And you
resolve these both to might be thinking, well, Just
"dat," However,then
there a what do we do with the date? And 5
good general rule? So there under
stemmingoccurs.

1.16 LEMMATIZATION
Lemmatization is the
process of grouping
of a word so that together the different inflected for
they can be analysed as a
single item.

(P8-86)
Tech-Neo Publications...A SACHIN SHAH nture
NLP (SPPU-Sem8-Comp.)
(Introduction
to NLP)....Pageno.(1-41)
Lemmatization is
similar to it
stemming but it brings context to the words, S
links words with similar
meaning to one word. Some people treat these as
same. But, t
lemmatization is ation
preferred over stemming because
does morphological lemmaua
analysis of words.
Lemmatization is
responsible for
into the root form,
grouping different inflected forms of words
having the same meaning.
Tagging systems, indexing,
SEOs,information retrieval, and web search all use
lemmatization to a vast extent.

Lemmatization involves
using a vocabulary and
words, removing morphological analysis or
inflectional
endings, and returning the dictionary form of a
word (the lemma).
The process of
lemmatization seeks to get rid of
inflectional suffixes and
prefixes for the purpose ofbringing out the word's
dictionary form.

Ileafs

leaf
leaves

Fig. 1.16.1

Lemmatization entails reducing a word to its


canonical or dictionary form. The
root word is called a 'lemma'.The
method entails assembling the inflected
of a word in a way that can be parts
recognised as a single element. The is
similar to stemming but the process
root words have meaning.

Lemmatizationhas applications in:


1. Biomedicine: Using lemmatization to parse biomedicine literature may
increase the efficiency of data retrieval tasks.
2 Search engines
3.
Compact indexing: Lemmatization is an efficient method for storing data
in the form of index values.
For example, NLTK provides Word Net Lemmatizer class- a slim cover
around the word net Corpus. This class makes use of a function called
wrapped
Morphy) to the Word Net Corpus Reader class to find a root word/lemma.
studying study
Lemmatization
studies study
study study

Fig. 1.16.2

1.16.1 Uses of Lemmatization


!GQ. What --are the uses of Lemmatization.
Lemmatization chat bots to understand customer's queries to a better
helps
extent.

Tech-Neo Publications...A SACHIN SHAH Venture


(P8-86)
NLP
(SPPU-Sem8-Comp.) (Introduction to

Since this involves a LP)....Pa


morphological analysis of the worde no.
no.
understand the contextual form
of
understanding of the overall meaning of the
in the
the words
text. And char
chat d he 14
(1
Lemmatization is also used to enable robots to
lemmatization a rather important part of natural
nce that is
sentence

speak and cor


can
being gain
bots
c
a 1.16.2 Difference
language procee.

Between Stemming And


mmatized
essing.
Thi
This
make

Lemmatization
-
GQ. Statethe difference between stemming and
lemmatizatin
n.

Sr.No.
Stemming
Lemmatization
Stemming attempts to reduce Lemmatization also
inflectional form of each reduce
inflectional form attempts to
word into a common base or of each
into a common
base or word
root root.

2 In stemming the end or


Lemmatization uses
beginning of a word is cut dictionaries
off, conduct a to
morphological
keeping common prefixes and the word and link it to analysis o
the lemma
suffixes.

3 One stem can be


inflectional
common for
Lemmatization involves
forms of many greater
complexity. It is because the
lemma can be linked to forms process
needs the words to be
classified
with different stems. by a
part ofspeech and the inflected fom
This is quite difficult task in

|
any
language.
4. Stemming tends to be faster
Lemmatization is a slow process,
it is
process because it
chops because it knows the context
words ofthe
without the
knowing word before
processing.
context of word in the
sentence.

5. Stemming is a rule-based Lemmatization is a dictionary based |


approach.
approach.
6. The of stemming has
process The of
process lemmatization has
a lower degree of
accuracy comparatively a higher degree of

accuracy.

(P8-86)
Tech-Neo Publications...A SACHIN SHAH Venture
NLP (SPPU-Sem8-Comp.) (Introduction to NLP)....Pageno. (1-43)

Stemming vs Lemmatization
Change Change

Changing Changing

Changes Chang Changes Change

Changed Changed

Changer Changer

Fig. 1.16.3

1.16.3 Importance of Lemmnatization


) Lemmatization

Natural
is a vital

Language Processing (NLP).


part of natural Language Understanding (NLU) and

(i) It
plays critical roles both in Artificial and big data analysis.
Intelligence
ii) Lemmatization is extremely important because it is for more accurate than

stemming. This brings great value when working with a chatbot where it is

crucialto understand the meaning of a user's message.

The major disadvantage to lemmatization algorithm is that they are much slower
than stemming algorithms.

1.16.4 Applications ofLemmatization


The of lemmatizastion is used extensively in test mining. The test
process
mining process enables computers to extract relevant information from a particular set
of text.

Some ofthe other areas where lemmatization can be used are as follows:

1. Sentiment analysis
Sentiment analysis refers to an analysis of people's messages, reviews or
comments to understand how they feel about something before the text is

analysed, it is lemmatized

2. Informationretrieval environments

Lemmatizing is used for the purpose of mapping documents to common topics


and displaying search results. To do so, indexes when documents are increasing
to large numbers.

3. Biomedicine

Lemmatization can be used while morphologically analysing biomedical


literature. The Biolemmatizer tool has been for this purpose only.

(P8-86) Tech-Neo Publications..A SACHIN SHAH Venture


NLP (SPPU
Sem8-Comp.) (Introduction to
It
puls lemmas based on the use of a NLP).Page
word lexicon. But
in the lexicon, it defines the rules
if the.
which turn the word into
a
no,.(1-4A
is no led
A
ot
Document Clustering foung

Document clustering (or text chustering is a practice of groupanal


on text documents.)
Analysis

Topic extraction and rapid information retrieval are vital conductet


applicati
Both stemming and lemmatization are used to diminish the it
numh
transfer the same information. That boost up the entire method.
mberof
After the tokensto
pre-processing is carried out, feature are estimated Via deta
frequency of each token, and then clustering methodsaare termini

5. Search engines
implemented the

Search engines like Google make use of lemmatization so that


they
better, more relevant results to their users.
provide
Lemmatization even allows search engines to
display relevant results ana
expand them to include other information that reader
may find useful. even

1.16.5 Advantages and Disadvantages of


Lemmatization

Advantages
i) Lemmatization is more accurate
i) It is useful to get root words from the
dictionary,unlike just
like
cutting the worde
stemming.
(iii) Lemmatization gives more context to chatbot conversations as it
recognises
words based on their exact and contextual
meaning.

Disadvantages
G) Lemmatization is a
time-consuming and slow process.
i) As it extracts the root words and
meaning of the words from the so
dictionary,
most lemmatization
algorithms are slower compared to their
stemming counter
parts.

1.16.6 Example of Lemmatization


RunningRun
Runs Run common root
RunRun
CreatingCreate
Creates Create common root
CreatedCreate
The boy's cakes are of different sizes. The boy cake be of differ size.

(P8-86) Tech-Neo Publications...A SACHIN SHAH Venture


NLP (SPPU-Sem8-Comp.) (Introduction to NLP)...Page no. (1-45)

SYNTAX ANALYSIS
1.17
Cntax analysis or parsing is the second
phase of a compiler.
Synta
A lexical tokens with the
help of regular expressions
and
analyser can identify
a cannot
pattern rules. But lexical analyser check the syntax of a given sentence
due to the limitations of the regular expressions.

cannot check
Regular expressions balancing tokens, such as parenthesis.
the uses context free
Therefore, phase grammar (CFG). Thus, CFG is a superset
of regular grammar

Context-free
Grammar

(Regular
Gramma

Fig. 1.17.1

The diagram implies that every regular grammar is also context-free CFG is an

important tool which describes the syntax of programming language.

1.17.1 Context-Free Grammar

) A context-free grammar has four components:


A
sets
set of non-terminals

of strings.
(V) :
Non-terminals are syntactic variables that denote

The non-terminals define sets of strings that help define the language
generated
by the grammar
A oftokens,known
set as terminal
(Gi) symbols (2): Terminals are basic symbols
from which strings are formed.

i) A set of productions (P): The productions of a grammar specify the manner in


which the terminals and non-terminals can be combined to form strings.
Each production consists of a non-terminal and is called as left side of the

production, an arrow, and a sequence of tokens and/or on-terminals, called the


right side of the
production.
(V) One of the non-terminals is designed as the start
symbol (S), from where the
production begins.
The are derived from the a non-
strings start symbol by replacing repeatedly
terminal (initially the start symbol) by the right side of a production, for that
non-terminal.

(P8-86) Tech-Neo Publications..A SACHIN SHAH Venture


NLP (SPPU-Sem8-Comp.)
toNLP)..
(Introduction
Pac
Page
a1.17.2 no.
(-
Part-Of-Speech Tagging (POS)
---.
GQ. Explain POS tagging.

Part-Of-Speech (POS)Tagging is a procesS of


a s
converting
of words, (whereeach tuple is
list list of tuples a
having (nce
form ntenceto
The and fome.
tag is a part-of-speech
so on.
tag it
signifies whether the lag)
adjective, verb and

Part of Speech Tag


Noun

Verb

Adjective

Adverb
We can also say that
tagging is a kind of classification that may he .
the automatic
assignment of description to the tokens. defined as

The descriptor is called tag, which may


also represent semantic
informati
In simple words, we say that POS
tagging is a task of labeling
each wo
sentence with word
its appropriate part of speech. in a

We have mentioned that parts of speech include nouns,


verb, adverb, prono
onouns,
adjectives, conjunction and so on.
Most of the POS tagging falls under Rule-Base POS tagging, stochastic POs
tagging and transformation based tagging.

1.17.3 Rule-based POSTagging


GQ. Explain Rule-based
POS tagging.
Rule-based taggers use dictionary or lexicon for obtaining possible for
tags

tagging each word.


If the word has more than one possible tag, then rule-based taggers use hand

written rules toidentify the correct


tag
Rule-based tagging can handle any disambiguity by analyzing the linguistie
features of a word. And that is done by its words.
preceding as well as following
For example, if the preceding word of a word is article or then u
adjective,
word must be a noun.
All such kind of information rm of
in rule-based POS tagging is coded
in the
for
rules.

These rules may beeither


() Context-pattern rules,

i) Regular expression compiled into finite state automata, andis


rsected

with lexically ambiguous sentence


representation.

(PB-86) Tech-Neo Publications..A SACHINS SHAH enture


NLP (SPPU-Sem8-Comp.) to
(Introduction NLP)...Page no.(1-47)
lehased POS tagging can be visualised
by its two-stage architecture
First stage: Here dictionary is used to
assign each word a list of potential parts-
of-speech.

s)Second stage : Here, the method


uses
nles to sort down the list to a
large list of hand-written
disambiguation
single
part-of-speech for each word.

1.17.3.1 Properties of Rule-based


POSTagging
)
We mention

The
below the properties

rules in Rule-based
of Rule-based
These taggers are knowledge-driven
POS
1000 number of
tagging
taggers.
POS taggers:
ggers;

aredone manually.
ii) There are around rules.

G Smoothingand language modelling are defined in


rule-based taggers and is done
explicitly.

1.17.4 Stochastic POSTagging

GQ. Explain- StochasticPOS Tagging.


Stochastic model is the model that includes
frequency or probability (statistics).
Different approachesto the
problem of a model that includes
of probability to the
problem part-of-speech refered to as stochastic
tagging is
tagger.
The simplest stochastic tagger uses the
following approaches for
POS-tagging:
(0) Word-Frequency Approach
Here the stochastic taggers disambiguate the words; i.e. the words based on the
probability that a word occurs with a particular tag.
The tag that is encountered most frequently with the word in the set is
training

assigned to an ambiguous instance of that word.


The main problem with this approach is that it may yield inadmissible sequence
of tags.

()Tag sequence probabilities


This is a different approach of stochastic tagging. Here the tagger calculates the

probability of a givensequence
of tags occurring.
probability at which occurs
it
The best tag for a given word is determined by the
n is also called as n-gram approach.
with previous tags. Hence
it

1.17.4.1 Properties of Stochastic POS Tagging

)
i)
We mention below its properties:
The POS tagging is
based

Training corpus is required


here.
on the probability
of tag occuring.

SACHIN SHAH Venture


(PB-86)
ATech-Neo Publications...A
NLP (SPPU-Sem8-Comp.) (Introduction to
NLP)....Page
the corpus, then there is no
(ii) If the words do not exist in probahil.
ility.
iv) Different testing other than training corpus, are used
corpus,
POS because it chooses most freguen
(V) It is the simplest tagging
requent
with a word in training corpus. tags ass

associatea

1.17.5 Transformation-based Lagging (TBI


L)
Transformation based tagging is also called Brill
tagging

It is the instance of the transformation-based learning, It i


It
is a
algorithm for automatic tagging of POS to the given text.
ule-base
ased
TBL allows to have linguistic knowledge in a readable form, It t..
state to another state by using transformation rules.
One
TBL can be though of as the mixture of both the
above-mentioned to
based and stochastic.
taggers-Tule
Like rule-based tagging, it is also based on the rules that
specify uA
need to be assigned to which words. hich
tags
Also we can see similarity between stochastic and
transformation
Similar to stochastic,

automatically induced from data.


it is machine learning
technique-in which m8
1.17.5.1 Working ofTransformation Based
Learning (TBL)
To understand the concept
governing transformation-based
taggers, we have to
understand the workingg of transformation-based
learning.
We mention below the
steps of the working of TBL:
Beginwith the solution: The TBL
usually starts with some solution to
the
problem and works in cycles.
i) Choosing most beneficial transformation:In
each
choose the most beneficial cycle, TBL will
transformation.
i) Applying to the
problem: The transformation that is chosen in
step will be applied to the the last
problem.
The algorithm comes to an end when the
not require any further
selected transformation in step (ü)wl
transformation to be selected.

a 1.17.5.2 Advantages of Transformation-based


Learning (TBL)
We mention below
the advantages
G) We have to learn small set of for
simple rules and these rules are gh
tagging. eno
(ii) Since the learned rules are
is
very easy to
understand, hence mentand
debugging very easy in TBL. developu

(P8-86)
Tech-Neo snture
Publications SACHINSHAH
NLP (SPPU-Sem8-Comp.) to
(Introduction NLP)....Pageno.(1-49)
ii) As in TBL there is
interlacing of machine learned rules, its
and human-generated
is reduced.
complexity

v) Transformation-based tagger is much faster than Markov-model tagger.

1.17.5.3 Disadvantages of Transformation-based


Learning (TBL)
are as follows
Thedisadvantages
Transformation-based learning (TBL)does not provide tag probabilities.
i) If corpora is large enough,then training time in TBL is very long.

1.18 ADVANTAGES AND DISADVANTAGES OF NLP


The use of natural
language processing comes with advantages as well as

disadvantages.

a1.18.1 Advantagesof NLP


i) Once implemented, NLP is less expensive and more time efficient than

employing a person.
(i) NLP can also help businesses. It offers faster customer service response times.

Customers can receive immediateanswers to their questions.

(i) Pre-trained learning models are available for developers to facilitate different

applications of NLP; It makes them easy to implement.

(iv)
Natural Language Processing is the practice of teaching machines to understand

and interpret conversational inputs from humans.

(v) NLP can be used to establish communication channels between humans and
machines.

(vi) The different implantations of NLP can help businesses and individuals save

time, improveefficiency and increase customer satisfaction.

1.18.2 Disadvantages of NLP


(i) Training can be time-consuming. If a new model needs to be developed without

the use model needs to be developed without the use of a pre-trained model, it
can take weeks before achieving a high level of performar

(i) There is
always a possibility of errors
in predictions and results that need to be

taken into account.

ii) NLP may not show context.

(iv) NLP may require more keystrokes.


(v) NLP is unable to the new domain,and it has a limited function. That is why NLP
is built for a single and specific takes only.

Tech-Neo Publications...A SACHIN SHAH Venture


(P8-86)
NLP
(SPPU-Sem8-Comp.) (Introduction to
NLP)....Page
1.19 SELF LEARNINGTOPICS no.(1-50

Types of for regional language


)
Various types of tools

i)
Using

Fonts
the phonetic

Download
in Indians regional language

keyboard
are:

(iii) Padma Plugin

) Using the Phonetic Keyboard

Using Indian languages on computer are


very attractive for a
layman,
Qulllpad and lipikaar is a free online tool in Indian
typing
transit eration languages. It sun
technologies according to pre-defined rules.
A transit eration
technology is one that allows user to
type words as, thevwo
usually do (like 'rashtrabhasha' instead
'RASHTRASHA) such as case senci ould
typing rules. itive

Transit eration tools


expect users to type English words
allows users to phonetically. This
communicatein their own regional
i) Fonts Download
language oftheir
This
choice.

Technologydevelopment for Indian


language (TDIL)
the of electronic and IT programme initiated by
department
(DEIT),govt. of Indian has the
develop information objective to
processing tools to facilitate human
Indian machineinteraction
language and to develop in
technologies to access
resources. multilingual knowledge

The fonts are


being made available free for
web downloadsfor the benefit public through language CDS and
of massen.

(ii)Padma Plugin
Padma is a
technology for
transforming Indic text between
proprietory formate. The technology public and
currently supports Telusu,
Tamil, Devenagri Malyalam
(including Marathi),
Gujarathi, Bengali and Gurmukhi.
Padma's goal is to
bridge the gap between closed
and
day Unicodesupport is widely open standard until the
available on all
platforms.
Padma transforms Indic text
encoded in
Unicode. proprietary formats automatically

Regional languages pre-processing and other


If there is a nation where old and functions
exist then it is India. morphologically rich varieties of regiona
languages

(P8-86)
ETech-Neo Publications...A SACHIN SHAH Venture
to NLP)...Page no.(1-51)
NLP (SPPU-Sem8-Comp.) (Introduction

in English
It is comparatively easy for computers to process the data represented
But
through standard ASCII codes than other natural languages.
language
is
machine capability of understudying other natural languages
building the
and is carried out using various techniques.
arduous
the mternet is no more monolingual contents of the other regional
Nowadays
the are
Janguages
are groWing rapidly. According to 2001 census there

1000 documented languages and dialects in


aDproximately India
work and interact with
Much research being carried out to facilitate users to
is

in their own regional natural languages.


computers
13 and the data of translation in Indian
offers languages provide
Google
(IRL) like Kannada, Hindi, Bengali, Tamil, Telugu,
regional languages

Malyalm,Marathi, Punjabi, and Gujarathi.


tasks on IRL are machine translation (MT), sentiment
The major concentrated
analysis (SA),
Parts
- Of
- Speech (POST) Tagging
and Named Entity

Recognition (NEER).
translate
translation is communication where machine
Machine inter-lingual

source language to the target language by preserving its meaning.

is identification of opinions expressed and orientation of


Sentiment analysis
in a piece of text.
thoughts
is labelled with a tag
POS is a process in which each word in a sentence
tagging
of speech.
indicating its appropriate part
names in the structured or
identifies the proper
Named Entity Recognition
names into sets of categories of
documents and then classifies the
unstructured
interest.
are
algorithms and natural language processing techniques
Machine-learning
and for English.
widely deeply investigated
for IRL due to the richness in morphology
But not much work has been reported
and complexity in structure.

Raw text Rule -based approaches

(Natural language) and


Lexical

morphological
Tokenisation analysis

Feature ML Post
Machine
Pre-processing extraction algorithm processing
transmliteration

Neural

network

Approaches ()DOS tagging

()Grapheme (i) Stemming


(i) Phoneme (ii) Stop-word remoed Neural -based approaches
(i) Itybrid etc.

Fig. 1.19.1

SACHIN SHAH Venture


(P8-86)
Tech-Neo Publications...A
NLP (SPPU-Sem8-Comp.) to
(Introduction
NLP)....Pane
ageno.
Tasks:
(1-S
MT
SA
NER
...etc

Evaluation

Fig. 1.19.2

Generic Model for Language Processing


The generic model for language processing consists
of
various st-

machine pre-processing, lexical and


transliteration, morphologio VI
POS tagging, feature extraction and evaluation.

The contributions of techniques for success of the language processin


cessing tasks
asfollows a
are

Tokenization
In natural language processing applications, the raw text
initially under.
rgoes a
process called tokenization.
In this process, the given text is tokenized into the lexical
units, and thes
ese a
basic units.

After tokenization, each lexical unit is termed as token.


Tokenization can he
eat
sentence level or word level, depending upon the category of the problem

Hence,there are 3 kinds of tokenization:

(a) Sentence level tokenization.

(b) Word level tokenization,

(c) n-gram tokenization.

(a) Sentence-level tokenization deals with the challenges like sentence ending
detection and sentence boundary ambiguity.
(b) In world-level tokenization, words are the lexical units, hence the whole

document is tokenised to the set of words.


The world level tokenization is used in various language processing and text

processing applications.
(c) The n-gram tokenization is a token of n-words, where 'n' indicates the number

of words taken together for a lexical unit.

Ifn = 1, then lexical unit is called as unigram similarly if 'n =2' lexical
unit s
bigram and if n is 3, it is trigram.
ill
For n-gram tokenization, (n 2 2), to
satisfy the n-words in the tokens there
w
be overlapping of terms in the token.

ture
(P8-86) ATech-Neo Publications...A SACHIN SHAH
(Introduction to NLP)....Page no.(1-53)
NLP (SPPU-Sem8-Comp.)

Machine transliteration

natural language
processing, machine a vital role in
In transliteration plays
like cross-language machine translation, named
applications entity recognition,
app retrieval etc.
information
a process of
is source
Transliteration converting a word or character from the
languages, alphabetical system to the target languages, alphabetical system
the phonetics of the source
without losing languages word or character.

Before transliteration, words are divided into syllabic units using Unicode and
character encoding standards. Then each of the syllabic units of a word gets
to target language.
converted

For example
Hindi English
|/v al
|/v«l'a'or'a'/

ChapterEnds...
UNIT I1
Language Syntax
8x Semantics
CHAPTER 2

Syllabus
Morphological Analysis :What is Morphologyn Types of Morphemes,
Inflectional morphology & Derivational morphology. Morphological
parsing with Finite State Transducers
(FST)
Syntactic Analysis: Syntactic Representations of Natural Language,
Parsing Algorithms, Probabilistic context-freegrammars, and Statistical
parsing
Semantic Analysis: Lexical Semantic, Relations among lexemes &
WordNet,
theirsenses-Homonymy,Polysemy, Synonymy, Hyponymy,
Word Sense Disambiguation (WSD), Dictionary based approach,
Latent Semantic Analysis.

2.1 ENGLISH MORPHOLOGY


For each token in the text, the NaturalLanguage Providesinformation about its

internal structure(morphology) and its role in the sentence (syntax).

6Q. Explain English Morphology

Morphology it the study of the internal structure of words. Morphology focuses


on how the components within a word (stems,root words, prefixes, suffixes

etc.) are arranged or modified to create different meanings.


English often adds 's' or "es" to the end of count nouns to indicate pluralities

and 'd or 'ed' to a verb to indicate past tense. The suffix

-ly" is added to adjectives to create adverbs (for example, "happy" (adjective)


and "happily"(adverb)).
The natural Language API uses morphological analysis to inter grammatical
information about words.

Morphology varies greatly between languages. Language such as English lacks


affixes indicating case, rely more on the word order in a sentenceto indicate the

respective roles of words.


NLP Syntax
(SPPU.Sem8-Comp.)(Lang. &Semantics).
Hence morphological analysis dependsheavily on the Page
within that ource no.
understanding of what is supported language
anguage,
In English there are numerous examples, such asrenl. and ,
of and -ment, and placem
composed re-"place", "walked", from ement",
and-ed. hiich
elements
English morphology supports language elements (gramma wEAI
language skills (reading, writing, speaking)
yocabular)
ana
a2.1.1 Survey of English Morphology
the of the way the words
Morphology is study are built
built
units called morphemes. Morphemes up from
meaning-bearing are
minimum meaning-bearing units a language. cats'
in
conversion defined
asthe ned
2 Morphological parsing is required
for such a
task. OPreviouslv w
rule?

accommodate both 'cat' and its plural form cats' in a arned


regexp. But, how
represent words such as 'geese, "foxes etc. which are also plurat Can ow a
not follow the 'cat plural forms we
butdn
The words such as 'foxes' are broken down into a stem and an affi
an
affix. The
the root word while the affix is the extension added to
the Stem stemis
either a different form of the same class or a whole new class.
represent
A. In English, morphology can be broadly classified into two types :
(a) Inflectional morphology: It is the combination of a word stem
grammatical
original stem.
morpheme which results in a word of the same
In English infliction is
simple,
th sthe
only nouns, ve
sometimes adjectives can be inflicted. Eg. cat cats, mous
walk walking, etc.
mice,

(b) Derivational morphology It is the combination of a word stem with.


grammatical morpheme which results in a word of a different
class
english it is very hard to predict the meaning of the stem from the
denive
structure. Eg. appoint >
appointee, clue clueless, kill killer etc >
2.2 MORPHEMME
A 'morpheme'is the smallest 'lexical item' in a language. The field of linguistec
study dedicated to morphemes is called morphology.

In English, morphemes are often but 'not necessarily' words. Morphemes tht

stand alone are considered as 'roots' (such as the morpheme cat); other

morphemes, called affixes, are found in combination with other


only
morphemes.
This distinction is not universal and does not apply to, for example, Latin,
which many roots cannot stand alone. For example, the Latin root 'reg ("King
must alwaysbe suffixed with etc.
a case marker:rex (reg-s),reg-is, reg-i,
For a language like Latin, a root can be defined as the main lexical neof
morpneu
a word.

Vve
SHAH
(P8-86) 4Tech-Neo Publications...A SACHIN
NLP (SPPU-Sem8-Comp.) (Lang. Syntax& Semantics)....Pageno.(2-3)

2.2.1 Classification (Freeand Bound Morphemes)


can be classified as free or bound.
Fuery morpheme
Free morphemes'can function independently as words (e.g. town,dog) and can
1) within lexemes (e.g.townhall, doghouse).
appear
with a
Bound morphemes' appear only as
sometimeswith other
of words,always in conjunction
parts

root' and bound morphemes. For example, 'un-' appears

only when accompaniedby


other morphemes to form a word.

2.2.2 Kinds a Morphology


the
Inflectional: Regular, applies to every noun, verb, whatever or at least
them. E.G. count
majority of nouns have singular/plural distinction,
all all
verbs
have tense distinctions, etc. Tend to be very Productive, ie. are found
verb has a
throughout the language; every (count) noun can be pluralized, every
etc.
past tense,
2. Derivational: Morphemes usually change "form class" (""part of speech"), e.g.

makes a verb out of a noun, or an adjective out of a verb, etc. Not always very
regular,
not very productive But useful in various ways, especially in the
formation of abstract nouns, esp. in development of scientific registers.

(n.)> (another kind


Example: Photograph photograph-y of N)
+ness: clearance, clearness: 3 different
clear(adj.)+ance, +ity, clarity,

kinds of N's

-ness, -hood, -dom, -ling. Likeness, likelihood (but not *ike hood,
-ize,

likeliness); kingdom, princeling (but not *king ling, princedom). -ize is

very productive: can be added to many form classes to


make verbs:
potentialize, manhattanize, losangelize, maximize,miniaturize, etc.

nation (n.) + al (adj.) > 'national' + ize> (makes a verb) 'nationalize' +


ation 'nationalization' (back to a noun) "process of making
s.t. belong

to the nation") + de- > denationalization' reversing the process of

making s.t. belong to the nation'"

2.2.3 Inflectional Morphology


It is one of the ways to combine morphemes with stems.

(1) Inflectional morphology conveys grammatical information, such as number,

tense, agreementor case.


2) Due to inflectional morphological process, the meaning and categories of the

new inflected words usually do not change.


That is a noun can be inflected to a noun while adding affixes, a verb can be
inflected to a verb in different tense in English.

One form other words of same


(3) say that the root word (stem) is inflected to
can

meaning and category.


4) Inflection creates different forms ofthe same word.

(P8-86) Tech-Neo Publications...A SACHIN SHAH Venture


NLP (SPPU-Sem8-Comp.) (Lang.Syntax &Semantics) p
(5) In Bnglish, only nouns and verbs can be inflected
(sometime
Page
no. (2-
Inflectional morphemes are very less when compared
mpared with
adiectives
languages.
somhealso
Example other

|Category Stem Aixes Inflected Word


Noun Word box Words
es Boxes
Verb Treat S Treats

-ing

In the
ed | Treating
Treated

above example,
) The
create
Inflectionalmorpheme
noun
s
is combined with
the
the noun stem
stem
plural 'words, word' to
i) The inflectional morpheme -ing' is combined with the verb
verb stem
create a gerund 'treating' 'treat to
6) Inflectional
morphemes do not change the essential
meaning or the pra
category of a word.

Adjectives stay adjectives, nouns remain nouns, and verbs remain


For example, if we add an -s
to the noun carrot to
show
verbs
plurality
remains a noun. lurality,
carrot

(7) Examplesof Inflectional Morphemes


Inflectional morphemes are suffixes that get added to a word, thus,
grammatical value to it. It can assign a tense, a number, a g
addino a

comparison, or a
possession. Here are some examples of inflectional
morphemes.
o Plural: Bikes, Cars, Trucks, Lions, Monkeys, Buses, Matches,
Classes
Possessive : Boy's, Girl's, Man's, Mark's, Robert's,
Samantha's,
Teacher's, Officer's
Tense: cooked, played, marked, waited, watched, roasted, grilled; sang,

O
drank, drove

:
Comparison Faster, Slower, Quicker,
Weaker,Stronger, Sharper, Bigger
Taller, Higher, Shorter, Smaller,

Superlative: Fastest, Slowest, Quickest, Tallest, Highest, Shortest,


Smallest, Biggest, Weakest, Strongest, Sharpest

2.2.4 Types ofMorphology


(1)-S3 person singular present She waits

(2)enPast participle
She has eaten

3)-S Plural Three tables

4) s Possessive Holly's cat

(5)-er Comparative You are taller

Ventud
(PB-86) Tech-Neo Publications...A SACHIN SHAH
sPPU-Sem8-Comp.) (Lang.Syntax & Semantics)...Page no.(2-5)
NLP (SPPU-Ser

2.2.5 Derivational Morphology


is the process of new words from a form of a
Derivation creating steam/base

word.
most common ways to derive new words is combine derivational
One of the to
with root words (stems).
affixes

formed through
The new words morphology may be a stem
deivational for

another affix.

New words are derived from the root words in this type of morphology.

derivation is one of the complex derivations. It is because of one or


(3)
English

more of the following reasons

(a)
Less productive. That is, a morpheme added with a sets of verbs to make
new meaningful words cannot always be added with all verbs.

For example, the base word 'summarise' can be added with the

grammatical morpheme 'ation' results in a word 'summarisation', but this

morpheme cannotbe added with all the verbs to make similar effects.

(4) Complex meaningdifferences among nominalising suffixes.


For example, the words 'conformation' and 'conformity' both derived from the
word stem 'conform but meanings are completely different.
Derivation creates different wordsfrom the same lemma

Example
Category Stem Affixes Derived word Target category

Noun Vapour -ize Vaporize Verb

Verb Read er reader Noun

Real -ize Realize Verb


Adjective|
Noun Mouthful Adjective
Mouth-ful
(1) Some more examples of words which are built up from smaller parts :

Black + bird combine to form black bird, dis + connect combine to form

disconnect.

(2) Some more examplesof English derivational patterns and their suffixes:

(a) adjective to -noun, -ness (slow > slowness)


(6) adjective
- -verb, -en (weak-weaken)
to

(c)adjective to -adjective -ly (personal


-
personally)

(d) nown -to -adjective - al (recreation- recreational)

(P8-86) Tech-Neo Publications...A SACHIN SHAH Venture


(Lang. &
Syntax Semantics)..
.pP
...Page
NLP (SPPU-Sem8-Comp.) no.(2
0.2-4
between Derivation and
a 2.2.6 Comparison
Inflection
Inflection

Derivation Inflection
also may be
formal effect.
by
may be
effected formal means
the
() Derivation
reduplication, etc.
like

means like affixation,


of bases, and
affixation
modification
internal
processes.
other morphological Intlection
nflection prototypical
prototypically
serves
to create new lexemes
serve
(2) Derivation modify to
fit
different
lexemes. grammatical contexts.
Inflection
typically adds
for
(3) Derivation changes category, information about numbergrammatical
(sino
a verb like employ
example taking dual, plural), person (first,
noun (employment,
and making it a third), tense (past, future),
second,
or an adjective aspect
employer,employee) (perfective, impertective,
or taking a noun
like
bitual)
(employable), and case (nominative,
it a verb (unionise) accusative
union and making among other grammatical
categories
or an (unionish,
adjective that languages might mark.
unionesque).
need not
we note that derivation
(4) Also,
For example, the
change category.
of abstract nouns from
creation
ones in English (king-
concrete
is a matter
kingdom, child-childhood)

of derivation
in English tends
Derivational prefixes
but it does
not to change category,
new for
add substantial meaning,
ves (unhappy,
example creating negati

inconsequential).

Remark
are difficult to or that seem to fall

There are instances that categorise,

somewhere between derivation and inflection.

2.3 MORPHOLOGICALPARSING WITH FST


(FINITE STATE TRANSDUCER)

Morphological parsing is the process of determining the morphemesrO


which a given word is constructed.

If must beable to
distinguish between orthographic rules and morpholo
logical

rules.For example, the word 'foxes' can be decomposed into "fox (ne
and 'es' (a suffix indicating plurality).

(PB-86) Tech-Neo Pubications..A SACHIN SHAHVVenture


NLP (SPPU-Sem8-Comp.) (Lang.Syntax &Semantics)..
Page no.(2-7)
The generally accepted approach to
morphological parsing is
through the use of
a finite state transducer
(FST). It inputs words and
modifiers outputs their stem and

The FST is actually


created
through algorithmic of some word
such as a dictionary, complete with parsing
source
modifier
markups.
Another approach is
through the use of an
indexed lookup method. It uses a
constructed radix tree. This is not an
often taken route because it breaks down
for morphologically complex languages.
With the advancement of neural
networks in natural
less common to use FST for
language processing, it is

is a lot of
morphological analysis. For languages for which
there available
training data, FST is less is use.

2.3.1 Orthographic
Orthographic rules are
general rules. They are used when a word into
breaking
its stem and modifiers.
Consider an example : singular words
it ends with - English ending with -y, when it is

pluralised, ies.

Morphological rules which contain corner cases to these


general rules.
Both these rules are used to construct
systems that can do morphological
parsing

2.3.2 Morphological
rules are to the
Morphological exceptions orthographic rules. It is when
breaking a word into its stem and modifiers.

Various models of natural


morphological processing are proposed. Generally
monolingual speakers process words as whole, while bilingual break words into
their corresponding It is because their lexical representations
morphemes. are
not as specific, and also lexical processing in the second language may be less

frequent than processing the mother tongue.


Applications include
of morphologicalprocessing
(a) Machine translation, (b) Spell checker, and (c) Information retrieval

Importance ofmorphology in NLD


Morphological analysis is a field of linguistics that studies the structure of
Words.

It identifies how a word is produced through the use of morphemes.


A morpheme is a basic unit of the English language. The morpheme is the

smallest elementof a word that has grammatical functioning and


meaning
nost of the applications related to the Natural Language processing. findings
of the morphological Analysis and Morphological generation are very

important.

Tech-Neo Publications...ASACHIN SHAH Venture


(P8-86)
NLP (SPPU-Sem8-Comp.) (Lang. Syntax &
Semantics)..
Application text to speech synthesis
Page
no.2-
Various mediums such as computers,
mobiles, are used fo
need. Btdisabled people and thepeople who are not
m ulfilment
such technical medias. They face log of difficulties.

There the need of Text to nteddaily


arises speech synthe
hesis.
Morphol with
sed to reduce the size of lexicon. It is
cause here one ogical
needs to analy
root word and various inflections need not be
remembered.
remember
Analysis can be used to
segregate the
Morphological a compound
form d
into
The applications are not limited to a particular language but hasie
can be
upto Hindi, Arabic, Marathi depending upon the particular field
incude

2.4 LEXICON (FREE FST PORTER STEMMER


ALGORITHM)

Lexicon refers to the component of a NLP system that contains inc.


(semantic, grammatical) about individual words or word strings. iormation

Example of lexicon
An example of lexicon is An example of
a.set
YourDictionary.Com.
lexicon is
of medical terms.

Lexicon in language learning


A lexicon is often used to describe the knowledge that a speaker has about the
words of a language. This includes meanings, use, form,and
relationships with
other words.

A lexicon can thus be thought of as a mental


dictionary.

Lexicon in NLTK
Lexicon is a vocabulary, a of words, a dictionary. In
list
NLTK, any lexicon is

considered a corpus since a of words also a


list is
body of text

Importance of lexicon
Different language research suggest that the lexicon is
representationallyrich
that it is the source of much behaviour. Its lexically
productive speciñic
information plays a critical and early role in the
interpretation of grammatical
structure.

Lexicon in Communication
A lexicon is the collection
of words
Or the intemalised dictionary also
that every speaker of a language has. Iti
a stock of terms used in a part
called Lexicon articular
lexis. also refers to

profession subject or style.

nture
Ven
(P8-86) Tech-Neo Publications...A SACHIN SHAH
Sem8-Comp.) (Lang. Syntax & Semantics)....Page no. (2-9)
(SPPUs
NLP
in AI
Lexicon form, a lexicon is
simplest the
vocabulary of a person, language, or branch
It 1s the
of words used
Lnowledge. catalog
often in conjunction with
of
of rules for the use of
ammar,
the set these words.

4.1 Porter Stemming Algorithm


The porter stemmingalgorithm (or 'porter
stemmer) is a
process for removing
ahe commoner morphological and inflexional
endings from words in English.
main use is as part of a
term normalisation
process that is generally done
while setting up information Retrieval systems.

Porter stemmer with example


Words such as "Likes", "liked", "likely" and
"liking" will be reduced to "like"
after stemming.
In 1980, porter presented a simple
algorithm for stemming
English language
words.

) Porter

stemmer.
stemmerhas two major achievements:
The rules associated with suffix removalare
much less complex in case of porter

G)The second difference is that the


porter's stemmer uses a single unified
to the handling of context. approach

Use of porter stemmer: (Applications):


)The main of porter stemmer include data
applications
mining and information
retrieval. But its
applications are limited only to
English words.
i) The group of stems is
mapped on to the same stem and the
a meaningful word. output stem is not
necessarily

Implementation of
porter stemmer
We follow the
following steps:
Step (1) : Import the NLTK and from NLTK
library import porterstemmer,
import nltk from nltk, stem import
porterstemmer.
Step (2) : Create a variable and store porterstemmer into it, (PS = Porter
Stemmer)

Step (3) : See how to use porter stemmer print (ps.stem ('bat )) print (ps.stem
(batting ))

LPorter stemmer in NLP


The porter
stemming algorithm (or 'porter stemmer is a process for removing )
common morphological and inflexional endings from words in
tne English. Its
main use as part of a term normalization
is
process that is usually done when
setüing up information retrieval systems.

(PB-86) Tech-Neo Publications..A SACHIN SHAH Venture


NLP (SPPU-Sem8-Comp.) (Lang.Syntax &

)
2.4.2 Difference between Semantics)..Page r
Lemmatization Stemming nd
and

Stemming
Stemming is a process that stems
Lemmatization
or removes last few characters Lemmatization
and
converts considersthe
from a word, often leading to the
meaningful base word COntex
incorrect meaningsand spellings called to
Lemma. form,
Gi) For which
example, stemming the wordLemmatizing the
e
'Caring' would returm 'car. would return word
'care',
iii) Stemming is used in case Carine
of large Lemmatization
dataset where is an
performance
expensive since it
issue. tables mputational
involves N

2.5 SYNTACTIC REPRESENTATION OF NLP


Syntactic analysis is the third
used to analyse
syntax,
phase of Natural language
sometimes called as synt

Syntax analysis compares the text to formal


or
processing
parsing NLP)ti
analysis.
grammar
to deten rules
.
meaning. The statement "heated ice-cream," for
by a semantic
example, would
ould he.
be
ermine
t
analyses. discardej

2.5.1 The Parser Concept


It isused to carry out the
parsing process. It is a software
component that t
input data (text)and converts it into a structural kes
representation after verifying i
for valid
syntax using formal grammar.
It creates a data
structure, which can be a parse an abstract
tree,
syntax tree,
another hierarchical structure.
or
token

Input. Lexical analyzer


Parser Output
Get next
token

Symbol table

Fig. 2.5.1

The primary functions ofparse include


To report any errors in syntax.

enture
(P8-86) Tech-Neo Publications...A SACHIN SHAH
m8-Comp.)(Lang. Syntax & Semantics)..Pageno.(2-1
(SPPUu
NLP from a frequently recurring error so that the
from rest of the program may
Torecover
recover
(ti)
be processed.
make a parse
tree.
To
ii) table
To make symbol
a
iv) representations (IR).
interme
Creating

PARSERS AND ITS RELEVANCE IN NLP


2.6
to
The word 'Parsing is used draw
exact
meaning or dictionary meaning from
It is also called syntactic analysis or
the text. syntax analysis.

Svntax analysis checks the


text for meaningfulness. The sentence like "Give me
will be
hot ice-cream,"
rejected by the parser or syntactic analyser.

sense, we can define parsing or syntactic


analysis or syntax analysis as
In this

follows:

I is defined as of
analyzing the strings of symbols in natural
the process
language conforming to the rules of formal grammar.

taken
Lexical analyser Parser
Input

Get next

Symbol table

Fig. 2.6.1

We can understand the relevance of parsing in NLP with the help of following

points

G) Parser is used to report any syntax error.


i) It helps to recover from commonly occurring error so that the processing of
the remainder of program can be continued.
(ii) Parse-tree is created with the help of a parser

(iv) Parser is used to create symbol table, and that plays an important role in

NLP
(v) Parser is also used to produce intermediate representations (IR).
Remark: The word Parsing' whose origin is
from Latin word 'pars' and it

means 'part'.

2.6.1 Parsing Top Down and Bottom Up


There are 2 types of parsing techniques, the first one is Top down parsing and
the second one is Bottom up parsing

(P8-86) ATech-Neo Publications...A SACHIN SHAH Venture


NLP
(SPPU-Sem8-Comp.) (Lang.Syntax & Semanti
Top down parsing is a parsing technique that Page
first
the parse tree and works down the oks no.
parse tree
ree by using the the (d
And Bottom-up ru
parsing is a parsing technique ulesof
highest
that first
level of the rst
parse tree and works up the leve
parse tree looksgrammar
grammar. by at
We mention using he
below the the
differences betweenthese two parsin
Sr. parsing
Top-Down Parsing
No. techniques
Bottom-up
Parsing
1.
It is a
parsing strategy that first It is a
parsing u
looks at the
highest level of parse at the
lowest
strategy that
level of
that
first
f
tree and works down the parse tree and works the
up the lonks
by using the rules of parse parse
grammar. the rules tree
of grammar. by tree
2. Top-down parsing attempts to find Bottom up uSng
the left most derivations for an parsing can be
an attempt to define
reduce fined a
the
input the start input
3. In this
string.

parsing technique uses left This


symbol of a grammar,
tring
m
parsing technique
most derivation. uses
righ
derivation.
4 The main leftmost decision is to The main decision is to
select what select
production rule to use use a when
when
production rule to
to
in order to construct the reduce
string. string to get the ne

Example Recursive Descent Example : starting

Its shift
reduce
symbol.

parser.
parser.

2.6.2 Modelling Constituency


Knowledge of language is the doorway to wisdom. -
Roger Bacon
Roger Bacon gave the above in the 13" quote century, and it still holds.
Today,the way of
understanding languages has changed
completely.
Here, we shall be covering some basic
concepts of modeling
constituency or
constituency parsing, in natural
languages.

2.6.3 ConstituencyParsing
Constituency Parsing is the process of
analyzing the sentences by breaking
down it into sub-phrases also known as
constituents.
These a
sub-phrases belong to
specific category of grammar like NP nou
phrase) and VP (verb phrase).

Constituency parsing is based on context-free


grammars.Constituency cone
free
granmmars are used to parse text.
The parse tree includes sentences that have been rases,
broken down into sub-pr
each of which belongs to a different
class. grammar

(P8-86) Tech-Neo Publications...A SACHIN


SHAR Venture
-Comp.) (Lang.Syntax & Semantics)..Page no.(2-13)
NLP(SPPU-Sem8-C a node is linguistic unit or phrase that has a mother or father node

A termir tag
a part-of-speech
and "A cat" and "a box beneath the bed", are noun phrases, while
vample,"A
an example,
As and "drive a car" are verb phrases.
a letter"
"Write sentence: "I shot an element in my pajamas."
an example
We consider
mention the constituency parse-tree graphically as:

NP

Verb NP
Pronoun

shot Det Nominal

an Nominal PP

Noun in my pajamas

elephant

(1)

Fig. 2.6.2(Contd...)

The parse tree


on
the top ()
represents catching an elephant carrying pajamas,
on the bottom an element in his
while the parse tree (1I) represents capturing

pajamas.
sentence is broken down into sub-phrases till we have got terminal
The entire

phrases remaining.

N
Pronoun VP
PP

Verb NP in my pajamas

Shot Det Nominal

an Noun

elephant

(1) Fig. 2.6.2

(P8-86) LTech-Neo Publications...A SACHIN SHAH Venture


x&
& Semantics)..
Pageno.
Sytax 2-1
(Lang. forNoun
Phra
hrases.

andNP
stands

NLP
(SPPU-Sem8-Comp.)
Verb-Phrases,
for
VPstands Parsing lear, so
so that we car
clear,
parsing
Dependencyofdependency compre
2.6.4 parsing
the concept to the proc
wemnke with constituency refers
in orde r toexamining the
First (DP)
of a sentence
parsing determine
Parsing ts
dependency
Dependency the phrases
term is
The between The proc
The process basedd
on
sections.
dependences sections. each lingu the
into many
structure between nit

grammatical is
divided
is a direct relationship
dependencies.
Consiistic"I prefer
the
A sentence
that
there
are called
links
assumption These hyper Denver." n mod
sentence.
through

morning
flight
dobj
root case
det

n modi
n subj flight
through Denver

the
moming
Prefer
2.6.3
Fig.
bv d-
are expressed direct
or phraseunit,
the sentens
ntence.
each linguistic of preceding
between the pinnacle
varies
The relationships
the tree "prefer two phrases.
root of between
arcs. The
indicates
the relationship of the noun "Denver
the meaning
A dependencetag changes and Denver
is the kid or
the word "fligh"
For example, is the pinnacle
where flight the nominal modifier.
Denver, stands for
flight by nmod, which two phrases, where
It is represented between the
dependent. for dependency
the scenario
as the dependent.
This distinguishes
and the other
the pinnacle
one serves as

V/s Constituency
a 2.6.5 Dependency
Parsing
Parsing

it is
1deal
into sub-phrases,
a sentence
is the best
to interrupt
the man objective
is

But dependency parsing


meinod

parsing.
implementconstituency sentence
between phrases in a
discovering the dependencies
The
to note the difference es.
Let us consider an example
tree denotes the subdivision
of atext into sub-prabthe
ls are
A constituency parse
of phrases, the
terna
are different sort
tree's non-terminals

sentence's words, and the edges are unlabeled. would

"John sees Bill"


A constituency parse for the simple statement
be!
Ventue

(P8-86) Tech-Neo Publications...A SACHIN


oPU-Sem8-Comp.)Lang.Syntax & Semantics)...Page no. (2-15)
SPPU-Sem8-Con
NLP
Sentence

Verb phrase
Noun phrase

John
Verb Noun phrase

Sees Bill

Fig. 2.6.4

parse links words together based on their connections. Each


A dependency
in coTespondsto a word,
the tree child nodes to words that are reliant
on
vertex
and edges to relationships.
the parent,
parse for "John sees Bill" is as
The dependency
Sees

Subject Object
John Bill

Fig. 2.6.5

One should choose the parser type that is closely related to the objective.

For sub-phrases inside a sentence, then constituency-parse is advisable.

But for the connection between words, then dependency-parse


is more
convenient.

2.7 CocKE-YOUNGER-KASAMI (CYK)ALGORITHM

Grammar implies syntactical rules for conversation in natural language. But in


the theory of formal language, grammar is defined as a set of rules that can

generate strings.
The set of all strings that can be generated from a grammar is called the

language of the grammar.

2.7.1 Context Free Grammar


We have a context free grammar
G = (V,X, R, S)and a string w, where:
) Vis a finite set ofvariables or non-terminal symbols.

(P8-86) Tech-Neo Publications...A SACHIN SHAH Venture


NLP(SPPU Sem8-Comp.) (Lang.Syntax & Semantie

(i) X is a finite set of terminal symbols. antics).Page


no,t
(iii) R is a finite set of rule.

Gv) Sis the start symbol, a distinct element V, and


() V and X are assumed to be disjoint sets.

The Membership problem defined as: Grammar


is
G ger
(G). To check whether the given is a memb
mber of L
string
(G).
xates a

2.7.2 Chomsky Normal Form : (CNF)


angage

Chomsky Normal Form (


)
A context free grammar Gis in

is form if
of the each

ii)
ABC
A
[with at most two non-terminal symbols on R.H.S.
or [one terminal symbol on RHS]
nile
ofc
a,

ii) Snullstring[null stringl

2.7.3 Cocke-Younger-Kasami Algorithm


This solves the membership problem using a dynam
The programming
is based on the principle that the solution to
algorithm prohlaPProa mi.jl
constructed from solution to subproblem [i, k] and solu
solution to canbe
The algorithm requires the grammar G be in Chomsky
to subproblem|
Normal F kj
Observe that any context-free grammar can be converted to
restriction is necessary because each problem can only be CNF
divideTi
subproblems and not more-to bound the time complexity.
into t

2.7.4 How CYK Algorithm Works?


For a string of length N, construct a table T of size x N. Each cell inthe N
etable
Ti, j] is the set of all constituents that can produce the substring
spanningi
position i to j.

The process involves filing the table with the solutions to be subproblens
encountered in the bottom-up parsing process. Therefore, cells will be flai

from left to right and bottom to top.

2 3 4 5
1[1,1] [1,2 [1,3] [1,41 [1,51

2,21 12,3] [2,41 2,5]

3 3,3] 13,4] 3,51

4 14,4) 451
5 I5, 51

(P8-86) Tech-Neo Publications..A SACHIN SHA


-Sem8-Comp.) (Lang. Syntax & Semantics)...Page no. (2-17)
NIP(SPPU-Sem8-C number denotes the start index and the column number
the row

1
the

j
il.
In T the end index.
denotes the phrase, "a very heavy orange
the book
us consider
Let (3) range (4) book (5).
(2) heavy
the

23
table from left to right and bottom as
a
the tabl totop, according to the rules
fill up
We
above:

a very heavy orange book


Det NP NP

Adv AP Nom Nom


very
3 A AP Nom Nom
heavy
4 Nom A, AP Nom
orange
Nom
book

2.8 PROBABILISTICCONTEXT FREE GRAMMAR


(PCFG)

PCFGs extend context-free grammars similar to how hidden Markov models

egular grammars.Each production is assigned a probability.

The probability of a parse (derivation) is the product of the probabilities of the

used in that derivation. These probabilities can be viewed as


productions
of the model.
parameters

2.8.1 Some Important Definitions

Derivation: The of strings from a


() process of recursive generation grammar
(i) Parsing: Finding a valid derivation using an automation.

i) Parse tree: The align ment of the grammar to a sequence.

An example of a parser for PCFG grammars is the pushdown automation.

The algorithm parses grammar nonterminals from left to


right in a stack-like

manner. This brute force is not very efficient.

Another of a PCFG is the standard statistical parser which is


example parser
trained
using Treebank.

PB-86) Tech-Neo Publications...A SACHIN SHAH Venture


&
(Lang.Syntax Semantics)...Pa
NLP (SPPU-Sem8-Comp.) no.(2-1

a2.8.2 Formal Definition of PCFG


G is defined by a quintunle
grammar
context-free
A probabilistic

G (M,T. R, S,P)
Where
symbols
(i) M is the set of non-terminal age
ii) Tis the set ofterminal symbols.

(ii) R is the set of production


rules.
28
(iv) S is the start symbol,
on production rules. Provs

(v) Pis the set ofprobabilities


FOr

hidden Markov Models


2.8.3 Relation with OUl,

the total
of all derivations that are
PCFG model computes probability
consistent
based on some PCFG.
with a given sequence,
the probability of the
to
PCFG generating the sequenoo
This is equivalent Itisa
the sequence with the given grammar.
measure of the consistency of
variants of the CYK
algorithm find the Viterhi noo
Dynamic programming parse of

PCFG model. The


for a parse is the most likely onof
a RNA sequence
derivation

the by the given PCFG.


sequence
AD

2.8.4 Viterbi PCFG Parsing


PCFG is a bottom-up that uses dynamic programming to find
Viterbi parser
the single most likely parse for a text.
It
parses texts by iteratively filling
in a most likely constituents table. This table
a2
records the most likely tree structure for each span
and node value.
Shi

2.8.5 How aPCFG Differs from CFG ? Re

A PCFG differs from a CFG by augmenting each rule with a


conditond pOT

probability: A >B P]. Here P expresses the probability that non-terminu a


will be expanded to sequence B.
Associate a probability with each rule. W
grammar

2.8.6 How PCFG Resolves Ambiguity ?


wilh

PCFG parsers resolve ambiguity by preferring constituents (andparse


paise
tree)

the highest probability.

2.8.7 How does PCFG is used ?


The PCFG is used
to predict the prior ofthe the
probability distribution
hm and
whereas posterior probabilities are estimated by the inside-outside algoriu
most likely structure is found by the CYK algorithm.

Ventun

(P8-86) Tech-NeoPublications...A SACHIN SHAH


(Lang.Syntax &Semantics)..Page no. (2-19)
SPPU-Sem8-Com
MLP(
aWhat are Limitations of PCFG?
2.8.8
lexical information into account.
do not take
PCFGs less than ideal.
plausibility
parse
I makes biases, the
thave certain
probability of a smaller tree
certa i.e.,
have a
PCFGs
is
greater than
tree.
larger

2.8.9 Are CFGs Ambiguous?


in a PCFG can be seen as a
Probabilities filtering mechanism.
sentence, the trees
Eor an ambiguous bearing maximum probability are singled
Out,
while all others are discarded.

The level of ambiguity 1S related to the size of the


singled out set of trees.

W 2.9 SHIFT REDUCE PARSER

Shift-Reduce parser attempts for the construction of parse in a similar manner as


is done in bottom-up parsing, 1.e. the parse tree is constructed from leaves
to the root (up).
(bottom)
A more general form of the shift-reduce parser is the LR parser.
This parser requires some data structures i.e.

) An input buffer for storing the input string.


) A stack for storing and accessing the production rules.

2.9.1 BasicOperations
)
()
Shift: This involves
Reduce : moving symbolsfrom the input buffer onto the stack.
the handle appears on
If
top of the stack then, its reduction by using
appropriate production rule is done. It means that RHS of a production rule is
popped out of a stack and LHS of a production rule is
pushed onto the stack.
() Accept:If only the start symbol is present in the stack and the input buffer is
empty, then the parsing action is called accept.
When accepted action is obtained, it implies that successful parsing is done.
iv) Error :This is the situation where the
parser can
i) neither perform shift action

)
i) nor reduce action

not even
accept action.
and

2.9.2 Shift Reduce Parsing in Computer


parser is a type of Bottom-up parser. It generates the parse Tree
nut reduce
from leaves to the Root.
In Snirt
reduce parser, the input string will be reduced to the starting symbol.

(P8-86) Publications..A SACHIN


Tech-Neo SHAH Venture
NLP (SPPU-Sem8-Comp.) (Lang.Syntax &Semantics)
This reduction can be
produced by handling the rightmost Page
from starting symbol
no.
i.e. tothe input string. rivar -20
in
2.9.3 Why Bottom-up Parser is called sh. revere

Parser ? Shift

Bottom-up parsing is also called shift-and-reduce Reduce


parsing
the next token, reduce nere
means that a substring matching the shift
A. right side
ofamean
ans
rea

2.9.4 What arethe 2 Conflicts in Shift


production

Red..
Parser ? duce
In shift reduce parsing, there are two types of
conflicts:
) Shift-reduce (SR) conflict and
(ii) Reduce-reduce conflict (RR)
For example, if a programming language contains a
terminal for ths
word "while", only one entry for "while" can exist in the
the state.
A shift-reduce action is caused when the system does
reserved

not know if
'reduce' for a if to
to
given token. 'shift
or

2.9.5 Example
Ex. 2.9.1: Consider
the grammar

E2 E2,
Perform shift-reduce
E>3 E3, E>4
parsing for input string "32423".

Soln.
Stack Input Buffer Parsing Action
32423 $
shift
3 2423 $ shift

$ 32 423 $ shift

$ 324 23 $ Reduce by E>4


$ 32 E 23 $ shift

$ 32 E2 3$ Reduce by E> 2 E2
$ 3E 3$ shift

| $ 3 E3

SE
$ Reduce by

Accept
E3 E3

(P8-86) L Tech-Neo Publications...A SACHIN AH Venture


epPU-Sem8-Comp.
U-Sem8-Comp
(SPPU (Lang.Syntax & Semantics)..Page no.
NLP (2-21
DOWN PARSER EARLY
TOP Do
TOP
2.1 PARSER
Early Parser 1s an
algorithm for
parsing strings that belong to a given
context-free
language.

upon the variants, it


epending may suffer with certain nullable
proble
grammars
uses dynamic
The algorithm programming.
used for
It is mainly parsing incomputational
linguistics.
EarleyParser
Class Passing grammars that are context-free

Data structure String

Worst-case performance

Best-caseperformance
2 (n) for all

2(n) for
deterministic context-free

unambiguousgrammars
grammars

Average performance e (n)

2.10.1 Functioning of Earley Parser


Early parsers are appealing because they can all
parse
context-free languages.
The early parser executes in cubic time in the
general case ), where n is the O (n
of the parsing string,
length quadratic time for
unambiguousgrammars O (nD.
and linear time for all deterministic context free
grammars.
It performs well when the rule, are written
left-recursively.

a 2.10.2 Earley Recogniser


The following algorithm describes the Earley recognizer.
The recognizer can be modified to create a parse tree as it recognizes, and in

that way it can be turned into a parser.

2.10.3 The Algorithm


Here, o, B and y represent any string of terminals/nonterminals including the
empty string), X and Y represent single nonterminals, and a represents a
terminal symbol.
Earley's algorithm is a top-downdynamicprogramming algorithm.

Here, we use Earley's dot notation: given a production aß the notation X- X


B represents a condition in
which a
has already been parsed and ß is

expected.
is the position
Input position O is the position prior to input. Input positionn
th
after
accepting the n token.

SACHIN SHAH Venture


(P8-86) Tech-Neo Publications...A
NLP (SPPU-Sem8-Comp.) (Lang.Syntax & Semantics)..,Pa
For every input
position,
then parser generates a Page
no.
)
X -B),consisting of
state set 1

h state is
the
production currently being matched (X-a8) a
(i) the current in that tunle
position production (represented by
the
(ii) the
position i in the input at which the matching
the origin
of the
position.
The ction
state set at input position K is called S (K). The parser is began
consisting of only the top-level rule. seeded
The parser then repeatedly executes three operations
completion prediction,

,
6) Scanning
S (K)ofthe form (X
Prediction: For every state in
V and
is the origin position as above), add

production in the grammar with Y


(Y
on the
y.
left-hand
K), to S j).
K)for
(where

) side (Y y) even
Scanning:If ais the next symbol in the
input stream, for
(K) of the form (X
S(K+ 1).
a a5, add (X
ever.

a B)
e $ in

ii) Completion For to


every state in S (K)ofthe form
(Y
> >y,
y
S G) of the form (X
states

(K).
in
Yß, i) and add (X
YB,
aY. j),
find
i)
to s
The Y, O) ends S (n), where
algorithm accepts
top-level-rule and n is
if(X
the input length, otherwise
up in
it
(x> Y)isthe
rejects.

2.11 PREDICTIVEPARSER

Predictive parser is another method that implementsthe technique of Top-doun


parsing without Backtracking.
A predictive parser is an effective
technique of executing
recursive-descent
parsing by managing the stack of activation records.

3 2.11.1 Predictive Parser Components


Predictive parsers has the
following components:
G) Input Buffer: The input buffer includes the string to be parsed following
by an
end marker $ to denote the
end of thestring

Inputstring
Here a, +,b are terminal symbols.

(i) Stack: It contains a combination


of grammarsymbols with $ on the bottom
the stack.
lowed
At the of parsing, the stack contains the
symbol of grammar 10u
start
start

by $.

(P8-86) d Tech-NeoPublications... A SACHIN SHAH Ve


enture
NLP (SPPU-Sem8-Comp.) (Lang.Syntax & Semantics)..Page no. (2-23)
Input buffer

S Predictive parser

Stack
S program Output

a I+bs
B Parsing table

Predictive parser

Fig. 2.11.1

gii) Parsing Table : It is a


nonterminal
two-dimensional
and 'a' is a terminal array or Matrix M [A, a] where A is
symbol.
All the terminals are
written
row-wise.
column-wise, and all the non-terminals are
written

(iv) Parsing program: The parsing program performs some action


the symbol on top of the stack and the current by comparing
input buffer. input symbol to be read on the

(vAction : Parsing
program takes various actions
of the upon the
the top stack and the current depending symbol on
input symbol.

2.11.2 Algorithm to Construct Predictive


Parsing
Table
Input:Context free grammar G.

Output: Predictive M

)
parsing table
Method: For the
production A a ofgrammar G.
For each terminal a in FIRST
(o)add A> a to M [A,a]
e
ii) If

M
is

[A,b].
in FIRST (), and b is in FOLLOW (A), then add A >a to

(ii) If e is in FIRST and $ is in FOLLOW


(a), (A), then add A a
M [A,$]
to

iv) All remaining entries in Table M are errors.

(P8-86) ATech-Neo Publications...A SACHIN SHAH Venture


(Lang.Syntax& Semantics
NLP (SPPU-Sem8-Comp.) symbols a0e
Terminal no.

Non-terminal
symbols
B -M(B,b)

M(C, +)

Fig. 2.11.2

the steps to perform predictiveparsing


mention below
We
Recursion
)Elimination of left
i) Left Factoring
FOLLOW.
and
ii) Computation ofFIRST
Parsing Table
iv) Construction ofPredictive
(v) Parse the Input string.

a 2.11.3 What are the Steps for Predictive


Parsing ?
are:
parsing
Preprocessing steps forpredictive
the left recursion from the grammar.
i) Removing
on the resultant grammar.
(i) Performing left factoring
from the grammar
(iii) Removing ambiguity

2.11.4 Is Predictive Parser Top-down ?


is a recursive descent parser with no backtracking or backup,
A predictive parser

It is a top-down parser that does not require backtracking.

choice of the rule to be expanded is made upon the nett


At each step, the

terminal symbol.

2.11.5 The Difference between Predictive Parser


and Recursive Descent Parser
The main difference between recursive descent parser and predictive pabct
while preicu
or may not require backtracking
that recursive descent parsing may
parsing does not require any backtracking.

a 2.11.6 Drawbacks of Predictive Parsing


Drawbacks or disadvantages of predictiveparser are: e stack

)t is
inherently
a recursive parser, so it consumesa lot of memorya
grows.

SHAHVentue
(P8-86) Tech-Neo Publications.. SACHIN
-Comp.)
(SPPU-Sem8-
(Lang. Syntax & no. (2-25)
ntion
Doing optimisati may not be as Semantics)...Page

weuse simpleasthe comp


this
i) To remove recursion, of
mplexity grammar grows.
LL-parser,
which uses a table for lookup
2.11.7 What is
Recursive
Predictive parser 1sS a Predictive Parsing
recursive
aredict which descent
production is to be used parser,
which has the to
The to capabilny
predictive parser does replace the
not input string.
tasks, the suffer from
redictive
parser uses a backtracking. To its
input symbols. look-ahead accomplish
pointer, which
ich points to the next

2.12
INTRODUCTION TO
SEMANTIC ANALYSIS
ALYSIS
Semantic Analysis is the
process of
computers to finding the
understand and meaning from text.
It can direct
documents, by interpret
analysing sentences,
their
paragraphs, or whole
relationshipsbetween grammatical
individual structure, and
Thus the aim of words of the identifying the
semantic sentence in a
analysis is particular
meaning from the text. to draw
exact
context

The meaning or
dictionary
purpose of asemantic
The most analyser is to check
the textfor
importanttask of semantic
meaningfulness.
sentence. For analysis is to
example, analyse the getthe proper
the sentence "Govind meaningof the
speaker is talking about is
Lord great". In this sentence,
Govind. Govind or about a
person whose name is

2.12.1 Use of Semantic


Analysis
-
GQ Where is
semanticanalysis used ?
Semantic analysis is used in
human level extracting important
information from
accuracy from the achieving
It is used computers.
in tools like
machine translations, chatbots, search
analysis. engines and text

2.12.2 Syntactic and Semantic


Analysis
GQ What is
syntactic and semantic analysis ?

Theoretically, syntactic analysis determines and checks


whether the instance of
the
language is 'well formed' and analyses its
grammatical structure.
Semanticanalysis analyses its
meaning and finds out whether it 'makessense
Syntactic analysis depends on the types of words, but not on their meaning.

(P8-86) Tech-Neo Publications..A SACHIN SHAH Venture


&
(Lang.Syntax emantics)....Pag
NLP (SPPU-Sem8-Comp.) no.
in Natural 2-28
Semantic Analysis tural La
2.12.3 Language
Processing
a subfield of
NLP and Machine Learn:.
is
Semantic analysis the emotione riest
text and makes one realise notions
the context of any clear
inherentin
the
sentence information from achievin
in extracting important ving
This helps human
from the computers.
accuracy

in Syntactic Anak.
2.12.4 Steps to be Carried ysis
boundaries and word boundaries
1) I:Identify clause
Segmentation
parts of speech.
(2) Classification I: Determine
II:Identify constituents.
(3)
4)
Segmentation
Classification II :Determine

the grammatical
the syntactic categories for the constiha

functions of the constituents.


ents.

(5) Determine

2.13 MEANING REPRESENTATIONN


of the meaning of a sentence is created by semantic alysis.
Representation
related to
To understand the concept and approaches meaningrepresentation we
blocks' of semantic system.
first make the idea of 'building

2.13.1 Building Blocks of SemanticSystem

)
In representation of the meaning

(ii)
example, Haryana, Kejari
Concepts:
of words, the following blocks are used

Entities: It represents the individual,


wal, Pune are
e.g particular person, location, etc. Fr
all entities.

This represents the general category


of the individuals such as a
person, nation
etc.

between entities and concept is represented.


(i) Relations: Here relation

For example, Lata Mangeshkar was a singer.

(iv) Predicates: the verb structures.


Itrepresents
of predicates.
For example, case grammar and semantic roles are the examples
the building
Now, it is clear, how the meaning representation combines together
blocks of semantic systems.
relation and predicates to describe a situation.
It puts together entities, concepts,
It enables the reasoning about the semantic world.

2.13.2 Approaches to Meaning Representations


Approaches used by semantic analysis for the representation of meaning
) First order predicate logic (FOPL)
(i) Semantic Nets

(P8-86) LTech-Neo Publications..A SACHIN SHAH V


U-Sem8-Comp (Lang. Syntax no.
&Semantic Page (2-27)
NLP

(
Frames
(i) (CD)
eptual dependency
(i) - based architecture
Rule
Grammar
i) Case Graphs
Conceptual
(171)

3.3 Need of
2.13.3 Meaning Representations
a mention below the reasons to show the need of
meaning representation.
inking of linguistic elements to
i) non-linguistic elements
1inking oflinguisnc Iements tO the
non-linguistic elements can be done
meaning representation. using

Representing Variety atLexical Level


Ticing meaning representation, unambiguous canonical forms can be
at the lexical level. represented

fi) It can be used for Reasoning


Tisingmeaning representation, one can reason out
by verifying the 'truth' in the
world and also infer the knowledge from the semantic
representation.

2.14 LEXICAL SEMANTICS


Lexical semantics is a part of semantic
analysis. It studies the meaning of
individual words. That includes words, subwords, affixes
(sub
-
units), compound
words and phrases.

Allthe words, sub words etc. are collectively called lexical items.

Thus lexical semantics is the relationship between lexical items, meaning of


sentences and syntax of sentences.

) Thesteps
Classification

lexical
involved

semantics.
in lexical
of lexical items
semanticsare as follows:
like words, sub-words, affixes etc. is preformed in

etc. is
items like words, sub-words, affixes, preformed
(11) Decomposition of lexical
in lexical semantics.
between various lexical semantic
(11) Analyse the differences and similarities

structures.

2.15 LEXICAL CHARACTERISTICS


It is a way of analysing and
as lexical approach.
Lexical characteristics, such of lexical units rather than
idea that it is made up
based on the
Cacning language fixed phrases.
Eammatical structures.
The units are words and

SACHIN SHAH Venture


Publications..A
ATech-Neo
(PB-86)
NLP(SPPU-Sem8-Comp.) (Lang.Syntax &Semantics).
age
2.15.1 Advantages of Lexical Approach no.
(2-
The great advantage of the lexical approach is that it is cons.
encourages the process of noticing of the lexical items. And this
And this iousness
is
and
preliminary step when dealing with new vocabulary. the
raising

2.15.2 Main Features of Lexical Unit ndamenta

The lexical unit can be (i) a single word, the na -


two words.
(11)
bitual
bitual e
co -
occurrencace
Second and third notion refers to the definition of a of
collocatin
word unit.
It is common to consider a single word as a lexical unit. multi-

2.15.3 Limitation of the Lexical


Approach
What is a limitation of the Lexical Approach ?
While the lexical
approach can be a quick way for students to
does not produce much creativity. pickup
un nh
phrases, it
It can have the
negative side effect of limiting people's responses
to saf.
phrases. Since they don't have to build responses, fixed
they don't need to
lear
intricacies of language. the

2.15.4 Principle ofLexical Approach


--
GQ What is the principle of Lexical
Approach?
The basic principle of lexical is
approach "Language is grammaticalised lexis,
not lexicalised grammar".
In other words, lexis is central in
creating meanings, grammar plays a subsidiary
managerial role.

2.16 CORPUS STUDY


Corpus study is corpus linguistics and is rapidly growing methodology that uses

the statistical analysis of large collections of written or spoken data to

investigate linguistic phenomena.


is the language that is text corpus, its body of
Corpus linguistics expressed in its
real world" text.
able
Corpus study maintains that a reliable
analysis of a language is more practicani
with corpora, that is collected in the natural context of that
language.
The text -corpus method uses the body of texts written in any natural languas
It derives the set of abstract rules which
governthat language.

enture
(P8-86) Tech-Neo Publications...A. SACHIN SHAR
em8-Comp.)
NLP (SPPU-Sen Syntax & Semantics)... Pageno.(2
results can be used to find the
The relationships between thesubject languag
and these other languages which have been
undergone a similar analysis.
not only
Corpora have be used for
linguistics research, they have also been used
to form dictionaries (e.g The American
Heritage Dictionary of the English
in 1969), and
Language grammar guides, such as A ComprehensiveGrammar
of the English Language,Published in 1985.

2.16.1 Methods of Corpus Study


Corpus study has generated a number of research
methods, and they try to trace
a path from data to Theory
Wallis and Nelson introduced the 3A
Abstraction and perspective. They are: Annotation,
Analysis.

() Annotation
Annotation consists of the
applications of a scheme to texts.
Annotation may include structural
make-up, part-of-speech tagging. parsing.
and numerous other representation.

(ii)
Abstraction

It consists of translation of terms in the scheme to terms in a


theoretically
motivated model or dataset.
Abstraction typically includes
linguist-directed search but may include
e.g.
rule-learning for parsers.

(ii) Analysis

Analysis consists of statistically probing, manipulating and generalising from


the dataset.
Analysis may include statistical
evaluations, optimisation of the rule
-bases or knowledge discovery methods. Most lexical corpora today are part -

of-speech -
tagged (POS tagged).
But even corpus who work with 'unannotated
linguists plain text' also apply
some method to isolate salient terms.

In this situation, annotation and abstractions are combined in a lexical search.


The main advantage of publishing an annotated corpus is that other users can

perform experiments on the corpus.

Linguists with differing perspectives and other interests than the originator's can

exploit this work.


By sharing data, corpus linguists are able to treat the corpus as a locus of

linquistic debate and further study.

(P8-86) Tech-Neo Publications...A


SACHIN SHAH Venture
NLP (SPPU-Sem8-Comp.) (Lang.Syntax & Semantics)
Page
2.16.2 Corpus Approach no.
(2-30

GQ. What
---a corpus Approach?
is

-
The corpus utilizes a large and
Approach principled collectie
Occurring texts as the basis for analysis. lection of
The characteristic of the corpus approach refers to the naturaly
corpus itselc
One may work, with a written corpus, a
spoken corpus,
rpus, an
an
corpus, etc academi C
spoke
a2.16.3 Corpus Linguistic Techniques

GQ What are corpus linguistic techniques ?


In corpus common
linguistics, analytical techniques are
dispersion
clusters,keywords, concordance and collocation.
frequency
This part mentions how these techniques can contribute to
uncovering
practices uncovering
discourse

2.16.4 Corpus Example

GQ. What is a corpus example ?


An example of a general corpus is the British National
Corpus. Some co
contain texts that are chosen from a Corpora
particular variety of a language.
For example,from a
particular dialector from a particular subject area.
These corpora are sometimescalled 'sublanguage corpora'.

2.17 LANGUAGE DICTIONARY LIKE WORLDNET

A dictionary is a listing of lexemes from the lexicon of one or more


specifc
languages. are arranged
They For ideographic
alphabetically. languages by
radical and stroke.

They include information on definitions, usage, etymologies, pronunciations,


translation etc.

It is a lexicographical reference that shows interrelationships among the data

A clear distinction is made between general and specialized dictionaries

dictionaries include a
Specialized words in
specialized fields, rather than

complete range of words in the language.

Lexical items that describe s


concepts in specific fields are usually callea
i
instead of words.
In theory, general dictionaries are word
supposed to be semasiological, mappiu
to definition, while specialized dictionaries are
supposedto be onomasiolog

enture
(PB-86) Tech-Neo Publications...A SACHIN SHAH Ve
(SPPU-Sem8-Con (Lang. Syntax&
NLP Semantics)..Pageno.(2-31)
first iden
lentify
concepts and
They then
In practice,the two establishing the terms used to designate
approaches are used for
both types.
There are other
types of dictionaries that
do not fit into the above
For example, bilingual distinction.
(translation)
(the saurs) and rhyming dictionaries
ries, dictionaries of synonyms
dictionaries.
The word dictionary 1s
usually meant to refer to
dictionary. a general purpose monolinguaal
There is also a difference
between
The prescriptive dictionary Prescriptive and
reflects what descriptive dictionaries.
is seen as
The descriptive reflects recorded correct use of the
actual language.
use,
Stvlistic indications (e.g.
'informal' or
are also considered by some
to be less
'vulgar)in many modern
than dictionaries
objectively descriptive.

2.17.1 Types of Dictionaries


In a general dictionary, each
word may have
dictionaries include each
separate multiple meanings. Some
meaning in the
list definitions. order of most
others
common usage while
In
many languages, words can
appear in
undeclined or unconjugated form many different forms, but
appears as the only the
Dictionaries are headword in most
commonly found in the dictionaries.
ike New Oxford form of a book. But
American some dictionaries
Dictionary are
Many online dictionaries are also dictionary software
running on
available via internet. computers.

2.17.2 Specialised Dictionaries


According to
Manual of
specialised a
is also referred to as a Lexicographies, Specialised dictionary,
technical
dictionary.
It focuses on a specific
subject field.

Lexicographers categorise specialised


i) A multi - field
dictionary
dictionaries
It covers several
into
three types:

dictionary.
subject fields, e.g. abusiness
(i) A single field-
dictionary covers one particular
(ii) A sub-field subject (e.g. law) and
dictionary. It covers a more field
constitutional law). specialised (e.g.

The 23-1anguage
Inter-Active Terminology for Europe is a multi-field
dictionary. The American National
Biography is a single field.
The African American National
Biography project is a sub-field
dictionary.
Another variant is the an alphabetical
glossary, list of defined terms in a
specialised field, such as medicine (medical dictionary).

2.17.3 Defining Dictionaries


A defining dictionary, provides a core glossary of the simplest meanings of the
simplest concepts.

(P8-86) ATech-Neo Publications..A SACHIN SHAH Venture


NLP
(SPPU-Sem8-Comp.) (Lang.Syntax & Semantics
ics)....Pa
no.
From these, other concepts can be explained and defined.

In English, the commercial defining dictionaries include


22
meanings of under 2000 words. With these, the 4000 mos one or
idioms and metaphors can be defined. common tG
Engish

2.17.4 Historical Dictionaries

A historical dictionary is a specific kind of descriptive dictionaw.


lictionary. Itde:
the elopment of words and senses over time, using (
iginal crilbes
to support source
its conclusions.
matera
Dictionaries for Natural Language Processing Dictionaries
for
Language Processing (NLP)are Built to be used by C
Computer Natura
Program
Th ect user is a program, even though the final user is a
human bein
a dictionary does not need to be printed on paper.
The structure of the content is not linear, ordered
entry by entm,
complex network form. nas a
Sincemost of these dictionaries control machine translations or
croce
information retrieval (CLIR),the content is
usually multilingual andlingual
n
of huge size. sually

To allow formalized exchange and merging of dictionaries, an


ISO stand.
standard
called Lexical Makeup Framework (LMF) has been defined and
used
among
the industrial and academic community.

2.18 BABELNET DICTIONARY

Babelnet is a multilingual lexicalised semantic network.

Babelnet was automatically created by linking most popular computational

lexicon of the English language, world


net
The integration is done using an automatic mapping and by filling in lexical
gaps in resource-poor languages by using statistical machine translation.
The result is an encyclopaedic dictionary. It
provides concepts and named

entities that are lexicalised in


many languages and connected with large

amountsof semantic relations.

Additional lexicalisations and definitions are added by linking to free


- licensc

word nets. Similar to wordnet, babelnet group -works in


differentlanguages ae
set into sets of synonyms,called babel synsets.
taken
Babelnet provides short definitions (called glosses) in many languages
languages
from wordnet.

antu/e

(P8-86) Tech-Neo Publications...A SACHIN SHAHV


m8-Comp) Syntax & Semantics)....Pageno. (2-33)
(Lang.
PU-Sem8-Comp.)
NMP(SPP
BabelNet
BabelNe
2.18.1: BabelNet 5.0/ February 2021
Aance
Virtuoso Universal Server Lucene
Stab
system:
erating encyclopedic dictionary Linked data
:Multilingual share A like 3.0
Type ution non-commercial unported
Attrib
License:
babelnet.org
Fig.2.18.1:BabelNet
Website:

Statistics of BabelNet
2.18.2
covers 500
(Version 5.0) languages. It contains almost 20 million
BabelNet
nsets
and around 1.4. billion word senses.

h Babel Synset contains 2 synonyms per language, i.e. word senses, on

average
Version 5.0 also assoCiates around 51 million images with Babel
Synsets and
ides a Lemon RDF encoding of the
resource, available via a SPARQL
endpoint 2.67 million synsets are assigned domain labels.

2.18.3 Applications

BabelNet has been shown to enable


multilingual Natural Language Processing
applications.

The lexicalised knowledge available in Babelnet has been shown to obtain


state-of-the-art results in

) Semantic relatedness

i) Multilingual Word SenseDisambiguation.


(ii) Multilingual Word Sense Disambiguation and with the
Entity Linking
Babelfy System
(iv) Video games with a purpose.

2.19 RELATIONS ANONG LEXEMES AND THEIR


SENSESs
We have seen that semantic analysis can be divided into the following two
parts

() In the first part of the semantic analysis, the study of the meaning of individual
words This
is
performed. part is called lexical semantics.
(2) In the second part, the individual words will be combined to
provide meaning in
sentences.

2.19.1 Important Elements of Semantic Analysis


We Mention below some important elements of Semantic Analysis,

(1) Hyponymy
It is defined as the relationship between a generic term and instances of that

generic term.

(P8-86) ATech-Neo Publications...A SACHIN SHAH Venture


NLP (SPPU-Sem8-Comp.)(Lang Syntax &Semantics
The called hypemym and its
Pageno.
generic term
As
is instances are
called
call.
(2
example,the word
an colour is and
hypenym andthe colour
hyponyms hyponyms
een
etc
(2) Homonymy are

It is defined as the words having same spelling or same


different and unrelated meaning. form
but
For example, the word "Bat" a havi
is
homonymy word because ha aving
hit a ball or bat is a flying mammal also.
bat
can be

(3) Polysemy
used
t
Polysemi means "many signs". It is a Greek word.

It is a word or phrase with different but related


sense.
Polysemy has spellingthe same related but different and

)
meaningss.
For example, the word "bank"is a polysemy word with
the follo
meanings following
differer
A financial institution
erent

ii) The building in which such an institution is located.


(iii) A synonym for "to rely on".

(4) Difference between Polysemy and Homonymy


Sr. No. Polysemy
I
Homonymy
It has same spelling or It has also same spelling or
syntax.
syntax.
The meanings of the word are The meaning of the
words are not
related.
related.
II For example, the word "Bank" But for the word "Bank" we
the meaning are related. can
write the
meaning as a financial
institution or a river bank.
Here the
meanings are not related, so it is an

example of Homonymy

(5) Synonymy
It is a relation between two lexicalitems having different forms but
expressing
the same or a close meaning.
Examplesare 'author / writer', 'fate/ destiny'

(6) Antonymy
It is a relation between two lexical items possessing between
symmetry their

semantic componentsrelative to an axis,.

The scopeof antonymy is a follows


G) Application of or not
property Example is life/death',

'certitude/incertitude'.

(i) Application ofscalable property: Example is 'rich/poor', 'hot/cold.


ii) Application ofa usage :Example is 'father/son', 'moon/sun'.

(P8-86) Tech-Neo Publications... A SACHIN SHAH VEntu


Sem8-Comp.) ang.Syntax & Semantics)...Page no.(2-35)
NLP(SPPUuSe
Ambiguity and Uncertainty in Language
2.19.2
refers tothe meaning: Double Meaning'.
Ambigurity
in natural
natural language processing refers to the
mbiguit ability of being
in more than one way.
na
understood

that ambiguity is the


capability ofbeing understood in more than
obviously. Natural lan is
ne way guage very ambiguous.
various types in NLPp
We discuss of ambiguities

() Lexical Ambiguity
a single word is
The ambiguity of called lexical
ambiguity. For example the
word walk as a noun or a verb.

(41) Syntactic Ambiguity

When the sentence is parsed in different


ways, this type of ambiguity occurs.
For example, the sentence "Theman saw the
with girl camera.
It is ambiguous, whether the man saw the girl with the camera or he saw the
girl taking photos.

(ii) Semantic Ambiguity

When the meaning of the words can be misinterpreted, such kind of


occurs. ambiguity

In short, semantic ambiguity occurs when a sentence contains an


word or phrase. ambiguous

For example, the sentence, "The bike hit the pole when it was moving'" is
having semantic ambiguity.
The interpretation can be done as "The bike, while moving, hit the and
pole
"The bike hit the pole while the was
pole moving"
(iv) Anaphoric Ambiguity
This type of arises due to the use of
ambiguity anaphora entities in discourse.
For example, the horse ran was very
up the hill. It
steep. It soon got tired.

Here, the anaphoric reference of "it" in two situations cause ambiguity.

(v) Pragmatic Ambiguity


When the context of a phrase gives multiple interpretation to the then
situation,
this kind of ambiguity arises.

Thus when the statement is not specific, pragmatic ambiguity arises.

For example, the sentence, "I likeyou too can have multiple interpretations:

) I like you (just as you like me),


(ii) I like you (just like someoneelse dose).

(P8-86) Tech-Neo Publications...A SACHIN SHAH Venture


NLP (SPPU-Sem8-Comp.) (Lang.Syntax & Semantics)....Paa aeno.
(2-36
2.20 wORD SENSE DISAMBIGUATION (WSD
SD)
To realise the various usage patterns in the language is
important
ortant for
Natural
Language Processing Applications.
ariouy
Word Sense Disambiguation is an of
important method NLP, I
meaning of a word can be determined. And tha
that can be used usingit
in a he
context. partic

The main problem of NLP systems 1S to


identify words prones
word a particular sentence. y
usage of a and tr
in
determine the specific

Word sense Disambiguation solves the ambiguity when that


rises
the meaning of the same word, when it is used while
determining in

situations.

2.20.1 Word-Sense Disambiguation Applicatione ons


WSD in various text

)
and NLP

(ii)
We

wSD
mention
fields.

can

corpus-based.
WSD can
be
below the various

used
WSD
also be used
with Lexicography.

in Lexicography

in
applications of

text mining
Much
text proc.

of the

can provide significant

and Information Extraction


modern
textual
processing

Lexicograph

indicator

tasks,
ators.
i

because the main aim


It can be used for the correct labelling of words, of WC
is to understand the meaning ofa word accurately in a particular sentence
should
(ii) From a security point of view, a text system understand the differeno
ice

betweena coal "mine" and a land "mine'".


iv) We note that the former serves industrial purposes, the latter is a security threat

Hence a text-mining application must be able to determine the difference

between the two.

(v) WSD can be used for Information Retrieval purposes. Information Retrieval

text data. And it is based on textual information. Hence


systems work through
of using a word in any sentence helps.
knowing the relevance

2.20.2 Challenges in Word Sense Disambiguation


WSD faces lot many challenges and problems.
the most common
The difference between various dictionaries or text-corpus is
(i)
problem.
makes the
Different dictionaries give different meanings for words. That seise

of the words to be perceived differently.


A lot of text information
is
availanie

and it is not possible to process everything properly.


nes

ii) Different algorithms are to be formed for different applications and thatbecou
a big challenge for WSD. And
(ii) Words cannot be divided into discrete meaning they have relatedmeaning
this causes a lot of problems.

Ventue
(P8-86) Tech-Neo Publications..A SACHINSHAH
cPPU-Sem8-comp.) (Lang.Syntax no. (2-37)
&Semantics)...Page

.20.3 Relevance of wSD


Sense Disambiguation is
related to
antpart of the whole parts of speech and 1s an
Natural tagging a
Language Processing
The main problem that process.
arises in
WSD is the whole meaning of word sense.
lard
Wore sense is not a numeric
quantity that can be measured as a true or false,
denoted
and can be by or 0.
1

The meaning of aword is contextual and


deals
leals with
dependson its usage.
Lexicography
generalising the
extended meaning of a word. But corpus and
plaining the
expla full and
or data. sometimesthese
algorithms meanings fail to apply to the

Rut. WSD has immense


applications and uses.
Tf a computer algorithm can
just read a text
a text, it will indicate and come to know
vast different uses of
improvementin the field of
text
analytics.

2.21 KNOWLEDGE BASED


APPROACH
A knowledgebased system
behaviour can be
designed in following
approaches
(1) Declarative Approach
In this
approach, stating from an empty
sentences one after another. knowledge base, the
agent can TELL
This is to be continued till the
agent has how to work
environment. It stores knowledge of with its

This is known
required information in
as declarative emptyknowledge based - system.
approach.

(2) Procedural Approach

In this
method, it converts
empty knowledge -based system.
required behaviour
directly into
program code in

Compared to declarative
approach, it is a contrast
system is designed. approach. Here, coding

2.21.1
Lesk Algorithm
The lesk algorithm is based on the
assumption that words in a
given
'neighbourhood' will tend to share a common topic.
In a
simplified manner, the Lesk algorithm is to
compare the dictionary
of an
definition
ambiguous word with the terms contained in its
neighbourhood.
Versions have been adapted to use wordnet. An implementation appears like
this:
1. For every sense of the word one should
being disambiguated count the number
of words that are in both the neighbourhood of that word and in the dictionary
definition of that sense.

(P8-86) ATech-Neo Publications...A SACHIN SHAH Venture


NLP (SPPU-Sem8-Comp.) (Lang.Syntax&
Semantics)
The sensethat is to be chosen is the sense
that hast Page
count. the no.(2
We consider an example
gest no.(2
illustrating this alg m,for the nurmber
ofth
cont
Dictionary definitions are : PINE ext"
pine
Kind of evergreen cone

2
1. tree with
needle-shaped leaves.
Waste away through sorrow or illness.

cONE
1. Solid body which narrows to a point.
2.
Somethingof this shape whether solid or hollow.
3. Fruit of certain
evergreen trees.
We note that, the best intersection is
pine #lncone#3 =2.

2.21.2 Simplified Lesk Algorithm


In
simplified Lesk algorithm, the correct
meaning of each WOrd
context is determined: by
noting down the sense that
its
dictionary definition and the
overlaps the mocde
given context.
Instead of collectively
determining the meanings of all
words in a
this takes into given co
approach account each word
meaning of the other words
occuring in the
individually, independent
same context.
context

of th
n
A comparative evaluation performed has shown that
simplified Lesk aleorith
can the original definition of the thm
outperform
algorithm, both in terms
ot
precision and efficiency.

Evaluating the
disambiguation algorithms on the Senseval-2
English all words
data, they measure 58%
precision using the simplified lesk
algorithm comparecd
to only 42% under the
original algorithm.

2.21.3 Limitations of Lesk-Based Methods


i) Lesk's approach is
very sensitive to the exact of
wording definitions.

(ii) The absence of a certain word can change the results considerably.

(iii) The algorithm determines the glosses of the senses


overlaps only among being

considered.

(iv) D
glosses are fairly short and do not provide suficient vocabulary
Dictionary
relate sense distinctions.

This is a significant limitation.

Different modifications other


of this algorithm have appeared. These works uS and
resources for analysis (the saur
uses, synonyn dictionaries or morphologica
syntactic models).
ar
For example, may use such information assynonyms,different
it vati'ves,

dern
words from definitions of words
from definitions.
Vent
(PB-86) Tech-Neo Publications..A SACHIN SHAH
NLP (SPPU-Sem8-Comp.) no. (2-39)
(Lang. Syntax & Semantics)..Page
Lang.
2.22 DICTIONARIES FOR REGIONAL LANGUAGES
() Hindi is the oficial
language ofIndia
(2)
language. But there
The oxford
is no while English being the secona
national
language as per the
o
dictionary is one of constitution.
in the world. It the most
has famousEnglish language
many extra dictiona
language learners, Iike the features that it as ol for
augment dictona
ability to make
flashcard learning notes on definitions and a
systemand a spellings
(3) Hindi is the official great the saurus.
language. In
constitution recognises 22 addition to the e
official
as scheduled regional language,
languages, which does not
languages. include
(4) The Sanskrit Engist
language is the
been spokensince oldest
5000 years language in India.
before Sanskrit language nas
of India. But, in Christ. Sanskrit is
the still the
present time, official
and ritual instead Sanskrit has language
of the become a
language of language of worship
(5) There are 22 official speech.
regional
Bodo,Dogri, languages in India.
Gujarati, Hindi, They are: Assamese,
Manipuri, Marathi,
Kannada,Kashmiri, Bengall,
Nepali, Oriya, Konkani, Maithili,
and Malyalam,
Telugu Urdu. Punjabi, Sanskrit.
Santhali, Sindhi Tami,
(6) The youngest
languagein India is
It
belongs to Dravidian Malayalam.
language group and
youngest language of the Dravidian considered an the
smallest and the
declared this language group.
language an 'the
classical
Government of India
(7) Currently six language of India in
languages enjoy the 2013".
'classical status' Tamil
Sanskrit
(2005).Kannada(2008), (declared 2004), m
(2014). Telugu (2008),
Malyalam(2013) and odiya

2.23 DISTRIBUTIONAL SEMANTICS


) Distributional semantics is an
important area of research
in natural
processing. It aims to describe
meaning of words and sentences with language
representation. These vectorial
representations are called distributional
(i) Distributional semantics is a representations.
research area that
methods for develops and studies theories and
quantifying and
categorising semanties similarities
linguistic items based on their between
distributational
properties in large
language data. samples of
(i) The aim of distributional
semantics is to learn the
meanings of
expressions from a corpus of text. linguistic

The core idea, known as the distributional


hy-pothesis, is that the contexts in
which an
expression appears give us information about its
meaning.
iv) Distributional evidencein linguistics
The distributional typothesis in is derived from the semantic
linguistics
theory of
language usage, i.e. words that are used and occur in the same contexts
tends to
produce similar meanings.

(P8-86) Tech-Neo Publications...A SACHIN SHAH Venture


NLP (SPPU-Sem8-Comp.)

(V)
The underlying
popularised also here.
idea that 'a
(Lang.Syntax & Semantics
word is charactised
cs)...Page
by the
the co no.
company it
keeps
(
Distributional structure
Wa
The distribution of an element will be understood as
as
the
environments. An environment of an element, A is an existis
Sum of
the other elements, each in a all
occurrents, i.e. particu
icular array of
position
occurs to yield an utterance. with itso
(vi) Distributional properties shicha
There are three basic properties of a distribution:
location, Snr.
The location refers to the typical
value of the
distribution, such
the amount by which
as t shape
spread of the distribution, is
similar
milar valmean
values he
larger ones. differ

(vii) Semantic criteria


A verb's meaning has to do with events.
Correspondingly, we
we can
noun denotes an entity,
adverbs modify events and so on. say th
thata

One can call the classification of words on the basis of


their ma.
semantic criteria. aning: a

2.24 TOPIC MODELS

Topic modelling is
recognising the words from the
topics
present in
document or the corpus of data. the

This is useful because extracting the words from a document


takes more
time
and is much more complex than extracting from them topics
present
the
document.
For examples, there are 1000 documents and 500 words in each document.So to
process this it requires (500) ( 1000) 500000 threads. = Soto

But if we divide the documents containing certain topics, if there are 5

documentspresent in it, the processing is just 5(500) =2500 threads.

This appears simple than processing the entire document and this is how topie

modelling has come up to solve the problem and also visualising things beter.

2.25 LATENT SEMANTIC ANALYSISs (LSA)

Latent semantic analysis is a natural language processing method that


uses i
statistical
approach to identify the association among the words in a
documen

LSA deals with the following kind of


issue
Example: mobile, phone,
cell phone are all similar but if we say "ne
phone is
ringing" Then the documents which have "cll phone ar
are
retrieved whereas the documents containing the mobile, phone, telepl
not retrieved.

Venture
(P8-86) Tech-Neo Publications...A SACHIN SHAR
n8-Comp.) (Lang.Syntax & Semantics)....Pageno. (2-41)

P(SPPU, ofLs
AsSumption
bhich are used in the same
context are
words
of the data analogous to each other.
The semantic structure is
unclear due to the i
ambiguity of the
thehidden

S
chosen.
words
value Decomposition
eular (SVD)
Singular
the stical
statistica
method that is used to find
SVD is acrossthe
the latent
(hidden) semantic
spread document,
ofwords
struct
Let
collection
of documents.
C
number of documents,
d

nnumber of unique words in the


M = dxn
the M
SVD decomposes th matrix
The i.e. word i.e.
word to document matrix
as follows into
three matrices

M uVT or words across


Tae l= distribution different contents.
diagonal matrix
oftheassociation
among the contents.
of contexts
VEdistribution across the
different documents.

2,26 SELF-LEARNINGTOPICS
www AARRwRw

2.26.1 Dictionary look-up


Dictionary in NLP means a list
of all the
unique words occurring in the
If some wordsare repeated in different corpus.
documents, they are all written
while creating the just once
dictionary.
A contains a of single
dictionary list word terms or multi-word
terms. These
terms represent instances of a single concept.
For example, there might be a list of
countries to extract the concept
country.
In addition to the base form for each term, the
dictionary can contain several
variants of the base form. Each term
includes a unique number that is called
term ID

The type of annotations that are created for


dictionary entries within text have
the same name as the dictionary.

a 2.26.2 Detail Explanation of Dictionary Lookup

Morphological parsing is a process by which word forms of a language are


associated with corresponding linguistic descriptions. Morphological systems
that
specify these associations by merely enumerating them case by case do not
offer any generalization means. Likewise for systems in which analyzing a word

(P8-86) Tech-Neo Publications...A SACHIN SHAH Venture


NLP
(SPPU-Sem8-Comp.)Lang.
Syntax & Seme
form is reduced to
looking it
up verbatim
databases, unless are a by word
in antics)..Pag
Pa0eno.
sophisticated models of the language.
they constructed by and
and
kept
lists,
in dic
Q
In this context, a dictionary is
understood
syne
syne
as cionanes
enables obtaining some precomputed data wih
results, in our
data structure can be optimized for structure
efficient look ase
shared. Lookup operations
can be implemented,
are relative

for instance, as
simpleand
Kup, word hoat
d the
lists, usually
binary analyses.
results
and so on. search quick.
trees, can
Because the set of associations trhes,
between word Dictonare
hash
for
descriptions
finite
is declared
and the generati ve potential of the
by plain
cove
meration, the
their
rms
d tubje

language is not rage of the


as well as verifying the association list is destret
tedious, exploited mode
inefficient and inaccurate unless the data are 1 liable
errors,
and reliable linguistic resources. and
Deveotjine
ikel
Despite all that, model is often
an enumerative
sufficient f
deals easily with exceptions, and can for the
impl ment even
instance, dictionary-based complex gIven
approaches to
Korean [351 mor purpose
dictionary of all possible combinations
denrpholog
of end on
alternations. These approaches do not allomorphs and
allow
a
morphological rules, though [36]. developmer morphologo
of
The word list or dictionary-based has been approach
reusale
used frean
ad hoc
implementations for many
languages. could assume entdy in We
availability of immense online data, a varnous
extracting with
word forms is feasible these days high-coverage
The question 37.
remains ho
the
abulary o
annotations are constructed and hOw the
informative and assod
References to the literature on the
accrtOCated
unsupervised learning and ina are,

morphology, which are methods resulting in induction


structured and theref
enumerative models, are provided later in this
chapter. non

Example for dictionaries


The following table shows the contents of the
dictionary countries
Base form Variant
Sri Lanka Ceylon
Germany BRD
United states
US; USA
Actually Sri Lanka was known as Ceylon, the following annotations are created

Annotation 1:
Type: Countries
Base form: Sri Lanka
Id: 1

begin:
end 9
covered Text Sri Lanka

enture

(P8-86) Tech-Neo Publications...A SACHIN SHAHV


em8-Comp.)(Lang. Syntax & Semantics)..Page no. (2-43)
PPU-Sem8-Com
MLP(SPA 2
Annotation

Type:Countries
Lanka
Baseform:Sri

id:1
20
begin:
end:
26
Ceylon
Text:
covered
include the following features
r
Annotations

Baseform of the recovered variants


The base form

id

The ID of the dictionary entry

begin
The offset that indicates the begin of the covered
text

end

The offset that indicates the end of the covered text.

text
Covered
The variant of the base form that is found in the text.

One can also import dictionaries in the


design studio dictionary XML format
or dictionaries that are with
compatible language wave
dictionary-resource
format

One can use dictionaries with the


Dictionary Lookup operator. These
dictionaries might contain more than one annotation
type, and the features for
these annotation types might vary from type to type.

2.26.3 Finite State Morphology


By finite-state morphological models, we mean those in which the
specifications
writen by human programmers are into
directly compiled finite-state
transducers. The two most popular tools
supporting this approach, which have
been cited in and for which
literature
example implementations for multiple
languages are available online, include XFST (Xerox Finite-State
Tool) [9] and
LexTools [11].
Finite-state transducers are
computational devices extending the power of finite-
state automata. They consist of a finite set of nodes connected
by directed edges
labeled with
pairs of input and output symbols. In such a network or graph,
nodes are also called states, while edges are called arcs.

Traversing the network from the set of initial states to the set of final states

along the arcs is equivalent to reading the sequences of encountered input


symbols and writing the sequences of corresponding output symbols.

(P8-86) Tech-Neo Publications..A SACHIN SHAH Venture


NLP (SPPU-Sem8-Comp)(Lang Syntax &
The of possible sequences Semantics).
set accepted by the
Page
language: the
set
of possible seque emitted ransduce
by no.2
output language.
For example, a
finite-state to the
Sines
regular language consisting of the words ansducer
infinite
wansducer the vnl
to the matching words in the infinite nuk, could
regular tra
define
.. pravnu
nsduce.uk
great-grandson. great-great-grandson, language
defin slate
The role of finite-state transalucers is to
capture by
prapravnmi
and
on sets [38, 9, 11]. That is, transduce grand
compute
specify relatio
In it is
output languages. fact, possible to invert regula
the dou
relation, that is, exchange the input and the
output.
In finite-state computational morphology, it is
commno
word forms as surface strings and to the to
output deser refer
if the transducer is used for morphological ns as to
analysis. asle
for morphological generation. vice lexica
versa,if
it
Morphology is a domain of linguistics that studies i
the fo
traditional to distinguish between rface forms mation of
and wo
their
lemmas. words
analyses,
cal
The lemma for a surface form such as the
English wo..
represented as big+ Adj + comp. To indicate that bigoe
form of the adjective big.
risthe
morphology,we come
comparatine
In modelling natural language
across two chs
1. hallenges
Morphotactics

Words are typically


composed of smaller units of meaning,
called moe
The morphemesthat make up a word must be combined in a eme
certain ord
less-ness is a word in
English but piti-ness-less is not.
Most languages build words by concatenation but some languages also af
non-concatenative process such as interdigitation and exnib
reduplication,

2. Morphological Alternations

The shape
piti in
of a
the context of
morpheme often depends on the environment
less, die as by in dying.
:pity is realised u

The basic claim of the finite-state approach to morphology is that the


relatia

between the surface-forms of a language and their corresponding lemmas


ca
described as a regular relation.

If the relation it can


is be defined using the metalanguage of reguli
regular,

expressions. Then with a suitable compiler, the regular expression source


the
can be compiled into a finite-state transducer and that implements ra
computationally.
In the resulting

initial state to
transducer, each path

afinal state represents a


(= sequence of
mapping between
states

a
and
surreo
aa rom tk

and I8

lemma. This is known as the lexical form.


For example, the comparative of the adjective big is bigger can Dc
where tne
in
English lexical transducer by the path in Fig. 2.26.1
epsilon symbols.
Venti

(P8-86) Tech-Neo Publications...ASACr


(Lang.Syntax &Semantics)..Page no.
(2-45)
omp.)
PPU-Sem8-Co

prlen/
s,

b
0-0-00O in a Transducer for English
Fie.
2.26.1:A path
b:bare reduced to a single symbol.
redistinct.the pairs
in Fig. 2.26.1 is labelled as
symbols thepath
rithe notation,
r
00:e+comp:
Instandard
g Adj:+ contain
hundreds of thousands, even millions, of states
igo sducers may
paths in the case of languages such as
an infinite number
transdu
Lexical
arcs allows noun compounds of any length.
and German language
from whic such complex networks are complied
xpressions
expres hese are developed to make it possible to
The regular operators.
lternations that are commonly found in natural
th-level
clude and alte
constraints
are present in a convenient and perspicuous way.
aes And these
describe
And
languages.
Calculus
Basic Expression
Xerox finite-state
comes from the
calculus.
used here
A, B, etc; as variables over regular expressions. Lower
The notation
ters
letter
We use uppercase Stand for symbols.
b etc.
letters a,
case that stands for empty
symbo 0, the Epsilon symbol,
are two special
There
and ?, the any symbol.
string
up from simpler ones by means
can be built of
Complex regular expressions
Square brackets, I ], are used for grouping
regular expression operators.

expressions.
and regular relations are closed under
Because
both regular languages
and union, the following basic operators can be combined
with
concatenation

any
kind of regular expression:
AIB union,
AB concatenation
(A)Optionality;equivalent to [A 10],
A+ Iteration; one or more concatenations of A,
A* kleene star
; equivalent to (A+)
subtraction, and
) Regular languages are closed
under complementation,
intersection, but regular relations are not.
with a regular language:
Hence, the following operators can be combined only

A complement
A Term complement; all single symbolsstrings not
in A.
A &B Intersection
A-B Subtraction (minus)
() Regular relations can be constructed by means of two basic operators:

A XB Cross product

(P8-86)
Tech-Neo Publications...A SACHIN SHAH Venture
NLP
(SPPU-Sem8-Comp.) (Lang.Syntax & Sema
A OBcomposition Semantics)...
Thecross product operator,X,
is Used Pageo.
regular language: it constructs a relation
only with
with no.
between tha
express
them.
[A X B]designates the relation that maps essions
B
: hat
every that
string of A
The notation a b is a convenient short hand
for [a . to denee
every
b. every

Remarks stina

(1) Replacement and marking expression in regular


exnres
be very useful for morphology, tokenizatic ons
and have
(2) Descriptions consisting of regular expressions parsing. tu
can wrned hec
finite-state-networks. And they can be
termined, siently
oti
reduce the size of the network. minimise con
in
(3) Also they can be sequentialised, other
compressed, and onti.
ways
(4)

(5)
application speed.

Regular expressions have semanties

They
strings,
constitute

languages and
a kind of
relations.
high level
which are clean

programming
and decl
lang
sed

.
clarative.
for
to
to
increase
th

6) Regular languages and relations can be encoded as manipul


finite pulating
more easily manipulated than context-free and
more tomata,
(7) With new complex they
regular-expression operators, regular languages
derived directly languages and
without mentioning the new grammar
rules tions
This is a fundamental over other canbe
advantage
higher-level

a formalism
ism.
2.26.4 Noisy Channel Models
(1) The noisy channel model is a framework used in

translation programs,
natural
NLP) to identify the correct word in situations where it is
The framework helps detect intended

question
words for spell
answeringsystems and speech to
checkers,
language
unclear.

virtual
.
assit.
SIstants,
text softwe
tware.
(2) Difference Between Noisy Channel and
Noiseless Channe
inel
The capacity of a noiseless channel is
numerically equal to the rate at which
communicatesbinary digits.
it

The capacity of a noisy channel is less than this because it is limited by the
amount of noise in the channel

Noiseless channel means that there will not


any kind of disturbance in the
path
when data is carried forward from sender to receiver.

(3) Capacity of noisy and noiseless channel


The channel capacity is to the of the as :
directly proportional power signal
SNR= (Power of signal)/ (Power of noise)

(P8-86) Tech-NeoPublications.. A SACHIN SHAH Ventue


(Lang.Syntax&
Semantics)...Page no. (2-47)
PPU-Sem8-Comp.) is commonly
atio of 1000 expressed as
noise 3,

Asignal =[lo8,,(10))
o (1000))
los,
10[3 log,10)]
= 30 dB
= 10 (3 x 1)

data rate of noisy channel ?


maximumn
is calculated
Cthermal noise as the ratio of
The power to
signal noise
amount Gince
The Sir SNR is the ratio
oftwo powers that varies
SNR. overa very large
in decibles, called
power, en expressed SNRs and calculated as:
itis
range,

1|log,,
SNR]
SNR can never transmit
the channel
haracteristics, much more than 13
matter how many or how few Signal levels are used
With
no matter and no matter
Mbps,
or how infrequently samples are taken.
often
how has a bandwidth of 300
telephone line normally Hz (3000 to 3000
Examples:
for data communication

) Hz) assigned

Noiseless

An

protocd
ealistic

does not
Channel
channel in vhich

implementerror control
no frames are
in
lost, corrupted
this category.
or
duplicated. The

Various edit distance


n computational linguisties
and computer science, edit distance is a way of
how dissimilar two
auantifying strings (e.g. words) are to one another by
the minimum number of operations
cOunting
required to transform one string
into the other.

(2) The maximum edit between any two


distance
strings (even two identical ones) is
unless we add some restrictions on
infinity,
repetitions of edits. In spite of that
there can be an arbitrarily large edit distance with any arbitrarily large set
character set.

(3) The minimum edit distance between two strings is defined as the minimum
number of editing
operations insertion, deletion, substitution) needed to
transform one string into another.

(4) Operations in edit distance


Most
commonly, the edit
operations for this purpose are
6) insert a character into a
string
() delete a character from a
string, and
(i1) replace a character of a
string by another character,
ror these
operations, edit distance issometimescalled as Levenshtein distance.
)The normalised edit distance is one of the distances derived from
the edit
Ostance.
It is useful in some applications because it takes into account the
of the length
two
strings compared.

(P&-86)
ch-Neo Publications..A SACHIN SHAH Venture

You might also like