Natural Language Processing
• Natural Language Processing (NLP) refers to AI method of
communicating with an intelligent systems using a natural language
such as English.
• Processing of Natural Language is required when you want an
intelligent system or dialogue based clinical expert system, etc.
• The field of NLP involves making computers to perform useful tasks
with the natural languages humans use. The input and output of an
NLP system can be −
• 1. Speech 2. Written Text
Forms of Natural Language
• The input/output of a NLP system can be:
– written text
– speech
• We will mostly concerned with written text (not speech).
• To process written text, we need:
– lexical, syntactic, semantic knowledge about the language
– discourse information, real world knowledge
• To process spoken language, we need everything required to process written
text, plus the challenges of speech recognition and speech synthesis.
2
NLP - an inter-disciplinary Field
• NLP borrows techniques and insights from several disciplines.
• Linguistics: How do words form phrases and sentences? What constraints the
possible meaning for a sentence?
• Computational Linguistics: How is the structure of sentences are identified?
How can knowledge and reasoning be modeled?
• Computer Science: Algorithms for automatons, parsers.
• Engineering: Stochastic techniques for ambiguity resolution.
• Psychology: What linguistic constructions are easy or difficult for people to
learn to use?
• Philosophy: What is the meaning, and how do words and sentences acquire it?
3
Why NL Understanding is hard?
• Natural language is extremely rich in form and structure, and very
ambiguous.
– How to represent meaning,
– Which structures map to which meaning structures.
• One input can mean many different things. Ambiguity can be at different
levels.
– Lexical (word level) ambiguity -- different meanings of words
– Syntactic ambiguity -- different ways to parse the sentence
– Interpreting partial information -- how to interpret pronouns
– Contextual information -- context of the sentence may affect the meaning of that sentence.
• Many input can mean the same thing.
• Interaction among components of the input is not clear. 4
Knowledge of Language
• Phonology – concerns how words are related to the sounds that realize
them.
• Morphology – concerns how words are constructed from more basic
meaning units called morphemes. A morpheme is the primitive unit of
meaning in a language.
• Syntax – concerns how can be put together to form correct sentences and
determines what structural role each word plays in the sentence and what
phrases are subparts of other phrases.
• Semantics – concerns what words mean and how these meaning combine in
sentences to form sentence meaning. The study of context-independent
6
Knowledge of Language (cont.)
• Pragmatics – concerns how sentences are used in different
situations and how use affects the interpretation of the sentence.
• Discourse – concerns how the immediately preceding sentences
affect the interpretation of the next sentence. For example,
interpreting pronouns and interpreting the temporal aspects of the
information.
• World Knowledge – includes general knowledge about the world.
What each language user must know about the other’s beliefs and
goals.
7
Components of NLP
• Natural Language Understanding (NLU)
Understanding involves the following tasks
– Mapping the given input in natural language into useful
representations.
– Analyzing different aspects of the language.
Natural Language Generation (NLG)
• It is the process of producing meaningful phrases and sentences in the
form of natural language from some internal representation.
It involves −
• Text planning − It includes retrieving the relevant content from
knowledge base.
• Sentence planning − It includes choosing required words, forming
meaningful phrases, setting tone of the sentence.
• Text Realization − It is mapping sentence plan into sentence structure.
• The NLU is harder than NLG.
• Natural Language Understanding
– Mapping the given input in the natural language into a useful
representation.
– Different level of analysis required:
morphological analysis,
syntactic analysis,
semantic analysis,
discourse analysis, …
• Natural Language Generation
– Producing output in the natural language from some internal
representation.
– Different level of synthesis required:
deep planning , syntactic generation 10
Natural language understanding
Raw speech signal
• Speech recognition
Sequence of words spoken
• Syntactic analysis using knowledge of the grammar
Structure of the sentence
• Semantic analysis using info. about meaning of words
Partial representation of meaning of sentence
• Pragmatic analysis using info. about context
Final representation of meaning of sentence
Natural Language Understanding
• Input/Output data Processing stage Other data used
Frequency spectrogram freq. of diff.
speech recognition sounds
Word sequence grammar of
“He loves Mary” syntactic analysis language
Sentence structure meanings of
semantic analysis words
He loves Mary
Partial Meaning context of
Ξx loves(x,mary) pragmatics utterance
Sentence meaning
loves(john,mary)
Difficulties in NLU
• NL has an extremely rich form and structure. It is very ambiguous. There
can be different levels of ambiguity −
• Lexical ambiguity − It is at very primitive level such as word-level.
• For example, treating the word “board” as noun or verb?
• Syntax Level ambiguity − A sentence can be parsed in different ways.
• For example, “He lifted the beetle with red cap.” − Did he use cap to lift the
beetle or he lifted a beetle that had red cap?
• Referential ambiguity − Referring to something using pronouns. For
example, Rima went to Gauri. She said, “I am tired.” − Exactly who is tired?
• One input can mean different meanings.
• Many inputs can mean the same thing.
Ambiguity is pervasive
Find at least 5 meanings of this sentence:
I made her duck
– I cooked duck for her
– I cooked duck belonging to her
– I created the (artificial) duck, she owns
– I caused her to quickly lower her head or body
– I waved my magic wand and turned her into a duck
Duck’ can be a noun or verb
‘her’ can be a possessive (‘of her’) or dative (‘for
her’) pronoun
Steps in NLP
• There are general five steps −
• Lexical Analysis − Lexical analysis is a vocabulary that includes its words
and expressions. It depicts analyzing, identifying and description of the
structure of words. It includes dividing a text into paragraphs, words and
the sentences. Individual words are analyzed into their components, and
non-word tokens such as punctuations are separated from the words.
• Syntactic Analysis (Parsing) − The syntax refers to the principles and
rules that govern the sentence structure of any individual languages. It
involves analysis of words in the sentence for grammar and arranging
words in a manner that shows the relationship among the words. The
sentence such as “The school goes to boy” is rejected by English syntactic
analyzer.
• Semantic Analysis Semantic Analysis is a structure created by the
syntactic analyzer which assigns meanings. This component transfers
linear sequences of words into structures. It shows how the words are
associated with each other. Semantics focuses only on the literal
meaning of words, phrases, and sentences. This only abstracts the
dictionary meaning or the real meaning from the given context. The
structures assigned by the syntactic analyzer always have assigned
meaning
• E.g.. "colorless green idea." This would be rejected by the Symantec
analysis as colorless Here; green doesn't make any sense. The
semantic analyzer disregards sentence such as “hot ice-cream”.
• Discourse Integration − The meaning of any sentence
depends upon the meaning of the sentence just before it. In
addition, it also brings about the meaning of immediately
succeeding sentence.
• Pragmatic Analysis − During this process, what was said
is re-interpreted on what it actually meant. It involves
deriving those aspects of language which require real
world knowledge.
Natural Language vs. Computer Language
Parameter Natural Language Computer Language
They are ambiguous in They are designed to
Ambiguous nature. unambiguous.
Natural languages employ Formal languages
Redundancy lots of redundancy. are less redundant.
Natural languages are Formal languages
Literalness made of idiom & mean exactly what
metaphor they want to say
Advantages of NLP
• Users can ask questions about any subject and get a direct response within
seconds.
• NLP system provides answers to the questions in natural language
• NLP system offers exact answers to the questions, no unnecessary or
unwanted information
• The accuracy of the answers increases with the amount of relevant information
provided in the question.
• NLP process helps computers communicate with humans in their language and
scales other language-related tasks
• Structuring a highly unstructured data source
Disadvantages of NLP
• Complex Query Language- the system may not be able to
provide the correct answer it the question that is poorly
worded or ambiguous.
• The system is built for a single and specific task only; it is
unable to adapt to new domains and problems because of
limited functions.
• NLP system doesn't have a user interface which lacks
features that allow users to further interact with the
system
NLP Applications
Two main areas:
1. Massive management of textual information
sources:
For human use
For automatic collection of linguistic resources
2 Person/Machine interaction
NLP Applications
Massive management of textual information
sources
– Machine Translation (MT)
– Information Retrieval (IR)
– Question Answering (Q&A)
– Information Extraction (IE)
– Summarization
Machine Translation
Process of translating a text from a source
language to a target language preserving some
properties
– The main property to preserve (but not the only
one) is the meaning
– MT textual vs oral
– Different degrees of human intervention
Machine Translation
Information Retrieval
Input A collection of documents
– The Web
– A corporate document collection
...
A user need represented as a query
Output
– The documents of the collection that satisfy the
user needs.
Information Retrieval
Question Answering
• Natural extension of IR
• A QA system receives a query expressed in NL and
tries to provide not a document containing the
answer but the proper answer (usually a fact).
• QA systems need to use NLP techniques for both
processing the question and looking for the answer.
Question Answering
Automatic Summarization
• A summary is a reductive transformation of a source text into a
summary text by extraction or generation
• Look for the relevant parts of a document and produce a summary of
them
Summarization vs Information Extraction
- Information Extraction
What has to be extracted is defined a priori “I am interested on this,
look for it”
- Summarization
An a priori definition of what is relevant is not always defined
Automatic Summarization
Information Extraction
• Extracting useful information from free text
• Named Entity Recognition (NER)
• Named Entity Classification (NEC)
• Both tasks together (NERC)
• Slot Filling
• Relation Extraction
Natural Language Processing Challenges
• Contextual words and phrases and homonyms
• Synonyms
• Irony and sarcasm
• Ambiguity
• Errors in text or speech
• Colloquialisms and slang
• Domain-specific language
• Low-resource languages
• Lack of research and development
Sentiment Analysis
• sentiment analysis is used to identify the sentiments among several
posts.
• Companies are using sentiment analysis, to identify the opinion and
sentiment of their customers online
• It will help companies to understand what their customers think
about the products and services.
• beyond determining simple polarity, sentiment analysis understands
sentiments in context to better understand what is behind the
expressed opinion.