Unit 1: Overview of the field
• Natural Language Processing (NLP) is the branch of
computer science focused on developing systems that
allow computers to communicate with people using
everyday language.
• Also called Computational Linguistics
Also concerns how computational methods can aid the
understanding of human language
1
Reference materials
• Lavid, J. (2005). “Lenguage y nuevas tecnologías: nuevas perspectivas,
métodos y herramientas para el lingüista del siglo XXI”. Madrid: Cátedra.
Chapters 2, 4 & 7
• Uzkoreit, H. “Language technology: a first overview”
• http://www.ailia.ca/Introduction+to+language+technologies
• Dale, R. “Language technology: an overview of commercial applications”.
2
NLP related areas
• Artificial Intelligence
• Formal Language (Automata) Theory
• Machine Learning
• Linguistics
• Psycholinguistics
• Cognitive Science
• Philosophy of Language
3
Human Language Technologies
• Human Language technologies (HLT) are the
application of linguistic knowledge to the development
of computer systems which are able to recognise,
analyse, interpret and generate language.
• Result: machines which behave as if they understood
the human language.
4
Language Technologies
From “Language Technology: a first overview” by Hans Uzkoreit 5
Principal components in a LT application
INPUT: recognising text / voice / images
LANGUAGE PROCESSING:
Reasoning about words to get
at their meaning
OUTPUT: rendering meaning as
text / voice / images
From “Lenguage y Nuevas Tecnologías” by Julia Lavid 6
Main applications
7
Applications: Language Input
• Speech recognition
• Optical character recognition
• Handwriting recognition
8
Speech recognition
9
Speech recognition: fielded products
10
11
Optical Character Recognition
12
Menu-translating pens
13
OCR: Fielded production
14
15
Hand-writing recognition
• Key focus of the technology:
deriving a computer-readable representation of human hand
-writing
• Application:
- Forms processing
- Mail routing
- PDAs
16
17
18
Principal components in a LT application
• Language input
(recognising the words)
• Language processing
(reasoning about the words to get at their meaning)
• Language output
(rendering meaning as words)
19
Applications of Language Processing
• (Spoken) Dialogue Systems
• Search and information retrieval
• Writing assistance
• Machine Translation
• Text summarisation
• Question-answering systems
20
Spoken Dialogue Systems
• Key focus of the Technology
Natural voice interactive dialogs with computer-based systems
• Applications
Information services: stock quotes, timetables
Transaction services: banking, betting, flight reservations
21
Spoken dialogue systems:
current state of the art
• Limited transaction and information services
QTAB betting service
American Airlines flight information
Charles Swab’s stock broking system
• Limited, finite-stae notion of dialog
• Limited natural langauge understanding
22
Spoken dialogue systems:
fielded applications
• Speech engine vendors
Nuance (www.nuance.com)
Phillips (www.speech.phillips.com)
23
Spoken Dialogue System Architecture
From “Lenguage y Nuevas Tecnologías” by Julia Lavid
24
Dialogue system modules
• Dialogue Manager (Gestor del diálogo) => central module in the system for
coordinating other modules; uses information from the task model, the user
model and the discourse models.
• Task model (modelo de la tarea) provides information on tasks that the user
will develop. E.g: search for flights, booking a seat, cancelling a reservation.
• User model (modelo del usuario) provides information on the user’s interests,
beliefs, etc..
• Discourse model (modelo del discurso) provides information on the discourse
history, i.e, entities mentioned in the discourse, etc…
25
Applications of Language Processing
• (Spoken) Dialogue Systems
• Search and information retrieval
• Writing assistance
• Machine Translation
• Text summarisation
• Question-answering systems
26
Search and information retrieval
27
Search and information retrieval:
current state of the art
28
29
30
31
Applications of Language Processing
• (Spoken) Dialogue Systems
• Search and information retrieval
• Writing assistance
• Machine Translation
• Text summarisation
• Question-answering systems
32
Grammar and style checking
33
An example
34
35
36
Applications of Language Processing
• (Spoken) Dialogue Systems
• Search and information retrieval
• Writing assistance
• Machine Translation
• Text summarisation
• Question-answering systems
37
Machine Translation
38
Machine translation
• Some systems:
- Browne Global Solution’s Itranslation
- Systran’s web-based Translation
• Ambiguity problems due to broad coverage by mainstream
translation technologies
• Limited to literal language use
• Main approaches:
-Transfer
- Interlingua
- Example-based
- More recently, statistical machine translation
• Real systems often Machine-Assisted Translation
39
Machine Translation: fielded products
• Systran –used by AltaVista
www.systran.co.uk
• Language Weaver
www.languageweaver.com
40
Applications of Language Processing
• (Spoken) Dialogue Systems
• Search and information retrieval
• Writing assistance
• Machine Translation
• Text summarisation
• Question-answering systems
41
Text summarisation
• Key focus of the technology:
Producing a document that is shorter than the original one
• Applications:
Information browsing
Voice delivery of web pages and email
42
Text summarisation:
current state of the art
• Commercial systems work on ‘sentence extraction’
model
• Sentences extracted on the basis of:
Location
Linguistic cues
Statistical information
43
Text summarisation: fielded products
44
45
46
ProSum Online summariser
47
Applications of Language Processing
• (Spoken) Dialogue Systems
• Search and information retrieval
• Writing assistance
• Machine Translation
• Text summarisation
• Question-answering systems
48
Question-answering systems
• Key focus of the technology:
Given a natural language query, produce an appropriate
response
• Applications:
Web-based information services
Desktop help systems
49
QA Systems: Fielded applications
50
Principal components in a LT application
• Language input
(recognising the words)
• Language processing
(reasoning about the words to get at their meaning)
• Language output
(rendering meaning as words)
51
Applications of LT: language output
• Text-to-Speech
• Document Generation
52
Text-to-speech
53
Text-to-speech: fielded applications
• Nuance RealSpeak
• Cepstral
• AT&T Natural Voices
54
Document Generation
55
Document Generation: fielded applications
56
57
Types of knowledge required
• Depends on the language technology:
- Phonetic and phonological knowledge
- Lexical and morphological knowlege
- Syntactic knowledge
- Semantic knowledge
- Pragmatic/Discourse knowledge
- World knowledge
58
Traditional NLP Tasks
Preprocesing tasks:
• word boundary detection and segmentation (tokenisation)
• sentence segmentation
Linguistics tasks:
morphological analysis
part-of-speech tagging
phrase chunking
syntactic parsing
semantic tasks & pragmatic tasks
59
Preprocessing tasks
• Word boundary detection
• Word segmentation (tokenisation)
• Sentence segmentation
60
61
Word Segmentation (Tokenisation)
- Breaking a string of characters (graphemes) into a sequence of
words.
- In some written languages (e.g. Chinese) words are not separated
by spaces.
- Even in English, characters other than white-space can be used to
separate words [e.g. , ; . - : ( ) ]
Examples from English URLs:
jumptheshark.com ⇒ jump the shark .com
myspace.com/pluckerswingbar
⇒ myspace .com pluckers wing bar
⇒ myspace .com plucker swing bar
⊗
Sentence segmentation
63
Language processing tasks
• Morphological analysis
• Part-of-speech tagging
• Phrase chunking
• Syntactic parsing
64
Morphological Analysis
• Morphology is the field of linguistics that studies the internal
structure of words. (Wikipedia)
• A morpheme is the smallest linguistic unit that has semantic
meaning (Wikipedia)
e.g. “carry”, “pre”, “ed”, “ly”, “s”
• Morphological analysis is the task of segmenting a word into its
morphemes:
carried ⇒ carry + ed (past tense)
independently ⇒ in + (depend + ent) + ly
Googlers ⇒ (Google + er) + s (plural)
unlockable ⇒ un + (lock + able) ?
⇒ (un + lock) + able ?
Part Of Speech (POS) Tagging
• Annotate each word in a sentence with a part-of
-speech.
I ate the spaghetti with meatballs.
Pro V Det N Prep N
John saw the saw and decided to take it to the table.
PN V Det N Con V Part V Pro Prep Det N
• Useful for subsequent syntactic parsing and word sense
disambiguation.
Phrase Chunking
• Find all non-recursive noun phrases (NPs) and verb
phrases (VPs) in a sentence.
[NP I] [VP ate] [NP the spaghetti] [PP with] [NP meatballs].
[NP He ] [VP reckons ] [NP the current account deficit ] [VP will
narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP
September ]
Syntactic Parsing
• Produce the correct syntactic parse tree for a sentence.
Semantic Tasks
Word Sense Disambiguation (WSD)
• Words in natural language usually have a fair number of
different possible meanings.
Ellen has a strong interest in computational linguistics.
Ellen pays a large amount of interest on her credit card.
• For many tasks (question answering, translation), the
proper sense of each ambiguous word in a sentence
must be determined.
70
Semantic Role Labeling (SRL)
• For each clause, determine the semantic role played by
each noun phrase that is an argument to the verb.
agent patient source destination instrument
John drove Mary from Austin to Dallas in his Toyota Prius.
The hammer broke the window.
• Also referred to a “case role analysis,” “thematic
analysis,” and “shallow semantic parsing”
71
Semantic Parsing
• A semantic parser maps a natural-language sentence
to a complete, detailed semantic representation
(logical form).
• For many applications, the desired output is
immediately executable by another program.
• Example: Mapping an English database query to
Prolog:
How many cities are there in the US?
answer(A, count(B, (city(B), loc(B, C),
const(C, countryid(USA))),
A))
72
Textual Entailment
• Determine whether one natural language sentence
entails (implies) another under an ordinary
interpretation.
Textual Entailment Problems
from PASCAL Challenge
ENTAIL
TEXT HYPOTHESIS MENT
Eyeing the huge market potential, currently
led by Google, Yahoo took over search Yahoo bought Overture. TRUE
company Overture Services Inc last year.
Microsoft's rival Sun Microsystems Inc.
bought Star Office last month and plans
to boost its development as a Web-based Microsoft bought Star Office. FALSE
device running over the Net on personal
computers and Internet appliances.
The National Institute for Psychobiology in
Israel was established in May 1971 as the Israel was established in May
FALSE
Israel Center for Psychobiology by Prof. 1971.
Joel.
Since its formation in 1948, Israel fought
Israel was established in
many wars with neighboring Arab TRUE
1948.
countries.
Pragmatics/Discourse Tasks
Anaphora Resolution/
Co-Reference
• Determine which phrases in a document refer to the
same underlying entity.
John put the carrot on the plate and ate it.
Bush started the war in Iraq. But the president needed the
consent of Congress.
• Some cases require difficult reasoning.
Today was Jack's birthday. Penny and Janet went to the store. They were
going to get presents. Janet decided to get a kite. "Don't do that," said
Penny. "Jack has a kite. He will make you take it back."
Ellipsis Resolution
• Frequently words and phrases are omitted from
sentences when they can be inferred from context.
"Wise men talk because they have something to say;
fools, [talk] because they have to say
something.“ (Plato)
Modular architecture of an NLP system
Acoustic/ Pragmatics
Syntax Semantics
sound Phonetic meaning
words parse literal
waves trees meaning (contextualized)
78