Natural Language
Processing
By Abhishek Saini
Lecture Outline
• What is Natural Language Processing?
• Fundamental tasks in NLP
• Some applications of NLP
What is Natural Language Processing?
• A field of computer science, artificial intelligence and computational
linguistics.
• To get computers to perform useful tasks involving human languages
- Human-Machine communication
− Improving human-human communication
- E.g Machine Translation
− Extracting information from texts
Why NLP is interesting?
• Languages involve many human activities − Reading, writing,
speaking, listening
• Voice can be used as an user interface in many applications − Remote
controls, virtual assistants like siri,...
• NLP is used to acquire insights from massive amount of textual data −
E.g., hypotheses from medical, health reports
Fundamental Tasks in NLP
• Word Segmentation
• Part-of-speech (POS) tagging
• Syntactic Analysis
• Semantic Analysis
Word Segmentation
• In some languages, there is no space between words, or a word may
contain smaller syllables .
• In such languages, word segmentation is the first step of NLP systems.
• Word tokenization (also called word segmentation) is the problem
of dividing a string of written language into its component words. In
English and many other languages using some form of Latin alphabet,
space is a good approximation of a word divider.
• ['Its', 'history', 'can', 'be', 'traced', 'back', 'nearly', '5,000', 'years', 'to',
'archeological', 'discoveries', 'in', 'the', 'Middle', 'East', '.']
Word Segmentation
1.Text Lemmatizaton
The process of removing inflectional endings only and to return the
base dictionary form of a word is known as lemma.
For ex- Worse-bad(lemma)
2. Text Stemming
The process of reducing inflected (or sometimes derived )words to their
root form .
For ex-Meeting-Meet(stem)
POS (Part of Speech)Tagging
• Each word in a sentence can be classified in to classes, such as verbs,
adjectives, nouns, etc
• POS Tagging is a process of tagging words in a sentences to particular
part-of-speech, based on:
− Its definition
− Its context in the sentence
Sequence Labeling
• Many NLP problems can be viewed as sequence labeling
• Each token in a sequence is assigned a label.
• Labels of tokens are dependent on the labels of other tokens in the
sequence, particularly their neighbors.
• John saw the saw and decided to take it to the table.
NNP VBD DT NN CC VBD TO VB PRP IN DT NN
Sequence Labeling as Classification
• Classify each token independently
• Use as features, information about the surrounding tokens (sliding
window).
Probabilistic Sequence Models
• Model probabilities of pairs (token sequences, tag sequences) from
annotated data set.
• Exploit dependency between tokens
• Typical sequence models
1.Hidden Markov Models (HMMs)
2. Conditional Random Fields (CRF)
Syntactical Analysis
• The task of recognizing a sentence and assigning a syntactic structure to it
• The purpose of this phase is to draw exact meaning or you say dictionary
meaning from the text.
• Syntax analysis check the text for meaningfulness comparing to the rules
of the grammar.
Syntactical Analysis
Syntactical Analysis
Syntactical Analysis
• Ambiguity problem: one sentence may have many possible parsing
trees
• Vietnamese language processing (VNLP) still lacks accurate syntax
parsers (in my understanding)
− Accuracy about 78 ~ 84%
Approach to Syntactical Analysis
• Top-down parsing
• Bottom-up parsing
• Dynamic programming methods
− CYK algorithm
− Earley algorithm
− Chart parsing
• Probabilistic Context-Free Grammars (PCFG)
• Assign probabilities for derivations
Semantic Analysis
• Two levels
Lexical semantics
-Representing meaning of words
− Word sense disambiguation (e.g., word bank)
• Compositional semantics
− How words combined to form a larger meaning.
Meaning Representations
• First order predicate calculus
• E.g., Maharani serves vegetarian food. => Serves(Maharani,
vegetarian food)
• E.g., I only have five dollars and I don’t have a lot of time =>
Have(Speaker, FiveDollars) ∧ ¬Have(Speaker, LotOfTime)
Syntax-driven Semantic Analysis
Some Applications
• Information Retrieval
• Information Extraction
• Question Answering
• Machine Translation
Information Retrieval
• Query: “list of good sushi restaurants in kyoto?”
Architecture of an ad hoc IR system
Information Extraction
• To extract from unstructured text, information which pre-specified or
pre-defined in templates − Fill a number of slots/attributes
• Example: use template [PERSON, go, LOCATION, TIME] to extract
information about the destination of an individual goes. − “President
Obama went to Hanoi yesterday. − [PERSON = “President Obama”, go,
LOCATION = “Hanoi”, TIME = “yesterday”]
Question Answering
• A system that automatically return answers for an user’s question by
retrieving information from a collected documents.
• Differences from information retrieval system:
• − QA system’s goal is to respond exact answer instead of documents
related to users’ question.
• Q: who did invent the internet?
• A: Robert E. Kahn and Vint Cerf.
• − QA system requires more complicated semantic analysis
Question Answering
Machine Translation
• The use computer to automatic some or all of the process of
translating one language to the other one.
• Fully automatic machine translation is one of the most challenging
and hot topic in NLP.
• Recent advances of Deep Learning raise the trend of Neural Machine
Translation.
Thanks
End of Session!