PMDS606L Natural Language
Processing
Dr.Kavipriya G
Assistant Professor Senior grade 1
Ab3 First floor Annexure cabin no 24 (109B)
VIT Chennai
1
Course Objectives
1. To introduce the fundamental concepts and techniques of Natural
language Processing for analyzing words based on Morphology and
CORPUS.
2. To examine the NLP models and interpret algorithms for classification
of NLP sentences by using both the traditional, symbolic and the more
recent statistical approach.
3. To get acquainted with the algorithmic description of the main
language levels that includes morphology, syntax, semantics, and
pragmatics for information retrieval and machine translation
applications.
Course Outcomes
1. Understand the fundamental concepts of natural language
processing.
2. Understand the text pre-processing and corpora.
3. Analyze the words and perform POS tagging.
4. Distinguish between the syntactic and semantic correctness of the
natural language.
5. Develop simple language models using NLTK.
Module -1 Introduction to NLP
Introduction to various levels (stages) of natural language processing,
Ambiguities, varieties and computational challenges in processing
natural languages. Introduction to Real life applications of NLP such as
spell and grammar checkers, information extraction, information
retrieval, question answering, and machine translation.
What is NLP?
Natural Language Processing (NLP)
• Natural language processing - to build machines that understand and
respond to text or voice data and respond with text or speech of their own in
much the same way humans do.
• NLP combines computational linguistics, rule-based modeling of human
language with statistical, machine learning, and deep learning models.
• The field of NLP is primarily concerned with getting computers to perform
useful and interesting tasks with human languages.
Goals of NLP
• Scientific Goal
• Identify the computational machinery needed for an
agent to exhibit various forms of linguistic behavior.
• Engineering Goal
• Design, implement, and test systems that process
natural languages for practical applications.
Forms of Natural Language
• The input/output of a NLP system can be:
• written text
• speech
• We will mostly concerned with written text (not speech).
• To process written text, we need:
• lexical, syntactic, semantic knowledge about the language
• discourse information, real world knowledge
• To process spoken language, we need everything required to process
written text, plus the challenges of speech recognition and speech
synthesis.
7
Components of NLP
• Natural Language Understanding
• Mapping the given input in the natural language into a useful representation.
• Different level of analysis required:
morphological analysis, provides the building blocks by breaking words into morphemes and identifying
their grammatical roles
syntactic analysis, organizes these words into a structured sentence, revealing the grammatical hierarchy.
semantic analysis, assigns meaning to the structured sentence, ensuring the machine understands the
intent and context.
discourse analysis, connects sentences to interpret the broader context, coherence, and relationships in
a text or conversation
• Natural Language Generation
• Producing output in the natural language from some internal representation.
• Different level of synthesis required:
deep planning (what to say),
syntactic generation
• NL Understanding is much harder than NL Generation. But, still both of them are hard.
8
Why NL Understanding is hard?
• Natural language is extremely rich in form and structure, and very ambiguous.
• How to represent meaning,
• Which structures map to which meaning structures.
• One input can mean many different things. Ambiguity can be at different levels.
• Lexical (word level) ambiguity -- different meanings of words
Example: "I saw a bat in the yard.“ Explanation: "Bat" could mean a flying mammal or a piece of sports
equipment (e.g., a baseball bat). The sentence is ambiguous without context to clarify which meaning is
intended.
• Syntactic ambiguity -- different ways to parse the sentence
Example: "The man saw the woman with a telescope.“ Explanation: This can be parsed in two ways: (1) The
man used a telescope to see the woman (telescope as a tool), or (2) The man saw a woman who was
holding a telescope (telescope associated with the woman). The structure creates ambiguity in how the
prepositional phrase "with a telescope" is attached.
• Interpreting partial information -- how to interpret pronouns
Example: "When John met Tom, he was upset.“ Explanation: The pronoun "he" could refer to either John
or Tom, leaving it unclear who was upset. Without additional context, the referent of the pronoun is
ambiguous.
9
• Contextual information -- context of the sentence may affect the
meaning of that sentence.
Example: "It's cold in here.“ Explanation: In one context (e.g., a chilly room), this could be a
complaint about the temperature. In another context (e.g., a morgue or an emotionally distant
situation), it could describe the atmosphere or mood. The sentence's meaning shifts based on
the situational context.
• Many input can mean the same thing.
• Interaction among components of the input is not clear.
Language Technologies
Goal: Deep Understanding Reality: Shallow Matching
Requires context, linguistic structure,
Requires robustness and scale
meanings…
Amazing successes, but
fundamental limitations
12
Language Processing
• Level 1 – Speech sound (Phonetics & Phonology)
• Level 2 – Words & their forms (Morphology, Lexicon)
• Level 3 – Structure of sentences (Syntax, Parsing)
• Level 4 – Meaning of sentences (Semantics)
• Level 5 – Meaning in context & for a purpose
(Pragmatics)
• Level 6 – Connected sentence processing in a larger
body of text (Discourse)
13
Examples of Levels
• L1 : sound
• L2 : Dog - Dog(s), Dog(ged)
Lady – Lad(ies)
Should we store all forms of words in the lexicon?
• L3 : Ram goes to market (right)
goes Ram to the market (wrong)
• L4 : translation from unstructured to structured
representation
go : (event)
agent : Ram
source : ?
destination : market
14
Example (Contd.)
• L5 : User situation & context
“Is that water?” – the action to be performed is
different in a chemistry lab and on a dining table.
• L6 : Backward & forward references –
• Coreference resolution
“The man went near the dog. It bit him.”
Often co reference & ambiguity go together as in –
“The dog went near the cat. It bit it.”
15
Knowledge of Language
• Phonology – concerns how words are related to the sounds that realize them.
Example: "Cat" vs. "Hat".
Phonological Analysis:
• The words "cat" and "hat" differ only in their initial consonant sounds: /k/ vs. /h/.
• In English, the phonemes /k/ and /h/ are distinct, and this difference changes the word's identity.
• Phonology determines that these sounds follow English pronunciation rules (e.g., /kæt/ for "cat").
• Morphology – concerns how words are constructed from more basic meaning units called morphemes. A
morpheme is the primitive unit of meaning in a language.
Example: Word: "Unhappiness".
Morphological Analysis:
Morphemes: "un-" (prefix, meaning "not"),"happy" (root, meaning "joyful"),"-ness" (suffix, turning an
adjective into a noun).
Combined meaning: The state of not being happy.
In NLP, morphology helps break down words to understand their structure (e.g., for lemmatization, where
"unhappiness" is reduced to "happy" for analysis) and generate correct word forms.
16
• Syntax – concerns how can be put together to form correct sentences and determines what structural role
each word plays in the sentence and what phrases are subparts of other phrases.
Example Sentence: "The quick brown fox jumps over the lazy dog."
• Syntactic Analysis:
• Parse Tree:
• [NP: The quick brown fox] (noun phrase, subject)
• [VP: jumps over the lazy dog] (verb phrase, predicate)
• Subparts: "quick brown" modifies "fox"; "over the lazy dog" is a prepositional phrase.
• Dependency: "jumps" is the main verb, with "fox" as the subject and "dog" as the object of the
preposition "over.“
In NLP, syntax ensures the sentence is structured correctly, enabling systems to identify relationships (e.g., who
is performing the action) for tasks like machine translation.
• Semantics – concerns what words mean and how these meaning combine in sentences to form sentence
meaning. The study of context-independent meaning.
• Example Sentence: "John gave Mary a book."
• Semantic Analysis:
• Word Meanings:
• "John" (a person, the agent),
• "gave" (an action of transferring possession),
• "Mary" (a person, the recipient),
• "book" (an object).
• Sentence Meaning: John performed the action of transferring a book to Mary.
• Semantic Roles: John (agent), Mary (recipient), book (theme).
In NLP, semantics helps systems understand the intended meaning (e.g., distinguishing "bank" as a financial
institution vs. a riverbank) for tasks like question answering.
• Pragmatics – concerns how sentences are used in different situations and how use affects the interpretation
of the sentence.
Example Sentence: "Can you pass the salt?“
Pragmatic Analysis:
Literal Meaning: A question about the ability to pass the salt.
Pragmatic Meaning: A polite request to pass the salt, inferred from the dining context.
Context: At a dinner table, the speaker expects action (passing the salt) rather than a yes/no answer.
In NLP, pragmatics helps chatbots interpret implied meanings (e.g., treating the question as a request) to
respond appropriately in conversational settings.
Knowledge of Language (cont.)
• Discourse – concerns how the immediately preceding sentences affect the interpretation of the next sentence. For
example, interpreting pronouns and interpreting the temporal aspects of the information.
• Example Text: "Sarah went to the park. She saw a dog."
• Discourse Analysis:
• Coreference Resolution: "She" refers to "Sarah" based on the preceding sentence.
• Temporal Aspect: The second sentence implies an event that happened after Sarah arrived at the park.
• Coherence: The two sentences form a narrative about Sarah’s experience at the park.
In NLP, discourse analysis ensures conversational systems track context across sentences (e.g., understanding that
"she" is Sarah) for coherent dialogue or text summarization
• World Knowledge – includes general knowledge about the world. What each language user must know about the
other’s beliefs and goals.
19
Language processing
Task, Tools, and algorithms
Accurately determines the intended meaning of text or voice data. Homonyms,
homophones, sarcasm, idioms, metaphors, grammar and usage exceptions,
variations in sentence structure.
Basic text Processing
1. Tokenization- Tokenization breaks the raw text into words, sentences called tokens.
2. Stemming- Stemming is a technique used to extract the base form of the words by
removing affixes from them.
3. Spelling Correction - error in the queries.
4. Normalization
5. Lemmatization -more formal way to find roots by analyzing a word’s morphology using
vocabulary from a dictionary.
6. Parts of speech tagging
Natural Language Processing
Applications Core Technologies
• Machine Translation
• Language modeling
• Information Retrieval
• Question Answering • Part-of-speech tagging
• Dialogue Systems • Syntactic parsing
• Information Extraction
• Named-entity recognition
• Summarization
• Sentiment Analysis • Word sense disambiguation
• ... • Semantic role labelling
NLP lies at the intersection of computational
linguistics and machine learning.
• ...
22
23
24
25
26
Ambiguity
I made her duck.
• How many different interpretations does this sentence have?
• What are the reasons for the ambiguity?
• The categories of knowledge of language can be thought of as
ambiguity resolving components.
• How can each ambiguous piece be resolved?
• Does speech input make the sentence even more ambiguous?
• Yes – deciding word boundaries
27
Ambiguity (cont.)
• Some interpretations of : I made her duck.
1. I cooked duck for her.
2. I cooked duck belonging to her.
3. I created a toy duck which she owns.
4. I caused her to quickly lower her head or body.
5. I used magic and turned her into a duck.
• duck – morphologically and syntactically ambiguous:
noun or verb.
• her – syntactically ambiguous: dative or possessive.
• make – semantically ambiguous: cook or create.
• make – syntactically ambiguous:
• Transitive – takes a direct object. => 2
• Di-transitive – takes two objects. => 5
• Takes a direct object and a verb. => 4
28
Ambiguity in a Turkish Sentence
• Some interpretations of: Adamı gördüm.
1. I saw the man.
2. I saw my island.
3. I visited my island.
4. I bribed the man.
• Morphological Ambiguity:
• ada-m-ı ada+P1SG+ACC
• adam-ı adam+ACC
• Semantic Ambiguity:
• gör to see
• gör to visit
• gör to bribe
29
30
31
32
33
34
35
36
37
38
Type vs Tokens
39
40
41
42
43
44
45
Applications
• Speech processing: get flight information or book a hotel over the
phone
• Information extraction: discover names of people and events they
participate in, from a document
• Machine translation: translate a document from one human language
into another
• Question answering: find answers to natural language questions in a
text collection or database
• Summarization: generate a short biography of Noam Chomsky from
one or more news articles
Machine
Translation
People’s Daily, August 30,
2017
Trump Pope family watch a hundred years a year in the White House
balcony
Machine
Translation
People’s Daily, August 30,
2017
Trump and his family watched a 100-year total solar eclipse on the
balcony of the White House
Speech Recognition
• Spoken Input
• Identify words and phonemes in speech
• Generate text for recognized word parts
• Concatenate text elements
• Perform spelling, grammar and context checking
• Output results
• Research question: How can speech recognition assist a deaf student
taking notes in class?
• VUST – Villanova University Speech Transcriber (
http://www.csc.villanova.edu/~tway/publications/wayAT08.pdf)
Textual Analysis - Readability
• Text Input
• Analyze text & estimate “readability”
• Grade level of writing
• Consistency of writing
• Appropriateness for certain educ. level
• Output results
• Research question: How can computer analyze text and measure
readability?
• Opportunities for hands-on research
Plagiarism Detection
• Text Input
• Analyze text & locate “candidates”
• Find one or more passages that might be plagiarized
• Algorithm tries to do what a teacher does
• Search on Internet for candidate matches
• Output results
• Research question: What algorithms work like humans when finding
plagiarism?
• Experimental CS research
Intelligent Agents
• Example: ELIZA
• AIML: Artificial Intelligence Modeling Lang.
• Human types something
• Computer parses, “understands”, and generates response
• Response is viewed by human
• Research question: How can computers “understand” and “generate”
human writing?
• Also good area for experimentation
Spell and grammar checkers
• Identify and correct spelling errors, grammatical mistakes, and style
issues in text.
Real-Life Use:
- Word processors (e.g., Microsoft Word, Google Docs).
- Email clients and messaging apps.
- Educational tools for students and professionals.
Impact:
- Improves communication clarity.
- Enhances professional writing quality.
How Spell and Grammar Checkers
Work
Core Techniques:
- Dictionary-Based Checking: Compares words against a dictionary database.
- Rule-Based Grammar Analysis: Applies linguistic rules to detect errors (e.g.,
subject-verb agreement).
- Machine Learning: Contextual analysis using models like BERT to suggest
corrections.
Challenges:
- Homophones (e.g., "their" vs. "there").
- Context-dependent errors (e.g., "I read a book" vs. "I red a book").
Example Tools: Grammarly, ProWritingAid, LanguageTool.
Advancements in Spell and
Grammar Checkers
Context-Aware Suggestions:
- Modern tools suggest style improvements (e.g., replacing "very
good" with "excellent").
- Tone detection (e.g., formal vs. casual).
Multilingual Support:
- Checking text in multiple languages (e.g., English, Spanish, French).
Integration:
- Browser extensions, mobile keyboards, and cloud-based platforms.
Future: Real-time feedback in collaborative writing environments.
Information Extraction - Overview
Definition: Automatically extracting structured information (e.g.,
entities, relationships, events) from unstructured text.
Applications:
- Extracting names, dates, and locations from news articles.
- Identifying medical conditions from clinical notes.
- Building knowledge graphs for businesses.
Impact:
- Converts raw text into actionable insights.
- Supports data-driven decision-making.
Techniques in Information Extraction
Named Entity Recognition (NER):
- Identifies entities like people, organizations, and locations (e.g., "Elon Musk" as a person).
Relation Extraction:
- Detects relationships between entities (e.g., "Elon Musk founded xAI").
Event Extraction:
- Identifies events and their details (e.g., "xAI launched Grok in 2024").
Tools and Models:
- SpaCy, Stanford NER, transformer-based models like RoBERTa.
Challenges:
- Ambiguity in entity boundaries.
- Handling noisy or incomplete text.
Real-Life Examples of Information
Extraction
Business Intelligence:
- Extracting customer feedback from reviews for sentiment analysis.
Healthcare:
- Pulling patient data from medical reports for research.
Legal Domain:
- Extracting contract terms or case details from legal documents.
Future Directions:
- Zero-shot extraction for low-resource domains.
- Multilingual and cross-domain extraction.
Information Retrieval - Overview
Definition: Finding relevant documents or information from large
datasets based on user queries.
Applications:
- Search engines (e.g., Google, Bing).
- Enterprise search systems.
- Recommendation systems (e.g., news or product suggestions).
Impact:
- Enables quick access to information.
- Enhances user experience in digital platforms.
How Information Retrieval Works
Core Components:
Indexing: Organizing text data for efficient search (e.g., inverted indices).
Query Processing: Parsing user queries to match relevant documents.
Ranking: Scoring documents based on relevance (e.g., TF-IDF, BM25).
Modern Approaches:
- Neural IR with embeddings (e.g., BERT for semantic search).
- Dense retrieval for context-aware results.
Challenges:
- Handling vague or ambiguous queries.
- Scaling to massive datasets.
Real-Life Examples of Information
Retrieval
Web Search:
- Google uses NLP to understand query intent (e.g., "best NLP tools 2025").
E-Commerce:
- Product search on Amazon or eBay using user queries.
Academic Research:
- Tools like PubMed for retrieving research papers.
Future Trends:
- Personalized search results.
- Multimodal retrieval (text, images, video).
Question Answering - Overview
Definition: Systems that provide direct answers to user questions rather than
returning documents.
Types:
Extractive QA: Extracts answers from a given text (e.g., SQuAD dataset).
Generative QA: Generates answers using language models (e.g., Grok
answering queries).
Applications:
- Virtual assistants (e.g., Siri, Alexa).
- Customer support chatbots.
- Educational tools.
How Question Answering Works
Pipeline:
Question Analysis: Parsing the question to understand intent.
Context Retrieval: Finding relevant text or knowledge base.
Answer Generation: Extracting or generating the answer.
Techniques:
- Transformer models (e.g., BERT, T5) for contextual understanding.
- Knowledge graph integration for factual answers.
Challenges:
- Handling open-domain questions.
- Ensuring factual accuracy.
Real-Life Examples of Question
Answering
Virtual Assistants:
- "Hey Siri, what’s the weather today?" → Direct response with weather
data.
Customer Support:
- Chatbots answering FAQs on websites.
Education:
- Tools like Quizlet or Duolingo using QA for learning.
Future Directions:
- Multimodal QA (combining text and images).
- Real-time, context-aware answers.
Machine Translation - Overview
Definition: Automatically translating text or speech from one language
to another.
Applications:
- Global communication (e.g., Google Translate).
- Localization of software and websites.
- Subtitling and dubbing for media.
Impact:
- Breaks language barriers.
- Enables cross-cultural collaboration.
How Machine Translation Works
Evolution:
Rule-Based MT: Used predefined linguistic rules (early systems).
Statistical MT: Leveraged bilingual corpora for translation.
Neural MT: Uses deep learning (e.g., Transformer models like mBART).
Key Components:
- Encoder-decoder architectures for sequence-to-sequence tasks.
- Attention mechanisms to focus on relevant words.
Challenges:
- Idioms and cultural nuances.
- Low-resource languages with limited parallel data.
Real-Life Examples of Machine
Translation
Consumer Tools:
- Google Translate for real-time text and speech translation.
- Microsoft Translator for multilingual meetings.
Business:
- Localizing e-commerce websites for global markets.
Media:
- Automatic subtitling for YouTube or Netflix.
Future Trends:
- Improved translation for low-resource languages.
- Context-aware translations (e.g., preserving tone).
Challenges Across NLP Applications
Linguistic Diversity:
- Supporting thousands of languages and dialects.
Context Understanding:
- Handling ambiguity, slang, and cultural references.
Scalability:
- Processing large datasets in real-time.
Ethical Issues:
- Bias in models (e.g., gender or cultural biases).
- Privacy concerns in text processing.
Future of NLP Applications
Advancements:
- More robust multilingual models.
- Integration with multimodal data (text, audio, images).
Accessibility:
- Expanding NLP tools for low-resource languages.
Ethical Focus:
- Reducing bias and ensuring fairness.
- Transparent and privacy-preserving systems.
Goal: Seamless, inclusive, and efficient NLP solutions.