SNLP
SNLP
Important Topic :-
Introduction to NLP:
Applications of NLP and key challenges.
Understanding the lexicon and morphology in natural language.
Syntactic Parsing:
Analysis of the grammatical structure of sentences.
Top-down and bottom-up parsing strategies for understanding sentence structure.
Semantics:
Understanding the meaning of words and sentences.
Word Sense Disambiguation techniques.
Semantic parsing for extracting meaning from sentences.
Subjectivity and sentiment analysis in text data.
Information Extraction:
Techniques for extracting structured information from unstructured text.
Automatic summarization of documents for condensing information.
Unit 6: Information Retrieval and Question Answering
Information Retrieval:
Retrieval of relevant documents or passages from a large text corpus.
Techniques for indexing, ranking, and retrieving information.
Question Answering:
Methods for automatically generating answers to user queries based on text data.
Additional Topics:
Machine Translation:
Translation of text from one language to another using computational methods.
Question paper :-
a) Explain Natural Language Processing.
Natural Language Processing (NLP) is a field of computer science, artificial intelligence, and
linguistics concerned with the interactions between computers and human (natural) languages. It
involves programming computers to process and analyze large amounts of natural language data.
NLP enables machines to read, understand, and derive meaning from human languages. Key
applications include machine translation, speech recognition, sentiment analysis, and chatbots.
b) List and explain different phases of analysis in Natural Language Processing with an
example for each.
1. Lexical Analysis: This phase involves identifying and analyzing the structure of words.
For example, breaking down "unhappiness" into "un-", "happy", and "-ness".
2. Syntactic Analysis (Parsing): This involves analyzing the grammatical structure of
sentences. For example, parsing "The cat sat on the mat" to identify "The cat" as the noun
phrase and "sat on the mat" as the verb phrase.
3. Semantic Analysis: This phase determines the meaning of words and sentences. For
example, understanding that "bat" can mean both an animal and a piece of sports
equipment.
4. Pragmatic Analysis: This involves understanding the context in which a sentence is
used. For example, "Can you pass the salt?" is understood as a request, not a question
about ability.
5. Discourse Analysis: This looks at the structure of texts and conversations. For example,
understanding reference and coherence in a multi-sentence text.
Stemming is the process of reducing words to their base or root form. In Information Retrieval
(IR) systems, stemming helps in matching documents with queries by reducing words to a
common form. For example, "running", "runner", and "ran" might all be reduced to "run". This
increases the recall of the system by retrieving documents that contain variations of the query
term, but it may decrease precision as sometimes different words are stemmed to the same root
(e.g., "universe" and "university").
1. Deductive Inference: Derives specific conclusions from general rules. For example, "All
men are mortal. Socrates is a man. Therefore, Socrates is mortal."
2. Inductive Inference: Generalizes from specific instances to broader generalizations. For
example, "All observed swans are white; therefore, all swans are white."
3. Abductive Inference: Involves reasoning from effects to causes. For example, "The
grass is wet; therefore, it probably rained."
Semantics: Studies the meaning of words and sentences. For example, understanding that
"bank" can mean the side of a river or a financial institution.
Pragmatics: Studies how context influences the interpretation of meaning. For example,
understanding that "Can you pass the salt?" is a request, not a question about capability.
Discourse: Studies how sentences connect and flow in larger texts and conversations. For
example, ensuring coherence and reference across multiple sentences in a paragraph.
Sentiment Analysis is a subfield of NLP that focuses on determining the emotional tone behind a
body of text. It involves classifying text into categories such as positive, negative, or neutral.
Applications include analyzing customer feedback, monitoring social media for public sentiment,
and improving customer service.
5x4=20
a) State the difference between homonymy and polysemy along with examples of each.
Homonymy: Words that sound alike but have different meanings and origins. Example:
"bat" (a flying mammal) and "bat" (a piece of sports equipment).
Polysemy: A single word that has multiple related meanings. Example: "bank" can mean
the side of a river and a financial institution, with both meanings related to a place where
something is stored or managed.
b) Explain lexicon.
A lexicon is a database of words and their meanings, along with other information such as
pronunciation, part of speech, and syntactic properties. In NLP, a lexicon serves as a reference
for various linguistic tasks, providing essential data for understanding and processing language.
c) Explain the different parts of speech. Differentiate between open class and closed class of
words.
Parts of Speech:
o Nouns: Name people, places, things, or ideas (e.g., cat, London).
o Verbs: Describe actions or states (e.g., run, is).
o Adjectives: Describe or modify nouns (e.g., happy, blue).
o Adverbs: Modify verbs, adjectives, or other adverbs (e.g., quickly, very).
o Pronouns: Replace nouns (e.g., he, it).
o Prepositions: Show relationships between nouns and other words (e.g., in, on).
o Conjunctions: Connect words, phrases, or clauses (e.g., and, but).
o Interjections: Express strong emotion (e.g., wow, ouch).
Open Class Words: Categories that frequently add new words (e.g., nouns, verbs,
adjectives, adverbs).
Closed Class Words: Categories that rarely change or add new words (e.g., pronouns,
prepositions, conjunctions).
d) Explain phonology.
Phonology is the study of the sound system of a language, including the organization and
patterning of sounds. It examines how sounds function in particular languages, how they interact
with each other, and how they are used to convey meaning.
S → NP VP
NP → Det N
VP → V NP
These rules can generate sentences like "The cat sat on the mat" by breaking it down into its
constituent parts.
10x2=20
Extractive Summarization: Selects key sentences or phrases directly from the original
text to form a summary.
Abstractive Summarization: Generates new sentences that convey the main points of
the original text, often rephrasing the information.
Feature sets are collections of measurable properties or characteristics used to represent data for
machine learning tasks. In NLP, features might include word frequencies, part-of-speech tags,
syntactic dependencies, and more. They are represented as vectors, where each element
corresponds to a specific feature.
For example, a feature vector for text classification might look like:
[term1_frequency, term2_frequency, ..., termN_frequency, avg_sentence_length, num_of_nouns, ...]...]
}[term1_frequency, term2_frequency, ..., termN_frequency, avg_sentence_length, num_of_noun
s, ...]
c) Explain in detail the application of Natural Language Processing.
NLP has a wide range of applications, including but not limited to:
1. Machine Translation: Automatically translating text from one language to another, such
as Google Translate.
2. Speech Recognition: Converting spoken language into text, as used in virtual assistants
like Siri and Alexa.
3. Text Summarization: Producing concise summaries of larger texts, useful for news
aggregation and academic research.
4. Sentiment Analysis: Determining the sentiment expressed in a text, commonly used in
social media monitoring and customer feedback analysis.
5. Chatbots: Enabling conversational agents to interact with users, used in customer service
and support.
6. Information Retrieval: Improving search engines by understanding and retrieving
relevant documents based on user queries.
7. Named Entity Recognition: Identifying and classifying entities (e.g., people,
organizations, locations) in text.
8. Part-of-Speech Tagging: Assigning parts of speech to each word in a text, aiding in
syntactic parsing and text analysis.
10x2=20
a) What is Parsing? For the given CFG, illustrate the steps to draw the Top-down parse
tree for the sentence: "The large can can hold the water."
Parsing is the process of analyzing the syntactic structure of a sentence according to a given
grammar. It involves breaking down the sentence into its constituent parts and identifying their
grammatical relationships.
For the sentence "The large can can hold the water," using the given CFG:
CFG:
o S → NP VP
o DT → the
o NP → DT ADJ N
o ADJ → large
o NP → DT N
o N → can | hold | water
o V → hold
o VP → Aux. VP
o Aux. → can
o VP → V NP
Top-down Parsing Steps:
Parse Tree:
css
Copy code
S
/ \
NP VP
| / \
DT Aux VP
| | |
the can V NP
| / \
hold DT N
| |
the water
Machine Translation (MT) is the process of automatically translating text from one language to
another using computational methods. There are several approaches to MT:
The Earley Algorithm is a dynamic programming algorithm for parsing sentences in context-free
grammars (CFGs). It can parse all CFGs, including ambiguous and left-recursive grammars. The
algorithm consists of three main steps: prediction, scanning, and completion.
Steps:
1. Prediction: Adds new states based on the grammar rules. If a state predicts a non-
terminal symbol, new states are added for each production of that non-terminal.
2. Scanning: Reads the next input symbol and adds new states for matching terminals.
3. Completion: When a state is complete (i.e., all symbols on the right-hand side of the
production have been parsed), it finds and completes previous states that were waiting for
this non-terminal.
The algorithm uses an Earley table with entries corresponding to input positions, where each
entry contains states representing partial parses.
10x2=20
Part-of-Speech (POS) tagging involves assigning a part of speech to each word in a sentence,
such as noun, verb, adjective, etc. It is crucial for understanding the syntactic structure of
sentences.
Rule-based Tagging: Uses hand-written rules to identify the POS tags. Example: If a
word ends in "ing," tag it as a verb (VBG).
Statistical Tagging: Uses machine learning models trained on annotated corpora to
predict POS tags. Example: Hidden Markov Models (HMM), Conditional Random Fields
(CRF).
Neural Tagging: Uses neural networks, such as recurrent neural networks (RNNs) or
transformers, to predict POS tags based on the context provided by surrounding words.
Approaches to WSD:
Data pre-processing is crucial in NLP as it prepares raw text data for further analysis and
modeling, ensuring that the data is clean, consistent, and structured. Key benefits include:
Improved Accuracy: Cleaning and normalizing text data helps in reducing noise and
variability, leading to better model performance.
Reduced Complexity: Simplifying text data by removing irrelevant parts (e.g.,
stopwords) reduces dimensionality and computational load.
Enhanced Interpretability: Pre-processing steps like tokenization and lemmatization
make the data more understandable for both humans and machines.
Consistency: Standardizing text data (e.g., lowercasing) ensures uniformity across the
dataset, crucial for reliable analysis.
b) What are the main types of phrases and what are their roles in forming sentences?
Noun Phrase (NP): Contains a noun and its modifiers (e.g., "the big dog"). It functions
as a subject, object, or complement.
Verb Phrase (VP): Contains a main verb and its auxiliaries, objects, or complements
(e.g., "is running quickly"). It expresses action or state.
Adjective Phrase (AdjP): Contains an adjective and its modifiers (e.g., "very tall"). It
describes nouns or pronouns.
Adverb Phrase (AdvP): Contains an adverb and its modifiers (e.g., "quite slowly"). It
modifies verbs, adjectives, or other adverbs.
Prepositional Phrase (PP): Contains a preposition and its object (e.g., "in the park"). It
functions as an adjective or adverb, providing additional information.
These phrases are building blocks of sentences, each contributing to the overall syntactic and
semantic structure.
c) Explain the concept of context-free grammars and how they relate to phrase structure
grammars.
A Context-Free Grammar (CFG) is a type of formal grammar used to define the syntactic
structure of languages. CFGs consist of a set of production rules that specify how symbols (non-
terminals) can be expanded into sequences of other symbols (terminals and non-terminals). Each
rule takes the form: A→αA \rightarrow \alphaA→α, where AAA is a non-terminal and α\alphaα
is a sequence of terminals and non-terminals.
Phrase Structure Grammars (PSG) are a type of CFG specifically used to describe the
hierarchical structure of phrases in a language. PSGs define how words and phrases combine to
form larger syntactic units (e.g., sentences), emphasizing the nested structure of language.
Syntactic parsing is the process of analyzing the grammatical structure of a sentence to identify
its constituent parts and their relationships. It involves breaking down a sentence into its parts of
speech (e.g., nouns, verbs) and determining how these parts are connected (e.g., subject, object).
Importance in NLP:
e) Discuss briefly the concept of semantic parsing and its relationship to natural language
processing.
Semantic parsing involves converting natural language into a structured representation that
captures the meaning of the text. This structured representation can be in the form of logical
forms, semantic graphs, or other formal structures that facilitate understanding and manipulation
by machines.
Relationship to NLP:
Core Task: Essential for applications like question answering, machine translation, and
dialogue systems, where understanding meaning is crucial.
Integration: Builds on syntactic parsing by adding layers of meaning, enabling more
sophisticated text analysis and interpretation.
Applications: Used in tasks requiring precise understanding of user intents and the
relationships between different entities in the text.
5x4=20
a) What are the main techniques used for tokenization, stemming, and stopword removal
in NLP?
1. Tokenization:
o Whitespace Tokenization: Splits text based on spaces.
o Punctuation-based Tokenization: Uses punctuation marks to define token
boundaries.
o Regex Tokenization: Employs regular expressions to identify tokens based on
patterns.
2. Stemming:
o Porter Stemmer: Uses a series of rules to iteratively remove suffixes.
o Lancaster Stemmer: A more aggressive version of the Porter Stemmer.
o Snowball Stemmer: An improved version of the Porter Stemmer, offering better
performance.
3. Stopword Removal:
o Predefined Lists: Uses standard lists of stopwords (e.g., "the", "is", "in").
o Frequency-based Methods: Removes the most frequent words in a corpus.
o Customized Lists: Tailors stopwords based on specific application needs.
1. Naive Bayes Classifier: Uses probability theory to classify text based on the frequency
of words.
2. Support Vector Machines (SVM): Finds the optimal hyperplane to separate different
classes in a high-dimensional space.
3. Decision Trees: Classifies text by making a series of decisions based on feature values.
4. Neural Networks: Uses layers of interconnected nodes to learn complex patterns in text
data.
5. Ensemble Methods: Combines multiple classifiers (e.g., Random Forests) to improve
accuracy.
6. Deep Learning Models: Includes architectures like Convolutional Neural Networks
(CNNs) and Recurrent Neural Networks (RNNs) for handling text classification tasks.
1. Noun Phrase (NP): Functions as the subject or object (e.g., "The quick brown fox").
2. Verb Phrase (VP): Contains the main verb and its objects or complements (e.g., "jumps
over the lazy dog").
3. Prepositional Phrase (PP): Provides additional context or information (e.g., "over the
lazy dog").
4. Adjective Phrase (AdjP): Modifies nouns (e.g., "very quick").
5. Adverb Phrase (AdvP): Modifies verbs, adjectives, or other adverbs (e.g., "extremely
quickly").
Top-down parsing starts from the highest-level rule and recursively breaks it down into its
constituent parts until reaching the terminal symbols (words) of the sentence.
Steps:
Advantages:
Disadvantages:
1. Grammar-based Approaches: Use predefined rules and grammars to convert text into
logical forms.
2. Machine Learning-based Approaches: Train models on annotated datasets to learn
mappings from text to semantic representations.
3. Neural Network-based Approaches: Utilize deep learning models, such as sequence-to-
sequence architectures, to generate semantic parses directly from text.
4. Hybrid Approaches: Combine rule-based and statistical methods to leverage the
strengths of both.
f) What is information retrieval, and how does it differ from traditional database systems?
Information Retrieval (IR) involves finding relevant documents or information within large
datasets based on user queries. It focuses on unstructured data (e.g., text, multimedia).
Differences from Traditional Database Systems:
Data Type: IR deals with unstructured or semi-structured data, while databases handle
structured data.
Querying: IR uses keyword-based or natural language queries, whereas databases use
structured query languages like SQL.
Indexing and Searching: IR uses inverted indexes and relevance scoring, while
databases rely on primary and secondary indexes for exact matches.
Flexibility: IR systems handle ambiguity and partial matches better, providing ranked
results based on relevance.
10x2= 20
a) What is Natural Language Processing and what are its main applications? Explain the
difference between rule-based and statistical approaches to NLP.
Natural Language Processing (NLP) is the field of AI that focuses on the interaction between
computers and humans through natural language. It involves processing and analyzing large
amounts of natural language data to enable machines to understand, interpret, and generate
human language.
Main Applications:
Model Paper 1
Unit 1: Natural Language Processing: applications and key issues, The lexicon and
morphology
Unit 2: Phrase structure grammars and English syntax, Part of speech tagging
6. Compare and contrast top-down and bottom-up parsing strategies. (10 marks)
Model Paper 2
4. Define information extraction in NLP. Describe the main techniques used for
information extraction. (10 marks)
5. What is automatic summarization? Differentiate between extractive and abstractive
summarization. (10 marks)
6. Explain the basic concepts of information retrieval. How does it differ from
information extraction? (10 marks)
Model Paper 3
Unit 1: Natural Language Processing: applications and key issues, The lexicon and
morphology
1. Outline the historical evolution of NLP and its key milestones. (10 marks)
2. Describe the structure and function of a lexicon in NLP. (10 marks)
3. Explain the concept of morphological analysis with suitable examples. (10 marks)
Unit 2: Phrase structure grammars and English syntax, Part of speech tagging
6. Describe the algorithm for a top-down parsing strategy. Provide an example. (10
marks)
Model Paper 4
6. What are the main components of a question answering system? How does it work?
(10 marks)
Model Paper 5
Unit 1: Natural Language Processing: applications and key issues, The lexicon and
morphology
1. Highlight the current trends and future directions in NLP. (10 marks)
2. Discuss the role of morphology in text normalization. (10 marks)
3. Explain the concept of a morphological analyzer with an example. (10 marks)
Unit 2: Phrase structure grammars and English syntax, Part of speech tagging
4. What are context-free grammars? Explain their relevance in NLP. (10 marks)
5. Describe the process of developing a part of speech tagger. (10 marks)
Model Paper 6
Model Paper 7
Unit 1: Natural Language Processing: applications and key issues, The lexicon and
morphology
1. Discuss the role of NLP in the development of intelligent personal assistants. (10
marks)
2. Explain the importance of lexicons in machine translation systems. (10 marks)
3. Describe different types of morphemes with examples. (10 marks)
Unit 2: Phrase structure grammars and English syntax, Part of speech tagging
Model Paper 8
Model Paper 9
Unit 1: Natural Language Processing: applications and key issues, The lexicon and
morphology
Unit 2: Phrase structure grammars and English syntax, Part of speech tagging
6. Provide an example of a bottom-up parsing strategy and explain its steps. (10 marks)
Model Paper 10
1. Describe the process of creating a semantic network for NLP. (10 marks)
2. What techniques are used for automatic word sense disambiguation? (10 marks)
3. Explain the concept of sentiment polarity and its determination. (10 marks)