Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views6 pages

NLP End Sem

The document is an examination paper for a Natural Language Processing course at the Indian Institute of Science Education and Research, Bhopal. It includes various questions on topics such as activation functions, neural language models, feed-forward neural networks, LSTMs, and attention mechanisms. The exam is structured into two sections, with Section A consisting of ten two-mark questions and Section B containing two five-mark questions.

Uploaded by

ramavaththarun73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views6 pages

NLP End Sem

The document is an examination paper for a Natural Language Processing course at the Indian Institute of Science Education and Research, Bhopal. It includes various questions on topics such as activation functions, neural language models, feed-forward neural networks, LSTMs, and attention mechanisms. The exam is structured into two sections, with Section A consisting of ten two-mark questions and Section B containing two five-mark questions.

Uploaded by

ramavaththarun73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Indian Institute of Science Education and Research, Bhopal

End Semester Examination Full marks:


Course: DSE 407/607 30
Course Name: Natural Language Processing Date: 22 Nov
2022

Section A: Following questions are two marks sentence are the same. On the other hand, in
each. Answer any ten of them. concatenation, we concat the word embeddings
1. Describe horizontally to get the sentence embedding. Thus, if
a. ReLU we have a list of ‘m’ words, each with dimension
b. Sigmoid 1xn, then, the sentence vector will be of dimension
c. A usecase where you will prefer (a.) 1x mn.
over (b.) and vice-versa. Concatenation preserves the word representation
Ans: (a) ReLU is an activation function defined as, more effectively than pooling. However, at the same
y = ReLU(z) = max(z,0). Where ‘z’ is the input time it requires more memory to store sentence
variable and ‘y’ is the output variable. embeddings. Thus, in case we have high system
memory(GPU/RAM) we can opt for concatenation.
On the contrary, polling seems to be a reasonable
choice for low-memory settings.
3. What are the advantages and
disadvantages of neural language
models(NLM) over n-gram language
models? Describe a case where you would
prefer an n-gram language model over a
(b) Sigmoid is an activation function defined as, NLM.
y = σ(z) = 1/ (1+e ). It squashes the outliers toward Ans: Some of the advantages NLM have on n-gram
−z

0 or 1. based LMs are,


(i) NLM can handle much longer histories,
theoretically up to infinite length.
(ii) NLM can generalize better over contexts of
similar words as compare to n-gram models.
(iii) It is more accurate at word-prediction as
compared to n-gram models.
(c) In the hidden layer units of the feed-forward
neural network, ReLU is used as a preferred The disadvantages of NLM with respect to n-gram
activation function. In the output layer of multi-label models are
classification, sigmoid is used as the preferred (i) NLMs are more complex, slower and need more
activation function. energy to train than n-gram models.
(ii) The decision process of NLMs are not clear as it
2. What are the different ways with which you is in n-gram models.
can convert a list of word embeddings to a
sentence embedding? Define the use-cases So for smaller tasks and in low cost setting an
in which you will prefer one method over n-gram language model is still the right tool. This is
another. because the models don’t expect to encounter
Ans: There are two ways in which we can convert a unknown words etc.
list of word embeddings to a sentence embedding, 4. Answer the following in the context of
(a) pooling, and (b) concatenation. In pooling we language modeling task,
apply a max/min/average operation across each
dimension of word embeddings to get the sentence
embedding. Here the dimensions of word and

–Best of luck–
a. What are the limitations and Ans: Perplexity is used to measure the quality of a
advantages of feed-worword neural language model. Perplexity (PP) of a model θ on an
networks? unseen test set is the inverse probability that θ
b. What are the advantages and assigns to the test set, normalized by the test set
disadvantages of LSTM? length.
Ans: Limitations of FFNN:
1. In FFNN we can not give any arbitrary length
context/previous words to predict for next word.
Thus the decision process is locally dependent.
2. It needs simultaneous access to all context
words. This makes it difficult to use it for applied LM
tasks like on the spot translation etc. Better models will have low perplexity.

Benefits of FFNN: 7. Define


1. The computation is easily parallelizable. a. Teacher forcing. What is its usage?
2. With the help of pretrained embedding, the b. auto-regression? What is its usage?
context can be generalizable. Ans: This idea that we always give the model the
3. It does more More accurate at word-prediction as correct history sequence to predict the next word
compare to n-gram models. (rather than feeding the model its best case from the
previous teacher forcing time step) is called teacher
Advantages of LSTM: forcing. It is used in training of sequence to
1. It solves the issue of vanishing and exploding sequence tasks.
gradients that we encounter in RNN.
2. Context management is better in LSTM, which Autoregression, on the other hand, represents the
makes the decision process more accurate. idea where we give the currently generated history
as input to generate output for the next step. It is
Disadvantages of LSTM: typically used in the deployment/ prediction/ test
1. Inherently sequential nature of recurrent networks phase.
makes it hard to do computation in parallel.
2. Passing information through an extended series 8. What is
of recurrent connections leads to information loss a. stacked-RNN? What are its
and difficulties in training advantages and disadvantages?
5. What are positional encodings? What is its b. Bidirectional RNN? What are its
role in transformer networks? advantages and disadvantages?
Ans: In transformers, they don’t have any notion of
Ans: (a) Stacked RNNs consist of multiple RNN
the relative, or absolute, positions of the tokens in
the input. Thus, to preserve the positional layers where the output of one layer serves as the
information, we modify the input embeddings by input to a subsequent layer.
Advantages:
combining them with positional information specific
1. Stacked RNNs generally outperform single-layer
to each position in an input sequence. The simplest
networks. This is because The network induces
way to generate positional embedding is to consider
representations at differing levels of abstraction
the position of the token from the starting position
across layers.
and append it with input embedding. Alternatively,
Disadvantages:
we can use a static function to map integer inputs to
real valued vectors in a way that captures the 1. Computation is costlier and directly proportional
inherent relationships among the positions. to the number of layers.
(b) Bidirectional RNNs: Bidirectional RNNs are the
6. What is perplexity? How it can help us to RNNs which take both the left and the right context
identify better language models. as input to produce output for the current hidden
state. Technically, they are a combination of two words in a sentence can relate to each other in
simple RNNs, where one RNN reads the left context many different ways simultaneously. For example,
and the other reads the right context to produce distinct syntactic, semantic, and discourse
hidden representations. These representations at relationships can hold between verbs and their
each token are concatenated to get the final arguments in a sentence. With the help of
representation. Other simple ways to combine the multi-head self attention, we can learn these
representations include element-wise addition or aspects simultaneously.
multiplication.
11. What is referential density? How it is
Advantages: causing a problem in machine translation?
1. For tasks where the whole sequence of input is Ans: Referential density of a language refers to the
available, this RNN produces better results due to density of pronoun usage in that language. Some
the availability of both contexts. languages like Japanese and Chinese, for example,
tend to omit pronouns far more than Spanish and
Disadvantages: English. Translating a text from a referentially less
1. We need both left and right context to use this dense language to a more dense language is a
type of RNN. problem because the model must somehow identify
2. Computational complexity is high as compare to each pronoun gaps/zeros, and recover who or what
simple RNNs is being talked about in order to insert the proper
3. It also suffers from vanishing and exploding pronouns.
gradient problems.
Section B: Following questions are five
9. What is residual connection? What are its marks each. Answer any of two.
advantages?
Ans: Residual connections are the connections that 12. Describe the hidden markov model? Can
pass information from a lower layer to a higher layer you use it for named entity recognition? If
without going through the intermediate layer. Here yes, how will you do it?
we add a layer’s input vector to its output vector Ans: An HMM is a probabilistic sequence model:
before passing it forward. given a sequence of units (words, letters,
Advantages: Allowing information from the morphemes, sentences, etc.), it computes a
activation going forward and the gradient going probability distribution over possible sequences of
backwards to skip a layer improves learning and labels and chooses the best label sequence. The
gives higher level layers direct access to information HMM is based on augmenting the Markov chain.
from lower layers. This improves the overall
performance of the model. A Markov chain makes a very strong assumption
that if we want to predict the future in the sequence,
10. Describe multi-head attention. How it is all that matters is the current state. All the states
helpful in building NLP models? before the current state have no impact on th future
Ans: A Multi-head attention layer consists of a set of except via the current state. Formally, consider a
self-attention layers, called heads, that reside in sequence of state variables q1,q2,...,qi, then P(qi =
parallel layers at the same depth in a model, each a|q1...qi−1) = P(qi = a|qi−1).
with its own set of parameters. Each head i is
provided with its own set of key, query and value In many cases, however, the events we are
matrices. These are used to project the inputs into interested in are hidden i.e. we don’t observe them
separate key, value, and query embeddings directly. For example parts of speech tags, named
separately for each head. Given these distinct sets entity tags of a given input text.
of parameters, each head can learn different
aspects of the relationships that exist among inputs
at the same level of abstraction. In NLP, different
A hidden Markov model (HMM) allows us to model access to all of the inputs up to and including the
both observed events and hidden events that we one under consideration, but it has no access to
think of as causal factors in our probabilistic model. information about inputs beyond the current one.
The computation performed for each item is
Let Q = {q1, q2, … qN } is a set of N states. independent of all the other computations. Thus, we
A = {a11 … ai j ...aNN} be the transition probability easily parallelize both forward inference and training
matrix A, where each aij representing the probability of such models.
of moving from state i to state j. At the core of an attention-based approach is the
O = {o1,o2,… oT} be the sequence of T observations, ability to compare an item of interest to a collection
each one drawn from a vocabulary V = {v1, v2,..., vV}. of other items in a way that reveals their relevance
B = bi(ot) be the emission probabilities, each in the current context. In the case of self-attention,
expressing the probability of an observation ot being the set of comparisons are to other elements within
generated from a state qi. a given sequence. The result of these comparisons
is then used to compute an output for the current
π = π1,π2,...,πN be the initial probability distribution input.
over states. πi is the probability that the Markov
chain will start in state i.

A first-order hidden Markov model instantiates two


simplifying assumptions. First one is the markov
assumption, i.e. P(qi | q1,...,qi−1) = P(qi |qi−1)

Second, the probability of an output observation oi


depends only on the state that produced the
observation qi and not on any other states or any
other observations P(oi | q1,...qi,...,qT , o1,...,oi ,...,oT )
= P(oi | qi ). Now, the probability of states is given by For example, computation of y3 is based on a set of
comparisons between the input x3 and its preceding
P(q1, q2, · · · , qn | o1, o2, · · · ,on) = P(o1, o2, · · · , on| elements x1 and x2, and to x3 itself. The simplest
q1, q2, · · · , qn) P(q1, q2, · · · , qn) / P(o1, o2, · · · , on) form of comparison between elements in a
self-attention layer is a dot product.
In this case conditional probabilities are to be
replaced by transition probability matrix values with
above two assumptions.

For NER, the observable states are the


vocabularies, the hidden states would be the named
entity tags. αi j indicates the proportional relevance of each input
13. Describe the mechanism of self-attention to the input element i that is the current focus of
networks. How it is helping the transformer attention. Given the proportional scores in α, we
networks to attain more parallel processing. then generate an output value yi by taking the sum
Ans: Self-attention layer allows a network to directly of the inputs seen so far, weighted by their
extract and use information from arbitrarily large respective α value.
contexts without the need to pass it through
intermediate recurrent connections as in RNNs.

It maps input sequences(x1,...,xn) to output


sequences of the same length (y1,...,yn). When
processing each item in the input, the model has
14. Describe the encoder decoder architecture?
How will you incorporate attention The idea of attention is instead to create the single
mechanisms to make it better? fixed-length vector c by taking a weighted sum of all
Ans: Encoder-decoder network, also known as the encoder hidden states. The weights focus on
sequence to sequence network is a type of (‘attend to’) a particular part of the source text that is
architecture where we map from a sequence of relevant for the token the decoder is currently
input words or tokens to a sequence of tags that are producing.
not merely direct mappings from individual words.

These are models capable of generating


contextually appropriate, arbitrary length, output
sequences. The key idea underlying these networks
is the use of an encoder network that takes an input
sequence and creates a contextualized
representation of it, often called the context. This
representation is then passed to a decoder which
generates a task specific output sequence.

Mathematically, it is defined as

15. What are the advantages of LSTM?


Describe the different gate mechanism it
has.
Where, c is the context vector, hdt / het refers to the Ans: RNNs need to backpropagate the error signal
decoder/encoder hidden state at time-stamp t, yt back through time. As a result, during the backward
refers to the output at time step t. pass of training, the hidden layers are subject to
repeated multiplications, as determined by the
In a typical encoder-decoder network the context length of the sequence. A frequent result of this
vector/ final encode hidden vector is either fed to the process is that the gradients are eventually driven to
first unit of decoder layer or to every unit of decoder zero, a situation called the vanishing gradients
network to generate the output. Thus, the only thing problem.
the decoder knows about the source text is what’s in
this context vector. Thus, this final hidden state is Long short-term memory provides a solution to
thus acting as a bottleneck. above by adding an explicit context layer to the
architecture (in addition to the usual recurrent
The attention mechanism is a solution to the hidden layer) and through the use of specialized
bottleneck problem. It allows the decoder to get neural units that make use of gates to control the
information from all the hidden states of the flow of information into and out of the units that
encoder, not just the last hidden state. comprise the network layers
The gates in an LSTM neurons share a common (as opposed to what information needs to be
design pattern i.e. (i) A feedforward layer, (ii) preserved for future decisions).
Followed by a sigmoid activation function, and (iii)
Followed by a pointwise multiplication with the layer
being gated.
Given the appropriate weights for the various gates,
The choice of the sigmoid as the activation function an LSTM accepts as input the context layer, and
arises from its tendency to push its outputs to either hidden layer from the previous time step, along with
0 or 1. Combining this with a pointwise multiplication the current input vector. It then generates updated
has an effect similar to that of a binary mask. Values context and hidden vectors as output.
in the layer being gated that align with values near 1
in the mask are passed through nearly unchanged.
Values corresponding to lower values are
essentially erased.

LSTM has three gates, (i) Forget gate, (ii) Add gate
and (iii) Output gate

Forget gate: The purpose of this gate to delete


information from the context that is no longer
needed. The forget gate computes a weighted sum
of the previous state’s hidden layer and the current
input and passes that through a sigmoid. This mask
is then multiplied element-wise by the context vector
to remove the information from context that is no
longer required.

LSTM Add Gate: Here the task is compute the


actual information we need to extract from the
previous hidden state and current inputs—the same
basic computation we’ve been using for all our
recurrent networks. Then we generate the mask for
the add gate to select the information to add to the
current context.

Next, we add this to the modified context vector to


get our new context vector

LSTM Output Gate: It is used to decide what


information is required for the current hidden state

You might also like