0 ratings0% found this document useful (0 votes) 131 views55 pagesVin AI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Given a grayscale image X with the size of 4x4 and the intensity levels in the range of [0, 15],
this image is passed through a point processing function f(X) as defined as
ae ese a oe Time left 1:04:05
fox) = a*loga(1 + X) + b, —
where @ and B are wo constants
‘Assuming that some of the pixel values of the input and output are given as the following
figure:
ol2ziclé 3{_|7
5(s[s
aja
> x >
6|7/5 L rot :
3 id
What are the values of c and d in the above figure?
Oa c=3,d=7Consider applying attention mechanism to encoder-decoder seq2seq architecture, which of
the following are applicable?
Ga
ad
In the global attention mechanism the context is derived as the weighted sum of all
input hidden states of the encoder.
Global attention uses all input hidden states of the encoder to derive a single
context c which is then applied repeatedly to generate decoded sequence.
In the local attention mechanism, the context is derived as the weighted sum of half
of the input hidden states of the encoder.
Let h, and h, be an output hidden state at position t' and input hidden state at
position s“” respectively, one way to measure the score between them to build the
attention score is their dot product.Given an RGBA image of size 200x200. Compute its size, assuming no compression.
Oa 120kb
Ob. 80kb
Oc 160kb
d. 200 kb
0Consider an RNN language model for text modelling with further reference to the
architecture below for the sequence of words denoted by {ws.Wa,..Wnl-
Which of the following statements are true?
Oa
Ob.
Oc
Od.
One can train this RNN language model by maximizing the likelihood p(wy) IT »
plas
The task of the RN language model is to model the conditional
word in a sentence given the sequence of previous words.
This architecture is an example of many-to-one seq2seq model.
The language model property implies p(wibw1s.1)=plwilh)Suppose we have a collection of 3 documents: document 1 says “lions eat fat cats";
document 2 says “cats eat fat mice"; document 3 says “mice eat fat cheese’. Compute the
cosine similarity of document 1 and document 2 with equal TF weighting:
Oa 0.25
Ob 05
Oe 10
Od, 075
oO
©
0.0Given the function f(x) = 75 S72" (a — é)?, We need to solve min, f(x) using
stochastic gradient descent with learning rate n = 0.1. Assume that at iteration ¢, we have
2, = 10 and we sample a batch i; = 1, iy = 2, is = 3, i4 = 4 of indices, What is the value
of 2,, at the next iteration?
Oey 8.5,
Ob. aty1 = 1000.
Oren ees:
Od au —9What is the main reason for the non-robustness of Transformer?
Oa
Ob.
Ove
Od:
The masking is not robust against outliers in data
The positional encoding is not robust against outliers in data
The multi-head attention is not robust against outliers in data
The attention matrix in each layer is not robust against outliers in dataWhich of the following actions can take to avoid overfitting?
oossa
8
Increasing the complexity of the model
Using dropout layers
Collecting more data
Increasing the learning rate
Decreasing the learning rate
Applying regulariser termsGiven a 2D convolution layer with kernel size 11, dilation 1, and stride 1. Which padding
should we use for that layer so that the output tensor has the same spatial resolution as the
input?
Oa 0
Ob. 3
Oa 1
(ORGS 7
< emnD)Given n labelled data samples (;,y;) € R4 x R, i = 1,...,n, we are interested in solving
the linear regression problem
my Dera - yi)? + AMlBllp-
If we want to recover a solution * that is sparse, what is recommended value for p?
Select one:
Oa p=lorp=co
Ob. p=lorp=2
O c None of these.
Od. p=Oorp=1Given the function f(x) = 75 Di? (x — i)?. We need to solve min, f(z) using
stochastic gradient descent with learning rate n = 0.1. Assume that at iteration ¢, we have
2, = 10 and we sample a batch i, = 1, ig = 2, iy = 3, iy = 4 of indices. What is the value
of 2,4 at the next iteration?
Oa ay41=85.
Ob. aes = 1000,
Oc my=8
Od ay =9.Which loss function is most sensitive to outliers?
Select one:
O a. The Huber loss.
O b. The 0-1 loss.
@® c. The square error loss.
O d. The absolute error loss.
Clear my choiceNeural networks
Select one:
@® a. canbe used for regression as well as classication.
O b. optimize a convex cost function.
O c._ always output values between 0 and 1.
O d. always use a differentiable loss function.
Clear my choiceLet A and B be two 2 x 2 matrices such that the product matrix AB is equal ls 2] .
What is trace of BABA?
Oa 5
Ob. 6.
Oc. None of these answers.
Od 4.Let X be random variable taking only positive value with mean 1. Which of the followings
cannot be true?
a P(X>2)=0.7
Ob P(X>2)=04
P(X > 2) =06
. P(X > 2)=05where a is the angle between two planes: x + y + 22 = 1 and
What is the value of cos(
z—y—22=0?
Oa $
Ob 2
Oc -2What is the rank of the linear map f : R — R where
f(z, y,z) = (@t+yt+z,0—y—2z,30+y+2)What is the anti-derivative of the function f(x) = # + 2a n(x) ?
Oa x+In(z)
Ob. 2?In(z)-—2
Oc. 2? In(z)What is the value of the sum }>°° 4 + for |x| > 1?
=z
Oa FA
l+z
Ob 3
lcs 1What is the use of matrix in linear algebra?
a. All of the mentioned.
O b. Store data
c. Store coordinates of a linear map.A fair coin tossed four times, what is the probability that there are at least two consecutive
Heads appear?
Da
Ob.
ele ole slo
OcWhich of the following is true for reinforcement learning with human feedback (RLHF) to
train large language models?
© a. Human feedback for RLHF requires an iterative process for model training and
human evaluation
O b. Human feedback for RLHF is based on human annotation of data for the instruction
classification task
O c_ RLHF allows multiple people to provide feedback for large language models
© 4: RLF utilizes human feedback during the reinforcement learning process
O e. RLHF avoids human-designed reward functionsWhat is a true statement for projective dependency parsing?
Oa
Ob
Oc
od
Multiple stacks are used for arc-standard parsing
Graph-based parsing cannot be used to produce projective dependency trees
Dependency arcs can be projected into a tree shape when words are put in their
linear order
Parsing trees are obtained by adding a swap-transition operation to transition-based
parsing
Organizing all arcs about the words in their linear order, dependency arcs are not
crossedWhich of the following is NOT true for in-context learning with large language models?
© a. In-context learning does not require the models to be trained with demonstrations
O b. The number of demonstrations in in-context learning might be limited by the
context window
O c._In-context learning will mainly emerge when the models are large enough
O d. In-context learning requires some demonstrations as the input
Clear my choiceThe words room and house are in a lexical semantic relation, in which room is the (1) and
house is the (2).
O a (1) meronym
(2) holonym
© b. (1) hypernym
(2) hyponym
© c (1)holonym
(2) meronym
O d. (1) hyponym
(2) hypernymWhat is the limitation of Graph Convolutional Networks (GCN)?
Oa
Ob.
Oc
Oda
The training of GCNs is isomorphism that cannot learn to classify complex
structures.
They cannot learn to distinguish certain simple graph structures.
They cannot be applied to text vision as there is no graph there.
They might need many layers to work well.Which of the following statements is true when you use 1x1 convolutions in a CNN?
a. It can be used for feature pooling
b. It can help in dimensionality reduction
It suffers less overfitting due to small kernel size
d. All of the above
00080 0
9
e. None of the aboveIn the Generative Pretrained Transformer model, how is the standard multi-head attention
layer modified?
O a. The dimensions of the hidden vectors are adaptively increased
© b. It must be masked appropriately
O & More attention heads are introduced
O d. Thenormalization operation is appended
Clear my choice
With rare exception, the verb die appears without a complement. This is an example of a__
constraint.
O a count noun
O b. subcategorization
Oc selection
O d. animacySelect the correct term for the following blank:
___ tagger uses probabilistic and statistical information to assign tags to words
O a. Stochastic
© b. Rule-based
O c Statistical
Od. Pos
Is the statement "Hierarchical Softmax increases the computation complexity when
compared to Softmax’ true?
O a. Itdepends on the application
© b. Not enough information to determine
Oc False
Od. TueIn a Hidden Markov model for part-of-speech tagging, observation likelihoods measure
Oa
Ob.
Oc
Od
The likelihood of a word given a POS tag
The likelihood of a POS tag given a word
The likelihood of a POS tag given two preceding tags
The likelihood of a POS tag given the preceding tagSuppose we wanted to extract reports of people visiting foreign countries, such as “Fred
occasionally visited Morocco.” What is the lexicalized dependency path (LDP) for this
relation?
©a
Ob.
Oc
Od
Oe
Fred -> visited -> Morocco
Fred <- visited -> Morocco
Fred -> occasionally <- visited -> Morocco
Fred <- visited <- Morocco
Fred <- occasionally <- visited -> MoroccoThe Glove word embeddings
O a. cannot be applied to the languages that cannot be tokenized by spaces
O b. predicts the current word based on the context words
O c._ predicts the context words based on the current word
O d. None of the other three statementsWhat are the following steps should be done to freeze a network to inference only?
Oa
M b.
Oc
Md.
Cit
Put the inference code to ‘with torch.inference_mode(:*
Set: ‘net.eval()’
Set: ‘net.train0’
Set: ‘for p in net.parameters():
p.requires_grad = False’
Put the inference code to ‘with torch.no_gradQ:"
Set: ‘for p in net.parameters():
p.requires_grad = True’Each image is represented by a color matrix whose elements are normalized to float numbers.
in [0,1]. For an input image A, gamma correction transforms it to a new image A’ = Ag=™me,
with gamma is a hyper-parameter.
Which value of gamma is suitable to get the enhancement result below?
output
Oas
Oc os
Od 025In stereo matching, what is the biggest advantage of image rectification?
Oa
Ob.
Of
Od
Ome:
All epipolar lines are perfectly vertical
Images are scaled to a desirable size.
Epipoles are moved to the center of the image.
All epipolar lines intersect at the vanishing point
All epipolar lines are perfectly horizontalWhat are the values for a, and 6 so that the homogeneous 3-vectors (1, a, b) ~ (-2, 4, -6),
where ~ represents the equivalent relationship?
-2,b=3
Ob a=-2,b=-3
Oc a=2,b=3
Od. a=2,b=-3In computer vision, there are many data modalities that can be represented as a sequence
such as videos. What are the common ways to process these kinds of data?
© a. Vision Transformer
Ob. Multilayer perceptron (MLP)
Mc Convolutional neural networks
O d. Recurrent neural networksWhat is the main reason for the non-robustness of Transformer?
Ha
Ob.
Oc
Od.
The attention matrix in each layer is not robust against outliers in data
The masking is not robust against outliers in data
The positional encoding is not robust against outliers in data
The multi-head attention is not robust against outliers in dataIn machine learning, regularization discourages learning a more complex or flexible model]
to prevent overfitting. Which of the following statements about the regularization paramet]
Xis not correct?
Select one:
O a. Using too small a value of \ can cause your hypothesis to underfit the data.
O b. None of these.
O c._ Using too large a value of A can cause your hypothesis to overfit the data.
@® d. Using a very large value of A cannot hurt the performance of your hypothesis.Assume that we would like to minimize the function f where f is convex and smooth
function. Then, the convergence rate of the Nesterov's accelerated method is:
O A. Sublinear convergence 1/t.
O B. Linear convergence
O C. Sublinear convergence 1/¢?.
O D. Superlinear convergenceAfter training a linear regression model by minimizing the empirical loss of a training data
set, we get a function f*(z) that has zero empirical loss. Which of the following statements
may be WRONG?
Select one or more:
M a. f*(x) may have a good generalization
Ob. f*(a) generalizes the best among all linear functions
Mc. f*(x) may be overfitting
Od. f*(x) may be underfittingthe linear regression problem
nin ye 4-4)
If B* is the solution of the above regression problem, then which first-order optimality
condition that 8* should satisfy?
Select one:
Oa ,
V's - wu =0
ei
6" -v)=0
~
6" we =0
+
Yee, —y)B=0
eiConsider the optimization problem to train a feed-forward NN:
ming J(8) = (0) + + OY, CEs, F(ais 9)),
where 0 = [(W*, BF)|E ,, f(x;3 8) returns the prediction probabilities for x; with ground-
truth label y;, 9(-) is the regularization term, and CE is the cross entropy loss function.
Choose all correct answers.
az YY, CE(y;, f(x;; 9) is known as a regularization term.
4b. LN, CE(y;,, f(x;;9)) is known as the empirical loss.
Oc Minimizing + SY, CE(yi, f(a;; 9) makes the model fitter to the testing set
Od. Minimizing + YN, CE(y;, f(a;;9)) can lead to overfitting
Be. Minimizing % YX, CE(y;; F(2;; 8)) makes the model fitter to the training
dataset.For the K-means clustering problem, we have iid. data X,, X2,...,X,, and we would like
to partition the data into K clusters based on their similarity. The quality of the clusters
mainly depends on:
O A. Data generating distribution
O B. The dimension of the data
O C The number of samples
© D. The separation of the dataWhich of the following are true about generative models?
Select one or more:
O a. Support Vector Machine is a generative model.
& b. They can be used for classification.
Mc. Linear discriminant analysis is a generative model.
OU d. The perceptron is a generative model.When performing regression or classification, which of the following is the correct way to
preprocess the data?
Select one:
Oa
Ob
Oc
Od
PCA — normalize PCA output — training.
Normalize the data —> PCA — training.
Normalize the data — training + PCA — evaluation of performance score.
Normalize the data —-> PCA — normalize PCA output — training.Astick is broken in two at random. What is the average ratio of the smaller length to the
larger length?
Oa. 2log2-1
1
Ob ¢
Oc 2log3-2
1
Od 3Given that f is a function of («, y) from R? to R. which of the followings could be the
Hessian matrix of f?
Oa (2 5
ya
Ob. (2 )
yoo
C1 «None of theses.
Od (z 4)
2y 2xLet A and B be two events such that P(A) = P(B) = 0.8. Which of the following could be
the value of P(B|A)?
Ma 08
O b. 0.65
Uc 05Let f be a function from R? to R with property: min, max, f(x,y) = max, min, f(x,y)
at (a9, yo). Which of the following could be the Hessian matrix of f at xo, yo?
Da ( »)
0 2
Ob. (-1 0
() =
Oc (3 )
0 -2There are m unit vectors in n-dimensional space such that the angles between any two of
them are the same. What is the maximal value of m?
@an
Ob n+1
Oa n+2
O d. None of these choices.Let X,Y be independent random variables with uniform distribution on (0, lj. Find
E[(x-Yy].
Oa
ok al
Ob.
O c. None of these.
i
oa tWhich of the followings could deduce that the function f is convex?
Oa tf(z)+(1—t)f(y) > f(te+(1—t)y) Va,y€ R;t € [0,1].
Ob. f(z) >2? VeeR
c. The second derivative of f is always positive.
Od. f(z)>23 VeeRA fair coin tossed four times, what is the probability that there are at least two consecutive
Heads appear?
9
Oa |
6
Ob. F
mesWhich of the following could be described by Normal distribution?
0 a. Weights of the population.
Ub. The number of people queuing to buy a football match’s ticket.
Oc. Blood pressure.
O d. The number of phone calls in one hour.