Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
131 views55 pages

Vin AI

Uploaded by

Công Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
131 views55 pages

Vin AI

Uploaded by

Công Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 55
Given a grayscale image X with the size of 4x4 and the intensity levels in the range of [0, 15], this image is passed through a point processing function f(X) as defined as ae ese a oe Time left 1:04:05 fox) = a*loga(1 + X) + b, — where @ and B are wo constants ‘Assuming that some of the pixel values of the input and output are given as the following figure: ol2ziclé 3{_|7 5(s[s aja > x > 6|7/5 L rot : 3 id What are the values of c and d in the above figure? Oa c=3,d=7 Consider applying attention mechanism to encoder-decoder seq2seq architecture, which of the following are applicable? Ga ad In the global attention mechanism the context is derived as the weighted sum of all input hidden states of the encoder. Global attention uses all input hidden states of the encoder to derive a single context c which is then applied repeatedly to generate decoded sequence. In the local attention mechanism, the context is derived as the weighted sum of half of the input hidden states of the encoder. Let h, and h, be an output hidden state at position t' and input hidden state at position s“” respectively, one way to measure the score between them to build the attention score is their dot product. Given an RGBA image of size 200x200. Compute its size, assuming no compression. Oa 120kb Ob. 80kb Oc 160kb d. 200 kb 0 Consider an RNN language model for text modelling with further reference to the architecture below for the sequence of words denoted by {ws.Wa,..Wnl- Which of the following statements are true? Oa Ob. Oc Od. One can train this RNN language model by maximizing the likelihood p(wy) IT » plas The task of the RN language model is to model the conditional word in a sentence given the sequence of previous words. This architecture is an example of many-to-one seq2seq model. The language model property implies p(wibw1s.1)=plwilh) Suppose we have a collection of 3 documents: document 1 says “lions eat fat cats"; document 2 says “cats eat fat mice"; document 3 says “mice eat fat cheese’. Compute the cosine similarity of document 1 and document 2 with equal TF weighting: Oa 0.25 Ob 05 Oe 10 Od, 075 oO © 0.0 Given the function f(x) = 75 S72" (a — é)?, We need to solve min, f(x) using stochastic gradient descent with learning rate n = 0.1. Assume that at iteration ¢, we have 2, = 10 and we sample a batch i; = 1, iy = 2, is = 3, i4 = 4 of indices, What is the value of 2,, at the next iteration? Oey 8.5, Ob. aty1 = 1000. Oren ees: Od au —9 What is the main reason for the non-robustness of Transformer? Oa Ob. Ove Od: The masking is not robust against outliers in data The positional encoding is not robust against outliers in data The multi-head attention is not robust against outliers in data The attention matrix in each layer is not robust against outliers in data Which of the following actions can take to avoid overfitting? oossa 8 Increasing the complexity of the model Using dropout layers Collecting more data Increasing the learning rate Decreasing the learning rate Applying regulariser terms Given a 2D convolution layer with kernel size 11, dilation 1, and stride 1. Which padding should we use for that layer so that the output tensor has the same spatial resolution as the input? Oa 0 Ob. 3 Oa 1 (ORGS 7 < emnD) Given n labelled data samples (;,y;) € R4 x R, i = 1,...,n, we are interested in solving the linear regression problem my Dera - yi)? + AMlBllp- If we want to recover a solution * that is sparse, what is recommended value for p? Select one: Oa p=lorp=co Ob. p=lorp=2 O c None of these. Od. p=Oorp=1 Given the function f(x) = 75 Di? (x — i)?. We need to solve min, f(z) using stochastic gradient descent with learning rate n = 0.1. Assume that at iteration ¢, we have 2, = 10 and we sample a batch i, = 1, ig = 2, iy = 3, iy = 4 of indices. What is the value of 2,4 at the next iteration? Oa ay41=85. Ob. aes = 1000, Oc my=8 Od ay =9. Which loss function is most sensitive to outliers? Select one: O a. The Huber loss. O b. The 0-1 loss. @® c. The square error loss. O d. The absolute error loss. Clear my choice Neural networks Select one: @® a. canbe used for regression as well as classication. O b. optimize a convex cost function. O c._ always output values between 0 and 1. O d. always use a differentiable loss function. Clear my choice Let A and B be two 2 x 2 matrices such that the product matrix AB is equal ls 2] . What is trace of BABA? Oa 5 Ob. 6. Oc. None of these answers. Od 4. Let X be random variable taking only positive value with mean 1. Which of the followings cannot be true? a P(X>2)=0.7 Ob P(X>2)=04 P(X > 2) =06 . P(X > 2)=05 where a is the angle between two planes: x + y + 22 = 1 and What is the value of cos( z—y—22=0? Oa $ Ob 2 Oc -2 What is the rank of the linear map f : R — R where f(z, y,z) = (@t+yt+z,0—y—2z,30+y+2) What is the anti-derivative of the function f(x) = # + 2a n(x) ? Oa x+In(z) Ob. 2?In(z)-—2 Oc. 2? In(z) What is the value of the sum }>°° 4 + for |x| > 1? =z Oa FA l+z Ob 3 lcs 1 What is the use of matrix in linear algebra? a. All of the mentioned. O b. Store data c. Store coordinates of a linear map. A fair coin tossed four times, what is the probability that there are at least two consecutive Heads appear? Da Ob. ele ole slo Oc Which of the following is true for reinforcement learning with human feedback (RLHF) to train large language models? © a. Human feedback for RLHF requires an iterative process for model training and human evaluation O b. Human feedback for RLHF is based on human annotation of data for the instruction classification task O c_ RLHF allows multiple people to provide feedback for large language models © 4: RLF utilizes human feedback during the reinforcement learning process O e. RLHF avoids human-designed reward functions What is a true statement for projective dependency parsing? Oa Ob Oc od Multiple stacks are used for arc-standard parsing Graph-based parsing cannot be used to produce projective dependency trees Dependency arcs can be projected into a tree shape when words are put in their linear order Parsing trees are obtained by adding a swap-transition operation to transition-based parsing Organizing all arcs about the words in their linear order, dependency arcs are not crossed Which of the following is NOT true for in-context learning with large language models? © a. In-context learning does not require the models to be trained with demonstrations O b. The number of demonstrations in in-context learning might be limited by the context window O c._In-context learning will mainly emerge when the models are large enough O d. In-context learning requires some demonstrations as the input Clear my choice The words room and house are in a lexical semantic relation, in which room is the (1) and house is the (2). O a (1) meronym (2) holonym © b. (1) hypernym (2) hyponym © c (1)holonym (2) meronym O d. (1) hyponym (2) hypernym What is the limitation of Graph Convolutional Networks (GCN)? Oa Ob. Oc Oda The training of GCNs is isomorphism that cannot learn to classify complex structures. They cannot learn to distinguish certain simple graph structures. They cannot be applied to text vision as there is no graph there. They might need many layers to work well. Which of the following statements is true when you use 1x1 convolutions in a CNN? a. It can be used for feature pooling b. It can help in dimensionality reduction It suffers less overfitting due to small kernel size d. All of the above 00080 0 9 e. None of the above In the Generative Pretrained Transformer model, how is the standard multi-head attention layer modified? O a. The dimensions of the hidden vectors are adaptively increased © b. It must be masked appropriately O & More attention heads are introduced O d. Thenormalization operation is appended Clear my choice With rare exception, the verb die appears without a complement. This is an example of a__ constraint. O a count noun O b. subcategorization Oc selection O d. animacy Select the correct term for the following blank: ___ tagger uses probabilistic and statistical information to assign tags to words O a. Stochastic © b. Rule-based O c Statistical Od. Pos Is the statement "Hierarchical Softmax increases the computation complexity when compared to Softmax’ true? O a. Itdepends on the application © b. Not enough information to determine Oc False Od. Tue In a Hidden Markov model for part-of-speech tagging, observation likelihoods measure Oa Ob. Oc Od The likelihood of a word given a POS tag The likelihood of a POS tag given a word The likelihood of a POS tag given two preceding tags The likelihood of a POS tag given the preceding tag Suppose we wanted to extract reports of people visiting foreign countries, such as “Fred occasionally visited Morocco.” What is the lexicalized dependency path (LDP) for this relation? ©a Ob. Oc Od Oe Fred -> visited -> Morocco Fred <- visited -> Morocco Fred -> occasionally <- visited -> Morocco Fred <- visited <- Morocco Fred <- occasionally <- visited -> Morocco The Glove word embeddings O a. cannot be applied to the languages that cannot be tokenized by spaces O b. predicts the current word based on the context words O c._ predicts the context words based on the current word O d. None of the other three statements What are the following steps should be done to freeze a network to inference only? Oa M b. Oc Md. Cit Put the inference code to ‘with torch.inference_mode(:* Set: ‘net.eval()’ Set: ‘net.train0’ Set: ‘for p in net.parameters(): p.requires_grad = False’ Put the inference code to ‘with torch.no_gradQ:" Set: ‘for p in net.parameters(): p.requires_grad = True’ Each image is represented by a color matrix whose elements are normalized to float numbers. in [0,1]. For an input image A, gamma correction transforms it to a new image A’ = Ag=™me, with gamma is a hyper-parameter. Which value of gamma is suitable to get the enhancement result below? output Oas Oc os Od 025 In stereo matching, what is the biggest advantage of image rectification? Oa Ob. Of Od Ome: All epipolar lines are perfectly vertical Images are scaled to a desirable size. Epipoles are moved to the center of the image. All epipolar lines intersect at the vanishing point All epipolar lines are perfectly horizontal What are the values for a, and 6 so that the homogeneous 3-vectors (1, a, b) ~ (-2, 4, -6), where ~ represents the equivalent relationship? -2,b=3 Ob a=-2,b=-3 Oc a=2,b=3 Od. a=2,b=-3 In computer vision, there are many data modalities that can be represented as a sequence such as videos. What are the common ways to process these kinds of data? © a. Vision Transformer Ob. Multilayer perceptron (MLP) Mc Convolutional neural networks O d. Recurrent neural networks What is the main reason for the non-robustness of Transformer? Ha Ob. Oc Od. The attention matrix in each layer is not robust against outliers in data The masking is not robust against outliers in data The positional encoding is not robust against outliers in data The multi-head attention is not robust against outliers in data In machine learning, regularization discourages learning a more complex or flexible model] to prevent overfitting. Which of the following statements about the regularization paramet] Xis not correct? Select one: O a. Using too small a value of \ can cause your hypothesis to underfit the data. O b. None of these. O c._ Using too large a value of A can cause your hypothesis to overfit the data. @® d. Using a very large value of A cannot hurt the performance of your hypothesis. Assume that we would like to minimize the function f where f is convex and smooth function. Then, the convergence rate of the Nesterov's accelerated method is: O A. Sublinear convergence 1/t. O B. Linear convergence O C. Sublinear convergence 1/¢?. O D. Superlinear convergence After training a linear regression model by minimizing the empirical loss of a training data set, we get a function f*(z) that has zero empirical loss. Which of the following statements may be WRONG? Select one or more: M a. f*(x) may have a good generalization Ob. f*(a) generalizes the best among all linear functions Mc. f*(x) may be overfitting Od. f*(x) may be underfitting the linear regression problem nin ye 4-4) If B* is the solution of the above regression problem, then which first-order optimality condition that 8* should satisfy? Select one: Oa , V's - wu =0 ei 6" -v)=0 ~ 6" we =0 + Yee, —y)B=0 ei Consider the optimization problem to train a feed-forward NN: ming J(8) = (0) + + OY, CEs, F(ais 9)), where 0 = [(W*, BF)|E ,, f(x;3 8) returns the prediction probabilities for x; with ground- truth label y;, 9(-) is the regularization term, and CE is the cross entropy loss function. Choose all correct answers. az YY, CE(y;, f(x;; 9) is known as a regularization term. 4b. LN, CE(y;,, f(x;;9)) is known as the empirical loss. Oc Minimizing + SY, CE(yi, f(a;; 9) makes the model fitter to the testing set Od. Minimizing + YN, CE(y;, f(a;;9)) can lead to overfitting Be. Minimizing % YX, CE(y;; F(2;; 8)) makes the model fitter to the training dataset. For the K-means clustering problem, we have iid. data X,, X2,...,X,, and we would like to partition the data into K clusters based on their similarity. The quality of the clusters mainly depends on: O A. Data generating distribution O B. The dimension of the data O C The number of samples © D. The separation of the data Which of the following are true about generative models? Select one or more: O a. Support Vector Machine is a generative model. & b. They can be used for classification. Mc. Linear discriminant analysis is a generative model. OU d. The perceptron is a generative model. When performing regression or classification, which of the following is the correct way to preprocess the data? Select one: Oa Ob Oc Od PCA — normalize PCA output — training. Normalize the data —> PCA — training. Normalize the data — training + PCA — evaluation of performance score. Normalize the data —-> PCA — normalize PCA output — training. Astick is broken in two at random. What is the average ratio of the smaller length to the larger length? Oa. 2log2-1 1 Ob ¢ Oc 2log3-2 1 Od 3 Given that f is a function of («, y) from R? to R. which of the followings could be the Hessian matrix of f? Oa (2 5 ya Ob. (2 ) yoo C1 «None of theses. Od (z 4) 2y 2x Let A and B be two events such that P(A) = P(B) = 0.8. Which of the following could be the value of P(B|A)? Ma 08 O b. 0.65 Uc 05 Let f be a function from R? to R with property: min, max, f(x,y) = max, min, f(x,y) at (a9, yo). Which of the following could be the Hessian matrix of f at xo, yo? Da ( ») 0 2 Ob. (-1 0 () = Oc (3 ) 0 -2 There are m unit vectors in n-dimensional space such that the angles between any two of them are the same. What is the maximal value of m? @an Ob n+1 Oa n+2 O d. None of these choices. Let X,Y be independent random variables with uniform distribution on (0, lj. Find E[(x-Yy]. Oa ok al Ob. O c. None of these. i oa t Which of the followings could deduce that the function f is convex? Oa tf(z)+(1—t)f(y) > f(te+(1—t)y) Va,y€ R;t € [0,1]. Ob. f(z) >2? VeeR c. The second derivative of f is always positive. Od. f(z)>23 VeeR A fair coin tossed four times, what is the probability that there are at least two consecutive Heads appear? 9 Oa | 6 Ob. F mes Which of the following could be described by Normal distribution? 0 a. Weights of the population. Ub. The number of people queuing to buy a football match’s ticket. Oc. Blood pressure. O d. The number of phone calls in one hour.

You might also like