1. What Is Deep Learning?
Deep Learning involves taking large volumes of structured or
unstructured data and using complex algorithms to train neural
networks. It performs complex operations to extract hidden
patterns and features (for instance, distinguishing the image of
a cat from that of a dog).
2. What is a Neural Network?
Neural Network replicate the way humans learn, inspired by
how the neurons in our brains fire, only much simpler.
The most common Neural Networks consist of three network
layers:
1. An input layer
2. A hidden layer (this is the most important layer where
feature extraction takes place, and adjustments are made to
train faster and function better)
3. An output layer
Each sheet contains neurons called “nodes,” performing various
operations. Neural Networks are used in deep learning
algorithms like CNN, RNN, GAN, etc.
3. What Is a Multi-layer Perceptron(MLP)?
As in Neural Networks, MLPs have an input layer, a hidden
layer, and an output layer. It has the same structure as a single
layer perceptron with one or more hidden layers. A single layer
perceptron can classify only linear separable classes with binary
output (0,1), but MLP can classify nonlinear classes.
Except for the input layer, each node in the other layers uses a
nonlinear activation function. This means the input layers, the
data coming in, and the activation function is based upon all
nodes and weights being added together, producing the output.
MLP uses a supervised learning method called
“backpropagation.” In backpropagation, the neural network
calculates the error with the help of cost function. It propagates
this error backward from where it came (adjusts the weights to
train the model more accurately).
4. What Is Data Normalization, and Why Do We Need It?
The process of standardizing and reforming data is called “Data
Normalization.” It’s a pre-processing step to eliminate data
redundancy. Often, data comes in, and you get the same
information in different formats. In these cases, you should
rescale values to fit into a particular range, achieving better
convergence.
5. What is the Boltzmann Machine?
One of the most basic Deep Learning models is a Boltzmann
Machine, resembling a simplified version of the Multi-Layer
Perceptron. This model features a visible input layer and a
hidden layer -- just a two-layer neural net that makes stochastic
decisions as to whether a neuron should be on or off. Nodes are
connected across layers, but no two nodes of the same layer are
connected.
6. What Is the Role of Activation Functions in a Neural
Network?
At the most basic level, an activation function decides whether a
neuron should be fired or not. It accepts the weighted sum of
the inputs and bias as input to any activation function. Step
function, Sigmoid, ReLU, Tanh, and Softmax are examples of
activation functions.
7. What Is the Cost Function?
Also referred to as “loss” or “error,” cost function is a measure
to evaluate how good your model’s performance is. It’s used to
compute the error of the output layer during backpropagation.
We push that error backward through the neural network and
use that during the different training functions.
8. What Is Gradient Descent?
Gradient Descent is an optimal algorithm to minimize the cost
function or to minimize an error. The aim is to find the local-
global minima of a function. This determines the direction the
model should take to reduce the error
9. What Do You Understand by Backpropagation?
Backpropagation is a technique to improve the performance of
the network. It backpropagates the error and updates the
weights to reduce the error.
10. What Is the Difference Between a Feedforward Neural
Network and Recurrent Neural Network?
A Feedforward Neural Network signals travel in one direction
from input to output. There are no feedback loops; the network
considers only the current input. It cannot memorize previous
inputs (e.g., CNN).
A Recurrent Neural Network’s signals travel in both directions,
creating a looped network. It considers the current input with
the previously received inputs for generating the output of a
layer and can memorize past data due to its internal memory.
11. What Are the Applications of a Recurrent Neural Network
(RNN)?
The RNN can be used for sentiment analysis, text mining, and
image captioning. Recurrent Neural Networks can also address
time series problems such as predicting the prices of stocks in a
month or quarter.
12. What Are the Softmax and ReLU Functions?
Softmax is an activation function that generates the output
between zero and one. It divides each output, such that the total
sum of the outputs is equal to one. Softmax is often used for
output layers.
ReLU (or Rectified Linear Unit) is the most widely used
activation function. It gives an output of X if X is positive and
zeroes otherwise. ReLU is often used for hidden layers.
13. What Are Hyperparameters?
With neural networks, you’re usually working with
hyperparameters once the data is formatted correctly. A
hyperparameter is a parameter whose value is set before the
learning process begins. It determines how a network is trained
and the structure of the network (such as the number of hidden
units, the learning rate, epochs, etc.).
14. What Will Happen If the Learning Rate Is Set Too Low or
Too High?
When the learning rate is too low, training of the model will
progress very slowly as we are making minimal updates to the
weights. It will take many updates before reaching the
minimum point.
If the learning rate is set too high, this causes undesirable
divergent behaviour to the loss function due to drastic updates
in weights. It may fail to converge (model can give a good
output) or even diverge (data is too chaotic for the network to
train).
15. What Is Dropout and Batch Normalization?
Dropout is a technique of dropping out hidden and visible units
of a network randomly to prevent overfitting of data (typically
dropping 20 percent of the nodes). It doubles the number of
iterations needed to converge the network.
Batch normalization is the technique to improve the
performance and stability of neural networks by normalizing
the inputs in every layer so that they have mean output
activation of zero and standard deviation of one.
16. What Is the Difference Between Batch Gradient Descent and
Stochastic Gradient Descent?
Batch Gradient Stochastic Gradient
Descent Descent
The batch gradient The stochastic gradient
computes the gradient computes the gradient using
using the entire dataset. a single sample.
It takes time to converge It converges much faster
because the volume of than the batch gradient
data is huge, and weights because it updates weight
update slowly. more frequently.
17. What Is Overfitting and Underfitting, and How to Combat
Them?
Overfitting occurs when the model learns the details and noise
in the training data to the degree that it adversely impacts the
execution of the model on new information. It is more likely to
occur with nonlinear models that have more flexibility when
learning a target function. An example would be if a model is
looking at cars and trucks, but only recognizes trucks that have
a specific box shape. It might not be able to notice a flatbed
truck because there's only a particular kind of truck it saw in
training. The model performs well on training data, but not in
the real world.
Underfitting alludes to a model that is neither well-trained on
data nor can generalize to new information. This usually
happens when there is less and incorrect data to train a model.
Underfitting has both poor performance and accuracy.
To combat overfitting and underfitting, you can resample the
data to estimate the model accuracy (k-fold cross-validation)
and by having a validation dataset to evaluate the model.
18. How Are Weights Initialized in a Network?
There are two methods here: we can either initialize the weights
to zero or assign them randomly.
Initializing all weights to 0: This makes your model similar to a
linear model. All the neurons and every layer perform the same
operation, giving the same output and making the deep net
useless.
Initializing all weights randomly: Here, the weights are
assigned randomly by initializing them very close to 0. It gives
better accuracy to the model since every neuron performs
different computations. This is the most commonly used
method.
19. What Are the Different Layers on CNN?
There are four layers in CNN:
1. Convolutional Layer - the layer that performs a
convolutional operation, creating several smaller picture
windows to go over the data.
2. ReLU Layer - it brings non-linearity to the network and
converts all the negative pixels to zero. The output is a
rectified feature map.
3. Pooling Layer - pooling is a down-sampling operation that
reduces the dimensionality of the feature map.
4. Fully Connected Layer - this layer recognizes and classifies
the objects in the image.
20. What is Pooling on CNN, and How Does It Work?
Pooling is used to reduce the spatial dimensions of a CNN. It
performs down-sampling operations to reduce the
dimensionality and creates a pooled feature map by sliding a
filter matrix over the input matrix.
21. How Does an LSTM Network Work?
Long-Short-Term Memory (LSTM) is a special kind of
recurrent neural network capable of learning long-term
dependencies, remembering information for long periods as its
default behavior. There are three steps in an LSTM network:
Step 1: The network decides what to forget and what to
remember.
Step 2: It selectively updates cell state values.
Step 3: The network decides what part of the current state
makes it to the output.
22. What Are Vanishing and Exploding Gradients?
While training an RNN, your slope can become either too small
or too large; this makes the training difficult. When the slope is
too small, the problem is known as a “Vanishing Gradient.”
When the slope tends to grow exponentially instead of
decaying, it’s referred to as an “Exploding Gradient.” Gradient
problems lead to long training times, poor performance, and
low accuracy.
23. What Is the Difference Between Epoch, Batch, and Iteration
in Deep Learning?
Epoch - Represents one iteration over the entire dataset
(everything put into the training model).
Batch - Refers to when we cannot pass the entire dataset
into the neural network at once, so we divide the dataset into
several batches.
Iteration - if we have 10,000 images as data and a batch
size of 200. then an epoch should run 50 iterations (10,000
divided by 50).
24. Why Is Tensorflow the Most Preferred Library in Deep
Learning?
Tensorflow provides both C++ and Python APIs, making it
easier to work on and has a faster compilation time compared
to other Deep Learning libraries like Keras and Torch.
Tensorflow supports both CPU and GPU computing devices.
25. What Do You Mean by Tensor in Tensorflow?
A tensor is a mathematical object represented as arrays of
higher dimensions. These arrays of data with different
dimensions and ranks fed as input to the neural network are
called “Tensors.”
26. What Are the Programming Elements in Tensorflow?
Constants - Constants are parameters whose value does not
change. To define a constant we use tf.constant() command.
For example:
a = tf.constant(2.0,tf.float32)
b = tf.constant(3.0)
Print(a, b)
Variables - Variables allow us to add new trainable parameters
to graph. To define a variable, we use the tf.Variable()
command and initialize them before running the graph in a
session. An example:
W = tf.Variable([.3].dtype=tf.float32)
b = tf.Variable([-.3].dtype=tf.float32)
Placeholders - these allow us to feed data to a tensorflow model
from outside a model. It permits a value to be assigned later. To
define a placeholder, we use the tf.placeholder() command. An
example:
a = tf.placeholder (tf.float32)
b = a*2
with tf.Session() as sess:
result = sess.run(b,feed_dict={a:3.0})
print result
Sessions - a session is run to evaluate the nodes. This is called
the “Tensorflow runtime.” For example:
a = tf.constant(2.0)
b = tf.constant(4.0)
c = a+b
# Launch Session
Sess = tf.Session()
# Evaluate the tensor c
print(sess.run(c))
27. Explain a Computational Graph.
Everything in a tensorflow is based on creating a computational
graph. It has a network of nodes where each node operates,
Nodes represent mathematical operations, and edges represent
tensors. Since data flows in the form of a graph, it is also called
a “DataFlow Graph.”
28. Explain Generative Adversarial Network.
Suppose there is a wine shop purchasing wine from dealers,
which they resell later. But some dealers sell fake wine. In this
case, the shop owner should be able to distinguish between fake
and authentic wine.
The forger will try different techniques to sell fake wine and
make sure specific techniques go past the shop owner’s check.
The shop owner would probably get some feedback from wine
experts that some of the wine is not original. The owner would
have to improve how he determines whether a wine is fake or
authentic.
The forger’s goal is to create wines that are indistinguishable
from the authentic ones while the shop owner intends to tell if
the wine is real or not accurately.
Let us understand this example with the help of an image
shown above.
There is a noise vector coming into the forger who is generating
fake wine.
Here the forger acts as a Generator.
The shop owner acts as a Discriminator.
The Discriminator gets two inputs; one is the fake wine, while
the other is the real authentic wine. The shop owner has to
figure out whether it is real or fake.
So, there are two primary components of Generative
Adversarial Network (GAN) named:
1. Generator
2. Discriminator
The generator is a CNN that keeps keys producing images and
is closer in appearance to the real images while the
discriminator tries to determine the difference between real
and fake images. The ultimate aim is to make the discriminator
learn to identify real and fake images.
29. What Is an Auto-encoder?
This Neural Network has three layers in which the input
neurons are equal to the output neurons. The network's target
outside is the same as the input. It uses dimensionality
reduction to restructure the input. It works by compressing the
image input to a latent space representation then
reconstructing the output from this representation.
30. What Is Bagging and Boosting?
Bagging and Boosting are ensemble techniques to train
multiple models using the same learning algorithm and then
taking a call.
With Bagging, we take a dataset and split it into training data
and test data. Then we randomly select data to place into the
bags and train the model separately.
With Boosting, the emphasis is on selecting data points which
give wrong output to improve the accuracy.
31. Why is it necessary to introduce non-linearities in a neural
network?
Solution: otherwise, we would have a composition of linear
functions, which is also a linear function, giving a linear model.
A linear model has a much smaller number of parameters, and
is therefore limited in the complexity it can model.
32. Describe two ways of dealing with the vanishing gradient
problem in a neural network.
Solution:
Using ReLU activation instead of sigmoid.
Using Xavier initialization.
33. What are some advantages in using a CNN (convolutional
neural network) rather than a DNN (dense neural network) in
an image classification task?
Solution: while both models can capture the relationship
between close pixels, CNNs have the following properties:
It is translation invariant — the exact location of the pixel
is irrelevant for the filter.
It is less likely to overfit — the typical number of
parameters in a CNN is much smaller than that of a DNN.
Gives us a better understanding of the model — we can
look at the filters’ weights and visualize what the network
“learned”.
Hierarchical nature — learns patterns in by describing
complex patterns using simpler ones.
34. Describe two ways to visualize features of a CNN in an
image classification task.
Solution:
Input occlusion — cover a part of the input image and see
which part affect the classification the most. For instance,
given a trained image classification model, give the images
below as input. If, for instance, we see that the 3rd image is
classified with 98% probability as a dog, while the 2nd image
only with 65% accuracy, it means that
Activation Maximization — the idea is to create an artificial
input image that maximize the target response (gradient
ascent).
35. Is trying the following learning rates: 0.1,0.2,…,0.5 a good
strategy to optimize the learning rate?
Solution: No, it is recommended to try a logarithmic scale to
optimize the learning rate.
36. Suppose you have a NN with 3 layers and ReLU activations.
What will happen if we initialize all the weights with the same
value? what if we only had 1 layer (i.e linear/logistic
regression?)
Solution: If we initialize all the weights to be the same we
would not be able to break the symmetry; i.e, all gradients will
be updated the same and the network will not be able to learn.
In the 1-layers scenario, however, the cost function is convex
(linear/sigmoid) and thus the weights will always converge to
the optimal point, regardless of the initial value (convergence
may be slower).
37. Explain the idea behind the Adam optimizer.
Solution: Adam, or adaptive momentum, combines two ideas
to improve convergence: per-parameter updates which give
faster convergence, and momentum which helps to avoid
getting stuck in saddle point.
38. Compare batch, mini-batch and stochastic gradient descent.
Solution: batch refers to estimating the data by taking the entire
data, mini-batch by sampling a few datapoints, and SGD refers
to update the gradient one datapoint at each epoch. The tradeoff
here is between how precise the calculation of the gradient is
versus what size of batch we can keep in memory. Moreover,
taking mini-batch rather than the entire batch has a regularizing
effect by adding random noise at each epoch.
39. What is data augmentation? Give examples.
Solution: Data augmentation is a technique to increase the
input data by performing manipulations on the original data.
For instance in images, one can: rotate the image, reflect (flip)
the image, add Gaussian blur
40. What is the idea behind GANs?
Solution: GANs, or generative adversarial networks, consist of
two networks (D,G) where D is the “discriminator” network and
G is the “generative” network. The goal is to create data —
images, for instance, which are undistinguishable from real
images. Suppose we want to create an adversarial example of a
cat. The network G will generate images. The network D will
classify images according to whether they are a cat or not. The
cost function of G will be constructed such that it tries to “fool”
D — to classify its output always as cat.
41. What are the advantages of using Batchnorm?
Solution: Batchnorm accelerates the training process. It also
(as a byproduct of including some noise) has a regularizing
effect.
42. What is multi-take learning? When should it be used?
Solution: Multi-tasking is useful when we have a small amount
of data for some task, and we would benefit from training a
model on a large dataset of another task. Parameters of the
models are shared — either in a “hard” way (i.e the same
parameters) or a “soft” way (i.e regularization/penalty to the
cost function).
43. What is end-to-end learning? Give a few of its advantages.
Solution: End-to-end learning is usually a model which gets
the raw data and outputs directly the desired outcome, with no
intermediate tasks or feature engineering. It has several
advantages, among which: there is no need to handcraft
features, and it generally leads to lower bias.
44. What happens if we use a ReLU activation and then a
sigmoid as the final layer?
Solution: Since ReLU always outputs a non-negative result,
the network will constantly predict one class for all the inputs!
45. How to solve the exploding gradient problem?
Solution: A simple solution to the exploding gradient problem
is gradient clipping — taking the gradient to be ±M when its
absolute value is bigger than M, where M is some large number.
46. Is it necessary to shuffle the training data when using batch
gradient descent?
Solution: No, because the gradient is calculated at each epoch
using the entire training data, so shuffling does not make a
difference.
47. When using mini batch gradient descent, why is it
important to shuffle the data?
Solution: otherwise, suppose we train a NN classifier and have
two classes — A and B, and that al
48. Describe some hyperparameters for transfer learning.
Solution: How many layers to keep, how many layers to add,
how many to freeze.
49. Is dropout used on the test set?
Solution: No! only in the train set. Dropout is a regularization
technique that is applied in the training process.
50. Explain why dropout in a neural network act as a
regularizer.
Solution: There are several (related) explanations to why
dropout works. It can be seen as a form of model averaging — at
each step we “turn off” a part of the model and average the
models we get. It also adds noise, which naturally has a
regularizing effect. It also leads to more sparsity of the weights
and essentially prevents co-adaptation of neurons in the
network.
51. Give examples in which a many-to-one RNN architecture is
appropriate.
Solution: A few examples are: sentiment analysis, gender
recognition from speech, .
52. When can’t we use BiLSTM? Explain what assumption has
to be made.
Solution: in any bi-directional model, we assume that we have
access to the next elements of the sequence in a given “time”.
This is the case for text data (i.e sentiment analysis, translation
etc.), but not the case for time-series data.
53. True/false: adding L2 regularization to a RNN can help with
the vanishing gradient problem.
Solution: false! Adding L2 regularization will shrink the
weights towards zero, which can actually make the vanishing
gradients worse in some cases.
54. Suppose the training error/cost is high and that the
validation cost/error is almost equal to it. What does it mean?
What should be done?
Solution: this indicates underfitting. One can add more
parameters, increase the complexity of the model, or lower the
regularization.
55. Describe how L2 regularization can be explained as a sort of
a weight decay.
Solution: Suppose our cost function is C(w), and that we add a
penalization c|w|2 . When using gradient descent, the iterations
will look like
w = w -grad(C)(w) — 2cw = (1–2c)w — grad(C)(w)
In this equation, the weight is multiplied by a factor < 1.
1. Presenting the meaning of Batch
Normalization
This can be considered a very good question because it covers
most of the knowledge that candidates need to know when
working with a neural network model. You can answer
differently but need to clarify the following main ideas:
Batch Normalization is an effective method when training a
neural network model. The goal of this method is to want to
normalize the features (the output of each layer after going
through the activation) to zero-mean state with standard
deviation 1. So the opposite phenomenon is non-zero mean,
How does it affect the model training?
Firstly, Non zero mean is a phenomenon where data is
not distributed around the value of 0, but the data has most
values greater than zero, or less than zero. Combined with
the high variance problem, data becomes very large or very
small. This problem is common when training neural
networks with deep layer numbers. The fact that the feature
is not distributed within stable intervals (small to large
values) will have an effect on the optimization process of the
network.
As we all know, optimizing a neural network will need to
use derivative calculations. Assuming a simple layer
calculation formula is y = (Wx + b), the derivative
of y from w looks like: dy = dWx. Thus the value of x directly
affects the value of the derivative (of course, the concept of
gradients in neural network models cannot be so simple but
theoretically, x will affect the derivative). Therefore, if x
brings unstable changes, the derivative may be too big, or too
small, resulting in an unstable learning model. And that also
means we can use higher learning rates during training when
using Batch Normalization.
Batch normalization can help us avoid the phenomenon
that the value of x falls into saturation after going through
non-linear activation functions. So it makes sure that no
activation is exceeded either too high or too low. This helps
the weights that when not using the patient will probably
never learn, now they are normally learned. This helps us
reduce the dependence on the initial value of the parameters.
Batch Normalization also acts as a form of regularization
that helps to minimize overfitting. Using batch
normalization, we won’t need to use too many dropput and
this makes sense since we won’t need to worry about losing
too much information when we drop down the network.
However, it is still advisable to use a combination of both
techniques.
2. Present the concept and trade-off
relationship between bias and
variance?
What is bias? Understandably, bias is the difference between
the average prediction of the current model and the actual
results that we need to predict. A model with a high bias
indicates that it is less focused on training data. This makes the
model too simple and does not achieve good accuracy on both
training and testing. This phenomenon is also known
as underfitting.
Variance can simply understand as the distribution (or
clustering) of the model outputs on a data point. The larger the
variance, the more likely it is that the model is paying close
attention to training data and does not provide a generalization
on data never encountered. As a result, the model achieved
extremely good results on the training data set, but the results
were very poor with the test data set. This is the phenomenon
of overfitting.
The correlation between these two concepts can be visualized by
the following figure:
In the diagram above, the centre of the circle is a model that
perfectly predicts the exact values. In fact, you have never found
such a good model. As we get farther away from the centre of
the circle, our predictions get worse and worse.
We can change the model to increase the number of prediction
models that fall into the centre of the circle as much as possible.
A balance between Bias and Variance values is needed. If our
model is too simple and has very few parameters then it may
have high bias and low variance.
Besides, if our model has a large number of parameters then it
will have high variance and low bias. This is the basis for us to
calculate the complexity of the model when designing the
algorithm.
3. Suppose that the Deep Learning
model has found 10 million faces
vectors. How to find a new face
fastest by queries.
This question is about the application of Deep Learning
algorithms in practice, the key point of this question is the
method of indexing data. This is the final step in the problem of
applying One Shot Learning for face recognition but it is the
most important step that makes this application easy to deploy
in practice.
Basically, with this question, you should present an overview of
the face recognition method by One Shot Learning first. It can
be understood simply as turning each face into a vector, and the
new face recognition is finding the vectors that are closest to
(most similar) to the input face. Usually, people will use a deep
learning model with a custom loss function called triplet loss to
do that.
However, with the increase in the images number at the
beginning of the article, calculating the distance to 10 million
vectors in each identification is not a smart solution, makes the
system much slower. We need to think of methods of indexing
data on real vector space in order to make the query more
convenient.
The main idea of these methods is to divide the data into easy
structures for querying new data (possibly similar to a tree
structure). When new data is available, querying in the tree
helps to quickly find the vector that has the closest distance with
time very quickly.
There are several methods that can be used for this purpose
such as Locality Sensitive Hashing — LSH, Approximate
Nearest Neighbors Oh Yeah — Annoy Indexing, Faiss…
4. With classification problem, is the
accuracy index completely reliable?
Which metrics do you usually use to
evaluate your model?
With a class problem, there are many different ways to evaluate.
As for accuracy, the formula simply takes the number of correct
prediction data points divided by the total data. This sounds
reasonable, but in reality, for unbalanced data problems, this
quantity is not significant enough. Suppose we are building a
prediction model for network attacks (assuming attack requests
account for about 1/100000 number of requests).
If the model predicts that all requests are normal, then the
accuracy is also up to 99.9999% and this figure is often
unreliable in the classification model. The accuracy calculation
above usually shows us how many percents of the data is
correctly predicted, but does not indicate how each class is
classified in detail. Instead, we can use the Confusion matrix.
Basically, Confusion matrix shows how many data points
actually belong to a class, and is predicted to fall into a class. It
has the following form:
In addition to expressing the change of True Positive and False
Positive indices corresponding to each threshold that defines
the classification, we have a graph called Receiver Operating
Characteristic — ROC. Based on ROC we can know whether the
model is effective or not.
An ideal ROC is the closer the orange line to the top left corner
(i.e., True Positive is high and False Positive is lower) the better.
What is computer vision ?
It’s Subset of AI. Computer vision is an interdisciplinary scientific field that deals with
how computers can be made to gain high-level understanding from images or videos
What are languages supported by Computer vision ?
C++, Python, Matlab,
What are computer vision libraries ?
OpenCV – python, Java
How many algorithms in opencv ?
2500 optimized algorithms
What is CUDA ?
What is OpenGL ?
What are machine learning algorithms available in
opencv ?
Normal Bayes Classifier
K-Nearest Neighbors
Support Vector Machines
Decision Trees
Boosting
Gradient Boosted Trees
Random Trees
Extremely randomized trees
What is Images stitching? How can you do with opencv ?
What is Computational Photography ? How can you do with
opencv ?
How you can you connect your webcam to opencv ?
How can you do objection detection in opencv ?
What are face recognition algorithms ?
Haar Cascades, Eigenfaces, Fisherfaces
How Haar Cascades algorithm works ?
How can you do face detection in opencv ?
How can you detect eye in Open CV ?
What is Cascade Classifier in Open CV ?
How can you detect corner of images using OpenCV ?
How many types of image filters in OpenCV ?
Averaging
Gaussian Filtering
Median Filtering
Bilateral Filtering
How can you do Feature Detection in Open CV ?
How many types of video filters in OpenCV ?
Color Conversion
Thresholding
Smoothing
Morphology
Gradients
Canny Edge Detection
Contours
Histograms
How can do Image processing in opencv ?
How can you do image compression in opencv ?
How can you do image resize in opencv ?
How can you convert black and white image to color image
using computer vision ?
What is video analysis in Opencv ?
How can you do detect objections video ?
how you would create a 3D model of an object ?
How can you remove red eye from photos using in opencv ?
How can you connect GPU with opencv ?
How can you integrate open cv with android ?
How can you integrate open cv with ios?
Machine Learning Interview Questions
A collection of technical interview questions for machine learning and computer
vision engineering positions.
1) What's the trade-off between bias and variance? [src]
If our model is too simple and has very few parameters then it may have high bias
and low variance. On the other hand if our model has large number of parameters
then it’s going to have high variance and low bias. So we need to find the right/good
balance without overfitting and underfitting the data. [src]
2) What is gradient descent? [src]
[Answer]
3) Explain over- and under-fitting and how to combat them? [src]
[Answer]
4) How do you combat the curse of dimensionality? [src]
Manual Feature Selection
Principal Component Analysis (PCA)
Multidimensional Scaling
Locally linear embedding
[src]
5) What is regularization, why do we use it, and give some examples of common methods?
[src]
A technique that discourages learning a more complex or flexible model, so as to
avoid the risk of overfitting. Examples
Ridge (L2 norm)
Lasso (L1 norm)
The obvious disadvantage of ridge regression, is model interpretability. It will shrink
the coefficients for least important predictors, very close to zero. But it will never
make them exactly zero. In other words, the final model will include all predictors.
However, in the case of the lasso, the L1 penalty has the effect of forcing some of the
coefficient estimates to be exactly equal to zero when the tuning parameter λ is
sufficiently large. Therefore, the lasso method also performs variable selection and is
said to yield sparse models. [src]
6) Explain Principal Component Analysis (PCA)? [src]
[Answer]
7) Why is ReLU better and more often used than Sigmoid in Neural Networks? [src]
Imagine a network with random initialized weights ( or normalised ) and almost 50%
of the network yields 0 activation because of the characteristic of ReLu ( output 0 for
negative values of x ). This means a fewer neurons are firing ( sparse activation ) and
the network is lighter. [src]
8) Given stride S and kernel sizes for each layer of a (1-dimensional) CNN, create a function to
compute the receptive field of a particular node in the network. This is just finding how many
input nodes actually connect through to a neuron in a CNN. [src]
9) Implement connected components on an image/matrix. [src]
10) Implement a sparse matrix class in C++. [src]
11) Create a function to compute an integral image, and create another function to get area
sums from the integral image.[src]
12) How would you remove outliers when trying to estimate a flat plane from noisy samples?
[src]
13) How does CBIR work? [src]
14) How does image registration work? Sparse vs. dense optical flow and so on. [src]
15) Describe how convolution works. What about if your inputs are grayscale vs RGB imagery?
What determines the shape of the next layer? [src]
16) Talk me through how you would create a 3D model of an object from imagery and depth
sensor measurements taken at all angles around the object. [src]
17) Implement SQRT(const double & x) without using any special functions, just fundamental
arithmetic. [src]
18) Reverse a bitstring. [src]
19) Implement non maximal suppression as efficiently as you can. [src]
20) Reverse a linked list in place. [src]
21) What is data normalization and why do we need it? [src]
Data normalization is very important preprocessing step, used to rescale values to fit
in a specific range to assure better convergence during backpropagation. In general,
it boils down to subtracting the mean of each data point and dividing by its standard
deviation. If we don't do this then some of the features (those with high magnitude)
will be weighted more in the cost function (if a higher-magnitude feature changes by
1%, then that change is pretty big, but for smaller features it's quite insignificant).
The data normalization makes all features weighted equally.
22) Why do we use convolutions for images rather than just FC layers? [src]
Firstly, convolutions preserve, encode, and actually use the spatial information from
the image. If we used only FC layers we would have no relative spatial information.
Secondly, Convolutional Neural Networks (CNNs) have a partially built-in translation
in-variance, since each convolution kernel acts as it's own filter/feature detector.
23) What makes CNNs translation invariant? [src]
As explained above, each convolution kernel acts as it's own filter/feature detector.
So let's say you're doing object detection, it doesn't matter where in the image the
object is since we're going to apply the convolution in a sliding window fashion
across the entire image anyways.
24) Why do we have max-pooling in classification CNNs? [src]
for a role in Computer Vision. Max-pooling in a CNN allows you to reduce
computation since your feature maps are smaller after the pooling. You don't lose
too much semantic information since you're taking the maximum activation. There's
also a theory that max-pooling contributes a bit to giving CNNs more translation in-
variance. Check out this great video from Andrew Ng on the benefits of max-pooling.
25) Why do segmentation CNNs typically have an encoder-decoder style / structure? [src]
The encoder CNN can basically be thought of as a feature extraction network, while
the decoder uses that information to predict the image segments by "decoding" the
features and upscaling to the original image size.
26) What is the significance of Residual Networks? [src]
The main thing that residual connections did was allow for direct feature access from
previous layers. This makes information propagation throughout the network much
easier. One very interesting paper about this shows how using local skip connections
gives the network a type of ensemble multi-path structure, giving features multiple
paths to propagate throughout the network.
27) What is batch normalization and why does it work? [src]
Training Deep Neural Networks is complicated by the fact that the distribution of
each layer's inputs changes during training, as the parameters of the previous layers
change. The idea is then to normalize the inputs of each layer in such a way that they
have a mean output activation of zero and standard deviation of one. This is done for
each individual mini-batch at each layer i.e compute the mean and variance of that
mini-batch alone, then normalize. This is analogous to how the inputs to networks
are standardized. How does this help? We know that normalizing the inputs to a
network helps it learn. But a network is just a series of layers, where the output of
one layer becomes the input to the next. That means we can think of any layer in a
neural network as the first layer of a smaller subsequent network. Thought of as a
series of neural networks feeding into each other, we normalize the output of one
layer before applying the activation function, and then feed it into the following layer
(sub-network).
28) Why would you use many small convolutional kernels such as 3x3 rather than a few large
ones? [src]
This is very well explained in the VGGNet paper. There are 2 reasons: First, you can
use several smaller kernels rather than few large ones to get the same receptive field
and capture more spatial context, but with the smaller kernels you are using less
parameters and computations. Secondly, because with smaller kernels you will be
using more filters, you'll be able to use more activation functions and thus have a
more discriminative mapping function being learned by your CNN.
29) Why do we need a validation set and test set? What is the difference between them? [src]
When training a model, we divide the available data into three separate sets:
The training dataset is used for fitting the model’s parameters. However, the accuracy
that we achieve on the training set is not reliable for predicting if the model will be
accurate on new samples.
The validation dataset is used to measure how well the model does on examples that
weren’t part of the training dataset. The metrics computed on the validation data can
be used to tune the hyperparameters of the model. However, every time we evaluate
the validation data and we make decisions based on those scores, we are leaking
information from the validation data into our model. The more evaluations, the more
information is leaked. So we can end up overfitting to the validation data, and once
again the validation score won’t be reliable for predicting the behaviour of the model
in the real world.
The test dataset is used to measure how well the model does on previously unseen
examples. It should only be used once we have tuned the parameters using the
validation set.
So if we omit the test set and only use a validation set, the validation score won’t be
a good estimate of the generalization of the model.
30) What is stratified cross-validation and when should we use it? [src]
Cross-validation is a technique for dividing data between training and validation sets.
On typical cross-validation this split is done randomly. But in stratified cross-
validation, the split preserves the ratio of the categories on both the training and
validation datasets.
For example, if we have a dataset with 10% of category A and 90% of category B, and
we use stratified cross-validation, we will have the same proportions in training and
validation. In contrast, if we use simple cross-validation, in the worst case we may
find that there are no samples of category A in the validation set.
Stratified cross-validation may be applied in the following scenarios:
On a dataset with multiple categories. The smaller the dataset and the more
imbalanced the categories, the more important it will be to use stratified cross-
validation.
On a dataset with data of different distributions. For example, in a dataset for
autonomous driving, we may have images taken during the day and at night. If we do
not ensure that both types are present in training and validation, we will have
generalization problems.
31) Why do ensembles typically have higher scores than individual models? [src]
An ensemble is the combination of multiple models to create a single prediction. The
key idea for making better predictions is that the models should make different
errors. That way the errors of one model will be compensated by the right guesses of
the other models and thus the score of the ensemble will be higher.
We need diverse models for creating an ensemble. Diversity can be achieved by:
Using different ML algorithms. For example, you can combine logistic regression, k-
nearest neighbors, and decision trees.
Using different subsets of the data for training. This is called bagging.
Giving a different weight to each of the samples of the training set. If this is done
iteratively, weighting the samples according to the errors of the ensemble, it’s called
boosting. Many winning solutions to data science competitions are ensembles.
However, in real-life machine learning projects, engineers need to find a balance
between execution time and accuracy.
32) What is an imbalanced dataset? Can you list some ways to deal with it? [src]
An imbalanced dataset is one that has different proportions of target categories. For
example, a dataset with medical images where we have to detect some illness will
typically have many more negative samples than positive samples—say, 98% of
images are without the illness and 2% of images are with the illness.
There are different options to deal with imbalanced datasets:
Oversampling or undersampling. Instead of sampling with a uniform distribution
from the training dataset, we can use other distributions so the model sees a more
balanced dataset.
Data augmentation. We can add data in the less frequent categories by modifying
existing data in a controlled way. In the example dataset, we could flip the images
with illnesses, or add noise to copies of the images in such a way that the illness
remains visible.
Using appropriate metrics. In the example dataset, if we had a model that always
made negative predictions, it would achieve a precision of 98%. There are other
metrics such as precision, recall, and F-score that describe the accuracy of the model
better when using an imbalanced dataset.
33) Can you explain the differences between supervised, unsupervised, and reinforcement
learning? [src]
In supervised learning, we train a model to learn the relationship between input data
and output data. We need to have labeled data to be able to do supervised learning.
With unsupervised learning, we only have unlabeled data. The model learns a
representation of the data. Unsupervised learning is frequently used to initialize the
parameters of the model when we have a lot of unlabeled data and a small fraction
of labeled data. We first train an unsupervised model and, after that, we use the
weights of the model to train a supervised model.
In reinforcement learning, the model has some input data and a reward depending
on the output of the model. The model learns a policy that maximizes the reward.
Reinforcement learning has been applied successfully to strategic games such as Go
and even classic Atari video games.
34) What is data augmentation? Can you give some examples? [src]
Data augmentation is a technique for synthesizing new data by modifying existing
data in such a way that the target is not changed, or it is changed in a known way.
Computer vision is one of fields where data augmentation is very useful. There are
many modifications that we can do to images:
Resize
Horizontal or vertical flip
Rotate
Add noise
Deform
Modify colors Each problem needs a customized data augmentation pipeline. For
example, on OCR, doing flips will change the text and won’t be beneficial; however,
resizes and small rotations may help.
35) What is Turing test? [src]
The Turing test is a method to test the machine’s ability to match the human level
intelligence. A machine is used to challenge the human intelligence that when it
passes the test, it is considered as intelligent. Yet a machine could be viewed as
intelligent without sufficiently knowing about people to mimic a human.
36) What is Precision?
Precision (also called positive predictive value) is the fraction of relevant instances
among the retrieved instances
Precision = true positive / (true positive + false positive)
[src]
37) What is Recall?
Recall (also known as sensitivity) is the fraction of relevant instances that have been
retrieved over the total amount of relevant instances. Recall = true positive / (true
positive + false negative)
[src]
38) Define F1-score. [src]
It is the weighted average of precision and recall. It considers both false positive and
false negative into account. It is used to measure the model’s performance.
F1-Score = 2 * (precision * recall) / (precision + recall)
39) What is cost function? [src]
Cost function is a scalar functions which Quantifies the error factor of the Neural
Network. Lower the cost function better the Neural network. Eg: MNIST Data set to
classify the image, input image is digit 2 and the Neural network wrongly predicts it
to be 3
40) List different activation neurons or functions. [src]
Linear Neuron
Binary Threshold Neuron
Stochastic Binary Neuron
Sigmoid Neuron
Tanh function
Rectified Linear Unit (ReLU)
41) Define Learning rate.
Learning rate is a hyper-parameter that controls how much we are adjusting the
weights of our network with respect the loss gradient. [src]
42) What is Momentum (w.r.t NN optimization)?
Momentum lets the optimization algorithm remembers its last step, and adds some
proportion of it to the current step. This way, even if the algorithm is stuck in a flat
region, or a small local minimum, it can get out and continue towards the true
minimum. [src]
43) What is the difference between Batch Gradient Descent and Stochastic Gradient Descent?
Batch gradient descent computes the gradient using the whole dataset. This is great
for convex, or relatively smooth error manifolds. In this case, we move somewhat
directly towards an optimum solution, either local or global. Additionally, batch
gradient descent, given an annealed learning rate, will eventually find the minimum
located in it's basin of attraction.
Stochastic gradient descent (SGD) computes the gradient using a single sample. SGD
works well (Not well, I suppose, but better than batch gradient descent) for error
manifolds that have lots of local maxima/minima. In this case, the somewhat noisier
gradient calculated using the reduced number of samples tends to jerk the model
out of local minima into a region that hopefully is more optimal. [src]
44) Epoch vs Batch vs Iteration.
Epoch: one forward pass and one backward pass of all the training examples
Batch: examples processed together in one pass (forward and backward)
Iteration: number of training examples / Batch size
45) What is vanishing gradient? [src]
As we add more and more hidden layers, back propagation becomes less and less
useful in passing information to the lower layers. In effect, as information is passed
back, the gradients begin to vanish and become small relative to the weights of the
networks.
46) What are dropouts? [src]
Long Short Term Memory – are explicitly designed to address the long term
dependency problem, by maintaining a state what to remember and what to forget.
47) Define LSTM. [src]
As we add more and more hidden layers, back propagation becomes less and less
useful in passing information to the lower layers. In effect, as information is passed
back, the gradients begin to vanish and become small relative to the weights of the
networks.
48) List the key components of LSTM. [src]
Gates (forget, Memory, update & Read)
tanh(x) (values between -1 to 1)
Sigmoid(x) (values between 0 to 1)
49) List the variants of RNN. [src]
LSTM: Long Short Term Memory
GRU: Gated Recurrent Unit
End to End Network
Memory Network
50) What is Autoencoder, name few applications. [src]
Auto encoder is basically used to learn a compressed form of given data. Few
applications include
Data denoising
Dimensionality reduction
Image reconstruction
Image colorization
51) What are the components of GAN? [src]
Generator
Discriminator
52) What's the difference between boosting and bagging?
Boosting and bagging are similar, in that they are both ensembling techniques,
where a number of weak learners (classifiers/regressors that are barely better than
guessing) combine (through averaging or max vote) to create a strong learner that
can make accurate predictions. Bagging means that you take bootstrap samples
(with replacement) of your data set and each sample trains a (potentially) weak
learner. Boosting, on the other hand, uses all data to train each learner, but instances
that were misclassified by the previous learners are given more weight so that
subsequent learners give more focus to them during training. [src]
53) Explain how a ROC curve works. [src]
The ROC curve is a graphical representation of the contrast between true positive
rates and the false positive rate at various thresholds. It’s often used as a proxy for
the trade-off between the sensitivity of the model (true positives) vs the fall-out or
the probability it will trigger a false alarm (false positives).
54) What’s the difference between Type I and Type II error? [src]
Type I error is a false positive, while Type II error is a false negative. Briefly stated,
Type I error means claiming something has happened when it hasn’t, while Type II
error means that you claim nothing is happening when in fact something is. A clever
way to think about this is to think of Type I error as telling a man he is pregnant,
while Type II error means you tell a pregnant woman she isn’t carrying a baby.
55) What’s the difference between a generative and discriminative model? [src]
A generative model will learn categories of data while a discriminative model will
simply learn the distinction between different categories of data. Discriminative
models will generally outperform generative models on classification tasks.