P.V.
P SIDDHARTHA INSTITUTE OF TECHNOLOGY
BRA REG
NCH Computer Science & Engineering ULAT PVP20
: ION :
Cour DATA
Cou B.Tech se SCIEN
rse: Name CE
:
Cou Year
rse 20CS4501A and III-I
Cod Seme
e: ster:
QUESTION BANK
UNIT - I
Q. QUESTION CO LEVEL MARKS
NO.
1 Explain the phases of Data Science. 1 2 14
2 What is Exploratory Data Analysis? Explain any two
types of visualization. 1 2 14
3 Explain various Hyper parameter optimization techniques
with suitable examples. 1 2 14
4 Briefly explain the role of Data Science in various fields. 1 2 14
5 Explain Data Science phases and lifecycle and write the 1 2 14
names of tools for Data Science.
6 Explain about roles and stages in data science project. 1 2 14
7 Explain about exploring and managing data in data 1 2 14
science.
8 Explain the various processes for preparing a dataset to 1 2 14
perform a data science task.
9 Define Hyperparameter Optimization and discuss various 1 2 14
strategies for optimizing hyperparameter methods.
UNIT-II
Q. QUESTION CO LEVEL MARKS
NO.
Suppose that the data for analysis includes the attribute
1 age. The age values for the data tuples are (in increasing 2 3 14
order):
13, 15, 16, 16, 19, 20, 23, 29, 35, 41, 44, 53, 62, 69, 72
(i) Use min-max normalization to transform the value of
45 for age onto the range [0,1]
(ii) Use Z-Score normalization to transform the value 45
for age where the standard deviation of age is 20.64
years.
a) Differentiate between data reduction and
2 dimensionality reduction for data discretization. 2 2 14
b) Explain the role of attributes in classification.
Normalize the following group of data by using the
following techniques.
200, 300, 400, 600, 1000
3 i. min-max normalization technique 2 3 14
ii. z-score normalization
iii. Decimal scaling.
c) Write your observations on the above techniques.
4 How to implement Data Transformation and Data 2 2 14
Discretization? Explain with examples.
5 Given data = {2, 3, 4, 5, 6, 7; 1, 5, 3, 6, 7, 8}. Use PCA 2 3 14
Algorithm to compute the principal component.
In real-world data, tuples with missing values for some
6 attributes are a common occurrence. Apply various 2 3 14
pre-processing methods for handling this problem.
The following are the sorted data price (in rupees) of
certain items in the supermarket.
7 4, 8, 15, 21, 21, 24, 25, 28, 34, 36, 39, 42, 51, 57, 60 2 3 14
Smooth the data by using the following smoothing
techniques. Consider the bin size as 3. i) Bin means ii)
Bin medians iii) Bin boundaries
a) Why data transformation is important and when do
we need it.
8 b) When do we use splitting table technique? 2 2 14
c) How does adding redundant column can become
cause for loss of information? Justify?
Evaluate any two data reduction techniques with
9 examples. What is the format for reporting results of 2 3 14
each?
UNIT-III
Q. QUESTION CO LEVEL MARKS
NO.
1 Explain the Mean and Variance of Binomial Distribution 3 2 14
and its properties.
There is a class of 25 students, and the mean score of their
test is 60 out of 100, with standard deviation 4 marks from
2 the mean. While other students of the school have a mean 3 3 14
score of 50 on the same test. What will be the t-score for
calculating the probability that school students scored not
less than 60 on their tests?
Find the mean and standard deviation of a normal
3 distribution in which 7% are under 35, and 89% are under 3 3 14
63.
In a Normal Distribution 31% of the items are under 45 and
4 8% are over 64. Find the Mean and Variance of the 3 3 14
Distribution.
From the following data, find whether there is any
significant linking in the habit of taking soft drinks among
categories of employees by using chi-square test.
Soft drinks Employees
5 Clerks Teachers Officers 3 3 14
Pepsi 10 25 65
Thumps up 15 30 65
Fanta 50 60 30
6 Explain about various methods of Data Collection involved 3 2 14
in Data Science.
In a sample of 1000 cases, the mean of a particular test is
14, and the standard deviation is 2.5. assuming the
distribution to be normal, Determine
7 i) how many students score between 12 and 15? 3 3 14
ii) how many scores above 18?
iii)how many scores below 18?
8 Differentiate Stratified sampling and Cluster sampling 3 2 14
techniques with examples.
A pair of dice is thrown 360 times and frequencies of each
sum are indicated below Would you say that the dice are fair
on the basis of the Chi-Square test at 0.01 Level of
9 Significance. 3 3 14
Sum 2 3 4 5 6 7 8 9 10 11 12
Frequency 8 24 35 37 44 65 51 42 26 14 14
4
UNIT-IV
Q. M
NO QUESTION CO LEVEL
.
If the logit score (linear predictor) is given by –2.4 + 1.5 X1 + 2 X2, find the
estimated P(Y = 1) for each of the following combination of the IDVs:
1 X1: 0 1.5 2 3 -2 -2.5 4 3
X2: 1 0 1.5 -1 2 2.5
2 Which specific regressors seem essential in multiple regression? How will 4
you address this question? Discuss. 2
Suppose you have the following data with one real-value input variable &
one real-value output variable. What is leave-one out cross validation
mean square error in case of linear regression (Y = bX+c)?
3 X (independent variable) Y (dependent variable) 4 3
0 2
2 2
3 1
a) Discuss the need for fitting the model in multiple regression.
4 b) Explain Logistic Regression with an example. 4 2
a) Obtain the likelihood equation for estimating the parameters of a
logistic regression model.
5 b) Define multiple linear regression model. Explain the least squares 4 2
method to estimate parameters in multiple linear regression models.
6 Explain Linear Discriminative analysis in detail with an example. 4 2
Apply linear regression using the method of least squares to the following
data and predict the crop yield for rain fall of 5 cm.
Rain
fall(in 10.5 8.8 13.4 12.5 18.8 10.3 7.0 15.6 16
cms)
7 Paddy
4 3
yield
(quint 30.3 46.2 58.8 59.0 82.4 49.2 31.9 76.0 78.8
al per
acre)
8 Explain how can over fitting and under fitting issues are handled in 4
Regression modeling. 2
Build a linear regression model with the following data and test for overall
fit. Also, test for the individual significance of X1 and of X2.
Y: 12.8 13.9 15.2 18.3 14.5 12.4
9 4 3
X1: 2 3 5 5 4 1
X2: 4 2 5 1 2 3
10 Apply logistic regression to demonstrate binary classification. 4 3
UNIT - V
Q.
N QUESTION CO LEVE
L MAR
O.
1 Explain the bias/ variance dilemma about the model complexity. 1 2 14
2 Explain k-fold cross validation and how it can be implemented for
building a model. 4 3 14
Imagine that you find out that your model has low bias and high
3 variance. Which algorithm would be best suited to this problem? What's 4 3 14
the reason?
a) Develop the estimate of in-sample error derivation.
4 b) Explain Minimum description length principle for model building 1 2 14
Briefly explain how you would calculate a cross-validated estimate of
5 prediction error in a 1 2 14
Linear regression. Is this estimate likely more minor or more significant
than the in-sample error?
6 What is the holdout approach? What is the limitation of this approach? 1 2 14
Name four alternative approaches for it.
7 Explain bias and variance in machine learning and how bias-variance 1 2 14
decomposition is used for deciding the model complexity?
Given data set STr = {(xi, yi), i=1,….6}, xi∈ℝ a feature scalar yi∈{-1,+1} a
class label.
Data points in the data set are
(x1,y1)=(2,-1) (x2,y2)=(7,-1)
(x3,y3)=(4,+1) (x4,y4)=(1,-1)
(x5,y5)=(3,+1) (x6,y6)=(6,+1)
Suppose you are training a Linear Classifier
8 f(x;a,b) = sign(ax+b) with 2-fold Cross Validation where sign(z) = 4 3 14
{+ 1, 𝑧≥0 − 1, 𝑧 < 0
Split STr into S1={(x1,y1) (x2,y2) (x3,y3)} and
S2={(x4,y4) (x5,y5) (x6,y6)}
After training the classifier f on S1, we have a1=-1, b1=5 and then try to
validate the classifier on S2.
After training the classifier f on S2, we have a2=2, b2=-3 and then try to
validate the classifier on S1
Calculate the average training error in the 2-fold cross-validation.
Given data set TTr = {(xi, yi), i=1,….4}, xi∈ℝ a data point yi∈{-1,+1} a
corresponding label.
The Data points are
(x1,y1)=(5,-1) (x2,y2)=(6,-1)
9 (x3,y3)=(1,+1) (x4,y4)=(4,-1) 4 3 14
Suppose you are training a Linear Classifier
f(x;a,b) = sign(bx+a) with 2-fold Cross Validation where sign(z) =
{+ 1, 𝑧≥0 − 1, 𝑧 < 0
Split STr into T1={(x1,y1) (x2,y2)} and
T2={(x3,y3) (x4,y4)}
Subsequently training the classifier f(x;a,b) on T1, we have a1=-3, b1=4
and then try to validate the classifier on T2.
Training the classifier f(x;a,b) on T2, we have a2=1, b2=-5 and then try to
validate the classifier on T1
Calculate the average validation error (i.e. the cross-validation
error) in the 2-fold cross-validation.