Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
91 views3 pages

SDSC3006 - Assignment 1

Uploaded by

sze
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views3 pages

SDSC3006 - Assignment 1

Uploaded by

sze
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

SDSC 3006 Fundamentals of Machine Learning I

Assignment #1

Deadline: October 13, Sunday@ 10:00 PM

1. For each of parts (a) through (d), indicate whether we would generally expect the performance
of a flexible statistical learning method to be better or worse than an inflexible method. Justify
your answer.
(a) The sample size n is extremely large, and the number of predictors p is small.
(b) The number of predictors p is extremely large, and the number of observations n is small.
(c) The relationship between the predictors and response is highly non-linear.
(d) The variance of the error terms, i.e. 𝜎 2 = 𝑉𝑎𝑟(𝜖), is extremely high.

2. We now revisit the bias-variance decomposition.


(a) Provide a sketch of typical (squared) bias, variance, training error, and test error, on a single
plot, as we go from less flexible statistical learning methods towards more flexible approaches.
The x-axis should represent the amount of flexibility in the method, and the y-axis should
represent the values for each curve. There should be four curves. Make sure to label each one.
(b) Explain why each of the four curves has the shape displayed in part (a).

3. Suppose we have a data set with five predictors, 𝑋1 = GPA, 𝑋2 = IQ, 𝑋3 = Gender (1 for
Female and 0 for Male), 𝑋4 = Interaction between GPA and IQ, and 𝑋5 = Interaction between
GPA and Gender. The response is starting salary after graduation (in thousands of dollars).
Suppose we use least squares to fit the model, and get 𝛽̂0 = 50, 𝛽̂1 = 20, 𝛽̂2 = 0.07, 𝛽̂3 = 35,
𝛽̂4 = 0.01, 𝛽̂5 = −10.
(a) Which answer is correct, and why?
i. For a fixed value of IQ and GPA, males earn more, on average, than females.
ii. For a fixed value of IQ and GPA, females earn more, on average, than males.
iii. For a fixed value of IQ and GPA, males earn more, on average, than females provided that
the GPA is high enough.
iv. For a fixed value of IQ and GPA, females earn more, on average, than males provided that the
GPA is high enough.
(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0.
(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is
very little evidence of an interaction effect. Justify your answer.

1
4. Using the Carseats data set (ISLP package) to answer the following questions.
(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.
(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables
in the model are qualitative!
(c) Write out the model in equation form, being careful to handle the qualitative variables
properly.
(d) For which of the predictors can you reject the null hypothesis H0: 𝛽𝑗 = 0?
(e) On the basis of your response to the previous question, fit a smaller model that only uses the
predictors for which there is evidence of association with the outcome.
(f) How well do the models in (a) and (e) fit the data?
(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
(h) Is there evidence of outliers or high leverage observations in the model from (e)?

5. Suppose we collect data for a group of students in a statistics class with variables 𝑋1 = hours
studied, 𝑋2 = undergrad GPA, and 𝑌 = receive an A. We fit a logistic regression and produce
estimated coefficient, 𝛽̂0 = −6, 𝛽̂1 = 0.05, 𝛽̂2 = 1.
(a) Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5
gets an A in the class.
(b) How many hours would the student in part (a) need to study to have a 50% chance of getting
an A in the class?

6. Answer the following questions about the differences between LDA and QDA.
(a) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the
training set? On the test set?
(b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on
the training set? On the test set?
(c) In general, as the sample size n increases, do we expect the test prediction accuracy of QDA
relative to LDA to improve, decline, or be unchanged? Why?
(d) True or False: Even if the Bayes decision boundary for a given problem is linear, we will
probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible
enough to model a linear decision boundary. Justify your answer.

2
7. This question should be answered using the Weekly data set (ISLP package). This data is similar
in nature to the Smarket data from this chapter’s lab, except that it contains 1089 weekly returns
for 21 years, from the beginning of 1990 to the end of 2010.
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be
any patterns?
(b) Use the full data set to perform a logistic regression with Direction as the response and the five
lag variables plus Volume as predictors. Use the summary function to print the results. Do any of
the predictors appear to be statistically significant? If so, which ones?
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the
confusion matrix is telling you about the types of mistakes made by logistic regression.
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2
as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions
for the held out data (that is, the data from 2009 and 2010).
(e) Repeat (d) using LDA.
(f) Repeat (d) using QDA.
(g) Repeat (d) using KNN with K = 1.
(h) Which of these methods appears to provide the best results on this data?

You might also like