Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
18 views2 pages

R Class 20

The document outlines an assignment consisting of multiple statistical analysis tasks involving datasets related to cable failures, auto claims, and marketing spend. It includes fitting linear and generalized linear models, evaluating model significance, performing hypothesis tests, and analyzing correlations and principal components. Each question requires specific statistical methods and interpretations, along with justifications for data manipulations and model adjustments.

Uploaded by

sarthakgarg0401
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views2 pages

R Class 20

The document outlines an assignment consisting of multiple statistical analysis tasks involving datasets related to cable failures, auto claims, and marketing spend. It includes fitting linear and generalized linear models, evaluating model significance, performing hypothesis tests, and analyzing correlations and principal components. Each question requires specific statistical methods and interpretations, along with justifications for data manipulations and model adjustments.

Uploaded by

sarthakgarg0401
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

CHAPTER 11 & 12 ASSIGNMENT PREPARED BY – RAKESH GUPTA

Question 1: A statistician is carrying out an exercise to analyses a dataset that describes the failure times of outdoor telephone cables,
with respect to the cable material quality (graded 1 to 4) and level of rainfall in centimeters that the cable is exposed to. The data given
in the file “Cables_dataset.csv” show failure times in years for 20
different cables.
(i) Fit a linear model to the data with the failure time as the response, including both cable material quality and level of rainfall as the
two covariates. Your answer should include a summary of the fitted model. [5]
(ii) (a) State the formula of the model fitted in part (i), clearly explaining the notation that you use.
(b) Comment on the significance of the parameters of the model fitted in part (i). [6]
(iii) (a) Plot the residuals of the model in part (i).
(b) Comment on the plot created in (iii)(a). [4]

An analyst suggests that the 6th row of the original data should be removed.
(iv) (a) Construct a new data set from the original data “Cables dataset.csv” with the 6th row removed. [2]
(b) Justify the removal of the 6th row from the original data. [2]

(v) (a) Fit a linear model to the new data set constructed in part (iv)(a). [1]

(b) Comment on the fit of the model from part (v)(a) compared to the model fitted in part (i), by comparing suitable statistics from the
R outputs. [3]
(vi) (a) Fit a generalized linear model (GLM) to the data set constructed in part (iv)(a) using a Gamma distribution.
(b) State the formula of the model fitted in part (vi)(a), clearly explaining the notation that you use.
(c) Comment on the significance of the parameters of the model fitted in part (vi)(a). [6]

Question 2
Refer to the dataset “AutoClaims.csv” and answer the following questions.

(i) Fit a linear regression model to predict the “PAID” claim amount based on other variables (Consider the AGE as a numerical
variable and all others as categorical).Provide your interpretation of the model by explaining R-Squared, Adjusted R-Squared, p-
value of the model and p-value of each of the coefficients. Identify the significant variables in the prediction of “PAID” claims.

(ii) Comment on the applicability of the linear regression model by plotting “Residuals vs.Fitted Values” and “QQ Plot of the
residuals”.

(iii) Your actuarial friend has suggested you to use natural logarithm of “PAID” claims instead of the actual “PAID” Claim amount
because the loge(PAID) is more closer to normal distribution than “PAID” Claims. Verify the statement made by your friend by
comparing the Skewness and Excess Kurtosis of both the PAID claims as well as loge(PAID). Write appropriate custom functions to
compute both of them

(iv) Repeat the model in (i) above by considering the suggestion in (iii). Identify and comment on the key differences between both
the models.

(v) Your Manager has suggested that the model can be improved by adding interaction effects between STATE and CLASS, STATE
and GENDER, CLASS and GENDER as additional variables to the set of independent variables taken in (i). Evaluate the worthiness
of this suggestion.

Question 3: Refer to the data file “Indices_Returns.csv” and answer the following questions:
Indices_Returns.csv file is provided in the system.

(i) Compute the pairwise Pearson correlation coefficient between the returns of 10 sectors (BM, CD, EN, FM, FI,
HC, IN, IT, TE and UT) rounded to three digits after the decimal point. Display the correlation matrix in the output
(ii) Identify the pair with the highest correlation coefficient and the pair with the least correlation coefficient.
(iii) Perform Principal component analysis on the returns values of the 10 sectors.
(iv) How many principal components have an Eigen value of more than 1?
(v) What is the approximate proportion of total variation explained by the first two principal components?
(vi) Compute the pair wise correlations among the 10 principal components (Round them to 3 digits after the
decimal point) and display the results. What do you infer about the resulting correlations?
(vii) Using a scree plot comment on the number of significant components in the model
CHAPTER 11 & 12 ASSIGNMENT PREPARED BY – RAKESH GUPTA

Question 4: Five years of marketing spend and company sales by month

i) Construct a scatterplot of the data. Comment on the relationship between the Sales & Spend based on the plot. (4)

ii) Calculate Pearson’s correlation coefficient between Sales and Spend of the company. (2)

iii) Perform a hypothesis test for the null hypothesis that Pearson’s population correlation coefficient is equal to zero,
against the alternative that it is positive. You should report the p-value of the test and a clear conclusion. (5)

iv) Perform a simple linear regression analysis on the data. Your answer should report the estimate of parameter
sigma. (6)

v) Plot the fitted line on the data scatterplot. (2)

vi) State the proportion of the total variability of the responses explained by the model based
on your output in (iv). (1)

vii) Plot a graph of the residuals of the model fitted in (iv) against the explanatory variable. (2)

viii) Obtain a 99% confidence interval for parameter sigma. (4)

ix) Comment on the validity of the model based on results in part (vii) and part (viii). (2)

x) Calculate the p-value of a hypothesis test for this suggestion (slope equal to 10), by creating a suitable test
statistic. (7)

xi) Comment on the suggestion in point (x). (2)

xii)Calculate the predicted amount of sales when the marketing spend is INR 4500. (2)

You might also like