Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views18 pages

Lecture 09

Lecture 9 of MAT 212 discusses regression analysis, focusing on the relationship between dependent and independent variables in engineering and scientific problems. It introduces least squares linear regression, explaining how to derive the regression coefficients and the importance of minimizing error to find a 'best fit' line. The lecture also covers the relevance of linear models through Pearson's correlation coefficient and explores other models reducible to linear forms, such as exponential and inverse functions.

Uploaded by

abdussalam362436
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views18 pages

Lecture 09

Lecture 9 of MAT 212 discusses regression analysis, focusing on the relationship between dependent and independent variables in engineering and scientific problems. It introduces least squares linear regression, explaining how to derive the regression coefficients and the importance of minimizing error to find a 'best fit' line. The lecture also covers the relevance of linear models through Pearson's correlation coefficient and explores other models reducible to linear forms, such as exponential and inverse functions.

Uploaded by

abdussalam362436
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Lecture Notes on

MAT 212 Probability & Statistics for Science & Engineering

Lecture 9
Regression

Introduction

Many engineering and scientific problems are concerned with determining a relationship
between a set of variables. For instance, in a chemical process, we might be interested in
the relationship between the output of the process, the temperature at which it occurs, and
the amount of catalyst employed. In another instance, we may wish to understand the
response of the atmospheric temperature with the change in pressure, particulate content,
and humidity of the atmosphere. Knowledge of such dependencies helps us to understand
the chemical and physical dynamics of the systems.

Least square linear regression

In many situations, there is a single response variable y, also called the dependant
variable, which depends on the value of a set of input, also called independent, variables
x1, x2, . . ., xr. The simplest type of relationship between the dependant variable y and the
input variables x1, x2, . . ., xr is a linear relationship. That is for some constant β0, β1, . . .,
βr the equation
y = β0 + β1x1 + β2x2 + . . . + βrxr (1)
would hold. If this was the relationship between y and xi, (i = 1, 2, . . ., r), then it would
be possible (once β’s were learned) to exactly predict the response for any set of input
values. However, in practice, such precision is almost never attainable, and the most that
one can expect is that equation (1) would be valid subject to a random error. By this we
mean that the explicit relationship is
y = β0 + β1x1 + β2x2 + . . . + βrxr + e (2)

Lecture Notes on Independent University, Bangladesh


Probability and Statistics
184 Regression

where e, representing the random error, is assumed to be a random variable having mean
0. Equation (2) is called linear regression equation. We say that it describes the regression
of y on the set of independent variables x1, x2, . . ., xr. The quantities β0, β1, . . ., βr are
called the regression coefficients, and must usually be estimated from a set of data. A
regression equation containing a single independent variable is called a simple regression
equation, whereas one containing many independent variables is called a multiple
regression equation. We will learn the fundamental techniques of regression through the
simple regression equation, and then will extend these concepts to multiple regression
case.

Thus, a simple linear regression model supposes a linear relationship between the mean
response and the value of a single independent variable. It can be expressed as
y = α + βx + e (3)
where x is the value of the independent variable, also called the input level, y is the
response, and e, representing the random error, is a random variable having mean 0. Let
us put things in perspective. We have a set of points (xi, yi). The points, graphically, are
shown in figure 1. We want to pass
a straight line through these points
the set of points (xi, yi). Since we do
have any reason to assume that all
the points would lie exactly on a
straight line, each point (xi, yi) will
have an error, e, associated with it,
in relation to the straight line. Our
objective is to choose the values of
α and β in such a way that the line
can be considered a ‘best fit’. y e

Even though we are clear about our x


objective, the problem begins with
the definition of ‘a best fit’ line. As
a first definition, we notice that
with respect to the line, some errors Figure 1. A set of points (xi, yi), a straight line
are negative, whereas others are through them, and the error associated with it.
positive. Therefore, we can decide
to position the line in such a way so that the total error is zero. Unfortunately, this would
result in infinite number of lines, because for any line passing through (x, y), the total
error would be zero. To handle the problem arising from the positive and negative nature
of the errors, we rather take the square of the error. Once the error is squared, it can no
longer add up to zero. Therefore, we would choose the values of α and β in such a way so
that the square of the total error is a minimum. This leads us to what is known as a least-
square linear regression.

For a given set of points (xi, yi), to obtain the expressions for α and β, we start with the
expression of error from equation (3).

Independent University, Bangladesh Lecture Notes on


Probability and Statistics
Regression 185

ei = yi – α – βxi
Squaring this, we get the square of the error.
ei2 = ( y i − α − β x i )
2

Therefore the sum of the square of the errors is


S = ∑ ei2 = ∑ ( y i − α − βx i )
2
(4)
i i

Our intent is to develop a model so that S is a minimum. In equation (4), the quantities xi
and yi are the given set of points; hence fixed. The quantity S, therefore, depends upon the
values of α and β. To minimize the values of S, we differentiate S with respect to α and β
separately, and set those equal to zero. Therefore,
∂S
= −2∑ ( y i − α − βx i )
∂α i

∂S
and = −2 ∑ ( y i − α − β x i )x i
∂β i

To minimize S, we would set these two equations equal to 0. Therefore

∑ (y
i
i − α − βx i ) = 0 (5.1)

and ∑ (y
i
i − α − βxi )xi = 0 (5.2)

Equations (5.1) and (5.2) can be written as


α n + β ∑ xi = ∑ y i (6.1)
i i

α ∑ xi + β ∑ xi2 = ∑ xi y i (6.2)
i i i
In these equations, n is the number of data points in the set (xi, yi). These two equations
give us a set of linear equation for α and β. We recognize that if we divide equation (6.1)
with n, it can be written as
α + βx = y (7)
Solving these equations, we can find the solutions for α and β as

∑y ∑x −∑x ∑x y
i
2
i i i i
α= i i i i
2
(8.1)
 
n∑ xi2 −  ∑ xi 
i  i 

Lecture Notes on Independent University, Bangladesh


Probability and Statistics
186 Regression

n∑ xi y i − ∑ xi ∑ y i
β= i i i
2
(8.2)
 
n∑ x −  ∑ xi 
2
i
i  i 
Using equation (7), equation (8.1) can also be written as
α = y − βx (8.3)

Example 10.1
The raw material used in the production of a certain synthetic fiber is stored in a location
without humidity control. Measurements of the relative humidity in the storage location
and the moisture content of a sample of the raw material were taken over 15 days. The
following data were recorded.
Relative
humidity 46 53 29 61 36 39 47 49 52 38 55 32 57 54 44
Moisture
content 12 15 7 17 10 11 11 12 14 9 16 8 18 14 12
Construct a least square regression model.

Considering the relative humidity as x, and moisture content as y, we can calculate


∑ xi = 692 , ∑ y i = 186 , ∑ x 2 = 33212 , ∑ xy = 8997 . Therefore, using equations
(8.1) and (8.2), we can calculate
186 × 33212 − 692 × 8997
α= = −2.51
15 × 33212 − (692)
2

15 × 8997 − 692 × 186


β= = 0.32
15 × 33212 − (692 )
2

Independent University, Bangladesh Lecture Notes on


Probability and Statistics
Regression 187

Moisture content

Relative humidity
Figure 2. Example 10.1

Therefore, the equation of the line is y = –2.51 + 0.32 x

The points and the least-square regression line are shown in figure 2.

Goodness of fit

If we critically observe the procedure for obtaining the


values of α and β, we will notice that to obtain the values
of these two parameters, all we need are a set of values for
x and a set of values for y. The values obtained for α and β
would give us a model for the linear trend of x and y.
Interestingly, x and y do not have to have a linear trend for
us to obtain a linear model between them. For example,
the data could have a graphical distribution as shown in
figure 3, yet we can find values of α and β that would give
a ‘linear’ model. It is clear that such a ‘linear’ model
would be meaningless in this case. Therefore, we need a
method to estimate the relevance of the linear model that Figure 3
we construct.

Lecture Notes on Independent University, Bangladesh


Probability and Statistics
188 Regression

The best method to obtain a relevance of the linear model is by using the Pearson’s
correlation coefficient. It is defined as

ρ=1 0<ρ <1 ρ = –1

−1 < ρ < 0 ρ=0

Figure 4. Different dependencies between X and Y and


their corresponding Pearson’s correlation coefficients.

Cov ( x, y )
ρ= (9.1)
Var ( x ) Var ( y )

ρ=
∑ (x − x )(y − y ) (9.2)
∑ (x − x ) ∑ (y − y )
2

Both these expressions can be used to find the correlation coefficient. Correlation
coefficient is much more robust to ascertain the dependency between two variables x and
y. We state this without proof here that the value of ρ ranges between –1 and 1.
Correlation coefficient gives us an estimate as to how good a line y = α + βx would fit the
data series x and y. A correlation coefficient of 1 indicates that the points (x, y) are
exactly on the line y = α + βx with β > 0. On the other hand, a correlation coefficient of –
1 also indicates that the points (x, y) are on the line y = α + βx with β < 0, The different

Independent University, Bangladesh Lecture Notes on


Probability and Statistics
Regression 189

cases of the association between x and y, and the corresponding correlation coefficient ρ
are shown in figure 4.

Example 2
We will find the correlation coefficient for the data given in example 1.
x y x−x y−y (x − x ) (y − y ) (x − x )(y − y )
2 2

46 12 −0.133 −0.4 0.018 0.16 0.053


53 15 6.867 2.6 47.151 6.76 17.853
29 7 −17.13 −5.4 293.551 29.16 92.52
61 17 14.867 4.6 221.018 21.16 68.387
36 10 −10.13 −2.4 102.684 5.76 24.32
39 11 −7.133 −1.4 50.884 1.96 9.987
47 11 0.867 −1.4 0.751 1.96 −1.213
49 12 2.867 −0.4 8.218 0.16 −1.147
52 14 5.867 1.6 34.418 2.56 9.387
38 9 −8.133 −3.4 66.151 11.56 27.653
55 16 8.867 3.6 78.618 12.96 31.92
32 8 −14.13 −4.4 199.751 19.36 62.187
57 18 10.867 5.6 118.084 31.36 60.853
54 14 7.867 1.6 61.884 2.56 12.587
44 12 −2.133 −0.4 4.551 0.16 0.853

The calculations are shown through the table above. From the table, we can find

∑ (x − x ) (
= 85.848 , ∑ y − y ) ∑ (x − x )(y − y ) = 27.75 . The Pearson’s
2 2
= 9.84 ,
correlation coefficient is

ρ=
∑ (x − x )(y − y ) = 27.75
= 0.95
∑ (x − x ) ∑ (y − y ) 85.848 × 9.84
2

Other problems reducible to linear problems

We have learnt how to obtain a linear trend for data which have a linear nature. Even
though this is extremely useful for understanding physical phenomenon, in many cases it
becomes necessary to obtain a mathematical model where the data are not linear, or
where we know that the problem is not supposed to be linear. For example, we know that
the populations of biological species grow exponentially. Therefore, if we have data
showing the growth of biological species, we should try to find an exponential model for
the data. In this section, we are going to see some models which are reducible to the
linear model.

Exponential problem

Lecture Notes on Independent University, Bangladesh


Probability and Statistics
190 Regression

We have seen in this course how exponential function appears in modeling of time
dependant events. In this case the data are meant to fit a mathematical model like
y = aebx (10)
This problem can be reduced to a linear problem by taking natural logarithm on both
sides. Doing this, we get
log y = log a + bx (11)
If we define log y as ŷ , and log a as â , then equation (11) can be written as
ŷ = â + bx
This is a linear equation exactly similar to the one that we have just studied. Using the
expression we have learnt, we can find the values of â and b. Finally we can go to the
form of equation (10) by taking the anti-logarithm of â .

Inverse function
Inverse function appears in engineering in understanding of birth and death process. In
the context of engineering birth and death process refers to events, for example, the birth
and death process of telephone calls in a telecommunication network.

The inverse function is defined as


1
y= (12)
a + bx
This function can be converted to a linear problem by simply taking the inverse, and
defining
a new variable ŷ as 1/y.
1
= a + bx
y
ŷ = a + bx (13)

Exercises

9.1 An article in Concrete Research (“Near Surface Characteristics of Concrete:


Intrinsic Permeability,” Vol. 41, 1989), presented data on compressive strength x
and intrinsic permeability y of various concrete mixes and cures. Summary
quantities are n = 14, ∑ y i = 572 , ∑ y i2 = 23,530 , ∑ x i = 43 , ∑ x i2 = 157 .42 ,
and ∑ x i y i = 1697 .80 . Assume that the two variables are related according to the
simple linear regression model.
(a) Calculate the least squares estimates of the slope and intercept.

Independent University, Bangladesh Lecture Notes on


Probability and Statistics
Regression 191

(b) Use the equation of the fitted line to predict what permeability would be
observed when the compressive strength is x = 4.3.
(c) Give a point estimate of the mean permeability when compressive strength is
x = 3.7.
(d) Suppose that the observed value of permeability at x = 3.7 is y = 46.1.
Calculate the value of the corresponding residual.

9.2 Regression methods were used to analyze the data from a study investigating the
relationship between roadway surface temperature (x) and pavement deflection (
y). Summary quantities were n = 20, ∑ y i = 12 .75 , ∑ y i2 = 8.86 , ∑ x i = 1478 ,
_ 157, ∑ x i2 = 143,215 .8 , and ∑ x i y i = 1083 .67
(a) Calculate the least squares estimates of the slope and intercept. Graph the
regression line.
(b) Use the equation of the fitted line to predict what pavement deflection would
be observed when the surface temperature is 85ºF.
(c) What is the mean pavement deflection when the surface temperature is 90ºF?
(d) What change in mean pavement deflection would be expected for a 1ºF
change in surface temperature?

9.3 Consider the regression model developed in Exercise 11-2.


(a) Suppose that temperature is measured in ºC rather than ºF. Write the new
regression model that results.
(b) What change in expected pavement deflection is associated with a 1ºC change
in surface temperature?

9.4 Montgomery, Peck, and Vining (2001) present data concerning the performance
of the 28 National Football League teams in 1976. It is suspected that the number
of games won (y) is related to the number of yards gained rushing by an opponent
(x). The data are shown in the following table.
Yards Yards
Games Rushing by Games Rushing
by
Teams Won (y) Opponent (x) Teams Won (y) Opponent
(x)
Washington 10 2205 Minnesota 11 2096
New England 11 1847 Oakland 13 1903
Pittsburgh 10 1457 Baltimore 11 1848
Los Angeles 10 1564 Dallas 11 1821
Atlanta 4 2577 Buffalo 2 2476
Chicago 7 1984 Cincinnati 10 1917
Cleveland 9 1761 Denver 9 1709
Detroit 6 1901 Green Bay 5 2288
Houston 5 2072 Kansas City 5 2861
Miami 6 2411 New Orleans 4 2289
New York Giants 3 2203 New York Jets 3 2592
Philadelphia 4 2053 St. Louis 10 1979

Lecture Notes on Independent University, Bangladesh


Probability and Statistics
192 Regression

San Diego 6 2048 San Francisco 8 1786


Seattle 2 2876 Tampa Bay 0 2560
(a) Calculate the least squares estimates of the slope and intercept. What is the
estimate of σ2? Graph the regression model.
(b) Find an estimate of the mean number of games won if the opponents can be
limited to 1800 yards rushing.
(c) What change in the expected number of games won is associated with a
decrease of 100 yards rushing by an opponent?
(d) To increase by 1 the mean number of games won, how much decrease in
rushing yards must be generated by the defense?
(e) Given that x = 1917 yards (Cincinnati), find the fitted value of y and the
corresponding residual.

9.5 An article in Technometrics by S. C. Narula and J. F. Wellington (“Prediction,


Linear Regression, and a Minimum Sum of Relative Errors,” Vol. 19, 1977)
presents data on the selling price and annual taxes for 24 houses. The data are
shown in the following table.
Taxes Taxes
Sale (Local, School), Sale (Local, School),
Price/1000 County)/1000 Price/1000 County)/1000
25.9 4.9176 29.5 5.0208
27.9 4.5429 25.9 4.5573
29.9 5.0597 29.9 3.8910
30.9 5.8980 28.9 5.6039
35.9 5.8282 31.5 5.3003
31.0 6.2712 30.9 5.9592
30.0 5.0500 36.9 8.2464
41.9 6.6969 40.5 7.7841
43.9 9.0384 37.5 5.9894
37.9 7.5422 44.5 8.7951
37.9 6.0831 38.9 8.3607
36.9 8.1400 45.8 9.1416
(a) Assuming that a simple linear regression model is appropriate, obtain the least
squares fit relating selling price to taxes paid. What is the estimate of σ2?
(b) Find the mean selling price given that the taxes paid are x = 7.50.
(c) Calculate the fitted value of y corresponding to x _ 5.8980. Find the
corresponding residual.
(d) Calculate the fitted for each value of xi used to fit the model. Then construct a
graph of versus the corresponding observed value yi and comment on what
this plot would look like if the relationship between y and x was a
deterministic (no random error) straight line. Does the plot actually obtained
indicate that taxes paid is an effective regressor variable in predicting selling
price?

Independent University, Bangladesh Lecture Notes on


Probability and Statistics
Regression 193

9.6 The number of pounds of steam used per month by a chemical plant is thought to
be related to the average ambient temperature (in ºF) for that month. The past
year’s usage and temperature are shown in the following table:
Month Temp. Usage/1000 Month Temp. Usage/1000
Jan. 21 185.79 July 68 621.55
Feb. 24 214.47 Aug. 74 675.06
Mar. 32 288.03 Sept. 62 562.03
Apr. 47 424.84 Oct. 50 452.93
May 50 454.58 Nov. 41 369.95
June 59 539.03 Dec. 30 273.98
(a) Assuming that a simple linear regression model is appropriate, fit the
regression model relating steam usage (y) to the average temperature (x).
What is the estimate of σ2?
(b) What is the estimate of expected steam usage when the average temperature is
55ºF?
(c) What change in mean steam usage is expected when the monthly average
temperature changes by 1ºF?
(d) Suppose the monthly average temperature is 47ºF. Calculate the fitted value of
y and the corresponding residual.

9.7 The data shown in the following table are highway gasoline mileage performance
and engine displacement for a sample of 20 cars.
Engine
MPG Displacement
Make Model (highway) (in3)
Acura Legend 30 97
BMW 735i 19 209
Buick Regal 29 173
Chevrolet Cavalier 32 121
Chevrolet Celebrity 30 151
Chrysler Conquest 24 156
Dodge Aries 30 135
Dodge Dynasty 28 181
Ford Escort 31 114
Ford Mustang 25 302
Ford Taurus 27 153
Ford Tempo 33 90
Honda Accord 30 119
Mazda RX-7 23 80
Mercedes 260E 24 159
Mercury Tracer 29 97
Nissan Maxima 26 181
Oldsmobile Cutlass 29 173
Plymouth Laser 37 122
Pontiac Grand Prix 29 173

Lecture Notes on Independent University, Bangladesh


Probability and Statistics
194 Regression

(a) Fit a simple linear model relating highway miles per gallon (y) to engine
displacement (x) using least squares.
(b) Find an estimate of the mean highway gasoline mileage performance for a car
with 150 cubic inches engine displacement.
(c) Obtain the fitted value of y and the corresponding residual for a car, the Ford
Escort, with engine displacement of 114 cubic inches.

9.8 An article in the Tappi Journal (March, 1986) presented data on green liquor
Na2S concentration (in grams per liter) and paper machine production (in tons per
day). The data (read from a graph) are shown as follows:
y 40 42 49 46 44 48
x 825 830 890 895 890 910
y 46 43 53 52 54 57 58
x 915 960 990 1010 1012 1030 1050
(a) Fit a simple linear regression model with y = green liquor Na2S concentration
and x = production. Find an estimate of σ2. Draw a scatter diagram of the data
and the resulting least squares fitted model.
(b) Find the fitted value of y corresponding to x = 910 and the associated residual.
(c) Find the mean green liquor Na2S concentration when the production rate is
950 tons per day.

9.9 An article in the Journal of Sound and Vibration (Vol. 151, 1991, pp. 383–394)
described a study investigating the relationship between noise exposure and
hypertension. The following data are representative of those reported in the
article.
y 1 0 1 2 5 1 4 6 2 3
x 60 63 65 70 70 70 80 90 80 80
y 5 4 6 8 4 5 7 9 7 6
x 85 89 90 90 90 90 94 100 100 100
(a) Draw a scatter diagram of y (blood pressure rise in millimeters of mercury)
versus x (sound pressure level in decibels). Does a simple linear regression
model seem reasonable in this situation?
(b) Fit the simple linear regression model using least squares. Find an estimate of
σ2.
(c) Find the predicted mean rise in blood pressure level associated with a sound
pressure level of 85 decibels.

9.10 An article in Wear (Vol. 152, 1992, pp. 171–181) presents data on the fretting
wear of mild steel and oil viscosity. Representative data follow, with x = oil
viscosity and y = wear volume (10−4 cubic millimeters).
(a) Construct a scatter plot of the data. Does a simple linear regression model
appear to be plausible?
(b) Fit the simple linear regression model using least squares. Find an estimate of
σ2
(c) Predict fretting wear when viscosity x = 30.

Independent University, Bangladesh Lecture Notes on


Probability and Statistics
Regression 195

(d) Obtain the fitted value of y when x = 22.0 and calculate the corresponding
residual.

9.11 An article in the Journal of Environmental Engineering (Vol. 115, No. 3, 1989,
pp. 608–619) reported the results of a study on the occurrence of sodium and
chloride in surface streams in central Rhode Island. The following data are
chloride concentration y (in milligrams per liter) and roadway area in the
watershed x (in percentage).
y 4.4 6.6 9.7 10.6 10.8 10.9
x 0.19 0.15 0.57 0.70 0.67 0.63
y 11.8 12.1 14.3 14.7 15.0 17.3
x 0.47 0.70 0.60 0.78 0.81 0.78
y 19.2 23.1 27.4 27.7 31.8 39.5
x 0.69 1.30 1.05 1.06 1.74 1.62
(a) Draw a scatter diagram of the data. Does a simple linear regression model
seem appropriate here?
(b) Fit the simple linear regression model using the method of least squares. Find
an estimate of σ2
(c) Estimate the mean chloride concentration for a watershed that has 1%
roadway area.
(d) Find the fitted value corresponding to x = 0.47 and the associated residual.

9.12 A rocket motor is manufactured by bonding together two types of propellants, an


igniter and a sustainer. The shear strength of the bond y is thought to be a linear
function of the age of the propellant x when the motor is cast. Twenty
observations are shown in the table on the next page.
Observation Strength y Age x Observation Strength y Age x
Number (psi) (weeks) Number (psi) (weeks)
1 2158.70 15.50 11 2165.20 13.00
2 1678.15 23.75 12 2399.55 3.75
3 2316.00 8.00 13 1779.80 25.00
4 2061.30 17.00 14 2336.75 9.75
5 2207.50 5.00 15 1765.30 22.00
6 1708.30 19.00 16 2053.50 18.00
7 1784.70 24.00 17 2414.40 6.00
8 2575.00 2.50 18 2200.50 12.50
9 2357.90 7.50 19 2654.20 2.00
10 2277.70 11.00 20 1753.70 21.50
(a) Draw a scatter diagram of the data. Does the straight-line regression model
seem to be plausible?
(b) Find the least squares estimates of the slope and intercept in the simple linear
regression model. Find an estimate of σ2.
(c) Estimate the mean shear strength of a motor made from propellant that is 20
weeks old.

Lecture Notes on Independent University, Bangladesh


Probability and Statistics
196 Regression

9.13 The final test and exam averages for 20 randomly selected students taking a
course in engineering statistics and a course in operations research (OR) follow.
Assume that the final averages are jointly normally distributed.
(a) Find the regression line relating the statistics final average to the OR final
average.
Statistics 86 75 69 75 90 94 83 86 71 65
OR 80 81 75 81 92 95 80 81 76 72
Statistics 84 71 62 90 83 75 71 76 84 97
OR 85 72 65 93 81 70 73 72 80 98
(c) Estimate the correlation coefficient.

9.14 The weight and systolic blood pressure of 26 randomly selected males in the age
group 25 to 30 are shown in the following table. Assume that weight and blood
pressure are jointly normally distributed.
Systolic Systolic
Subject Weight BP Subject Weight BP
1 165 130 14 172 153
2 167 133 15 159 128
3 180 150 16 168 132
4 155 128 17 174 149
5 212 151 18 183 158
6 175 146 19 215 150
7 190 150 20 195 163
8 210 140 21 180 156
9 200 148 22 143 124
10 149 125 23 240 170
11 158 133 24 235 165
12 169 135 25 192 160
13 170 150 26 187 159
(a) Find a regression line relating systolic blood pressure to weight.
(b) Test for significance of regression using α = 0.05.
(c) Estimate the correlation coefficient.

9.15 The following data gave X = the water content of snow on April 1 and Y = the
yield from April to July (in inches) on the Snake River watershed in Wyoming for
1919 to 1935. (The data were taken from an article in Research Notes, Vol. 61,
1950, Pacific Northwest Forest Range Experiment Station, Oregon)
x y x y
23.1 10.5 37.9 22.8
32.8 16.7 30.5 14.1
31.8 18.2 25.1 12.9
32.0 17.0 12.4 8.8
30.4 16.3 35.1 17.4
24.0 10.5 31.5 14.9
39.5 23.1 21.1 10.5
24.2 12.4 27.6 16.1

Independent University, Bangladesh Lecture Notes on


Probability and Statistics
Regression 197

52.5 24.9
(a) Draw a scatter diagram of these data. Does a straight-line relationship seem
plausible?
(b) Fit a simple linear regression model to these data.
(c) Find the correlation coefficient.

Supplemental Exercises

9.17 An article in the IEEE Transactions on Instrumentation and Measurement


(“Direct, Fast, and Accurate Measurement of VT and K of MOS Transistor Using
VT-Sift Circuit,” Vol. 40, 1991, pp. 951–955) described the use of a simple linear
regression model to express drain current y (in milliamperes) as a function of
ground-to-source voltage x (in volts). The data are as follows:
y x y x
0.734 1.1 1.50 1.6
0.886 1.2 1.66 1.7
1.04 1.3 1.81 1.8
1.19 1.4 1.97 1.9
1.35 1.5 2.12 2.0
(a) Estimate the correlation between Y and X.
(b) Find a least-square regression fit between x and y.

9.18 The strength of paper used in the manufacture of cardboard boxes (y) is related to
the percentage of hardwood concentration in the original pulp (x). Under
controlled conditions, a pilot plant manufactures 16 samples, each from a
different batch of pulp, and measures the tensile strength. The data are shown in
the table that follows:

y 101.4 117.4 117.1 106.2 131.9 146.9 146.8 133.9


x 1.0 1.5 1.5 1.5 2.0 2.0 2.2 2.4
y 111.0 123.0 125.1 145.2 134.3 144.5 143.7 146.9
x 2.5 2.5 2.8 2.8 3.0 3.0 3.2 3.3
(a) Fit a simple linear regression model to the data.
(b) Find the correlation coefficient.
(c) Construct a 95% confidence interval on the mean strength at x = 2.5.

9.19 The vapor pressure of water at various temperatures follows:


Observation Vapor pressure
Number, i Temperature (K) (mm Hg)
1 273 4.6
2 283 9.2
3 293 17.5
4 303 31.8
5 313 55.3

Lecture Notes on Independent University, Bangladesh


Probability and Statistics
198 Regression

6 323 92.5
7 333 149.4
8 343 233.7
9 353 355.1
10 363 525.8
11 373 760.0
Draw a scatter diagram of these data. What type of relationship seems appropriate
in relating y to x? Fit a linear regression model to these data. Find the correlation
coefficient.

9.20 An electric utility is interested in developing a model relating peak hour demand (
y in kilowatts) to total monthly energy usage during the month (x, in kilowatt
hours). Data for 50 residential customers are shown in the following table.
Customer x y Customer x y
1 679 0.79 26 1434 0.31
2 292 0.44 27 837 4.20
3 1012 0.56 28 1748 4.88
4 493 0.79 29 1381 3.48
5 582 2.70 30 1428 7.58
6 1156 3.64 31 1255 2.63
7 997 4.73 32 1777 4.99
8 2189 9.50 33 370 0.59
9 1097 5.34 34 2316 8.19
10 2078 6.85 35 1130 4.79
11 1818 5.84 36 463 0.51
12 1700 5.21 37 770 1.74
13 747 3.25 38 724 4.10
14 2030 4.43 39 808 3.94
15 1643 3.16 40 790 0.96
16 414 0.50 41 783 3.29
17 354 0.17 42 406 0.44
18 1276 1.88 43 1242 3.24
19 745 0.77 44 658 2.14
20 795 3.70 45 1746 5.71
21 540 0.56 46 895 4.12
22 874 1.56 47 1114 1.90
23 1543 5.28 48 413 0.51
24 1029 0.64 49 1787 8.33
25 710 4.00 50 3560 14.94
(a) Draw a scatter diagram of y versus x.
(b) Fit the simple linear regression model.

9.21 Consider the following data. Suppose that the relationship between Y and x is
hypothesized to be Y = (β0 + β1x + ε)–1. Fit an appropriate model to the data. Does
the assumed model form seem reasonable?
X 10 15 18 12 9 8 11 6

Independent University, Bangladesh Lecture Notes on


Probability and Statistics
Regression 199

Y 0.1 0.13 0.09 0.15 0.20 0.21 0.18 0.24

9.22 The following data, adapted from Montgomery, Peck, and Vining (2001), present
the number of certified mental defectives per 10,000 of estimated population in
the United Kingdom ( y) and the number of radio receiver licenses issued (x) by
the BBC (in millions) for the years 1924 through 1937. Fit a regression model
relating y and x. Comment on the model. Specifically, does the existence of a
strong correlation imply a cause-and-effect relationship?
Year y x Year y x
1924 8 1.350 1931 16 4.620
1925 8 1.960 1932 18 5.497
1926 9 2.270 1933 19 6.260
1927 10 2.483 1934 20 7.012
1928 11 2.730 1935 21 7.618
1929 11 3.091 1936 22 8.131
1930 12 3.674 1937 23 8.593

9.23 An article in Air and Waste (“Update on Ozone Trends in California’s South
Coast Air Basin,” Vol. 43, 1993) studied the ozone levels on the South Coast air
basin of California for the years 1976–1991. The author believes that the number
of days that the ozone level exceeds 0.20 parts per million depends on the
seasonal meteorological index (the seasonal average 850 millibar temperature).
The data follow:
Year Days Index Year Days Index
1976 91 16.7 1984 81 18.0
1977 105 17.1 1985 65 17.2
1978 106 18.2 1986 61 16.9
1979 108 18.1 1987 48 17.1
1980 88 17.2 1988 61 18.2
1981 91 18.2 1989 43 17.3
1982 58 16.0 1990 33 17.5
1983 82 17.2 1991 36 16.6
(a) Construct a scatter diagram of the data.
(b) Fit a simple linear regression model to the data. Find the correlation
coefficient.

9.24 An article in the Journal of Applied Polymer Science (Vol. 56, pp. 471–476,
1995) studied the effect of the mole ratio of sebacic acid on the intrinsic viscosity
of copolyesters. The data follow:
Mole ratio
X 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3
Viscosity
Y 0.45 0.20 0.34 0.58 0.70 0.57 0.55 0.44
(a) Construct a scatter diagram of the data.
(b) Fit a simple linear repression module.
(c) Test for significance of regression. Calculate R2 for the model.

Lecture Notes on Independent University, Bangladesh


Probability and Statistics
200 Regression

(d) Analyze the residuals and comment on model adequacy.

9.25 The grams of solids removed from a material ( y) is thought to be related to the
drying time. Ten observations obtained from an experimental study follow:
Y 4.3 1.5 1.8 4.9 4.2 4.8 5.8 6.2 7.0 7.9
X 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
(a) Construct a scatter diagram for these data.
(b) Fit a simple linear regression model.
(c) Test for correlation coefficient.
(d) Based on these data, what is your estimate of the mean grams of solids
removed at 4.25 hours? Find a 95% confidence interval on the mean.

9.26 Two different methods can be used for measuring the temperature of the solution
in a Hall cell used in aluminum smelting, a thermocouples implanted in the cell
and an indirect measurement produced from an IR device. The indirect method is
preferable became the thermocouples are eventually destroyed by the solution.
Consider the following 10 measurements:
Thermocouple 921 935 916 920 940 936 925 940 933 927
IR 918 934 924 921 945 930 919 943 932 935
(a) Construct a scatter diagram for these data, letting x = thermocouple
measurement and y = IR measurement.
(b) Fit a simple linear regression model.
(c) Test for significance a regression and calculate R2. What conclusions can you
draw?

Independent University, Bangladesh Lecture Notes on


Probability and Statistics

You might also like