Lecture 09
Lecture 09
Lecture 9
Regression
Introduction
Many engineering and scientific problems are concerned with determining a relationship
between a set of variables. For instance, in a chemical process, we might be interested in
the relationship between the output of the process, the temperature at which it occurs, and
the amount of catalyst employed. In another instance, we may wish to understand the
response of the atmospheric temperature with the change in pressure, particulate content,
and humidity of the atmosphere. Knowledge of such dependencies helps us to understand
the chemical and physical dynamics of the systems.
In many situations, there is a single response variable y, also called the dependant
variable, which depends on the value of a set of input, also called independent, variables
x1, x2, . . ., xr. The simplest type of relationship between the dependant variable y and the
input variables x1, x2, . . ., xr is a linear relationship. That is for some constant β0, β1, . . .,
βr the equation
y = β0 + β1x1 + β2x2 + . . . + βrxr (1)
would hold. If this was the relationship between y and xi, (i = 1, 2, . . ., r), then it would
be possible (once β’s were learned) to exactly predict the response for any set of input
values. However, in practice, such precision is almost never attainable, and the most that
one can expect is that equation (1) would be valid subject to a random error. By this we
mean that the explicit relationship is
y = β0 + β1x1 + β2x2 + . . . + βrxr + e (2)
where e, representing the random error, is assumed to be a random variable having mean
0. Equation (2) is called linear regression equation. We say that it describes the regression
of y on the set of independent variables x1, x2, . . ., xr. The quantities β0, β1, . . ., βr are
called the regression coefficients, and must usually be estimated from a set of data. A
regression equation containing a single independent variable is called a simple regression
equation, whereas one containing many independent variables is called a multiple
regression equation. We will learn the fundamental techniques of regression through the
simple regression equation, and then will extend these concepts to multiple regression
case.
Thus, a simple linear regression model supposes a linear relationship between the mean
response and the value of a single independent variable. It can be expressed as
y = α + βx + e (3)
where x is the value of the independent variable, also called the input level, y is the
response, and e, representing the random error, is a random variable having mean 0. Let
us put things in perspective. We have a set of points (xi, yi). The points, graphically, are
shown in figure 1. We want to pass
a straight line through these points
the set of points (xi, yi). Since we do
have any reason to assume that all
the points would lie exactly on a
straight line, each point (xi, yi) will
have an error, e, associated with it,
in relation to the straight line. Our
objective is to choose the values of
α and β in such a way that the line
can be considered a ‘best fit’. y e
For a given set of points (xi, yi), to obtain the expressions for α and β, we start with the
expression of error from equation (3).
ei = yi – α – βxi
Squaring this, we get the square of the error.
ei2 = ( y i − α − β x i )
2
Our intent is to develop a model so that S is a minimum. In equation (4), the quantities xi
and yi are the given set of points; hence fixed. The quantity S, therefore, depends upon the
values of α and β. To minimize the values of S, we differentiate S with respect to α and β
separately, and set those equal to zero. Therefore,
∂S
= −2∑ ( y i − α − βx i )
∂α i
∂S
and = −2 ∑ ( y i − α − β x i )x i
∂β i
∑ (y
i
i − α − βx i ) = 0 (5.1)
and ∑ (y
i
i − α − βxi )xi = 0 (5.2)
α ∑ xi + β ∑ xi2 = ∑ xi y i (6.2)
i i i
In these equations, n is the number of data points in the set (xi, yi). These two equations
give us a set of linear equation for α and β. We recognize that if we divide equation (6.1)
with n, it can be written as
α + βx = y (7)
Solving these equations, we can find the solutions for α and β as
∑y ∑x −∑x ∑x y
i
2
i i i i
α= i i i i
2
(8.1)
n∑ xi2 − ∑ xi
i i
n∑ xi y i − ∑ xi ∑ y i
β= i i i
2
(8.2)
n∑ x − ∑ xi
2
i
i i
Using equation (7), equation (8.1) can also be written as
α = y − βx (8.3)
Example 10.1
The raw material used in the production of a certain synthetic fiber is stored in a location
without humidity control. Measurements of the relative humidity in the storage location
and the moisture content of a sample of the raw material were taken over 15 days. The
following data were recorded.
Relative
humidity 46 53 29 61 36 39 47 49 52 38 55 32 57 54 44
Moisture
content 12 15 7 17 10 11 11 12 14 9 16 8 18 14 12
Construct a least square regression model.
Moisture content
Relative humidity
Figure 2. Example 10.1
The points and the least-square regression line are shown in figure 2.
Goodness of fit
The best method to obtain a relevance of the linear model is by using the Pearson’s
correlation coefficient. It is defined as
Cov ( x, y )
ρ= (9.1)
Var ( x ) Var ( y )
ρ=
∑ (x − x )(y − y ) (9.2)
∑ (x − x ) ∑ (y − y )
2
Both these expressions can be used to find the correlation coefficient. Correlation
coefficient is much more robust to ascertain the dependency between two variables x and
y. We state this without proof here that the value of ρ ranges between –1 and 1.
Correlation coefficient gives us an estimate as to how good a line y = α + βx would fit the
data series x and y. A correlation coefficient of 1 indicates that the points (x, y) are
exactly on the line y = α + βx with β > 0. On the other hand, a correlation coefficient of –
1 also indicates that the points (x, y) are on the line y = α + βx with β < 0, The different
cases of the association between x and y, and the corresponding correlation coefficient ρ
are shown in figure 4.
Example 2
We will find the correlation coefficient for the data given in example 1.
x y x−x y−y (x − x ) (y − y ) (x − x )(y − y )
2 2
The calculations are shown through the table above. From the table, we can find
∑ (x − x ) (
= 85.848 , ∑ y − y ) ∑ (x − x )(y − y ) = 27.75 . The Pearson’s
2 2
= 9.84 ,
correlation coefficient is
ρ=
∑ (x − x )(y − y ) = 27.75
= 0.95
∑ (x − x ) ∑ (y − y ) 85.848 × 9.84
2
We have learnt how to obtain a linear trend for data which have a linear nature. Even
though this is extremely useful for understanding physical phenomenon, in many cases it
becomes necessary to obtain a mathematical model where the data are not linear, or
where we know that the problem is not supposed to be linear. For example, we know that
the populations of biological species grow exponentially. Therefore, if we have data
showing the growth of biological species, we should try to find an exponential model for
the data. In this section, we are going to see some models which are reducible to the
linear model.
Exponential problem
We have seen in this course how exponential function appears in modeling of time
dependant events. In this case the data are meant to fit a mathematical model like
y = aebx (10)
This problem can be reduced to a linear problem by taking natural logarithm on both
sides. Doing this, we get
log y = log a + bx (11)
If we define log y as ŷ , and log a as â , then equation (11) can be written as
ŷ = â + bx
This is a linear equation exactly similar to the one that we have just studied. Using the
expression we have learnt, we can find the values of â and b. Finally we can go to the
form of equation (10) by taking the anti-logarithm of â .
Inverse function
Inverse function appears in engineering in understanding of birth and death process. In
the context of engineering birth and death process refers to events, for example, the birth
and death process of telephone calls in a telecommunication network.
Exercises
(b) Use the equation of the fitted line to predict what permeability would be
observed when the compressive strength is x = 4.3.
(c) Give a point estimate of the mean permeability when compressive strength is
x = 3.7.
(d) Suppose that the observed value of permeability at x = 3.7 is y = 46.1.
Calculate the value of the corresponding residual.
9.2 Regression methods were used to analyze the data from a study investigating the
relationship between roadway surface temperature (x) and pavement deflection (
y). Summary quantities were n = 20, ∑ y i = 12 .75 , ∑ y i2 = 8.86 , ∑ x i = 1478 ,
_ 157, ∑ x i2 = 143,215 .8 , and ∑ x i y i = 1083 .67
(a) Calculate the least squares estimates of the slope and intercept. Graph the
regression line.
(b) Use the equation of the fitted line to predict what pavement deflection would
be observed when the surface temperature is 85ºF.
(c) What is the mean pavement deflection when the surface temperature is 90ºF?
(d) What change in mean pavement deflection would be expected for a 1ºF
change in surface temperature?
9.4 Montgomery, Peck, and Vining (2001) present data concerning the performance
of the 28 National Football League teams in 1976. It is suspected that the number
of games won (y) is related to the number of yards gained rushing by an opponent
(x). The data are shown in the following table.
Yards Yards
Games Rushing by Games Rushing
by
Teams Won (y) Opponent (x) Teams Won (y) Opponent
(x)
Washington 10 2205 Minnesota 11 2096
New England 11 1847 Oakland 13 1903
Pittsburgh 10 1457 Baltimore 11 1848
Los Angeles 10 1564 Dallas 11 1821
Atlanta 4 2577 Buffalo 2 2476
Chicago 7 1984 Cincinnati 10 1917
Cleveland 9 1761 Denver 9 1709
Detroit 6 1901 Green Bay 5 2288
Houston 5 2072 Kansas City 5 2861
Miami 6 2411 New Orleans 4 2289
New York Giants 3 2203 New York Jets 3 2592
Philadelphia 4 2053 St. Louis 10 1979
9.6 The number of pounds of steam used per month by a chemical plant is thought to
be related to the average ambient temperature (in ºF) for that month. The past
year’s usage and temperature are shown in the following table:
Month Temp. Usage/1000 Month Temp. Usage/1000
Jan. 21 185.79 July 68 621.55
Feb. 24 214.47 Aug. 74 675.06
Mar. 32 288.03 Sept. 62 562.03
Apr. 47 424.84 Oct. 50 452.93
May 50 454.58 Nov. 41 369.95
June 59 539.03 Dec. 30 273.98
(a) Assuming that a simple linear regression model is appropriate, fit the
regression model relating steam usage (y) to the average temperature (x).
What is the estimate of σ2?
(b) What is the estimate of expected steam usage when the average temperature is
55ºF?
(c) What change in mean steam usage is expected when the monthly average
temperature changes by 1ºF?
(d) Suppose the monthly average temperature is 47ºF. Calculate the fitted value of
y and the corresponding residual.
9.7 The data shown in the following table are highway gasoline mileage performance
and engine displacement for a sample of 20 cars.
Engine
MPG Displacement
Make Model (highway) (in3)
Acura Legend 30 97
BMW 735i 19 209
Buick Regal 29 173
Chevrolet Cavalier 32 121
Chevrolet Celebrity 30 151
Chrysler Conquest 24 156
Dodge Aries 30 135
Dodge Dynasty 28 181
Ford Escort 31 114
Ford Mustang 25 302
Ford Taurus 27 153
Ford Tempo 33 90
Honda Accord 30 119
Mazda RX-7 23 80
Mercedes 260E 24 159
Mercury Tracer 29 97
Nissan Maxima 26 181
Oldsmobile Cutlass 29 173
Plymouth Laser 37 122
Pontiac Grand Prix 29 173
(a) Fit a simple linear model relating highway miles per gallon (y) to engine
displacement (x) using least squares.
(b) Find an estimate of the mean highway gasoline mileage performance for a car
with 150 cubic inches engine displacement.
(c) Obtain the fitted value of y and the corresponding residual for a car, the Ford
Escort, with engine displacement of 114 cubic inches.
9.8 An article in the Tappi Journal (March, 1986) presented data on green liquor
Na2S concentration (in grams per liter) and paper machine production (in tons per
day). The data (read from a graph) are shown as follows:
y 40 42 49 46 44 48
x 825 830 890 895 890 910
y 46 43 53 52 54 57 58
x 915 960 990 1010 1012 1030 1050
(a) Fit a simple linear regression model with y = green liquor Na2S concentration
and x = production. Find an estimate of σ2. Draw a scatter diagram of the data
and the resulting least squares fitted model.
(b) Find the fitted value of y corresponding to x = 910 and the associated residual.
(c) Find the mean green liquor Na2S concentration when the production rate is
950 tons per day.
9.9 An article in the Journal of Sound and Vibration (Vol. 151, 1991, pp. 383–394)
described a study investigating the relationship between noise exposure and
hypertension. The following data are representative of those reported in the
article.
y 1 0 1 2 5 1 4 6 2 3
x 60 63 65 70 70 70 80 90 80 80
y 5 4 6 8 4 5 7 9 7 6
x 85 89 90 90 90 90 94 100 100 100
(a) Draw a scatter diagram of y (blood pressure rise in millimeters of mercury)
versus x (sound pressure level in decibels). Does a simple linear regression
model seem reasonable in this situation?
(b) Fit the simple linear regression model using least squares. Find an estimate of
σ2.
(c) Find the predicted mean rise in blood pressure level associated with a sound
pressure level of 85 decibels.
9.10 An article in Wear (Vol. 152, 1992, pp. 171–181) presents data on the fretting
wear of mild steel and oil viscosity. Representative data follow, with x = oil
viscosity and y = wear volume (10−4 cubic millimeters).
(a) Construct a scatter plot of the data. Does a simple linear regression model
appear to be plausible?
(b) Fit the simple linear regression model using least squares. Find an estimate of
σ2
(c) Predict fretting wear when viscosity x = 30.
(d) Obtain the fitted value of y when x = 22.0 and calculate the corresponding
residual.
9.11 An article in the Journal of Environmental Engineering (Vol. 115, No. 3, 1989,
pp. 608–619) reported the results of a study on the occurrence of sodium and
chloride in surface streams in central Rhode Island. The following data are
chloride concentration y (in milligrams per liter) and roadway area in the
watershed x (in percentage).
y 4.4 6.6 9.7 10.6 10.8 10.9
x 0.19 0.15 0.57 0.70 0.67 0.63
y 11.8 12.1 14.3 14.7 15.0 17.3
x 0.47 0.70 0.60 0.78 0.81 0.78
y 19.2 23.1 27.4 27.7 31.8 39.5
x 0.69 1.30 1.05 1.06 1.74 1.62
(a) Draw a scatter diagram of the data. Does a simple linear regression model
seem appropriate here?
(b) Fit the simple linear regression model using the method of least squares. Find
an estimate of σ2
(c) Estimate the mean chloride concentration for a watershed that has 1%
roadway area.
(d) Find the fitted value corresponding to x = 0.47 and the associated residual.
9.13 The final test and exam averages for 20 randomly selected students taking a
course in engineering statistics and a course in operations research (OR) follow.
Assume that the final averages are jointly normally distributed.
(a) Find the regression line relating the statistics final average to the OR final
average.
Statistics 86 75 69 75 90 94 83 86 71 65
OR 80 81 75 81 92 95 80 81 76 72
Statistics 84 71 62 90 83 75 71 76 84 97
OR 85 72 65 93 81 70 73 72 80 98
(c) Estimate the correlation coefficient.
9.14 The weight and systolic blood pressure of 26 randomly selected males in the age
group 25 to 30 are shown in the following table. Assume that weight and blood
pressure are jointly normally distributed.
Systolic Systolic
Subject Weight BP Subject Weight BP
1 165 130 14 172 153
2 167 133 15 159 128
3 180 150 16 168 132
4 155 128 17 174 149
5 212 151 18 183 158
6 175 146 19 215 150
7 190 150 20 195 163
8 210 140 21 180 156
9 200 148 22 143 124
10 149 125 23 240 170
11 158 133 24 235 165
12 169 135 25 192 160
13 170 150 26 187 159
(a) Find a regression line relating systolic blood pressure to weight.
(b) Test for significance of regression using α = 0.05.
(c) Estimate the correlation coefficient.
9.15 The following data gave X = the water content of snow on April 1 and Y = the
yield from April to July (in inches) on the Snake River watershed in Wyoming for
1919 to 1935. (The data were taken from an article in Research Notes, Vol. 61,
1950, Pacific Northwest Forest Range Experiment Station, Oregon)
x y x y
23.1 10.5 37.9 22.8
32.8 16.7 30.5 14.1
31.8 18.2 25.1 12.9
32.0 17.0 12.4 8.8
30.4 16.3 35.1 17.4
24.0 10.5 31.5 14.9
39.5 23.1 21.1 10.5
24.2 12.4 27.6 16.1
52.5 24.9
(a) Draw a scatter diagram of these data. Does a straight-line relationship seem
plausible?
(b) Fit a simple linear regression model to these data.
(c) Find the correlation coefficient.
Supplemental Exercises
9.18 The strength of paper used in the manufacture of cardboard boxes (y) is related to
the percentage of hardwood concentration in the original pulp (x). Under
controlled conditions, a pilot plant manufactures 16 samples, each from a
different batch of pulp, and measures the tensile strength. The data are shown in
the table that follows:
6 323 92.5
7 333 149.4
8 343 233.7
9 353 355.1
10 363 525.8
11 373 760.0
Draw a scatter diagram of these data. What type of relationship seems appropriate
in relating y to x? Fit a linear regression model to these data. Find the correlation
coefficient.
9.20 An electric utility is interested in developing a model relating peak hour demand (
y in kilowatts) to total monthly energy usage during the month (x, in kilowatt
hours). Data for 50 residential customers are shown in the following table.
Customer x y Customer x y
1 679 0.79 26 1434 0.31
2 292 0.44 27 837 4.20
3 1012 0.56 28 1748 4.88
4 493 0.79 29 1381 3.48
5 582 2.70 30 1428 7.58
6 1156 3.64 31 1255 2.63
7 997 4.73 32 1777 4.99
8 2189 9.50 33 370 0.59
9 1097 5.34 34 2316 8.19
10 2078 6.85 35 1130 4.79
11 1818 5.84 36 463 0.51
12 1700 5.21 37 770 1.74
13 747 3.25 38 724 4.10
14 2030 4.43 39 808 3.94
15 1643 3.16 40 790 0.96
16 414 0.50 41 783 3.29
17 354 0.17 42 406 0.44
18 1276 1.88 43 1242 3.24
19 745 0.77 44 658 2.14
20 795 3.70 45 1746 5.71
21 540 0.56 46 895 4.12
22 874 1.56 47 1114 1.90
23 1543 5.28 48 413 0.51
24 1029 0.64 49 1787 8.33
25 710 4.00 50 3560 14.94
(a) Draw a scatter diagram of y versus x.
(b) Fit the simple linear regression model.
9.21 Consider the following data. Suppose that the relationship between Y and x is
hypothesized to be Y = (β0 + β1x + ε)–1. Fit an appropriate model to the data. Does
the assumed model form seem reasonable?
X 10 15 18 12 9 8 11 6
9.22 The following data, adapted from Montgomery, Peck, and Vining (2001), present
the number of certified mental defectives per 10,000 of estimated population in
the United Kingdom ( y) and the number of radio receiver licenses issued (x) by
the BBC (in millions) for the years 1924 through 1937. Fit a regression model
relating y and x. Comment on the model. Specifically, does the existence of a
strong correlation imply a cause-and-effect relationship?
Year y x Year y x
1924 8 1.350 1931 16 4.620
1925 8 1.960 1932 18 5.497
1926 9 2.270 1933 19 6.260
1927 10 2.483 1934 20 7.012
1928 11 2.730 1935 21 7.618
1929 11 3.091 1936 22 8.131
1930 12 3.674 1937 23 8.593
9.23 An article in Air and Waste (“Update on Ozone Trends in California’s South
Coast Air Basin,” Vol. 43, 1993) studied the ozone levels on the South Coast air
basin of California for the years 1976–1991. The author believes that the number
of days that the ozone level exceeds 0.20 parts per million depends on the
seasonal meteorological index (the seasonal average 850 millibar temperature).
The data follow:
Year Days Index Year Days Index
1976 91 16.7 1984 81 18.0
1977 105 17.1 1985 65 17.2
1978 106 18.2 1986 61 16.9
1979 108 18.1 1987 48 17.1
1980 88 17.2 1988 61 18.2
1981 91 18.2 1989 43 17.3
1982 58 16.0 1990 33 17.5
1983 82 17.2 1991 36 16.6
(a) Construct a scatter diagram of the data.
(b) Fit a simple linear regression model to the data. Find the correlation
coefficient.
9.24 An article in the Journal of Applied Polymer Science (Vol. 56, pp. 471–476,
1995) studied the effect of the mole ratio of sebacic acid on the intrinsic viscosity
of copolyesters. The data follow:
Mole ratio
X 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3
Viscosity
Y 0.45 0.20 0.34 0.58 0.70 0.57 0.55 0.44
(a) Construct a scatter diagram of the data.
(b) Fit a simple linear repression module.
(c) Test for significance of regression. Calculate R2 for the model.
9.25 The grams of solids removed from a material ( y) is thought to be related to the
drying time. Ten observations obtained from an experimental study follow:
Y 4.3 1.5 1.8 4.9 4.2 4.8 5.8 6.2 7.0 7.9
X 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
(a) Construct a scatter diagram for these data.
(b) Fit a simple linear regression model.
(c) Test for correlation coefficient.
(d) Based on these data, what is your estimate of the mean grams of solids
removed at 4.25 hours? Find a 95% confidence interval on the mean.
9.26 Two different methods can be used for measuring the temperature of the solution
in a Hall cell used in aluminum smelting, a thermocouples implanted in the cell
and an indirect measurement produced from an IR device. The indirect method is
preferable became the thermocouples are eventually destroyed by the solution.
Consider the following 10 measurements:
Thermocouple 921 935 916 920 940 936 925 940 933 927
IR 918 934 924 921 945 930 919 943 932 935
(a) Construct a scatter diagram for these data, letting x = thermocouple
measurement and y = IR measurement.
(b) Fit a simple linear regression model.
(c) Test for significance a regression and calculate R2. What conclusions can you
draw?