Spreadsheet Problem Solving
fitting models to data
straight-line regression
multilinear regression
nonlinear regression
model building and selection
Data Analysis Regression tool
using
Trendline
Solver
Review of Straight-line Linear Regression
[ from Class #6 ]
y1
y = ax + b
Model
y
y11
e11
y11
x
x11
For each data point, there is an error between that
point and the model line. Fitting the model has to do
with minimizing these errors.
Finding the model parameters that give the best fit
For the straight-line model, the model parameters are
the slope (a) and the intercept (b).
The problem is then to find the values of a and b that
give the best fit. What is meant by the best fit?
The standard measure of goodness of fit is the sum
of squares of the errors:
n
SSE yi yi
i 1
yi a xi b
So, the problem reduces to finding the minimum of
SSE by adjusting a and b.
Fitting a straight-line model to data
The minimization of SSE can be solved by calculus
to give formulas for the best values of a and b:
n xi yi xi yi
i 1 i 1
a i 1
2
n
n
2
n xi xi
i 1
i 1
n
y
i 1
x
i 1
and Excel solves problems like this with either formulas
or built-in tools (Data Analysis Regression & Trendline).
4
Example: straight-line fit
Transfer the data to an Excel spreadsheet
and create a graph
CO2 Emissions for the US
1520
1500
1480
CO2 Emissions (MMT C)
1460
1440
1420
1400
1380
1360
1340
1320
1989
1990
1991
1992
1993
1994
1995
Year
1996
1997
1998
1999
2000
Calculating the slope and intercept using Excel formulas
n xi yi xi yi
i 1 i 1
a i 1
2
n
n
n xi2 xi
i 1
i 1
n
y
i 1
x
i 1
The formulas behind the numbers
Using the model straight-line equation to compute
the predictions:
and copy these
to the graph,
displaying as
a straight line
CO2 Emissions for the US
1550
CO2 Emissions (MMT C)
1500
y = 21.32x - 41090
1450
1400
1350
1300
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Year
10
Using an alternate, shortcut approach
Trendline
Start with a simple graph of the data
Select the data series by
clicking on it
CO2 Emissions for the US
1520
1500
1480
Select
Add Trendline
option
1460
CO2 Emissions (MMT C)
Right-click on a
data point to get
context-sensitive
menu
1440
1420
1400
1380
1360
1340
1320
1989
1990
1991
1992
1993
1994
1995
Year
1996
1997
1998
1999
11
2000
The Add Trendline dialog box
Linear selected
by default
OK for this
problem
Click on
Options tab
12
Options tab
Set for
Display equation
on chart
Click OK
13
Fix up
equation
display
Initial form of graph with straight-line added
CO2 Emissions for the US
1550
y = 21.315x - 41090
CO2 Emissions (MMT C)
1500
1450
1400
1350
1300
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Year
14
CO2 Emissions for the US
1550
CO2 Emissions (MMT C)
1500
y = 21.315x - 41090
1450
1400
1350
1300
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Year
Looks just like before, but we got there quicker
But neither of these approaches gives us much information
15
about the model, how good it is, etc.
A 2nd alternate approach
Tools
Data Analysis
Data Analysis Regression tool
recall that, if Data Analysis
does not appear on the Tools
menu, you will need to check
Analysis Toolpak in the Add-ins
dialog box [if its not there, you
will have to go back to Microsoft
Office/Excel set-up]
Initial, empty
Regression
dialog box
16
Regression dialog box set up for our problem
checking Residuals
will give us also
model predictions
17
Initial (poorly formatted) Regression output display
[ on new worksheet ]
Format
Autoformat
OK
and fix up
display for
appropriate
significant
figures
18
Final Display of Regression Output
[ tons of info, most of
which you will not
understand for a
couple years ]
used to judge
goodness of
fit
intercept
and slope
values
used to judge
whether terms
belong in the
model
add to data graph
for visual comparison
with model
19
Judging Goodness of Fit
correlation coefficient: if close
to +1 or 1, indicates strong
correlation between x and y
[something we already know
from the original graph!]
coefficient of determination:
%-age of the variability in y
thats accounted for by the
model
gives an idea of how
far off the model
predictions will be
adjustment to R2 that
penalizes the value for
using a model with too
many terms
Adjusted R2 or Standard Error can be used to compare
different models and choose which fits best. The higher
the value of Adjusted R2 the better, the lower the value
of Standard Error the better.
20
Judging whether terms belong in the model
P-values estimate the probability
that the true value of the coefficient
could be zero
A P-value of 5%
(0.05) or greater
causes suspicion
that the coefficient
may not be
significant and that
the term should
probably be dropped
from the model
P-values that are quite small, like
these, indicate that there is little
question about the significance of
the term coefficients. In our case
here, that means that both the
intercept term and the slope term
belong in the model.
21
The Data Analysis Regression tool appears much more
complicated and involved that the shortcut Trendline tool, so . . .
Why use Data Analysis Regression?
1) It provides more information that lets us
judge the goodness of fit and significance
of model terms
2) It can handle model forms that cannot be
handled by Trendline
So, generally, when using Excel, we prefer
the Data Analysis Regression tool over Trendline
but Trendline is still quite good for quick and dirty
looks at the data
Learn to use both!
22
More complicated models
Polynomial models
y a bx cx 2 dx 3 L
Note: it is called linear regression,
even when there are nonlinear
terms in x, because the terms are
linear in the model parameters,
a, b, c, etc.
General linear models
y a f1 x b f 2 x c f 3 x d f 4 x L
Examples:
polynomial models above
1
y a b c ln x
x
Multilinear models
y a f1 x1 ,x2 ,K b f 2 x1 ,x2 ,K c f 3 x1 ,x2 ,K L
Examples:
y a bx1 cx2 dx1 x2
y ae
x1
x2
23
Nonlinear models
Transformable to linear
ln y ln a b x
y a eb x
Not transformable
P 10
B
T C
straight-line
regression!
We can use the Data Analysis Regression tool for everything
except the nonlinear models that cant be transformed into
linear. For those, we can use the Solver.
24
Example: polynomial regression
curvature evident
Viscosity of Water at Atmospheric Pressure
2.000
1.800
1.600
Viscosity (cp)
1.400
1.200
1.000
0.800
0.600
0.400
0.200
0.000
0
50
100
150
200
250
Temperature (degF)
25
Setting up for polynomial fits
Select for quadratic model, etc
26
Data Analysis Regression tool
check Labels because
headings are included
in selections for Y and X
check
Residuals
27
Quadratic model regression results
model performance
adjR2
model coefficients
copy to graph
28
Quadratic model really doesnt capture behavior of data
Viscosity of Water at Atmospheric Pressure
2.000
1.800
1.600
Data
Viscosity (cp)
1.400
Quadratic
1.200
1.000
0.800
0.600
0.400
0.200
0.000
0
50
100
150
200
250
Temperature (degF)
29
Continue with fits of cubic, 4th- & 5th-order polynomials
Summary of results
Looks like 5th-order offers best performance
but improvement is marginal over 4th-order.
Resulting model:
Visc 3.161 0.05699 T 5.023 10 4 T 2 2.162 10 6 T 3 3.593 10 9 T 4
30
Viscosity of Water at Atmospheric Pressure
2.000
1.800
1.600
Data
Viscosity (cp)
1.400
Quadratic
Cubic
1.200
4th Order
1.000
0.800
0.600
0.400
0.200
0.000
20
40
60
80
100
120
140
160
180
200
220
Temperature (degF)
31
Precautions on polynomial fitting
Try to use the lowest-order model that gives a good fit.
Higher-order models will have wiggles between data
points that will cause prediction errors.
In fact, an (n-1)th-order polynomial will provide a perfect
fit to the n data points, but it will usually do bizarre things
in between the data points.
32
Example: multi-linear regression
Model 1: y a b x1 c x2
Model 2:
y b x1 c x2
X-input range includes
two independent variables:
x1 and x2
High P value for intercept in
Model 1 suggests Model 2
without intercept, but there
is a significant loss in adjR2
33
Multilinear Model Performance
12.0
Model performance isnt that
great for either model, and
Model 1 doesnt appear
dramatically better than Model 2
10.0
Predicted y
8.0
Model 1
6.0
Model 2
4.0
2.0
0.0
0
10
12
Measured y
Note: for multi-linear models, we plot Predicted vs Measured y.
A perfect model would place points directly on the 45-degree line.
34
Nonlinear Regression
Fitting the parameters of the van der Waals equation of state
Data for SO2
RT
a
V b V 2
Find the values of a and b
that give the best predictions
for P, when compared to the
measured values of P
35
Strategy for Nonlinear Regression
1) estimate initial values for a and b
2) compute predicted Ps using data for V and T
3) compute errors between predicted Ps and measured Ps
4) sum the squares of these errors to compute SSE
5) have the Solver minimize SSE
by adjusting the values of a and b
36
Basic data
Calculated Pressure
by both ideal gas law
and van der Waals
Sum of
squares
of this
column
37
Ideal Gas
Sum of Squares
Calculation Calculation
van der Waals Calculation
Error Calculation
38
Setting up Solver Parameters
SSE as Target Cell
Minimize
by adjusting a and b
with b>=0 constraint
Results
39
Results
40
Fit of van der Waals Eqn for SO2
and Comparison to Ideal Gas Law
12000000
Note departure of
ideal gas predictions
at higher pressures
Predicted Pressure (Pa)
10000000
8000000
van der Waals
Ideal Gas
6000000
4000000
2000000
0
0
2000000
4000000
6000000
8000000
10000000
12000000
Measured Pressure (Pa)
41