Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
41 views70 pages

Data Analysis

The document provides an overview of regression analysis and its history. It discusses the main goals of regression modeling such as examining relationships between variables, making predictions, and serving as a descriptive tool. It also notes tensions between interpretability and predictive power. The document then discusses the origins of least squares dating back to Gauss and Legendre. It describes how Gauss applied the method to astronomical data. Finally, it outlines Francis Galton's seminal work developing the concept of regression in his study of heredity and the transmission of traits between generations. Galton linked regression to the bivariate normal distribution, helping resolve issues with applying the normal curve to inheritance.

Uploaded by

shri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views70 pages

Data Analysis

The document provides an overview of regression analysis and its history. It discusses the main goals of regression modeling such as examining relationships between variables, making predictions, and serving as a descriptive tool. It also notes tensions between interpretability and predictive power. The document then discusses the origins of least squares dating back to Gauss and Legendre. It describes how Gauss applied the method to astronomical data. Finally, it outlines Francis Galton's seminal work developing the concept of regression in his study of heredity and the transmission of traits between generations. Galton linked regression to the bivariate normal distribution, helping resolve issues with applying the normal curve to inheritance.

Uploaded by

shri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Regression Analysis: Model building, fitting and criticism

Statistics 201b, Winter 2011

Regression and its uses

Suppose we have a response Y (also known as an output or a dependent


variable) and predictors X1 , . . . , Xp (also known as inputs or independent
variables) -- We find several goals lurking behind the banner of
regression modeling
1. We might want to examine the relationship between inputs and
outputs -- Do they tend to vary together? What does the structure of
the relationship look like? Which inputs are important?
2. We are often interested in making predictions -- Given a new set of
predictor values X1 , . . . , X,p , what can we say about an unseen Y*? How
accurate can we be?
3. Regressions can also be little more than slightly sophisticated
descriptive statistics, providing us with summaries that we interpret like
we would the sample mean or standard deviation
4. Finally, regression tools often serve as a building block for more
advanced methodologies -- Smoothing by local polynomials, for example,
involves fitting lots of regression models locally, while iteratively fitting
weighted regressions is at the heart of the standard computations for
generalized linear models

Regression and its uses

We actually find tension between the first two bullets on the previous slide -Models that are good for prediction are often not the most transparent from a
data modeling or data analytic perspective

Put another way, the price of interpretability might be diminished predictive


power, especially when we start to look at more modern tools that depend on
averages of large number of models (boosting) or willfully amplify the
dimensionality of the problem (support vector machines)

Breimans piece on the two cultures is still instructive reading...

Some history

Yes, you knew it was coming -- Its not possible for me to present material
without at least the barest hint of context or history

So lets have a look at the two cultures (more or less) through two major
advances that produced the technology well come to know as regression...

The contested origins of least squares

Stephen Stigler, a well-known statistician who writes extensively on the history


of our field, begins a 1981 article on least squares with the sentence The most
famous priority dispute in the history of statistics is that between Gauss and
Legendre, over the discover of the method of least squares.

Legendre is undisputedly the first to publish on the subject, laying out the
whole method in an article in 1805 -- Gauss claimed to have used the
method since 1795 and that it was behind is computations of the Meridian arc
published in 1799

In that paper, Gauss used a famous data set collected to define the first
meter -- In 1793 the French had decided to base the new metric system upon
a unit, the meter, equal to one 10,000,000th part of the meridian quadrant, the
distance from the north pole to the equator along a parallel of latitude passing
through Paris...

Least squares

The relationships between the variables in question (arc length, latitude, and
meridian quadrant) are all nonlinear -- But for short arc lengths, a simple
approximation holds
a = (S/d) = + sin2 L

Having found values for and , one can estimate the meridian quadrant via
meridian quadrant = 90( + /2)

Label the four data points in the previous table


(a1 , L1 ), (a2 , L2 ), (a3 , L3 ) and (a4 , L4 )

and apply the method of least squares -- That is, we identify values for
and such that the sum of squared errors is a minimum
4
!
i=1

(ai sin2 Li )2

Least squares

Given a set of predictor-response pairs (x1 , y1 ), . . . , (xn , yn ) , we can write the


ordinary least squares (OLS) criterion (as opposed to a weighted version that
well get to) as
argmin,

n
!
i=1

(yi xi )2

Least squares

Graphically, in this simple case, we are doing nothing more than hypothesizing
a linear relationship between the x and y variables and choosing that line that
minimizes the (vertical) errors between model and data

Least squares

Given a set of predictor-response pairs (x1 , y1 ), . . . , (xn , yn ) , we can write the


ordinary least squares (OLS) criterion (as opposed to a weighted version that
well get to) as
argmin,

n
!

(yi xi )2

i=1

By now you are all familiar with the idea that we can minimize this quantity by
simply taking derivatives with respect to the parameters
!
!
(yi xi )
(yi xi )2 = 2

i=1

i=1

n
n
!
!
2
xi (yi xi )
(yi xi ) = 2

i=1

i=1

Least squares

Setting these to zero yields the so-called normal equations

+x=y

x + x2 = xy

or in matrix form
!

and

n
"

xi

#!
#
! "
#
"
x

y
i
i
" 2
"
=

xi
xi yi

After a little (familiar?) algebra, we can rewrite the expression in the form

xx
yy
= r
sd(y)
sd(x)

where r is the usual correlation coefficient (we write it this way because well
see it again in a moment)

Gauss and least squares

Stigler attempts to reproduce Gausss calculations, but cannot given the


simple linearization (and a couple not-so-simple linearizations) on the
previous slide

Ultimately, he reckons that because Gauss was a mathematician and not a


statistician, he might have derived a more elaborate expansion -- No matter
what form was used, Stigler seems convinced that something like least
squares was required

Gauss eventually publishes on least squares in 1809, and his account of the
method is much more complete than Legendres -- Linking the method to
probability and providing computational approaches

Gauss and least squares

My intention in bringing up this example is that least squares, as a method, has


been around for a long time -- Numerical analysts often use the procedure for
fitting curves to data, whether the underlying functional form is known (or an
approximation is known as in the last example) or not (using a flexible basis of,
say, polynomials or piecewise polynomials)

We will see many examples of similar problems in statistics -- Smoothing


procedures borrow tools directly from approximation theory (from
polynomials to smooth, piecewise polynomials to wavelets)

But statistics brings with it an emphasis on model interpretation, model


assessment, and the formal incorporation of uncertainty -- Its interesting
to compare how these same tools are used by statisticians and numerical
analysts and how the analysis shifts as they pass across disciplinary
boundaries

Galton and regression

While least squares, as a method, was developed by several people at around


the same time (often ideas are in the air), regression as we have come to
understand it, was almost entirely the work of one man

Stigler writes Few conceptual advances in statistics can be as unequivocally


associated with a single individual. Least squares, the central limit theorem, the
chi-squared test -- all of these were realized as the culmination of many years
of exploration by many people. Regression too came as the culmination of
many years work, but in this case it was the repeated efforts of one
individual.

Galton and regression

Francis Galton (1822-1911) was at various points in


his career an inventor, an anthropologist, a
geographer, a meteorologist, a statistician and even
a tropical explorer -- The latter gig paid quite well
as his book The art of travel was a best seller

Among his many innovations, was the first modern


weather map, appearing in The Times in 1875 -- To
draw it, Galton requested data from meteorological
stations across Europe

He also developed the use of fingerprints as a


means of identification -- This work is just one
small part of his larger interest how human
characteristics (physical or even mental) varied
across populations

Galton and regression

Galton was also half-cousins with Charles Darwin (sharing the same
grandfather) and took a strong interest in how physical and mental
characteristics move from generation to generation -- Heredity

His work on regression started with a book entitled Hereditary Genius from
1869 in which he studied the way talent ran in families -- The book has lists
of famous people and their famous relatives (great scientists and their families,
for example)

He noted that there was a rather dramatic reduction in awesomeness as you


moved up or down a family tree from the great man in the family (the Bachs or
the Bernoullis, say) -- And thought of this as a kind of regression toward
mediocrity

Galton and regression

In some sense, his work builds on that of Adolphe Quetelet -- Quetelet


saw normal distributions in various aggregate statistics on human
populations

Galton writes Order in Apparent Chaos -- I know of scarcely anything


so apt to impress the imagination as the wonderful cosmic order
expressed by the Law of Frequency of Error. The law would have been
personified by the Greeks and deified, if they had known of it.

Galton and regression

Relating the normal curve (and the associated central limit theorem) to heredity,
however, proved difficult for Galton -- He could not connect the curve to the
transmission abilities or physical characteristics from one generation to the
next, writing

If the normal curve arose in each generation as the aggregate of a large


number of factors operating independently, no one of them overriding or even
significant importance, what opportunity was there for a single factor such as
parent to have a measurable impact?

So at first glance, the normal curve that Galton was so fond of in Quetelets
work was at odds with the possibility of inheritance -- Galtons solution to the
problem would be the formulation of regression and its link to the bivariate
normal distribution

Galton and regression

In 1873, Galton had a machine built which he christened the Quincunx -- The
name comes from the similarity of the pin pattern to the arrangement of fruit
trees in English agriculture (quincunxial because it was based on a square of
four trees with a fifth in the center)

The machine was originally devised to illustrate the central limit theorem and
how a number of independent events might add up to produce a normal
distribution -- Lead shot were dropped at the top of the machine and piled up
according to the binomial coefficients at the bottom

The other panels in the previous slide illustrate a thought experiment by Galton
(its not clear the other devices were ever made) -- The middle region (between
the As) in the central machine, could be closed, preventing the shot from
working their way down the machine

Galton and regression

By imagining holding back a portion of the shots, Galton expected to still see a
normal distribution at the bottom of the machine, but one with less variation -As he opened each barrier, the shot would deposit themselves according to
small normal curves, adding to the pattern already established

Once all the barriers had been opened, youd be left with the original normal
distribution at the bottom -- Galton, in effect, showed how the normal curve
could be dissected into components which could be traced back to the
location of the shot at A-A level of the device

In effect, he established that a normal mixture of normals is itself normal -- But


with this idea in hand, we see his tables of human measurements in a
different light...

Galton and regression

Looking at these tables, we see the Quincunx at work -- The righthand column
labeled Total number of Adult Children being the counts of shot at the A-A
level, while the row marked Totals can be thought of as the distribution one
would see at the bottom of the device when all the barriers are opened and
the individual counts in each row as the corresponding normal curves

By 1877, Galton was starting to examine these ideas mathematically -- He


essentially discovered the important properties of the bivariate normal
distribution (the bivariate normal had been derived by theorists unknown to
Galton, but they did not develop the idea of regression, nor did they attempt to
fit it from data as Galton did)

Galton and regression

In his text Natural Inheritance, he approached a table like this by first examining
the heights of the mid-parents and noted that it appeared to be normal -- He
then looked at the marginal distribution of child heights and found them to
also be normally distributed

He then considered the heights of the children associated with different


columns in his table, plotting median values against mid-parental height and
finding a straight line (which he fit by eye)

He found that the slope was about 2/3 -- If children were on average as tall as
their parents, hed expect a slope of 1, leading him to coin the phrase
regression toward mediocrity

Galton and regression

What Galton found through essentially geometric means was the following
relationship (which weve seen earlier in the lecture)

xx
yy
= r
sd(y)
sd(x)

where we might take x to be the heights of mid-parents and y to be the heights


of their adult offspring -- The quantity r is the correlation coefficient between x
and y (another Galton innovation)

This gives a precise meaning to his phrase regression to the mean

Galton and regression

Galton also noticed, however, that a similar kind of regression happened in


reverse -- That is, that if you transposed the table, youd find a slope of 1/3
relating the average mid-parents height to that of their children

He surmised that the regression effect was more a fact about the bivariate
normal distribution than anything else -- This is a lesson that many
researchers have failed to appreciate even now

Heres Galton -- Notice that hes not content to just invent regression, but he
also exhibits one of the first (if not the first) bivariate kernel density estimate!

Galton and regression

To complete this story, Galton enlisted the help of a mathematician, Hamilton


Dickson -- The problem he wanted solved was the following

Suppose x and y are expressed as deviations from the mean and that x is
normal with mean zero and standard deviation Qx

Also suppose that conditional on a fixed value of x, y is also normal with


mean y|x and standard deviation Qy|x

What is the joint density of x and y and, in particular, are the contours of
equal probability elliptical?

What is the conditional distribution of x given y, and in particular, what is the


relation between the two regression coefficients?

Galton and regression

In his response, Dickson derived the bivariate normal distribution and the
associated marginals and conditionals -- For simplicity, let X and Y have two
standard normal distributions with correlation
"

1
!
(x2 2xy + y2 )
exp
f(x, y) =
2)
2
2(1

2 1
1

Then, after a little algebra, the conditional density of Y given X=x is just
f(y|x) =
=

f(x, y)
fx (x)

1
1
!
exp
2
2(1 2 )

y x
!
1 2

%2

which we recognize as a normal with mean x and standard deviation 1 2

Galton and regression

Despite the tremendous influence Galton had on the practice and


(indirectly) theory of statistics, its worth asking why he was so concerned
with heredity?

Tables of heights seem innocent, tracking familial eminence is maybe


less so, but his photographic work...

Galton

As weve seen, Galton was deeply committed to the


idea of the normal curve as an important force in
nature and (as with Quetelet) thought the mean
value had particular importance as an indicator of
type

Quetelet was more extreme than Galton, however, in


that he believed deviations from the mean were
more like small errors, and regarded the mean as
something perfect or ideal

For Galton, these types were stable from generation


to generation -- You can see this in his work on
fingerprints or even in his composite photography

Regression today

Regression has become a powerful tool in many quantitative disciplines -- In


many cases, a regression model acts as a kind of social probe, providing
researchers with a glimpse into the workings of some larger phenomnenon

OK, thats generous. Its also a highly abused tool, one for which the elegant
mathematics breaks down rather quickly once you hit modern practice -Researchers often choose between many competing models, often through
exhaustive searchers; data go missing and some form of imputation is often
required; the underlying functional form is rarely linear and must also be
estimated...

But heres what regression looks like in other fields...

Regression

In the two major examples from this lecture (Gauss and Galton), we see two
approaches to regression -- One based on a loss-function (the least squares
criterion) and one that involves a conditioning argument using a formal data
model (the bivariate normal)

Lets see if we can bring these two approaches into agreement...

Regression

Lets recall a little probability and take Y to be a random variable representing


our output, and X = (X1 , . . . , Xp ) a random vector denoting our inputs -Suppose we want to find a function h(X) for predicting values of Y

This seems to require some criterion to judge how well a function h is doing -Let L(Y,h(X)) represent a loss function that penalizes bad predictions

For convenience, we start this quarter with squared error loss or simply
L(Y, h(X)) = (Y h(X))2

and define the expected (squared) prediction error to be

E L(Y, h(X)) = E(Y h(X)) =

[y h(x)]2 f(x, y) dx dy

Regression

We can rewrite this expression, conditioning on X to yield


E L(Y, h(X)) = EX EY|X ([Y h(x)]2 |X)

which we can then consider solving pointwise


h(x) = argminz EY|X ([Y z]2 |X = x)

This produces the simple conditional mean h(x) = E(Y|X = x) -- So, under
squared error loss, the conditional mean is the best prediction of Y at any
point X=x

Regression

In the case of a bivariate normal distribution, this conditional expectation is, in


fact, linear -- There are certainly plenty of other situations in which an
assumption of linearity is (at best) an approximation (all smooth functions h
looking linear in small neighborhoods)

Regression, then, has come to stand for a suite of algorithms that attempt to
estimate the mean (or some centrality parameter) of an output conditional on
one or more input variables

Modern regression

As an analysis tool, the practice of regression seems to have undergone a


massive shift in the 1970s -- Writing to some of the big names publishing at the
time (Cook, Allen, Weisberg), this seems to be due in part to the shifting nature
of computing

It was also noted that an interest in regression was in the air as it was the hot
topic of the decade (What is the hot topic today? Whats in the air today?)

Practicalities

OK so that wasnt quite what I wanted, but in the 1970s you have the following
innovations appearing
1. Diagnostic tools (leave-one-out measures, influence, Cooks distance)
2. Automatic criterion for variable selection (Cp, AIC, BIC)
3. Simulation techniques for inference (the bootstrap)
4. Computational schemes for subset selection (leaps and bounds, say)
5. Computational tools for fitting (SVD, QR decomposition -- well, mid1960s)
6. Biased or penalized estimates (ridge regression)
7. Alternate loss functions (robust regression, quantile regression)
8. Flexible modeling (local polynomials, global B-splines, smoothing splines)
9. New data types (generalized linear models)
10.The Bayesian linear model

Practicalities

Since the 1970s, regression has continued to flourish with new advances in
nonparametric methods (wavelets, averaged or boosted predictors, kernel
methods), new approaches to penalties (the lasso, say) and an explosion in
Bayesian tools

We intend to cover all of this during the quarter!

To start

Lets go back to our general framework -- We have an output or response Y


and inputs or predictors X1 , . . . , Xp, both of which we can think of as random
variables (although given most of you are coming from 201a, you can think of
the Xs as deterministic)

We can express a stochastic relationship between inputs and outputs with the
formula
Y = 1 X1 + + p Xp + "

where we assume the error term is independent of the predictors, have mean
zero and constant (finite) variance 2

To start

Suppose we are now given data from this model -- That is, we have n data
pairs (x1 , y1 ), . . . , (xn , yn ) where (with a slight abuse of notation) xi = (xi1 , . . . , xip )

The simple linear model (regression model) relating inputs to outputs is then

yi = 1 xi1 + + p xip + "i

For the moment, well assume that if the model has an intercept, it is
represented by one of the p predictors -- Its a boring predictor thats simply 1
for each data pair

To start

We determine estimates for the unknown parameters 1 , . . . , p and 2 via


ordinary least squares -- That is, we want to minimize the quantity
!

(yi 1 xi1 p xip )2

Taking partial derivatives as we had for the case of simple regression (and now
with p > 1, we use the term multiple regression), we can derive the so-called
normal equations

1
xi1 xi2 + 2

x2i1 + 2
x2i2 + 3

xi1 xi2 + + 2
xi2 xi3 + + 2

xi1 xip + + p1

xi,p1 xip + p

xi1 xip

xi1 xip

..
.
x2ip

!
!

yi xi1
yi xi2

yi xip

To start

While this seems tedious, we can again appeal to a matrix formulation -- Let X
denote the so-called design matrix and y the matrix of responses

M
X=

x11
x21
..
.

x12
x22

x1,p1
x2,p1

x1p
x2p
..
.

xn1

xn2

xn,p1

xnp

y=

y1
y2
..
.
yn

Then, we can (and this is not news to any of you I am certain) rewrite the
normal equations in the form
M
Xt M
X = M
Xt y

where we have collected our regression coefficients into the vector = (1 , . . . , p )t

To start

XtM
X is invertible (its time to dust off your linear
Now, assuming the matrix M
algebra books!) we can form an estimate of our regression coefficients using
the (symbolic!) manipulation
M tM
! = (X
X)1M
Xt y

Similarly, the estimated conditional mean for the ith data point is simply
!
yi = !1 xi1 + . . . + !p xip

which we can write in matrix notation as


!
y = M
X!
M tM
= M
X(X
X)1M
Xt y
=

Hy

To start

The matrix H is known as the hat matrix (for the obvious reason that it carries
our observed data into an estimate of the associated conditional means, in
effect placing a hat on y)

We can derive some properties of H somewhat easily -- For example, H is


symmetric (easy) and its idempotent
M tM
M tM
M tM
H2 = HH = M
X(X
X)1M
XtM
X(X
X)1M
Xt = M
X(X
X)1M
Xt = H

! i = yi !
yi as !
! = (I H)y , so
We can compute the residuals from our fit !
that the residual sum of squares can be written as

!
i

"
!2i = "
!t "
! = yt (I H)t (I H)y = yt (I H)y

You might also like