Regression Analysis
y a b1 x1 b2 x2 b3 x3 ... bk xk
y
X3
X1
X2
STATITICAL DATA ANALYSIS
COMMON TYPES OF ANALYSIS?
1.Compare Groups
a. Compare Proportions (e.g., Chi Square Test 2)
H0:
P1 = P2 = P3 = = P k
b. Compare Means (e.g., Analysis of Variance)
H0:
1 = 2 = 3 = = k
2.Examine Strength and Direction of Relationships
a. Bivariate (e.g., Pearson Correlationr)
Between one variable and another: Y = a + b 1 x1
b. Multivariate (e.g., Multiple Regression Analysis)
Between one dep. var. and each of several indep. variables,
while holding all other indep. variables constant:
Y = a + b 1 x 1 + b 2 x2 + b 3 x 3 + + b k x k
Simple and Multiple Regression Analysis
What does regression analysis do?
Examines whether changes/differences in values of one variable
(dependent variable Y) are linked to changes/differences in values
of one or more other variables (independent variables X 1, X2, etc.),
while controlling for the changes in values of all other Xs.
E.g., Relationship between salary and gender for people who have the same
levels of education, work experience, position level, seniority, etc.
The DV (Y) must be metric.
The IVs (Xs) must be either metric or dummy var.
Central Question Addressed:
Is Y a function of X1, X2, etc.? How ?
Is there a relationship between Y and X 1, X2 , etc., (in each case,
after controlling for the effects of all other Xs)? In what way?
What is the relative impact of each X on Y, holding all other Xs
constant (that is, all other Xs being equal)?
Simple and Multiple Regression Analysis
More specifically,
Do values of Y tend to increase/decrease as
values of X1, X2, etc. increase/decrease?
If so,
By how much?
And
How strong is the connection/relationship
y
between Xs and Y?
what % of differences/variations
in Y values (e.g., income) among
study subjects can be explained by
(or attributed to) differences in
X1
X values (e.g. years of education,
years of experience, etc.)?
X3
X2
Simple and Multiple Regression Analysis
NOTE: Once we can determine how values of Y change as a
function of values of X1, X2, etc., we will also be able to
predict/estimate the value of Y from specific values of X 1, X2,
etc.
Y = a + b1 x1 + b2 x2 + b3 x3 + + bk xk+
Therefore, regression analysis, in a sense, is about
ESTIMATING values of Y, using information about
values of Xs:
Estimation, by definition, involves?
The objective?
To minimize error in estimation.
Or, to compute estimates that are
as close to the true/actual values as possible.
Simple and Multiple Regression Analysis
QUESTION: What is the simplest way to obtain an
estimate for some population characteristic
(e.g., number of credit cards per U.S. household)?
ANSWER:
1.Select a representative sample from the population and
2.Compute the mean for that sample (e.g., compute the
average number of CCs for the sample households).
X
Regression analysis can be viewed as a technique that often
significantly improves the accuracy of estimation results relative
to using the mean value.
So, suppose we were to estimate the number of credit cards for
U.S. households, based on information from a random sample of,
say, n = 8 families.
Simple and Multiple Regression Analysis
Estimating Number of Credit Cards*
i
yi
Family
Number
Actual # of Credit
Cards
10
y Estimate?
y y
56
7
8
QUESTION: Can we
determine how much error in
estimation we are committing
by using Y 7 as our estimate,
for each of these households?
56
* This example was adopted from Hair, Black, Babin, Anderson, & Tatham, (2006). Multivariate Data Analysis, 6th ed., Prentice Hall.
Simple and Multiple Regression Analysis
Estimating Number of Credit Cards
i
Family
Number
yi
Actual # of
Credit Cards
y y
Estimate for #
of Credit
Cards
Error in
Estimation
10
yi 56
y y
56
7
8
Simple and Multiple Regression Analysis
Estimating Number of Credit Cards
i
Family
Number
yi
yi y
Actual # of
Credit Cards
y y
Estimate for #
of Credit
Cards
Error in
Estimation
-3
-1
-1
+1
+1
10
+3
yi 56
y y
56
7
8
Lets now see all
this graphically
Simple and Multiple Regression Analysis
Actual # of credit cards
10
9
8
7
6
F8
F5
F7
F6
F4
F2, F3
5
4
F1
3
2
1
0
Lets spread the dots away from each
other to see things more clearly!
Y Y Estimate
Simple and Multiple Regression Analysis
Actual # of credit cards
10
9
8
7
F3
3
2
1
0
F4
F7
F6
F2
5
4
F8
Graphic Representation
Actual Estimate F5
F1
Estimation Error
Can we determine the
total estimation error
for all 8 families?
Y Y Estimate
Simple and Multiple Regression Analysis
i
Family
Number
yi
yi y
Actual # of
Credit Cards
y y
Estimate for #
of Credit
Cards
Error in
Estimation
-3
-1
-1
+1
+1
10
yi 56
7
56
y y
7
8
+3
(
yi y ) =
What would be the
total estimation
error for all 8
families combined?
0
Solution?
Simple and Multiple Regression Analysis
Estimating Number of Credit Cards
yi y
i
Family
Number
yi
Actual # of
Credit Cards
y y
Estimate for #
of Credit
Cards
Error in
Estimation
Errorsi Squared
-3
-1
-1
+1
+1
+3
( yi y ) 0
9
2
( yi y ) 22
10
yi 56
y y
56
7
8
( y y)
SST = Sum of Squares Total
Simple and Multiple Regression Analysis
22 = SST = Index for total (combined) amount of estimation error
for all families (observations) in the sample when using the mean
as the estimate.
SST is also the sum of squared deviations from the mean.
o Remember the formula for computing Variance?
Objective in Estimation?
Minimize error, maximize precision.
Can we cut down the amount of estimation error (SST)? How?
Yes, we can, by using information about other variables suspected
to be strong predictors (strongly related to) # of credit cards
possessed by families (e.g., family size, family income,
income etc.)..
Simple and Multiple Regression Analysis
y
i
Family
Number
Actual # of
Credit Cards
Family Size
10
We now can attempt to
estimate # of credit cards
from the information on
family size, rather than
from its own mean.
Lets first see this graphically!
Y
# Of Credi t Cards
10
9
Simple and Multiple Regression Analysis
F2
F5
F6
F4
F7
F1
y y
Original (Baseline)
Estimate
F3
x 2, y 4
QUESTION: Does the mean ( y ) appear to represent the
closest estimate of the actual c.c. numbers for our
sample families ?
That is, is the green line the best line to represent the
location of estimates of # of CC for these families?
3
2
1
0
F8
Plot actual numbers of CCs
against family Size.
7
Family Size
Simple and Multiple Regression yAnalysis
a b x
# Of Credi t Cards
Y Generic Equation for any
10
straight line: Y= a + bx
F8
y a3 b3 x
Regression Line
9
8
F4
F2
F5
F7
F3
5
4
F1
3
2
y a2 b2 x
y y
Original (Baseline)
Estimate
F6
y a 0 x y
Regression Line
(Line of Best Fit)-new improved
location for CC
estimates (see next
slide)
1
0
7
Family Size
Simple and Multiple Regression Analysis
F8
# Of Credi t Cards
10
y a bx
9
8
F2
F4
y Original
(Baseline)
Estimate
F3
5
4
Estimation ERROR ( y
F1
Regression Line will
Minimize
F7
F6
F5
( y y )
Reg. Line (Line of
Best Fit)--new
improved location
for CC estimates
y )
= total estimation error.
But, how do we know the values a and b in y a bx (the reg. line)?
7
Family Size
Actual # of credit cards
EQUATION FOR REGRESSION LINE (LINE OF BEST
FIT)-Values of a and b for the regression line:
y a bx
( x x)( y y)
b
2
(x x)
a y bx
Lets use above formulas to compute the values of a
and b for the regression line in our example.
We will need: y , x ,
( x x )( y y ),
and
(x x)
Simple and Multiple Regression Analysis
We need: y, x , ( x x )( y y ), and ( x x )
y
i
Family Actual #
Number of Credit
Cards
x
Family
Size
xx
y y ( x x )( y y )
(x x)
10
56
Y
7
8
34
x 4.25
8
( x x )( y y ) ?
(x x) ?
2
Simple and Multiple Regression Analysis
We need: y, x , ( x x )( y y ), and ( x x )
y
i
Family Actual #
Number of Credit
Cards
x
Family
Size
xx
y y ( x x )( y y )
(x x)
-2.25
-3
6.75
5.0625
-2.25
-1
2.25
5.0625
-.25
-1
.25
.0625
-.25
.0625
.75
.75
.5625
.75
.5625
1.75
1.75
3.0625
10
1.75
5.25
3.0625
56
Y
7 x 34 4.25
8
8
( x x )( y y ) 17 ( x x )
17.5
Simple and Multiple Regression Analysis
REGRESSION LINE (LINE OF BEST FIT):
y a bx
( x x)( y y ) 17
.971
2
17.5
(
x
x
)
a y b x 7 .971( 4.25) 2.87
a =2.87
b = .97
y 2.87 .97 x
?
Y-Intercept
Regression Coefficient
Simple and Multiple Regression Analysis
# Of Credi t Cards
F5
F2
F7
F4
Estimate
F3
5
4
New
Improved
Estimates
y Original
(Baseline)
F6
F1
Can we tell how much estimation error we have
committed by using the new regression line?
Yes, examine differences between our households
actual # of CCs and their new/regression estimates.
2
1
0
y 2.87 .97 x
F8
10
7
Family Size
Simple and Multiple Regression Analysis
y 2.87 .97 x
i
y
Family Actual #
Numbe of Credit
r
Cards
x
Family
Size
y
y
y y
( y y )
Regression
Error
Estimate
(Residual)
Errors
Squared
10
( y y )
Simple and Multiple Regression Analysis
y 2.87 .97 x
y 2.87 .97(2) 4.81
i
y
Family Actual #
Numbe of Credit
r
Cards
x
Family
Size
y y
( y y )
Regression
Error
Estimate
(Residual)
Errors
Squared
4.81
-.81
.66
4.81
1.19
1.42
6.76
-.76
.58
6.76
.24
.06
7.73
.27
.07
7.73
-.73
.53
8.7
-.7
.49
10
8.7
1.3
1.69
5.486 ( y y ) 2
SSE = Sum of Squares Error (SS Residual)
Simple and Multiple Regression Analysis
Total Baseline Error using the mean (SS Total)
22.0
New or Remaining Error (SS Error or SS Residual) 5.486 ~ 5.5
Total Var.
QUESTION: How much of the original estimation error have we explained in Y = 22
away (eliminated) by using the regression model (instead of the mean)?
5.5
22 5.486 = 16.514 (SS Regression or SS Explained)
16.5
X1
QUESTION: What % of estimation error have we explained (eliminated by
using the regression model?
R2 = 16.514 / 22 = .751 or 75% What is this called?
% of differences in # of CCs among households that is
explained by differences in their family size.
What does the remaining 25% represent?
Percent of variation (differences) in number of credit cards owned by families
that can be accounted for by: (a) all other potential predictors not included in the
model, beyond family size, and (b) unexplainable random/chance variations.
Simple and Multiple Regression Analysis
R2 = SS Regression / SS Total = 16.5/22 = 75%
R2 is a measure of our success regarding accuracy of our estimation effort.
R2 = % of estimation error that we have been able to explain away by
using the regression model, instead of using the mean.
R2 indicates how much better we can predict Y from information about
Xs, rather than from using its own mean.
R2 = % of differences (variations) in Y values that is explained by
(attributable to) differences in X values.
Note: When dealing with only two variables (a single X and Y):
16.514
r R
.75 .866
22
2
Pearson Correlation
of Y with X1
(NOT controlling for
any other var.)
Lets now examine all this graphically!
Simple and Multiple Regression Analysis
# Of Credi t Cards
10
Regression Line (New Improved Estimates):
F8
y 2.87 .97 x
9
8
F2
7
6
5
4
y y
y y
Original
Baseline
ERROR
for F1
F4
by
? Explained
REGRESSION
F5
F6
F7
y Original
(Baseline)
Estimate
F3
Model
? y y
F1 New ERROR
(Unexplained/
RESIDUAL)
3
2
1
0
7
Family Size
Simple and Multiple Regression Analysis
5.5 = SSE = The amount of estimation error for the 8 sample families
when using simple regression (i.e., a regression model that includes
only information about family size).
Can we reduce the amount of estimation
error (SSE) to an even lower level and,
thus, improving the estimation process? How?
Yes, by adding information on a second variables suspected to be
strongly related to # of credit cards (e.g., family income--X2).
Simple and Multiple Regression Analysis
x1
x2
i
Family
Number
Actual # of
Credit Cards
Family Size
14
16
14
17
18
21
17
10
25
yi
Generic Equation for a linear plane:
Family
Income
We now can attempt
to estimate # of CCs
from our information
on family size and
family income!
Our regression model
will now be a linear
plane, rather than a
straight line!
y a b1 x1 b2 x2
Lets examine the regression plane for our example graphically.
Y = # of Credit Cards
12
y a b1 x1 b2 x2
11
10
Formulas are available for
computing values of
9
a, b1 and b2
8
MULTIPLE REGRESSION
7
MODEL FOR OUR EXAMPLE:
y .482 .63x1 .216 x2
Lets now see
how much error
in estimation we
are committing
by using this
multiple
regression
model.
Family Income
6
5
4
3
2
1
0
Actual
Regression Estimate
X1 = Family Size
Simple and Multiple Regression Analysis
y .482 .63x1 .216 x2
y
i
Family Actual #
Number of Credit
Cards
x1
Family
Size
x2
y y
Family Regression
Income Estimate
($000)
Error
(Residual)
( y y )
Errors
Squared
14
16
14
17
18
21
17
10
25
)
(y y
Simple and Multiple Regression Analysis
y .482 .63 x1 .216 x2
y
i
Family Actual #
Number of Credit
Cards
y .482 .63(2) .216(14) 4.77
x1
Family
Size
x2
Family Regression
Income Estimate
($000)
y y
Error
(Residual)
( y y )
Errors
Squared
14
4.77
-.77
.59
16
5.20
.80
.64
14
6.03
-.03
.00
17
6.68
.32
.10
18
7.53
.47
.22
21
8.18
-1.18
1.39
17
7.95
.05
.00
10
25
9.67
.33
.11
SSE = Sum of Squares Error (Residual)
3.05 ( y y )
Unique (additional) contribution of X2 (family income) beyond X1 = ? 5.5 3.05 = 2.45
Simple and Multiple Regression Analysis
The MULTIPLE REGRESSION MODEL FOR OUR EXAMPLE:
y .482 .63 x1 .216 x2
?
Y-Intercept,
Y-Intercept a
b1 and b2 = Regression Coefficients
(NOTE: Only when all Xs
can meaningfully take on
value of zero, the intercept
will have a meaningful/direct/
practical interpretation.
Otherwise, it is simply an aid
in increasing accuracy of
estimation.
0.63: Among families of the same income, an increase in
family size by one person would, on average, result in .63
more credit cards.
0.21: Among families of the same size, an income increase
of $1,000, results in an average increase of 0.2 credit cards .
bs represent effect of each X on Y when all other Xs are
controlled for/held constant/taken into account
i.e., after impacts of all other variables are accounted
for (remember the high blood pressure-hearing
problem connection?)
Simple and Multiple Regression Analysis
The MULTIPLE REGRESSION MODEL FOR OUR EXAMPLE:
y .482 .63 x1 .216 x2
SST = 22
SSE = 3.05
What is our new R2?
SS Regression = 22 3.05 = 18.95
2
R = 18.95 / 22 = .861 or 86%
The Remaining 14%?
(3.05 / 22 = .14)
Percent of differences in households
number of CCs that is explained by
differences in family size and family
income.
Percent of variation in number of credit
cards that can be accounted for by (a) all
other relevant factors not included in the
model, beyond family size and income, and
(b) unexplainable random/chance
variations.
Y= # of CC
Total Variation/Error in Y = SS Total = a + b + c + d = 22
X1=Family
Size
X2 = Family
y 2.87 .97 X 1 r2 = ?
SSR =
Income
a+c
X1=Family
= 16.5
size
y 0.063 .398 X 2
SSR =
c+b
= 15.12
X2 = Family
Income
R2 = (a+c) / (a+b+c+d)
R2 = 16.5 / 22 = 0.75
What do we call the square root of this?
Pearson/simpl ryx 16.5 0.75 0.867
22
e
Correlation
ac
ryx1
of Y with X1
abcd
(not
controlling for
X2)
2
r = (b+c) / (a+b+c+d) = 15.12 / 22 = 0.687
Pearson/simpl
bc
r
yx2
e Correlation
abcd
of Y with X2
(not
15.11
ryx2
0.829
controlling for
22
X)?
a
c
y .482 .63x1 .216 x2
b
X1=Family
Size
X2 = Family
Income
Graphically = ?
NOTE: c is explained by
both X1 and X2
R2
SSR = a + b +c = 18.95
SST = a + b + c + d = 22
R2 = SSR / SST = (a + b + c) / (a + b + c + d) = 18.95 / 22 = 86%
SSE = ?
SSE = d = 22 18.95 = 3.05
Simple and Multiple Regression Analysis
y .482 .63 x1 .216 x2
i
Family
Number
y .482 .63(2) .216(14) 4.77
x1
x2
Family Regression
Income Estimate
($000)
y y
Error
(Residual)
( y y )
Actual #
of Credit
Cards
Family
Size
14
4.77
-.77
.59
16
5.20
.80
.64
14
6.03
-.03
.00
17
6.68
.32
.10
18
7.53
.47
.22
21
8.18
-1.18
1.39
17
7.95
.05
.00
10
25
9.67
.33
.11
SSE = Sum of Squares Error (Residual)
Remember:
Errors
Squared
3.05 ( y y )
Unique (additional) contribution of X2 = 5.5 3.05 = 2.45
Exercise 1: Redo the credit card
analysis with SPSS.
First, Correlations and Simple Regression
Next, Multiple Regression (also ask for part
and partial correlations.)
SPSS CREDIT CARD FILE
Simple and Multiple Regression Analysis
EXERCISE 2:
Using gss_2 data file, we are interested in
understanding the role that the following demographics (age, educ, sibs,
agewed), as well as respondent income (rincmdol), job satisfaction (satjob_2),
and marriage satisfaction (hapmar_2) play in determining/predicting ones
general happiness (happy_2).
We also wish to know which of the above variables is the strongest predictor of
general happiness (Standardized Reg. Coefficients).
Use the gss_2 data file and conduct the appropriate analysis.
NOTE:
satjob_2 is coded as:
1 = Very Dissatisfied
2 = A Little Dissatisfied
3 = Pretty Satisfied
4 = Very Satisfied
hapmar_2 is coded as:
1 = Not Too Happy
2 = Pretty Happy
3 = Very Happy
Interpreting Regression Results
Ho: R2 = 0. That is, There is NO RELATIONSHIP between the DV
and ANY OF the IVs included in the regression model.
No
Dont reject Ho; No indep. Variable has a
1. Is overall F significant?
sig. relationship with dep. Variable.
(i.e., < 0.05)
Stop.
Yes
Reject Ho; One or more independent
variables are significantly related to the
dep. Variable.
2. Which independent variable(s) have significant relationships with the
dep. Var.? In the Coefficients table, look up the result of the t-test for
each indep. variables regression coefficient (b). Ho for t-test of a given
variable hypothesizes that the coefficient b = 0. That is, there is no
relationship between the corresponding independent variable and the
dep. Variable. If a t-tests < 0.05, reject the null and conclude that the
corresponding variable has a significant relationship with the dep.
Variable.
3. Look up the sign of the regression coefficient (b) ONLY
FOR
those indep. variables that are found to have a
significant
relationship with the dependent variable (i.e., those
with < 0.05), and state your conclusions accordingly.
Simple and Multiple Regression Analysis
Regression Analysis Using Categorical Variables:
General Rule: Categorical variables should NOT be used in multiple
regression since interpretation of the variables regression coefficient becomes
nonsensical.
Coded: Democrat = 1
Republican = 2
EXAMPLE: Income = 24000 + 1400 Political Party
Independent = 3
Other = 4
Exception to the above Rule: Dummy variables (i.e., categorical
variables representing only two groups--such as gender, when coded as 0 and 1)
can be used as independent variables in regression analysis. The reason is that a
dummy variables values (0, 1) can go up or down by only 1 unit, signifying a
change from one group to another.
EXAMPLE:
Income = 24000 + 1400 gender
Meaning?
Coded: Female = 0, Male = 1
Note: A dummy variables regression coefficient represents the
average difference in the value of the dependent variable between the
two groups represented by the dummy variable.
Simple and Multiple Regression Analysis
Coded: Female = 0, Male = 1
EXAMPLE 1:
Income = 24000 + 1400 gender.
Average income of females is $24,000.
Males on average make $1400 more than females
MULTIPLE REGRESSION EXAMPLE 2:
Coded: Female = 0, Male = 1
Income = 12000 + 1000 Education Years + 800 Gender
Meaning?
Average income of females
with no education is $12000.
Meaning?
Among people of the same gender, every
additional year of education results in an
average additional income of $1,000.
Males make, on average, $800 more in
comparison with females who have the
same number of years of education.
Exercise 4: Suppose we are interested in
knowing what role, if any, demographic
characteristics (i.e., age, sex_Dummy,
educ, sibs, agewed, incomdol), as well as
job satisfaction (satjob-2), and marriage
satisfaction (hapmar-2) play in determining
ones overall happiness in life (happy-2).
Use the gss_2 data file and conduct the
appropriate analysis.
Exercise 3: Suppose we are interested
in knowing what role, if any, the following
demographic characteristics play in
determining ones income (rincmdol):
Age,
Sex_Dummy (0=male, 1=female),
age first married (agewed),
Years of education completed (educ), and
Political party affiliation--republic
(0=Democrat, 1=Republican) .
Use the gss_2 data file and conduct the
appropriate analysis.
Assignment 5
Data file Salary.sav contains information about 474 employees hired by a Midwestern bank
between 1969 and 1971 (NOTE: Due to SPSS site license restrictions, this hyperlink will
not work if you are off campus). Of the 474 employees, 258 were men, 216 women, 370
white, and 104 non-white. The bank was subsequently involved in EEOC litigation; the
bank was accused of gender and race discrimination in its hiring and compensation
practices. The two issues that were of particular interest in the litigation were alleged
gender and racial inequalities not only in the banks beginning salaries (variable salbeg),
but also in its later salaries (variable salnow).
1.
Print, examine, and interpret correlation coefficients between beginning salary
(salbeg) and age in years (age), education in years (edlevel), employment category or job
classification level--rated from 1=lowest to 8=highest (jobcat), and work experience in
months (work).
2.
Conduct the appropriate analysis to see: (a) What role each of the variables age,
education (edlevel), employment category (jobcat), and work experience (work) played,
holding all other variables constant, in determining the banks beginning salaries? For
example, what was the differential pay for one additional year of education among new
hires who otherwise had the same age, employment category, and work experience? (b)
Which of the above demographic characteristics had the strongest influence on beginning
pay? How can you tell? (c) What percent of the differences in employees beginning
salaries can be explained by/attributed to difference in all of the above characteristics?
Assignment 5
3.
Now conduct the appropriate analysis to indicate, holding all other variables
constant, what roles gender (sex, male=0, female=1) played in determining beginning
salaries at the bank. That is, what was the differential beginning pay between male and
female employees who otherwise had the same age, education, employment category, and
work experience? Does this evidence support the charges of gender discrimination in the
banks practices regarding initial compensation?
4.
During litigation, it was charged that the banks unfair compensation practices had
continued beyond its initial salary decisions. That is, the prosecution claimed that with
time, not only the beginning salary disparities between men and women did not shrink, but
further widened. Conduct the appropriate analysis to indicate (a) everything else being
equal, what roles gender played in determining employees later salaries at the bank
(salnow). That is, what was the average differential pay between male and female
employees who otherwise had the same age, education, employment category, work
experience, and job seniority (variable time represents seniority in terms of number of
months employed at the bank)? (b) Compare the later pay disparities you have just
identified with the beginning pay disparities you had found in question 3 above to explain
if the evidence supports the prosecutions charges of continued gender discrimination
beyond initial salary decisions, resulting in widening disparities in later pay.
NOTE: For each question, provide thorough explanations on corresponding pages and
parts of your printout.
Simple and Multiple Regression Analysis
QUESTIONS
OR
COMMENTS
?