Lecture Notes 3
Lecture Notes 3
& %
1
' $
& %
2
' $
& %
3
' $
Confidence Intervals
• A large sample confidence interval for log θ is given by
& %
4
' $
Example: Aspirin Usage
• Sample Odds Ratio = 1.832
• Sample log odds ratio, log θ̂ = log(1.832) = 0.2629
• ASE of log θ̂
r
1 1 1 1
+ + + = 0.123
89 10933 10845 104
& %
6
' $
& %
7
' $
& %
8
' $
Observations
• A sample odds ratio 1.832 does not mean that p1 is 1.832 times
p2 .
• A simple relation:
p1 /(1 − p1 ) 1 − p2
OddsRatio = = RelativeRisk ×
p2 /(1 − p2 ) 1 − p1
• If p1 and p2 are close to 0, the odds ratio and relative risk take
similar values.
• This relationship between odds ratio and relative risk is useful.
& %
9
' $
Ever Myocardial
Smoker Infarction Controls
& %
10
' $
& %
11
' $
Some Properties
• This statistic takes its minimum value of zero when all
nij = µij .
• For a fixed sample size, greater differences between {nij } and
{µij } produce larger χ2 values and stronger evidence against
H0 .
• The χ2 statistic has approximately a chi-squared distribution
with appropriate degrees of freedom for large sample sizes.
& %
12
' $
& %
13
' $
Properties
• Likelihood ratio cannot exceed 1.
• Small likelihood ratio implies deviation from H0 .
• Likelihood ratio test statistic is −2 log Λ, which has a
chi-squared distribution with appropriate degrees of freedom
for large samples.
• For a two-way contingency table, this statistic reduces to
X nij
2
G =2 nij log( )
µij
& %
14
' $
& %
15
' $
Test Statistics
• Pearson’s Chi-square test statistic
XI X J
2 (nij − µ̂ij )2
χ =
i=1 j=1
µ̂ij
& %
16
' $
Party Identification
Gender Democrat Independent Republican Total
& %
17
' $
Example: Continued · · ·
• The test statistics are: χ2 = 7.01 and G2 = 7.00
• Degrees of freedom = (I − 1)(J − 1) = (2 − 1)(3 − 1) = 2.
• p-value = 0.03.
• Thus, the above test statistics suggest that party identification
and gender are associated.
& %
18
' $
& %
19
' $
& %
20
' $
Output
The FREQ Procedure
Table of Gender by Party
Gender Party
Frequency
Expected Democrat Independent Republican Total
& %
21
' $
Output
Statistics for Table of Gender by Party
Statistic DF Value Prob
-------------------------------------------------
Chi-Square 2 7.0095 0.0301
Likelihood Ratio Chi-Square 2 7.0026 0.0302
Mantel-Haenszel Chi-Square 1 6.7581 0.0093
Phi Coefficient 0.0846
Contingency Coefficient 0.0843
Cramer’s V 0.0846
Sample Size = 980
& %
22
' $
R Codes
>gendergap<-matrix(c(279,73,225,165,47,191),
byrow=T,ncol=3)
>dimnames(gendergap)<-
list(Gender=c("Female","Male"),
PartyID=c("Democrat","Independent",
"Republican"))
>gendergap
PartyID
Gender Democrat Independent Republican
Female 279 73 225
Male 165 47 191
& %
23
' $
R Codes
>chisq.test(gendergap)
& %
24
' $
An Alternative Way
>Gender<-c("Female","Female","Female","Male",
"Male","Male")
>Party<-c("Democrat","Independent", "Republican",
"Democrat","Independent", "Republican")
>count<-c(279,73,225,165,47,191)
>gender1<-data.frame(Gender,Party,count)
>gender<-xtabs(count Gender+Party, data=gender1)
>gender
>summary(gender)
& %
25
' $
Output
Party
Gender Democrat Independent Republican
Female 279 73 225
Male 165 47 191
Call: xtabs(formula = count Gender + Party,
data = gender1)
Number of cases in table: 980
Number of factors: 2
Test for independence of all factors:
Chisq = 7.01, df = 2, p-value = 0.03005
& %
26
' $
& %
27
' $
& %
28
' $
Residuals (p.31)
• To understand better the nature of evidence against H0, a cell
by cell comparison of observed and estimated frequencies is
necessary.
• Define, adjusted residuals
nij − µ̂ij
rij = p
µ̂ij (1 − pi+ )(1 − p+j )
Computing Residuals in R
> rowp<-rowsum/n %Row marginal prob.
> colp<-colsum/n %Column marginal prob.
> pd<-outer(1-rowp,1-colp,
make.dimnames=T)
> resid<-(gender-gd)/sqrt(gd*pd)
> resid
& %
30
' $
Residuals Output
Party
Gender Democrat Independent Republican
Female 2.2931603 0.4647941 -2.6177798
Male -2.2931603 -0.4647941 2.6177798
& %
31
' $
& %
32
' $
& %
33
' $
& %
34
' $
& %
35
' $
Malformation
Alcohol
Consumption Absent(0) Present(1) Total
0 17,066 48 17,114
<1 14,464 38 14,502
1-2 788 5 793
3-5 126 1 127
≥6 37 1 38
& %
36
' $
Malformation
Alcohol Percent Adjusted
Consumption Absent(0) Present(1) Total Present Residual
& %
37
' $
& %
38
' $
& %
39
' $
SAS Codes
data infants;
input malform alcohol count @@;
datalines;
1 0 17066 2 0 48
1 0.5 14464 2 0.5 38
1 1.5 788 2 1.5 5
1 4.0 126 2 4.0 1
1 7.0 37 2 7.0 1
;
run;
proc format;
value malform 2=’Present’ 1=’Absent’;
value Alcohol 0=’0’ 0.5=’<1’ 1.5=’1-2’ 4.0=’3-5’
7.0=’>=6’;
run;
& %
40
' $
SAS Codes
proc freq data = infants;
format malform malform. alcohol alcohol.;
weight count;
tables alcohol*malform / chisq cmh1 norow
nocol nopercent;
run;
& %
41
' $
Partial Output
& %
42
' $
Partial Output
Cochran-Mantel-Haenszel Statistics (Based on Table
Scores)
Statistic Alternative Hypothesis DF Value Prob
-------------------------------------------------------
1 Nonzero Correlation 1 6.5699 0.0104
& %
43
' $
Notes
• The correlation r has limited use as a descriptive measure of
tables.
• Different choices of monotone scores usually give similar results.
• However, it may not happen when the data are very
unbalanced, i.e. when some categories have many more
observations than other categories.
• If we had taken (1, 2, 3, 4, 5) as the row scores in our example,
then M 2 = 1.83 and p-value = 0.18 gives a much weaker
conclusion.
• It is usually better to use one’s own judgment by selecting
scores that reflect distances between categories.
& %
44
' $
SAS Codes
data infantsx;
input malform alcoholx count @@;
datalines;
1 0 17066 2 0 48
1 1 14464 2 1 38
1 2 788 2 2 5
1 3 126 2 3 1
1 4 37 2 4 1
;
run;
proc freq data = infantsx;
weight count;
tables alcoholx*malform / cmh1 norow nocol nopercent;
run;
& %
45
' $
Partial Output
Cochran-Mantel-Haenszel Statistics (Based on Table
Scores)
Statistic Alternative Hypothesis DF Value Prob
--------------------------------------------------------
1 Nonzero Correlation 1 1.8278 0.1764
& %
46
' $
Milk 3 1 4
Tea 1 3 4
Total 4 4 8
& %
47
' $
& %
48
' $
& %
49
' $
& %
50
' $
& %
51
' $
& %
52
' $
& %
53
' $
Partial Output
& %
54
' $
R Codes
> Poured<-c("Milk","Milk","Tea","Tea")
> Guess<-c("Milk","Tea","Milk","Tea")
> count<-c(3,1,1,3)
> teadata<-data.frame(Poured,Guess,count)
> tea<-xtabs(count Poured+Guess,data=teadata)
> fisher.test(tea,alternative="greater")
& %
55
' $
Output
Fisher’s Exact Test for Count Data
data: tea p-value = 0.2429
alternative hypothesis: true odds ratio is
greater than 1
95 percent confidence interval:
0.3135693 Inf
sample estimates:
odds ratio
6.408309
& %
56