Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views56 pages

Lecture Notes 3

CDA3

Uploaded by

kenenisa Abdisa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views56 pages

Lecture Notes 3

CDA3

Uploaded by

kenenisa Abdisa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

' $

ST3241 Categorical Data Analysis I


Two-way Contingency Tables

Odds Ratio and Tests of Independence

& %
1
' $

Inference For Odds Ratio (p. 24)


• For small to moderate sample size, the distribution of sample
odds ratio θ̂ is highly skewed.
• For θ = 1, θ̂ cannot be much smaller that θ, but it can be much
larger than θ with nonnegligible probability.
• Consider log odds ratio, log θ
• X and Y are independent implies log θ = 0.

& %
2
' $

Log Odds Ratio


• Log odds ratio is symmetric about zero in the sense that
reversal of rows or reversal of columns changes its sign only.
• The sample log odds ratio, log θ̂ has a less skewed distribution
and can be approximated by the normal distribution well.
• The asymptotic standard error of log θ̂ is given by
r
1 1 1 1
ASE(log θ̂) = + + +
n11 n12 n21 n22

& %
3
' $

Confidence Intervals
• A large sample confidence interval for log θ is given by

log(θ̂) ± zα/2 ASE(log θ̂)

• A large sample confidence interval for θ is given by

exp[log(θ̂) ± zα/2 ASE(log θ̂)]

& %
4
' $
Example: Aspirin Usage
• Sample Odds Ratio = 1.832
• Sample log odds ratio, log θ̂ = log(1.832) = 0.2629
• ASE of log θ̂
r
1 1 1 1
+ + + = 0.123
89 10933 10845 104

• 95% confidence interval for log θ equals

0.605 ± 1.96 × 0.123

• The corresponding confidence interval for θ is (e0.365 , e0.846 ) or


(1.44,2.33).
& %
5
' $

Recall SAS Output


Estimates of the Relative Risk (Row1/Row2)
Type of Study Value 95% Confidence Limits

Case-Control (Odds Ratio) 1.8321 1.4400 2.3308


Cohort (Col1 Risk) 1.8178 1.4330 2.3059
Cohort (Col2 Risk) 0.9922 0.9892 0.9953

Sample Size = 22071

& %
6
' $

A Simple R Function For Odds Ratio


> odds.ratio < −
function(x, pad.zeros = FALSE, conf.level=0.95) {
if(pad.zeros) {
if(any(x==0)) x<-x+0.5
}
theta<-x[1,1]*x[2,2]/(x[2,1]*x[1,2])
ASE<-sqrt(sum(1/x))
CI<-exp(log(theta) +
c(-1,1)*qnorm(0.5*(1+conf.level))*ASE)
list(estimator=theta, ASE=ASE,
conf.interval=CI, conf.level=conf.level) }

& %
7
' $

Notes (p. 25)


• Recall the formula for sample odds ratio
n11 n22
θ̂ =
n12 n21

• The sample odds ratio is 0 or ∞ if any nij = 0 and it is


undefined if both entries in a row or column are zero.
• Consider the slightly modified formula
(n11 + 0.5)(n22 + 0.5)
θ̃ =
(n12 + 0.5)(n21 + 0.5)

• In the ASE formula also, nij ’s are replaced by nij + 0.5.

& %
8
' $

Observations
• A sample odds ratio 1.832 does not mean that p1 is 1.832 times
p2 .
• A simple relation:
p1 /(1 − p1 ) 1 − p2
OddsRatio = = RelativeRisk ×
p2 /(1 − p2 ) 1 − p1

• If p1 and p2 are close to 0, the odds ratio and relative risk take
similar values.
• This relationship between odds ratio and relative risk is useful.

& %
9
' $

Example: Smoking Status and Myocardial Infarction

Ever Myocardial
Smoker Infarction Controls

Yes 172 173


No 90 346

• Odds Ratio=? (3.8222)


• How do we get relative risk? (2.4152)

& %
10
' $

Chi-Squared Tests (p.27)


• To test H0 that the cell probabilities equal certain fixed values
{πij }.
• Let {nij } be the cell frequencies and n be the total sample size.
• Then µij = nπij are the expected cell frequencies under H0 .
• Pearson (1900)’s chi-squared test statistic
X (nij − µij )2
2
χ =
µij

& %
11
' $

Some Properties
• This statistic takes its minimum value of zero when all
nij = µij .
• For a fixed sample size, greater differences between {nij } and
{µij } produce larger χ2 values and stronger evidence against
H0 .
• The χ2 statistic has approximately a chi-squared distribution
with appropriate degrees of freedom for large sample sizes.

& %
12
' $

Likelihood-Ratio Test (p. 28)


• The likelihood ratio

maximum likelihood when H0 is true


Λ=
maximum likelihood when parameters are unrestricted

• In mathematical notation, if L(θ) denotes the likelihood


function with θ as the set of parameters and the null
hypothesis is H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1 , the likelihood
ratio is given by
supθ∈Θ0 L(θ)
Λ=
supθ∈(Θ0 ∪Θ1 ) L(θ)

& %
13
' $

Properties
• Likelihood ratio cannot exceed 1.
• Small likelihood ratio implies deviation from H0 .
• Likelihood ratio test statistic is −2 log Λ, which has a
chi-squared distribution with appropriate degrees of freedom
for large samples.
• For a two-way contingency table, this statistic reduces to
X nij
2
G =2 nij log( )
µij

• The test statistics χ2 and G2 have the same large sample


distribution under null hypothesis.

& %
14
' $

Tests of Independence (p.30)


• To test: H0 : πij = πi+ π+j for all i and j.
• Equivalently, H0 : µij = nπij = nπi+ π+j .
• Usually, {πi+ } and {π+j } are unknown.
• We estimate them, using sample proportions
ni+ n+j ni+ n+j
µˆij = npi+ p+j = n =
n2 n

• These {µ̂ij } are called estimated expected cell frequencies

& %
15
' $

Test Statistics
• Pearson’s Chi-square test statistic
XI X J
2 (nij − µ̂ij )2
χ =
i=1 j=1
µ̂ij

• Likelihood ratio test statistic


I X
X J
2 nij
G =2 nij log( )
i=1 j=1
µ̂ij

• Both of them have large sample chi-squared distribution with


(I − 1)(J − 1) degrees of freedom.

& %
16
' $

Party Identification By Gender (p.31)

Party Identification
Gender Democrat Independent Republican Total

Females 279 73 225 577


(261.4) (70.7) (244.9)
Males 165 47 191 403
(182.6) (49.3) (171.1)
Total 444 120 416 980

& %
17
' $

Example: Continued · · ·
• The test statistics are: χ2 = 7.01 and G2 = 7.00
• Degrees of freedom = (I − 1)(J − 1) = (2 − 1)(3 − 1) = 2.
• p-value = 0.03.
• Thus, the above test statistics suggest that party identification
and gender are associated.

& %
18
' $

SAS Codes: Read The Data


data Survey;
length Party $ 12;
input Gender $ Party $ count;
datalines;
Female Democrat 279
Female Independent 73
Female Republican 225
Male Democrat 165
Male Independent 47
Male Republican 191
;
run;

& %
19
' $

SAS Codes: Use Proc Freq


proc freq data=survey order=data;
weight count;
tables gender*party / chisq expected
nopercent norow nocol;
run;

& %
20
' $

Output
The FREQ Procedure
Table of Gender by Party
Gender Party
Frequency
Expected Democrat Independent Republican Total

Female 279 73 225 577


261.42 70.653 244.93

Male 165 47 191 403


182.58 49.347 171.07

Total 444 120 416 980

& %
21
' $

Output
Statistics for Table of Gender by Party
Statistic DF Value Prob
-------------------------------------------------
Chi-Square 2 7.0095 0.0301
Likelihood Ratio Chi-Square 2 7.0026 0.0302
Mantel-Haenszel Chi-Square 1 6.7581 0.0093
Phi Coefficient 0.0846
Contingency Coefficient 0.0843
Cramer’s V 0.0846
Sample Size = 980

& %
22
' $

R Codes
>gendergap<-matrix(c(279,73,225,165,47,191),
byrow=T,ncol=3)
>dimnames(gendergap)<-
list(Gender=c("Female","Male"),
PartyID=c("Democrat","Independent",
"Republican"))
>gendergap
PartyID
Gender Democrat Independent Republican
Female 279 73 225
Male 165 47 191

& %
23
' $

R Codes
>chisq.test(gendergap)

Pearson’s Chi-squared test


data: gendergap
X-squared = 7.0095, df = 2,p-value = 0.03005

& %
24
' $

An Alternative Way
>Gender<-c("Female","Female","Female","Male",
"Male","Male")
>Party<-c("Democrat","Independent", "Republican",
"Democrat","Independent", "Republican")
>count<-c(279,73,225,165,47,191)
>gender1<-data.frame(Gender,Party,count)
>gender<-xtabs(count Gender+Party, data=gender1)
>gender
>summary(gender)

& %
25
' $

Output

Party
Gender Democrat Independent Republican
Female 279 73 225
Male 165 47 191
Call: xtabs(formula = count Gender + Party,
data = gender1)
Number of cases in table: 980
Number of factors: 2
Test for independence of all factors:
Chisq = 7.01, df = 2, p-value = 0.03005

& %
26
' $

Table of Expected Cell Counts


> rowsum<-apply(gender,1,sum)
> colsum<-apply(gender,2,sum)
> n<-sum(gender)
> gd<-outer(rowsum,colsum/n,
make.dimnames=T)

& %
27
' $

Table of Expected Cell Counts


> gd
Democrat Independent Republican
Female 261.4163 70.65306 244.9306
Male 182.5837 49.34694 171.0694

& %
28
' $

Residuals (p.31)
• To understand better the nature of evidence against H0, a cell
by cell comparison of observed and estimated frequencies is
necessary.
• Define, adjusted residuals
nij − µ̂ij
rij = p
µ̂ij (1 − pi+ )(1 − p+j )

• If H0 is true, each rij has a large sample standard normal


distribution.
• If rij in a cell exceeds 2 then it indicates lack of fit of H0 in
that cell.
• The sign also describes the nature of association.
& %
29
' $

Computing Residuals in R
> rowp<-rowsum/n %Row marginal prob.
> colp<-colsum/n %Column marginal prob.
> pd<-outer(1-rowp,1-colp,
make.dimnames=T)
> resid<-(gender-gd)/sqrt(gd*pd)
> resid

& %
30
' $

Residuals Output

Party
Gender Democrat Independent Republican
Female 2.2931603 0.4647941 -2.6177798
Male -2.2931603 -0.4647941 2.6177798

& %
31
' $

Some Comments (p.33)


• Pearson’s χ2 tests only indicate the degree of evidence for an
association, but they cannot answer other questions like nature
of association etc.
• These χ2 tests are not always applicable. We need large data
sets to apply them. Approximation is often poor when
n/(IJ) < 5.
• The values of χ2 or G2 do not depend on the ordering of the
rows. Thus we ignore some information when there is ordinal
data.

& %
32
' $

Testing Independence For Ordinal Data (p.34)


• For ordinal data, it is important to look for types of
associations when there is dependence.
• It is quite common to assume that as the levels of X increases,
responses on Y tend to increases or responses on Y tends to
decrease to ward higher levels of X.
• The most simple and common analysis assigns scores to
categories and measures the degree of linear trend or
correlation.
• The method used is known as “Mantel-Haenszel Chi-Square”
test (Mantel and Haenszel 1959).

& %
33
' $

Linear Trend Alternative to Independence


• Let u1 ≤ u2 ≤ · · · ≤ uI denote scores for the rows.
• Let v1 ≤ v2 ≤ · · · ≤ vJ denote scores for the columns.
• The scores have the same ordering as the category levels.
• Define the correlation between X and Y as
P
I P
J P
I P
J
ui vj nij − ( ui ni+ )( vj n+j )/n
i=1 j=1 i=1 j=1
r= s
P
I P
I P
J P
J
[ u2i ni+ −( ui ni+ )2 /n][ vj2 n+j −( vj n+j )2 /n]
i=1 i=1 j=1 j=1

& %
34
' $

Test For Linear Trend Alternative


• Independence between the variables implies that its true value
equals zero.
• The larger the correlation is in absolute value, the farther the
data fall from independence in this linear dimension.
• A test statistic is given by M 2 = (n − 1)r2 .
• For large samples, it has approximately a chi-squared
distribution with 1 degrees of freedom.

& %
35
' $

Infant Malformation and Mothers Alcohol Consumption

Malformation
Alcohol
Consumption Absent(0) Present(1) Total

0 17,066 48 17,114
<1 14,464 38 14,502
1-2 788 5 793
3-5 126 1 127
≥6 37 1 38

& %
36
' $

Infant Malformation and Mothers Alcohol Consumption

Malformation
Alcohol Percent Adjusted
Consumption Absent(0) Present(1) Total Present Residual

0 17,066 48 17,114 0.28 -0.18


<1 14,464 38 14,502 0.26 -0.71
1-2 788 5 793 0.63 1.84
3-5 126 1 127 0.79 1.06
≥6 37 1 38 2.63 2.71

& %
37
' $

Example: Tests For Independence


• Pearson’s χ2 = 12.1, d.f. = 4, p-value = 0.02.
• Likelihood Ratio Test, G2 = 6.2 , d.f. = 4, p-value =.19.
• The two tests give inconsistent signals.
• The percent present and adjusted residuals suggest that there
may be a linear trend.

& %
38
' $

Test For Linear Trend


• Assign scores, v1 = 0, v2 = 1 and
u1 = 0, u2 = 0.5, u3 = 1.5, u4 = 4.0 and u5 = 7.0.
• We have, r = 0.014, n = 32, 574 and M 2 =6.6 with p-value =
0.01.
• It suggests strong evidence of a linear trend for infant
malformation with alcohol consumption of mothers.

& %
39
' $

SAS Codes
data infants;
input malform alcohol count @@;
datalines;
1 0 17066 2 0 48
1 0.5 14464 2 0.5 38
1 1.5 788 2 1.5 5
1 4.0 126 2 4.0 1
1 7.0 37 2 7.0 1
;
run;
proc format;
value malform 2=’Present’ 1=’Absent’;
value Alcohol 0=’0’ 0.5=’<1’ 1.5=’1-2’ 4.0=’3-5’
7.0=’>=6’;
run;

& %
40
' $

SAS Codes
proc freq data = infants;
format malform malform. alcohol alcohol.;
weight count;
tables alcohol*malform / chisq cmh1 norow
nocol nopercent;
run;

& %
41
' $

Partial Output

Statistic DF Value Prob


---------------------------------------------------------
Chi-Square 4 12.0821 0.0168
Likelihood Ratio Chi-Square 4 6.2020 0.1846
Mantel-Haenszel Chi-Square 1 6.5699 0.0104
Phi Coefficient 0.0193
Contingency Coefficient 0.0193
Cramer’s V 0.0193

& %
42
' $

Partial Output
Cochran-Mantel-Haenszel Statistics (Based on Table
Scores)
Statistic Alternative Hypothesis DF Value Prob
-------------------------------------------------------
1 Nonzero Correlation 1 6.5699 0.0104

Total Sample Size = 32574

& %
43
' $

Notes
• The correlation r has limited use as a descriptive measure of
tables.
• Different choices of monotone scores usually give similar results.
• However, it may not happen when the data are very
unbalanced, i.e. when some categories have many more
observations than other categories.
• If we had taken (1, 2, 3, 4, 5) as the row scores in our example,
then M 2 = 1.83 and p-value = 0.18 gives a much weaker
conclusion.
• It is usually better to use one’s own judgment by selecting
scores that reflect distances between categories.

& %
44
' $

SAS Codes
data infantsx;
input malform alcoholx count @@;
datalines;
1 0 17066 2 0 48
1 1 14464 2 1 38
1 2 788 2 2 5
1 3 126 2 3 1
1 4 37 2 4 1
;
run;
proc freq data = infantsx;
weight count;
tables alcoholx*malform / cmh1 norow nocol nopercent;
run;

& %
45
' $

Partial Output
Cochran-Mantel-Haenszel Statistics (Based on Table
Scores)
Statistic Alternative Hypothesis DF Value Prob
--------------------------------------------------------
1 Nonzero Correlation 1 1.8278 0.1764

Total Sample Size = 32574

& %
46
' $

Fisher’s Tea Tasting Experiment (p.39)

Guess Poured First

Poured First Milk Tea Total

Milk 3 1 4
Tea 1 3 4
Total 4 4 8

& %
47
' $

Example: Tea Tasting


• To test whether she can tell accurately.
• To test H0 : θ = 1 against H1 : θ > 1.
• We cannot use previously discussed tests as we have a very
small sample size.

& %
48
' $

Fisher’s Exact Test


• For a 2 × 2 table, under the assumption of independence, i.e. θ
= 1, the conditional distribution of n11 given the row and
column totals is hypergeometric.
• For given row and column marginal totals, the value for n11
determines the other three cell counts. Thus, the
hypergeometric formula expresses probabilities for the four cell
counts in terms of n11 alone.

& %
49
' $

Fisher’s Exact Test


• When θ = 1, the probability of a particular value n11 for that
count equals
  
n n2+
 1+   
n11 n+1 − n11
p(n11 ) =  
n
 
n+1

• To test independence, the p-value is the sum of hypergeometric


probabilities for outcomes at least as favorable to the
alternative hypothesis as the observed outcome.

& %
50
' $

Example: Tea Tasting


• The outcomes at least as favorable as the observed data is
n1 1 = 3 and 4 given the row and column totals.
• Hence, 0 10 1
B 4 CB 4 C
B CB C
@ A@ A
3 1 16
p(3) = 0 1 = 70 = .2286,
B 8 C
B C
@ A
4
0 10 1
B 4 CB 4 C
B CB C
@ A@ A
4 0 1
p(4) = 0 1 = 70 = .0143.
B 8 C
B C
@ A
4

& %
51
' $

Example: Tea Tasting


• Therefore, p-value = P (3) + P (4) = 0.243.
• There is not much evidence against the null hypothesis of
independence.
• The experiment did not establish an association between the
actual order of pouring and the guess.

& %
52
' $

SAS Codes: Exact Test


data tea;
input poured $ guess $ count @@;
datalines;
Milk Milk 3 Milk Tea 1
Tea Milk 1 Tea Tea 3
;
proc freq data=tea order=data;
weight count;
tables poured*guess / exact;
run;

& %
53
' $

Partial Output

Fisher’s Exact Test


----------------------------------
Cell (1,1) Frequency (F) 3
Left-sided Pr <= F 0.9857
Right-sided Pr >= F 0.2429
Table Probability (P) 0.2286
Two-sided Pr <= P 0.4857
Sample Size = 8

& %
54
' $

R Codes
> Poured<-c("Milk","Milk","Tea","Tea")
> Guess<-c("Milk","Tea","Milk","Tea")
> count<-c(3,1,1,3)
> teadata<-data.frame(Poured,Guess,count)
> tea<-xtabs(count Poured+Guess,data=teadata)
> fisher.test(tea,alternative="greater")

& %
55
' $

Output
Fisher’s Exact Test for Count Data
data: tea p-value = 0.2429
alternative hypothesis: true odds ratio is
greater than 1
95 percent confidence interval:
0.3135693 Inf
sample estimates:
odds ratio
6.408309

& %
56

You might also like