Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views10 pages

Class Lecture-3

CDA

Uploaded by

Misbahur Rahman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views10 pages

Class Lecture-3

CDA

Uploaded by

Misbahur Rahman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Lectured by

STT553: Categorical Data Analysis (CDA)


Md. Kaderi Kibria, STT-HSTU

Lecture # 3 Contingency Table

Objectives of this lecture:


After reading this unit, you should be able to
• basic concepts of Two-way contingency table
• understand about the Odds ratio
• understand about the test of independence

Contingency Table
A rectangular table having I rows for categories of X and J columns for categories of Y displays the
IJ possible combinations of outcomes. The cells of the table represents the IJ possible outcomes.
When the cell contains frequency counts of outcomes for a sample, the table is called a contingency
table, a term introduced by Karl Pearson (1904). Another name is cross classification table.
A contingency table with I rows and J columns is called an I by J table.
In the abstract, a contingency table looks like:
nij Y=1 Y=2 ... Y=J Total
X=1 n11 n12 ... n1 J n1 +
X=2 n 21 n 22 ... n2 J n2+
... ... ... ... ... ...
X=I nI 1 nI 2 ... n IJ nI +
Total n+1 n+ 2 ... n+ J n ++

If subjects are randomly sampled from the population and cross-classified, both X and Y are random
and (X , Y ) has a bivariate discrete joint distribution. Let π ij =P ( X =i ,Y = j) , the probability of
falling into the (i, j)th (row,column) in the table.
Example: The Physicians’ Health Study was a 5-year randomized study of whether Regular aspirin
intake reduces mortality from cardiovascular disease. Every other Day, physicians participating in
the study took either one aspirin tablet or a placebo. The study was blind those in the study did not
know whether they were taking aspirin or a placebo.
Table 1: Cross-Classification of Aspirin use and Myocardial
infarction
Myocardial Infarction
Fatal Attack Non-fatal Attack No Attack
Placebo 18 171 10, 845
Aspirin 5 99 10, 933

Categorical Data Analysis | Lecture 3 | Two-way Contingency table


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

Joint/ Marginal for Contingency Table


Let πij denote the probability that (X, Y) occurs in the cell in row i and column j. The probability
distribution { πij } is the joint distribution of X and Y. The marginal distributions are the row and
column totals that result from summing the joint probabilities. We denote these by {πi + } for the
row variable and {π+ j } for the column variable, where the subscript ‘‘+’’ denotes the sum over
that index; that is,
J J
P ( X =i)=∑ P ( X =i , Y = j)=∑ πij = πi + and
j=1 j =1
I I
P ( X = j)=∑ P ( X =i , Y = j)=∑ πij =π + j
i=1 i=1

These satisfy ∑ πi +=∑ π+ j =∑ ∑ π ij=π ++=1. The marginal distributions provide single-
i j i j

variable information.
When X is fixed rather than random, the notation of a joint distribution for X and Y is no longer
meaningful. However, for a fixed category of X, Y has a probability distribution.

Example: The following 2 × 3 contingency table is from a report by the Physicians’ Health Study
Research Group on n = 22, 071 physicians that took either a placebo or aspirin every other day.
Myocardial Infarction
Fatal Attack Non-fatal Attack No Attack
Placebo 18 171 10, 845
Aspirin 5 99 10, 933

Here we have placed the marginal probabilities into each cell:

Myocardial Infarction

Fatal Attack Non-fatal Attack No Attack Row marginal

Placebo 0.00081 0.0077 0.4913 0.49981

Aspirin 0.00022 0.0044 0.4953 0.49992

Column 0.00103 0.0121 0.9866 1.000

Categorical Data Analysis | Lecture 3 | Two-way Contingency table


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

Conditional distribution for Contingency Table


Given that a subject is classified in row I of X, π j∣i denotes the probability of classification in
J
column j of Y, j=1,2, . . . , J. Note that ∑ π j∣i =1. The probabilities {π1∣i , .. . , π J∣i } from the
j=1

conditional distribution of Y at category i of X.


The conditional distribution of Y given X relates to the joint distribution by

π j∣i= P( Y = j∣X =i )=π ij / πi + for i=1,2 ,. . . , I

Column
Row 1 2 Total
1 π11 π12 π1 +
( π 1∣1) ( π 2∣1) (1.0)
2 π21 π22 π2 +
( π 1∣2) ( π 2∣2 ) (1.0)
Total π+1 π+ 2 1.0

Sample distributions use similar notation, with p or π^ in place for π . For instance {πij }
denotes the sample joint distribution. The cell frequencies are denoted by {nij } and
n=∑ ∑ nij is the total sample size. Thus πij =nij /n .
i j

Example: The following 2 × 3 contingency table is from a report by the Physicians’ Health Study
Research Group on n = 22, 071 physicians that took either a placebo or aspirin every other day.
Myocardial Infarction
Fatal Attack Non-fatal Attack No Attack
Placebo 18 171 10, 845
Aspirin 5 99 10, 933

Here we have placed the probabilities of each classification into each cell:
Myocardial Infarction
Fatal Attack Non-fatal Attack No Attack
Placebo π1∣1 π 2∣1 π 3∣1
Aspirin π1∣2 π 2∣2 π 3∣2

Categorical Data Analysis | Lecture 3 | Two-way Contingency table


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

Independence of Categorical Variables


Two categorical response variables are defined to be independent if all joint probabilities equal the
product of their marginal probabilities
P ( X =i ,Y = j)=P ( X =i) P (Y = j) or
πij =πi + π+ j for i=1,2 , . . . , I and j=1,2 ,. . . , J

When X and Y are independent,


π j∣i= P( Y = j∣X =i)=π ij / πi + = πi + π + j / πi + =π + j for i=1,2 , . . . , I
πi∣ j= P( X =i∣Y = j)=π ij / π+ j =πi + π+ j / π+ j= πi + for j=1,2 , . . . , J

Each conditional distribution of Y is identical to the marginal distribution of Y.

Thus, two variables are independent when {π j∣i=. . .= π j∣i , for j=1,2 , . . . , J }; that is the
probability of any given column response is the same in each row. When Y is a response and X is an
explanatory variable, this is a more nature way to define independence. Independence is then often
referred to as homogeneity of the conditional distributions.

Comparing Two Proportions


Difference of Proportions
We use the generic terms success and failure for the outcome categories. Let X and Y be
dichotomous. Let π 1=P (Y =1∣X =1) and let π 2=P (Y =1∣X =2).
The difference in probability of Y = 1 when X = 1 versus X = 2 is π 1 − π 2.

The difference π 1 − π 2 compares the success probabilities for the two groups. It falls between −1
and +1, equating zero when π 1=π 2 , that is, when the response variable is independent of the
group classification. Let π^1 and π^2 denote the sample proportions of successes. The sample
difference of proportions π^1 −π^2 estimates π 1 − π 2.
For sample sizes n1 and n2 for the two groups, when we treat the two samples as independent
binomial samples, the estimated expectation and standard error of π^1 −π^2 is
E( π^1− π^2)= π1− π2 and

π^1 (1− π^1) π^2 (1−π^2 )


SE=
√ n1
+
n2

As the sample sizes increase, the standard error decreases and the estimate of π 1 − π 2 tends to
improve. A large-sample 100(1 − α)% Wald confidence interval for π 1 − π 2 is

Categorical Data Analysis | Lecture 3 | Two-way Contingency table


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

( π^ 1−π^2 )±z α / 2(SE)

For a significance test of H 0 :π 1=π 2 , the standard Z-test statistic divides ( π^ 1 − π^2 ) by a pooled
SE that applies under H 0.
Example: The study population consisted of over 22,000 male physicians who were randomly
assigned to either low-dose aspirin or a placebo (an identical looking pill that was inert). They
followed these physicians for about five years. Some of the data is summarized in the 2x2 table
shown below.
Myocardial Infarction (MI)
Yes No
Placebo 189 10845

Aspirin 104 10933

We can calculate proportions

Myocardial Infarction (MI)


Yes No
Placebo 0.01712887 0.9828711

Aspirin 0.00942285 0.9905771

data: MI
X-squared = 24.429, df = 1, p-value = 7.71e-07
alternative hypothesis: two.sided
95 percent confidence interval:
0.004597134 0.010814914
sample estimates:
prop 1 prop 2
0.01712887 0.00942285

Ratio of Proportions (Relative Risk)


A difference between two proportions of a certain fixed size usually is more important when both
proportions are near 0 or 1 than when they are near the middle of the range. Consider a comparison
of two drugs on the proportion of subjects who had adverse reactions when using them. The
difference between 0.010 and 0.001 is the same as the difference between 0.410 and 0.401, namely
0.009.
Relative risk is the ratio of the risks for an event for the exposure group to the risks for the non-
exposure group.

Categorical Data Analysis | Lecture 3 | Two-way Contingency table


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

For 2 × 2 tables, the ratio of probabilities is often called the relative risk:
nij Y=1 Y=2 Total
X=1 n11 n12 n1 +
X=2 n 21 n 22 n2+
Total n+1 n+ 2 n ++

n11/ n 1+
Relative Risk =
n21/ n 2+
A relative risk of 1.00 occurs when π 1=π 2 , that is, when the response variable is independent of
the group.
Interpretation:
 Relative Risk < 1: The event is less likely to occur in the treatment group
 Relative Risk = 1: The event is equally likely to occur in each group
 Relative Risk > 1: The event is more likely to occur in the treatment group
Example, a data from a flu vaccination study (Beran et al., 2009). This study was a randomized
controlled trial, the gold standard for identifying causal relationships.

Treatment Flu infections Non-infections


Vaccine 49 5054
Placebo 74 2475

Now, let’s plug these numbers into the relative risk formula:
49 /(49+5054)
RR= =0.3310
74 /(74+ 2475)
The risk ratio is 0.3310, indicating the vaccine is a protective factor. The vaccinated are about 1/3 as
likely to catch the flu as the unvaccinated. OR the probability of getting flu is lower for vaccinated
individuals than the unvaccinated.
Another example, The study population consisted of over 22,000 male physicians who were
randomly assigned to either low-dose aspirin or a placebo (an identical looking pill that was inert).
They followed these physicians for about five years. Some of the data is summarized in the 2x2
table shown below.
Myocardial Infarction (MI)
Yes No Total Cumulative Incidence
Placebo 189 10845 11,037 189/11,037 = 0.0126
Aspirin 104 10933 11,034 104/11,034 = 0.0217

Categorical Data Analysis | Lecture 3 | Two-way Contingency table


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

189/11037
Relative Risk= =0.58
104/11034
Those who take low dose aspirin regularly have 58% or 0.58 times the risk of myocardial infarction
compared to those who do not take aspirin.
OR
Subjects taking aspirin had 43% less risk of having a myocardial infarction compared to subjects
taking the placebo.

Odds Ratio (OR)


The odds ratio is another measure of association for 2 × 2 contingency tables. It also occurs as a
parameter in the most important model for categorical responses logistic regression.
For a probability of success π, the odds of success are defined to be
Odds= π
(1−π )

For instance, when π = 0.75, the odds of success equal 0.75/0.25 = 3.0. The odds are non-negative,
with value greater than 1.0 when a success is more likely than a failure. When odds = 3.0, we
expect to observe three successes for every one failure. When odds = 1/3, a failure is three times as
likely as a success. We then expect to observe one success for every three failures.
In 2 × 2 tables, within row 1 the odds of success are odds 1=π 1 /(1− π 1 ) , and within row 2 the
odds of success equal odds 2=π 2 /(1− π 2) .

nij Y=1 Y=2 Total


X=1 n11 n12 n1 +
X=2 n 21 n 22 n2+
Total n+1 n+ 2 n ++
The ratio of the odds from the two rows,
Odds1 π1 /(1− π1 ) n 11/ n12 n11 n22
θ = = = =
Odds2 π2 /(1− π2 ) n 21/ n22 n12 n21
is the odds ratio. Whereas the relative risk is a ratio π 1 /π 2 of two probabilities, the odds ratio θ is
a ratio of two odds.

Properties of Odds Ratio


1. The odds ratio can equal any non-negative number.
2. It doesn’t depend on the marginal distribution of either variable.
3. If the categories of both variables are interchanged, the value doesn’t change.

Categorical Data Analysis | Lecture 3 | Two-way Contingency table


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

4. If the categories of one variable are switch, the odds ratio in the new rearrange table will
equal.
Example, a data from a flu vaccination study (Beran et al., 2009). This study was a randomized
controlled trial, the gold standard for identifying causal relationships.

Treatment Flu infections Non-infections


Vaccine 49 5054
Placebo 74 2475

Now, let’s plug these numbers into the odds ratio formula:
49×2475
OR = = 0.324
74×5054
So, the vaccinated individuals were more likely to affected by flu infections than the unvaccinated.

Inference for Odds Ratios (OR) and Log OR


The sampling distribution of the odds ratio is highly skewed unless the sample size is extremely
large. Because of this skewness, statistical inference uses its natural logarithm, log(θ).
Independence corresponds to log(θ) = 0.

The sample log odds ratio, log θ^ , has a less-skewed, bell-shaped sampling distribution. Its
approximating normal distribution has a mean of logθ and a standard error of
1 1 1 1
SE (log θ^ )=
√ + + +
n11 n12 n21 n22
The SE decreases as the cell counts increase. Because the sampling distribution is closer to
normality for log θ^ , than θ^ .

A large-sample Wald confidence interval for logθ is


log θ^ ±zα /2 (SE (log θ^ ))

Relative Risk and Odds Ratios are often confused despite being unique
concepts. Why?
Well, both measure association between a binary outcome variable and a continuous or binary
predictor variable.
The basic difference is that the odds ratio is a ratio of two odds (yep, it’s that obvious) whereas the
relative risk is a ratio of two probabilities. (The relative risk is also called the risk ratio).
π1 /(1− π1 ) (1−π2 )
θ= =Relative Risk 
π2 /(1− π2 ) (1−π1 )

Categorical Data Analysis | Lecture 3 | Two-way Contingency table


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

When π 1 and π 2 are both close to zero, the fraction in the last term of this expression equals
approximately 1.0. The odds ratio and relative risk then take similar values.

Testing Independence of Two Way CT


Pearson Chi-square Statistic: In two-way contingency tables with joint probabilities {πij} for two
response variables, the null hypothesis of statistical independence is
H 0 : πij = πi + π+ j for all i and j

To test H 0 , we identify μ ij =n π ij =n π i + π + j as the expected frequency of nij , assuming


independence. Usually, {πi+} and {π+j} are unknown, as is this expected value. To obtain estimated
expected frequencies, we substitute sample proportions for the unknown marginal probabilities,
giving
ni + n+ j ni + n+ j
μ^ij =n π^i + π^+ j =n( )( )=
n n n
based on the row marginal totals {ni+} and the column marginal totals {n+j}. The μ^ij have the
same row and column totals as the cell counts {nij}, but they display the pattern of independence.
Then chi-square equals
∑ ∑ (nij −μ^ij) 2
χ 2= i j
with df =(I −1)( J −1)
μ^ij

Likelihood-Ratio Statistic: For multinomial sampling the Kernel of the likelihood is


∏ ∏ π nij , ij
where all πij ≥0 and ∑ ∑ π ij=1
i j i j

Let us consider the null hypothesis


H 0 : πij = πi + π + j for alli and j

Under H0, π^ij =π^i + π^+ j=ni + n + j /n2 . In the general case, π^ij =nij / n .
The ratio of the likelihoods equals
∏ ∏ (ni + n + j )n ij

i j
Λ= n
n ∏ ∏ nijn ij

i j

The likelihood-ratio chi-squared statistic is denoted by G2, it equals


2
G =−2 log Λ=2 ∑ ∑ n ij log ( nij / μ^ij ) where , μ^ij =ni + n + j /n
i j

The larger the values of G2, the more evidence exists against independence.

Categorical Data Analysis | Lecture 3 | Two-way Contingency table


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

Reference Book:
i. Agresti A. (2019), An Introduction to Categorical Data Analysis, 3rd edition, A John Wiley & Sons
Inc., Publication.
ii.
<><><><><><><><><> End <><><><><><><><><>

Categorical Data Analysis | Lecture 3 | Two-way Contingency table

You might also like