Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
41 views108 pages

SAHADEB - Categorical - Data - LECTURES 1 - Part 2

The document discusses categorical data analysis and terminology. It covers discrete and count data types, as well as binomial, multinomial, negative binomial, hypergeometric distributions and related inference. Examples on metabolic syndrome data and chi-square tests are also presented.

Uploaded by

Arshdeep Singla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views108 pages

SAHADEB - Categorical - Data - LECTURES 1 - Part 2

The document discusses categorical data analysis and terminology. It covers discrete and count data types, as well as binomial, multinomial, negative binomial, hypergeometric distributions and related inference. Examples on metabolic syndrome data and chi-square tests are also presented.

Uploaded by

Arshdeep Singla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

Categorical Data Analysis

Sahadeb Sarkar
IIM Calcutta

• Slides Adapted from Prof Ayanendranath Basu’s Class-notes


• R Programs and Data Sets in Textbook (Tang, He & Tu):
http://accda.sph.tulane.edu/r.html
• Readings: Chapters 1-6, Text

1
Terminology
Discrete data: relates to discrete outcomes, Discrete
distributions
• Categorical Data: Discrete data with finitely many
possible values on a nominal scale (e.g., the state a
person lives in, the political party one might vote for,
the blood type of a patient; Multinomial, Bernoulli
distribution). Central tendency given by its mode
• Count data (non-negative integer valued) : Records
the frequency of an event, may not have an upper
bound (e.g., Poisson, Binomial, Negative Binomial
distributions). It arise out of counting and not ranking.

2
Discrete Data Types
• Dichotomous data: can take only two values
such as “Yes” and “No”
• Nonordered polytomous data: five different
detergents
• Ordered polytomous data: grades A, B , C, D;
“old”, “middle-aged”, “young” employees

• Integer valued: nonnegative counts


3
Derivation Tools in CDA, Text p.18
Delta Method:
𝑑 1
If 𝜃𝑛 𝑁 𝜃, Σ , 𝑔 𝜃 a m×1 differentiable
𝑛
function, then
𝑑 1 𝑇
𝑔 𝜃𝑛 𝑁 𝑔 𝜃 , 𝐷 Σ𝐷
𝑛
𝜕
where Dk×m = 𝑔(𝜃)
𝜕𝜃

4
Derivation Tools in CDA, Text p.18
Slutsky’s Theorem:
𝑑 𝑑
Suppose 𝑋𝑛 𝑋 and 𝑌𝑛 𝑐, constant. Then
𝑑
1. 𝑋𝑛 + 𝑌𝑛 𝑋+𝑐
𝑑
2. 𝑌𝑛 𝑋𝑛 𝑐𝑋
𝑑
3. If c0, 𝑋𝑛 /𝑌𝑛 𝑋/𝑐

5
Inference for One-way Frequency
Table

• Binary case (Sec 2.1.1, Text)


• Inference for Multinomial Variable (Sec 2.1.2)
• Inference for Count Variable (Sec 2.1.3)

R Programs and Data Sets in Textbook (Tan, He & Tu):


http://accda.sph.tulane.edu/r.html

6
Binomial Distribution
(leading to One-Way Frequecy Table)
Suppose Y is a random variable with 2 possible outcome
categories c1,c2 with probabilities π1, π2=(1 π1).
Suppose there are n observations on Y ; we can summarize
the responses through the vector of observed frequencies
(random variables), (X1, X2=nX1).

Then (X1, X2=nX1) is said to have a Binomial distribution


with parameters n and (π1, π2=1 π1), or simply X1 is said to
have a Binomial distribution with parameters n and π1.
𝑛! 𝑥
P(X1=x1) = 𝜋1 1 (1 − π1)𝑛−𝑥1 , 𝑥1 = 0,1, … , 𝑛
𝑥1 !(𝑛−𝑥1 )!
Then, E(X1) = nπ1, V(X1) = nπ1(1-π1) < E(X1)

7
Example 1.1, p. 6, Text

What is Metabolic Syndrome ?

8
Metabolic syndrome
(https://en.wikipedia.org/wiki/Metabolic_syndrome)

Metabolic syndrome, sometimes known by other


names, is a clustering of at least three of the five
following medical conditions (giving a total of 16
possible combinations giving the syndrome):
 abdominal (central) obesity
 High blood pressure
 High blood sugar
 High Serum Triglycerides
 Low high-density lipoprotein (HDL) levels

9
Example 1.1 (Binary Case), p. 37, Text
• Test if the prevalence of Metabolic Syndrome is 40% in this
study population
48
π− 𝜋0 93
− 0.4
𝑍= = = 2.286;
𝜋0 ×(1−𝜋0 )/𝑛 0.4×0.6/93
P-value = 2(2.2.86)=0.0223

• Construct 95% Confidence Interval for the prevalence in this


population
π ± 𝑍/2 × π × (1 − π)/𝑛
48 48 48
= ± 1.96 × × (1 − )/93
93 93 93
=[0.4146, 0.6177]
10
Negative Binomial Distribution (p. 41)
• A sequence of independent Bernoulli trials, having two potential
outcomes "success" and "failure". In each trial probability of success
is p and of failure is (1 − p). Observe this sequence until a predefined
number r of failures has occurred. Then X = number of successes
observed, will have the negative binomial distribution:

•  = E(X)= rp/(1-p), V(X) = rp/(1-p)2 > E(X).

 𝑟+𝑘 1
• 𝑃 𝑋 = 𝑘 =  𝑟 𝑘! 𝑝𝑟 (1 − 𝑝)𝑘 . Put 𝛼 = r & =rp/(1-p) for reparameterization

𝑘+𝑟−1 −𝑟
• Note: =(−1)𝑘
𝑘 𝑘

11
Negative Binomial Distribution (p. 41)

 𝑟+𝑘 𝑟
• 𝑃 𝑋=𝑘 = 𝑝 (1 − 𝑝)𝑘 ……… (1a)
 𝑟 𝑘!
• E(X)= rp/(1-p), V(X) = rp/(1-p)2 > E(X) …….. (1b)
• Extension through reparameterization:
1
= (> 0), =rp/(1-p) in (1)
𝑟
𝑟
 1
+𝑘 µ 𝑘
1
• Then, 𝑃 𝑋 = 𝑘 = α α
……… (2a)
 1
α
1
𝑘!
α

1
α

• E(X)= ; V(X) =  + 2 ……………………(2b)

12
Hypergeometric Distribution
• Randomly sample n elements from a finite (dichotomous)
population of size N, without replacement, having K
“success”-type and (N-K) “failure”-type elements. (e.g.
Pass/Fail or Employed/ Unemployed).
• The probability of a success changes on each draw, as each
draw decreases the population.
• X = number of successes in the sample. Then X has the
hypergeometric distribution:

• E(X)=n(K/N), V(X) = {n(K/N)(1 - K/N)}×[(N-n)/(N-1)]

13
Multivariate Hypergeometric Distribution
• Randomly sample n elements from a finite (polytomous)
population of size N, without replacement, having K1, K2, ..., Kc
elements of types 1, 2, …, c.
• Xi= number of i-th type elements in the sample, i=1,…,c. Then
X has multivariate hypergeometric distribution:
𝑐 𝐾𝑖
𝑖=1 𝑥
𝑖
𝑃(𝑋𝑖 = 𝑥𝑖 , 𝑖 = 1, … , 𝑐) =
𝑁
𝑛
• E(Xi)=n(Ki/N),
• V(Xi) = {n(Ki/N)(1 – (Ki/N) )}×[(N-n)/(N-1)]
• Cov(Xi, Xj) = {n(Ki/N)(Kj/N) }×[(N-n)/(N-1)]

14
Inference for Multinomial Case

15
Multinomial Distribution
(may lead to One-Way, Two-Way, … Frequecy Table)
Suppose Y is a random variable with k possible
outcome categories c1,c2,…,ck with probabilities π1,
π2,…, πk=(1- π1-…- πk-1).
Suppose there are n observations on Y; we can
summarize the responses through the vector of
observed frequencies (random variables), X = (X1,
X2,…, Xk), where Xk=n- X1-…- Xk-1.

Then X = (X1, X2,…, Xk) is said to have a multinomial


distribution with parameters n and (π1, π2,…, πk ).
𝑛! 𝑥 𝑥
P(X1=x1, …, Xk=xk) = 𝜋1 1 … 𝜋𝑘 𝑘
𝑥1 ! 𝑥2 !… 𝑥𝑘 !

16
Multinomial Distribution
(may lead to One-Way, Two-Way, … Frequecy Table)
X = (X1, X2,…, Xk) has a multinomial distribution
with parameters n and (π1, π2,…, πk ).
𝑛! 𝑥 𝑥
P(X1=x1, …, Xk=xk) = 𝜋1 1 … 𝜋𝑘 𝑘
𝑥1 ! 𝑥2 !… 𝑥𝑘 !
𝐸(𝑋𝑖 )=n𝜋𝑖 ,

𝑉(𝑋𝑖 )=n𝜋𝑖 (1−𝜋𝑖 ) ; 𝐶𝑜𝑣(𝑋𝑖 , 𝑋𝑗 ) = n𝜋𝑖 𝜋𝑗 , ij

MLE of 𝜋𝑖 = 𝑋𝑖 /𝑛 (Prove it, Excercise)

𝐿(𝜋1 , …, 𝜋𝑘 , ) = ln 𝑙𝑜𝑔 − 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 + (1 − 𝑖=1 𝑖 )


𝑘

17
Example 1.1, p. 6, Text
One-Way Frequency Table for Metabolic Syndrome Study
MS
Present Absent Total
48 45 93

Two-Way Frequency Table for Metabolic Syndrome Study


MS
Gender Present Absent Total
male 31 31 62
female 17 14 31
Total 48 45 93
18
Pearson’s Chi-square (χ2) Test
H 0 :  i   0i i  1,..., k (1)

The fit of the model is assessed by comparing the


frequencies expected in each cell, against the observed
frequencies. If there is substantial discrepancy between
the observed frequencies and those expected from the
null model, then it would be wise to reject the null model.
The best known goodness-of-fit statistic used to test the
hypothesis in (1) is the Pearson’s Chi-Square (PCS):
𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑𝑖 −𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑖 2 𝑋𝑖 −𝑛0𝑖 2
PCS,  = 𝑖=1
2 𝑘 𝑘
= 𝑖=1
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑖 𝑛0𝑖

19
Example: Pearson’s χ2 Test
When we are trying to do a test of hypothesis to determine
whether a die is a fair die, it is a simple hypothesis.
Suppose we roll it 120 times and the summarized data
are as follows:
In this case, k=6 and n=120. H0: πi = 1/6 (= π0i), i=1,2,…,6

20
Pearson’s Chi-Square (contd.)
The hypothesis presented in Equation (1) is an
example of a simple hypothesis. (Simple in the sense
that the hypothesis completely specifies the true
distribution).

The hypothesis becomes composite when the null is


not completely spelt out, but is specified in terms of
d parameters (d < k − 1).

21
Multinomial Example, p.38,Text
Multinomial Case:
Depression Diagnosis in the DOS Study
Major Dep Minor Dep No Dep Total
128 136 481 745
DOS = Depression Of Seniors

Test H0: P(No Dep) = 0.65, P(Minor Dep)=0.2, P(Major


Dep = 0.15)
Here, k=3, n=745
(481−484.25)2 (136−149)2 (128−111.75)2
𝑃𝐶𝑆 = + + = 3.519
484.25 149 111.75
df= (no. of categories -1) = 2; P-value= 0.1721
22
Example 2.2, p.4, p.38, Text

Conclusion: The null hypothesis claim appears to be true 23


Testing Composite Hypothesis
in Inference for Count Data

24
Poisson Distribution Case
Suppose Y is a random variable taking integer values y=0,
− 
𝑦
1, 2,…, with probability P(Y=y)=𝑒
𝑦!
Suppose there are n observations on Y; we can summarize
the observations through the vector of observed
frequencies for value-categories 0, 1, 2, …

Suppose all counts ≥ 6 are combined to make combined


frequency more than 5. Then with k=7, value-categories
0, 1, 2, 3, 4, 5, and ≥6 (say) observed frequencies X =
(X1, X2,…, Xk), where Xk=n- X1-…- Xk-1, has a multinomial
distribution with parameters n and (π1, π2,…, πk ), where
π1 = P(Y=0), π2 = P(Y=1),…, πk=P(Y≥6).
25
Example 2.3, p.42, Text

Exc: Check MLE of


=9.1

Conclusion: Null hypothesis claim appears to be false (df=7-1-1) 26


MLE of  = 9.1 ?
32 4 𝜃 2 5
𝜃 5 6
𝐿 𝜃 = 𝑒 −𝜃 𝑒 −𝜃 𝜃 𝑒 −𝜃 … 𝑒 −𝜃 ×
2 120
2 5 41
𝜃 𝜃
1 − 𝑒 −𝜃 − 𝑒 −𝜃 𝜃 − 𝑒 −𝜃
… −𝑒 −𝜃
2 120
Maximize this function 𝐿 𝜃 w.r.t. 𝜃
Need to do numerical maximization
Do it as an Exercise

27
Intentionally Kept Blank

28
Sampling Schemes
Leading to (2×2) Contingency Tables

29
Layout of the 2×2 table
Column factor
(‘Response’)

Level 1 Level 2
Level 1 n11 n12 R1=n1+
Row Fact Row
(‘Explanatory’) Total
Level 2 n21 n22 R2=n2+

C1=n+1 C2=n+2 T=n

Grand
Total
Column
Marginal
Total 30
Totals
Sampling schemes
leading to 2×2 contingency tables

Sampling scheme Marginal Total fixed in


advance
Poisson None
Multinomial Grand Total (Sample size)

Prospective Row (explanatory) total

Retrospective Column (Response) total

31
Poisson Sampling
• Poisson Sampling (French mathematician Simeon
Denis Poisson): Here a fixed amount of time (or space,
volume, money etc.) is employed to collect a random
sample from a single population and each member of
the population falls into one of the four cells in the
2×2 table.
• In the CVD Death example 1 (next slide), researchers
spent a certain amount of time sampling the health
records of 3112 women who were categorized as
obese and non obese against died of CVD or not. In
this case, none of the marginal totals or the sample
size was known in advance.
32
Example-1: Cardio-Vascular Deaths and Obesity among
women in American Samoa

[7.76 (=16/2061) observed deaths versus 6.66 (=7/1051) deaths per


thousand.]
Test equal proportions of CVD deaths in populations of obese and
nonobese Samoan women
This is an “Observational Study“, an example of “Poisson Sampling”
[Ramsey, F. L. and Schafer, D. W. (1997). The Statistical Sleuth.
Duxbury Press, Belmont, California.] 33
Multinomial Sampling
• This is same as the Poisson sampling scheme except
for the fact that here the overall sample size is
predetermined, and not the amount of time for
sampling (or space or volume or money etc.)
• If in the CVD Death Example 1, researchers decided to
sample the health records of exactly 3112 women
and then note (i) who were obese and non obese and
(ii) who died of CVD or did not die of CVD, then it
would have been multinomial sampling.

34
Prospective Product Binomial Sampling
• Prospective Product Binomial Sampling
(“cohort” study): First identify explanatory variable(s)
that explain “causation” . Population is categorized according
to levels of explanatory variable and random samples are then
selected from each explanatory group.
If separate lists of obese and non obese American Samoan
women were available in Example 1, a random sample of
2500 could have been selected from each. The term Binomial
refers to the dichotomy of the explanatory variable. The term
Product refers to the fact that sampling is done from more
than one population independently.

35
Example-2: Vitamin-C versus Common Cold
Outcome

COLD NO COLD TOTAL


PLACEBO 335 76 411
VITAMIN-C 302 105 407
TOTAL 637 181 818

Testing equal proportions of Colds in populations of


Placebo and Vitamin-C takers. One sided P-value for this
example is 0.0059 [Observed proportion 82% versus 74%]
This is a ‘Double Blind Randomized’ Study (not just
Observational), also Double Blind [Ramsey and Schaffer]

36
Retrospective Product Binomial Sampling

• Retrospective Product Binomial Sampling


(“Case- Control” study): This sampling scheme is
technically same as the previous one. However, roles
of the response and the explanatory factors are
reversed. In this scheme, we categorize the
population according to the identified response
levels and random samples are selected from each
response group.

37
Example 3: Smoking versus Lung Cancer
Outcome
CANCER CONTROL TOTAL

SMOKER 83 72 155
NON-SMOKER 3 14 17
TOTAL 86 86 172

Testing equality of proportions of smokers in


populations of cancers and non-cancers
(Homogeneity).

A retrospective observational study [Ramsey and Schaffer]

38
Retrospective Product Binomial
Sampling
• We cannot test for the equality of proportions along the
explanatory variable if the sampling scheme is
retrospective.
• We only get odds ratio from a case control study which is
an inferior measure of strength of association as
compared to relative risk.
• Why do retrospective sampling at all, then?
Compared to prospective cohort studies they tend to be less
costly and shorter in duration. Case-control studies are often
used in the study of rare diseases, or as a preliminary study
where little is known about the association between possible
risk factor and disease of interest.

39
Retrospective Product Binomial
Sampling (Continued)
• If the probabilities of the “Yes” response are very
small, it may need a huge sample size to get any
“Yes” response at all through prospective sampling.
• Retrospective sampling guarantees that we have at
least a reasonable number of “Yes” responses for
each level of explanatory variable.
• In the smoking versus lung cancer study (Example 3),
retrospective sampling may be accomplished without
having to follow the subjects throughout their
lifetime.

40
Prospective
Subjects selected
according to the levels
of the explanatory
variable

Explanatory Response
Variable Variable

Retrospective
Subjects selected
according to
levels of the
Response variable

41
Layout of the 2×2 table
Column factor
(Response)

Level 1 Level 2
Level 1 n11 n12 R1=n1+
Row Factor Row
(Explanatory) Total
Level 2 n21 n22 R2=n2+

C1=n+1 C2=n+2 T=n

Grand
Total
Column
Marginal
Total 42
Totals
Estimated Proportions
• Proportion of “Yes” (Level 1) response in the
first level of the explanatory variable is

ˆ1  n11 / R1

• Similarly the proportion of “Yes” response in


the second level of the explanatory variable is

ˆ 2  n21 / R2
43
Assumption
• We will assume that the frequencies of all the entries
in the 2x2 table are greater than 5.
• This ensures that the “asymptotic tests” performed
on the 2x2 tables are reasonably accurate.
(“asymptotic” means ‘appropriate in large samples’)

• If all the entries in the 2x2 table are not greater than
5, one may try Fisher’s Exact test.

44
Example-1: Cardio-Vascular Deaths and Obesity among
women in American Samoa

[7.76 observed deaths versus 6.66 deaths per thousand.]


Testing equal proportions of CVD deaths in populations of obese
and nonobese Samoan women.
This is an “Observational Study“, an example of “Poisson Sampling”
[Ramsey, F. L. and Schafer, D. W. (1997). The Statistical Sleuth.
Duxbury Press, Belmont, California.]
45
Pearson’s Chi-square (PCS) Test
(𝑂𝑐 − 𝐸𝑐 )2
𝑃𝐶𝑆 =
𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑐 𝐸𝑐
where Oc = observed count in category c, Ec = expected
count in category c, as per proposed model.

H0: Proposed model generated the observed data


Ha: Proposed model did not generate the data
If H0 is true then PCS has a chi-square distribution with
appropriate degrees of freedom (df).
46
Chi-square Distribution
Let Z1, …, Zk be independent random variables each
having N(0,1) distribution. Then 𝑍1 2 + … + 𝑍𝑘 2 is said to
follow chi-square (2k) with k degrees of freedom (df).

Result: The expected value and variance of a chi-square


(2k) random variable are given by: E(2k ) = k (=df);
Var(2k) = 2k (= 2*df).

For given k and , let 2k, denote a real number, which is


exceeded with probability  by a 2k random variable.
47
Chi-square Distribution

48
Calculations

DF = 1; Two-sided P-value = 1-CHISQ.DIST(0.115,1,TRUE)


One-sided P-value = 1-NORM.S.DIST(0.34,TRUE)
49
Example-2: Vitamin-C versus Common Cold
Outcome

COLD NO COLD TOTAL


PLACEBO 335 76 411
VITAMIN-C 302 105 407
TOTAL 637 181 818

Testing equal proportions of colds in populations of Placebo


and Vitamin-C takers. One sided P-value for this example is
0.0059 [Observed proportion 82% versus 74%]

A randomized study … Also double blind [Ramsey and


Schaffer]
50
Calculations

51
Example 3: Smoking versus Lung Cancer
Outcome
CANCER CONTROL TOTAL

SMOKER 83 72 155
NON-SMOKER 3 14 17
TOTAL 86 86 172

Testing equal proportions of smokers in


populations of cancers and non-cancers. One
sided p-value = 0.0025
A retrospective observational study, Retrospective
Sampling [Ramsey and Schaffer]
52
Calculations

53
Intentionally Kept Blank

54
Exact Test: Independence of Two Attributes
• Example: Data collected on a random sample of
people attending preview of a movie

• Did the movie have equal appeal to the young and old
or whether it is more liked by the young.
• Test H0: two attributes are independent against Ha:
they are positively associated.
55
Exact Test: Independence of Two Attributes
• To test if two qualitative characters (attributes) A and B
are independent. Let P(A=Ai, B=Bj) = pij, i=1,…,k, j=1,…,l.
𝑙
• Let 𝑃 𝐴 = 𝐴𝑖 = 𝑗=1 𝑝𝑖𝑗 = 𝑝𝑖0 ; Let 𝑃 𝐵 = 𝐵𝑗 =
𝑘
𝑖=1 𝑝𝑖𝑗 = 𝑝0𝑗
• To test H0: 𝑝𝑖𝑗 = 𝑝𝑖0 𝑝0𝑗 , for all i,j.
• nij= observed freq for cell AiBj. The marginal frequency
of Ai and Bj are 𝑛𝑖0 = 𝑙𝑗=1 𝑛𝑖𝑗 and 𝑛0𝑗 = 𝑘𝑖=1 𝑛𝑖𝑗

56
Exact (Conditional) Test: Independence of
Two Attributes
• To test if two qualitative characters (attributes) A and B are
independent. Let P(A=Ai, B=Bj) = pij, i=1,…,k, j=1,…,l.
• To test H0: 𝑝𝑖𝑗 = 𝑝𝑖0 𝑝0𝑗 , for all i,j.
• nij= observed freq for cell AiBj. The marginal frequency of Ai and Bj
are 𝑛𝑖0 = 𝑙𝑗=1 𝑛𝑖𝑗 and 𝑛0𝑗 = 𝑘𝑖=1 𝑛𝑖𝑗
• Under H0, conditional distribution of {nij, all I,j} given current
sample marginals {𝑛𝑖0 , 𝑛0𝑗 , all i, j} has the (multivariate
hypergeometric) pmf

57
Exact (Conditional) Test: Independence of
Two Attributes
• Add up probabilities, under H0, of the given table and of
those indicating more extreme positive association (and
having the same marginals). These tables and
corresponding probabilities are:

• So, P-value = 0.0198 < 0.05  Ha seems to be true


58
Intentionally Kept Blank

59
Homogeneity versus Independence
Hypotheses
• Hypothesis of homogeneity
H0: π1 = π2
Not done in Retrospective Product Binomial Sampling

• Hypothesis of Independence
(At this stage qualitatively expressed)
Done only in Poisson or Multinomial Sampling

60
Homogeneity versus Independence
Hypotheses (contd.)
• The hypothesis of independence is used to
investigate an association between row and column
factors without specifying one of them as a
response. Although the hypotheses may be
expressed in terms of parameters, it is more
convenient to use the qualitative wording:
• H0: The row categorization is independent of the
column categorization

61
Sampling scheme versus Hypotheses
Sampling scheme Marginal Total fixed in Usual Hypothesis: Usual Hypothesis:
advance Independence Homogeneity
Poisson None YES YES
Multinomial Grand Total (Sample YES YES
size)
Prospective Row (explanatory) YES
total
Retrospective Column (Response) YES
total

Through “Odds Ratio” only

62
Inference for 22 Table
(Sec 2.2, Text)

Measures of Association:
• (i) Relative Risk (or Incidence Rate Ratio or
‘Probability Ratio’)
• (ii) Difference Between Proportions,
• (iii) Odds Ratio

63
Is “Tutoring” Helpful in a Business Stat Course?

Success Failure Row marginal


Tutoring a b (a+b)
No Tutoring c d (c+d) = 𝜋2
Col. marginal (a+c) (b+d) n=(a+b+c+d)

a/(a+b) a(c+d) 𝑎𝑑+𝑎𝑐


Estimated Risk Ratio = = = ,
c/(c+d) (a+b)c 𝑏𝑐+𝑎𝑐
where 𝜋1 = a/(a+b), 𝜋2 = c/(c+d)

𝑎/𝑏 𝑎𝑑
Estimated Odds Ratio = =
𝑐/𝑑 𝑏𝑐
64
Relative Risk vs Odds Ratio

• Relative risk tells how much ‘risk’ (probability) is increased or


decreased from an initial level. It is readily understood. A
relative risk of 0.5 means the initial risk has halved. A relative
risk of 2 means initial risk has increased twofold.
• Odds ratio is simply the ratio of odds in two groups of interest.
If the odds ratio is less than one then the odds (and therefore
the risk too) has decreased, and if the odds ratio is greater
than one then they have increased. But by how much?
• How to interpret an odds ratio of, say, 0.5 or an odds ratio of
2? Lack of familiarity with odds implies no intuitive feel for the
size of the difference when expressed in this way.

65
Layout of the 2×2 table
Column factor
(Response)

Level 1 Level 2
Level 1 n11 n12 R1=n1+
Row Factor Row
(Explanatory) Total
Level 2 n21 n22 R2=n2+

C1=n+1 C2=n+2 T=n

Grand
Total
Column
Marginal
Total 66
Totals
(i) Relative Risk (RR) or Incidence Rate Ratio (IRR)
(Text, p.53)

(Population proportion p, also denoted by Greek letter π)


• The relative risk (RR) of response Y=1 of the population
X=1 to population X=0 is the ratio of two population
proportions:
𝑃(𝑌=1|𝑋=1) π1
• RR = =
𝑃(𝑌=1|𝑋=0) π2
• RR > 1 means probability of response is larger in
Population X=1 than in Population X=0
𝑛11
𝑛1+
• Estimate of RR : 𝑅𝑅 = 𝑛21
𝑛2+

67
Confidence Intervals for Relative Risk (RR)
(Text, p.54)
𝑛11
π1 𝑛1+
• Estimate of RR ( ): 𝑅𝑅 = 𝑛21
π2
𝑛2+
• Estimate of “asymptotic” variance of loge(RR):
1− π1 1− π2
• 𝑉𝑎𝑟(𝑙𝑜𝑔𝑒 𝑅𝑅 = + (should 𝑛22 be 𝑛21 ?)
𝑛11 𝑛22

• 100(1-)% CI for RR:


𝑙𝑜𝑔𝑒 𝑅𝑅 exp − 𝑍/2  𝑉𝑎𝑟(𝑙𝑜𝑔𝑒 𝑅𝑅 to

𝑙𝑜𝑔𝑒 𝑅𝑅 exp 𝑍/2  𝑉𝑎𝑟(𝑙𝑜𝑔𝑒 𝑅𝑅

Z = 100*(1-)-th percentile of N(0,1) distribution

Note: RR can not be estimated with retrospective sampling


68
Difference Between two Proportions
• Interpreting the difference between two proportions
may not always be easy.
• Two proportions π1 = 0.5 and π2 = 0.45 have the same
difference as π1 = 0.1 and π2 = 0.05 (even though in
the second case one is twice the other). This is when
relative risk is a better measure.
• An alternative to comparing proportions (i.e., π1
versus π2) is to compare the corresponding odds (i.e.,
𝜔1 = π1/(1- π1) versus 𝜔2 = π2/(1- π2).

69
Confidence Interval for π1  π2
𝑛11 𝑛21
• Estimate of π1  π2: 𝜋1 𝜋2 = −
𝑛1+ 𝑛1+
𝜋1 (1− 𝜋1 ) 𝜋2 (1− 𝜋2 )
• 𝑉𝑎𝑟(𝜋1 − 𝜋2 ) = +
𝑛1+ 𝑛2+
𝜋1 (1− 𝜋1) 𝜋2 (1− 𝜋2 )
• s.e.( 𝜋1 − 𝜋2 ) = +
𝑛1+ 𝑛2+
• 100(1-)% CI for π1  π2:
𝜋1 (1− 𝜋1) 𝜋2 (1− 𝜋2)
(𝜋1 − 𝜋2 ) − 𝑍/2 + to
𝑛1+ 𝑛2+
𝜋1 (1− 𝜋1) 𝜋2 (1− 𝜋2)
(𝜋1 − 𝜋2 ) + 𝑍/2 +
𝑛1+ 𝑛2+

𝑤ℎ𝑒𝑟𝑒 𝑍/2 =1.96 for =.05

70
Testing H0: π1  π2 = 0
𝑛11 𝑛21
• Estimate of π1  π2: 𝜋1 𝜋2 = −
𝑛1+ 𝑛1+
𝑛11 +𝑛21
• =
𝑛1+ +𝑛2+
1 1
• 𝑉𝑎𝑟() = (1 − )( + )
𝑛1+ n2+
𝜋1 −𝜋2
• Test statistic, Z = is asymptotically
(1−)(𝑛 + n ) 1 1
1+ 2+
N(0,1), under H0, if n1+, n2+ are ‘large’

𝑍 2 =1.96 for =.05

71
Exact Test of Two Proportions
• Example. Compare two methods of treatment of an allergy.
Method 1(A) uses 15 patients and Method 2(B) uses 14. Is
mehod 2 better than method 1 ?

• Here n1+=15, n2+ = 14, n11=6, n21 = 11 and Ha: p1 < p2. Here
sample sizes are not large, hence asymptotic tests are not
applicable. Need to use exact tests.

72
Exact (Conditional) Test of Two Proportions
(GGD, Fundamentals, Vol 1)

• Two populations for which proportions of subjects


with certain characteristic are p1 and p2. Random
samples of sizes n1 (same as n1+ notation) and n2
(same as n2+ notation) are drawn independently from
the two pop. Let X1 and X2 denote numbers of
members having characteristic in the samples.
• Want to test H0: p1 = p2 (=p, unknown)
• Make use of the statistics X1, X2, but concentrate on
samples for which X=X1+X2 is fixed, same as observed
sum (x1+x2).
73
Exact Test of Two Proportions
• Conditional pmf of X1 for given X=x1+x2, is

• If observed value of X1 is x10 and that of X is x0, then


use conditional pmf of X1, f(x1|x0) for testing H0.

74
Exact Test of Two Proportions

• H0: p1 = p2 against Ha: p1 > p2, then P-value is


computed by

• H0: p1 = p2 against Ha: p1 < p2, then P-value is


computed by

75
Example: Exact Test of Two Proportions
• Example. Compare two methods of treatment of an allergy.
Method 1(A) uses 15 patients and Method 2(B) uses 14. Is
mehod 2 better than method 1 ?

• Here n1=15, n2 = 14, x0=17, x10 = 6 and Ha: p1 < p2

76
(iii) Odds, and Odds Ratio
Odds of an outcome: Let  be the population
proportion of “YES” outcomes. Then the
corresponding odds is given by,

   /(1   )
The sample odds is given by,

ˆ  ˆ /(1  ˆ )

77
(iii) Odds, and Odds Ratio (contd)
i = population proportion of “YES” response for
Group X=i. Then the odds of “YES” happening is given
𝜋𝑖
by: 𝜔𝑖 = , 0 ≤ 𝜔𝑖 < ∞.
1−𝜋𝑖
The sample odds of “YES” in Group i, give the
𝜋𝑖
estimate: 𝜔𝑖 = .
1−𝜋𝑖
Odds Ratio of “YES” response in Group 1 to that in
Group 2:
𝜔1 𝜋1 (1 − 𝜋2 )
𝜑= = ×
𝜔2 (1 − 𝜋1 ) 𝜋2

78
Odds versus Probabilities
Given the probability  of a “YES” outcome, the
corresponding odds is given by,

   /(1   )
Similarly, given the odds ω of a “YES” response, the
corresponding probability  is given by

   /(1  )

79
Odds versus Probabilities (contd.)
Interpretation: An event with chance of
occurrence 0.95 means the event has odds of 19
to 1 in favour of its occurrence while an event with
chances 0.05 has the same odds 19 to 1, against it.

We generally express the larger number first.

80
Relation between Probability, Odds & Logit
Log(Odds)
Probability Odds =Logit
0 0 NC Odds maps probability
0.1 0.11 -2.20 from [0,1] to [0,)
0.2 0.25 -1.39 asymmetrically,
0.3 0.43 -0.85 while Logit maps it to
0.4 0.67 -0.41 (-, ) symmetrically
0.5 1.00 0.00
0.6 1.50 0.41
0.7 2.33 0.85
0.8 4.00 1.39
0.9 9.00 2.20
1 NC NC

81
Example: NFL Football
TEAM ODDS against (Prob of Win)
San Francisco 49ers Even (1/2)
Denver Broncos 5 to 2 (2/7)
New York Giants 3 to 1 (1/4)
Cleveland Browns 9 to 2 (2/11)
Los Angeles Rams 5 to 1 (1/6)
Minnesota Vikings 6 to 1 (1/7)
Buffalo Bills 8 to 1 (1/9)
Pittsburgh Steelers 10 to 1 (1/11)

Total probability is 1.73!!


[Christensen]
82
Odds versus Probabilities (contd.)
Some facts:
1. Odds must be greater than or equal to zero but
have no upper limit.
2. Odds are not defined for the proportions that are
exactly 0 or 1
3. If the odds of a “YES” outcome is , then the odds
of a “NO” is 1/

83
The Following are Equivalent
• The proportions π1, π2 are equal.

• The odds are equal.

• The odds ratio is equal to 1.

• The log(odds ratio) is equal to 0.

84
Confidence Intervals for Odds Ratio (OR)
(Text, p.52)
𝑛11 𝑛22
• Estimate of OR : 𝑂𝑅 =
𝑛21 𝑛12
• Estimate of “asymptotic” variance of loge(OR):
1 1 1 1
• 𝑉𝑎𝑟(𝑙𝑜𝑔𝑒 𝑂𝑅 = + + +
𝑛11 𝑛22 𝑛21 𝑛12
• 100(1-)% CI for OR:
𝑙𝑜𝑔𝑒 𝑂𝑅 exp − 𝑍/2  𝑉𝑎𝑟(𝑙𝑜𝑔𝑒 𝑂𝑅 to

𝑙𝑜𝑔𝑒 𝑂𝑅 exp 𝑍/2  𝑉𝑎𝑟(𝑙𝑜𝑔𝑒 𝑂𝑅

Note: RR can not be estimated with retrospective sampling


85
Test for Homogeneity
• Hypothesis of homogeneity
H0: π1 = π2

• Alternatively,
H0 : ω1 = ω2, or H0: φ = 1, or H0: log(φ) = 0

86
Odds Ratio (Contd.)
Interpretation:
If the odds ratio =1 /2 equals to 4, then 1=42.
This means that the odds of a “yes” outcome in the
first group is four times the odds of a “yes” outcome in
the second group.

87
Advantages of Odds Ratio over
Risk Ratio or Difference of Proportions
1. Estimate of Odds Ratio (OR) remains invariant over
the sampling design (i.e., works even in case of
retrospective sampling), and it is given by
𝑂𝑅=(n11n22)/(n12n21), since
𝑷(𝒀=𝟏|𝑿=𝟏) 𝑃(𝑌=1,𝑋=1)
𝑷(𝒀=𝟎|𝑿=𝟏) 𝑃(𝑌=0,𝑋=1) 𝑃(𝑌=1,𝑋=1)𝑃(𝑌=0,𝑋=0)
𝑷(𝒀=𝟏|𝑿=𝟎) = 𝑃(𝑌=1,𝑋=0) =
𝑃 𝑌=0,𝑋=1 𝑃(𝑌=1,𝑋=0)
𝑷(𝒀=𝟎|𝑿=𝟎) 𝑃(𝑌=0,𝑋=0)
𝑃 𝑋 =1 𝑌 =1 𝑃(𝑌 = 1) 𝑷 𝑿 = 𝟏 𝒀=𝟏
𝑃 𝑋 =1 𝑌 =0 𝑃(𝑌 = 0) 𝑷 𝑿 = 𝟎 𝒀=𝟏
= =
𝑃 𝑋 =0 𝑌 =1 𝑃(𝑌 = 1) 𝑷 𝑿 = 𝟏 𝒀=𝟎
𝑃 𝑋 =0 𝑌 =0 𝑃(𝑌 = 0) 𝑷 𝑿 = 𝟎 𝒀=𝟎
2. Comparison of odds extends nicely to regression
analysis when response (Y) is a categorical variable. 88
Computation of odds ratio in a 2x2 table
Cold No Cold

Placebo 335 76

Vitamin C 302 105

Odds ratio = (335)(105)/(302)(76) = 4.41/2.88 = 1.53

Calculate odds ratio by dividing the product of the diagonal elements of the
table with that of the off diagonal element of the table.

The above result indicates that the odds of getting cold on a placebo
treatment is 1.53 times larger than that of getting cold on vitamin C
treatment.
89
Example: Computation of odds ratio
Cancer Control

Smoker 83 72

Non-Smoker 3 14

Odds ratio = (83)(14)/(3)(72) = 5.38

Calculate odds ratio by dividing the product of the diagonal elements of the
table with that of the off diagonal element of the table.

The above result indicates that the odds of getting cancer for a smoker is
5.38 times larger than that of getting cancer for a non-smoker.

90
Sampling Distribution of the
Loge of Estimated Odds Ratio

Let  be the odds ratio. Then it can be shown that for


the estimated odds ratio 𝜑, using DELTA method,
1 1
ln(𝜑) ~ N 𝑙𝑛 (𝜑), +
𝑛1+ 𝜋1 (1− 𝜋1 ) 𝑛2+ 𝜋2 (1− 𝜋2 )
for large samples, where ‘ln’ denotes loge

91
Two Formulae of Standard Errors for the
Loge of Odds Ratio
• The estimated variance is obtained by substituting
sample quantities for unknowns in the variance
formula of the estimator. The sample quantities used
to replace the unknowns depend on the usage.
– For a confidence interval, π1 and π2 are replaced by their
individual sample estimates.
– For the test of hypothesis, they are replaced by their
pooled sample estimate from the combined sample.

92
• Testing: The Odds are equal  then odds ratio=1
 ln(odds ratio) = 0.
– If the sample sizes are large, resulting P-value for testing
ln(1/2) = 0, is nearly identical to that obtained with
the Z-test for equal proportions (π1 = π2).

• Confidence interval for odds ratio (OR):


– Construct a confidence interval for log(odds ratio) and
take the antilogarithm of the endpoints.
– A shortcut formula (p.52, text) for the standard
error of log(OR) is the square root of the sum of
the reciprocals of the four cell counts in the 2x2
table.

93
Testing Equality of proportions π1 and π2,
i.e., log(OR)=0 :
• To test the equality of odds of “YES” 1 and 2 in two
Groups ( H0: 1/ 2 =1) , one estimates the common
proportion from combined sample and compute
standard error based on it.
• Estimated st. dev. for constructing Test Statistic:
1 1
s.e.(𝑙𝑛( 𝜔1 /𝜔2 )) = 𝑛1+ 𝜋𝑐 (1− 𝜋𝑐 )
+
𝑛2+ 𝜋𝑐 (1− 𝜋𝑐 )
(𝑛11 +𝑛21 )
𝑤ℎ𝑒𝑟𝑒 𝜋𝑐 =
(𝑛1+ +𝑛2+ )
𝑙𝑛( 𝜔1 /𝜔2 )
• Test statistic=
s.e.( 𝑙𝑛( 𝜔1 /𝜔2 )) ~ N(0,1)
Reject H0 if |Test statistic value| > Z/2
94
Example: Cardio-Vascular Deaths and Obesity
among women in American Samoa

[7.76 (=16/2061) observed deaths versus 6.66 (=7/1051) deaths per


thousand.]
Test equality of proportions of CVD deaths in populations of obese
and nonobese Samoan women ( ln(Odds Ratio) =0 )
This is an “Observational Study“, an example of “Poisson Sampling” [Ramsey, F. L.
and Schafer, D. W. (1997). The Statistical Sleuth. Duxbury Press, Belmont,
California.]
95
Testing equality of two population odds:
Cardiovascular disease and obesity data
group 1 (obese):
1. Estimate the
odds on CVD
death.
group 2 (nonobese):

2. Odds ratio and it’s


log

3. Proportion from the


combined sample

4. SE for the log odds


ratio estimate (test
version) 6. One-side
P-value
5. Z-Statistic 96
Confidence Interval for Odds Ratio
(through that for loge of Odds Ratio)

Estimated st. dev. for Confidence Interval for 𝑙𝑛(1/ 2 ):


1 1
s.e.(𝑙𝑛( 𝜔1 /𝜔2 )) = 𝑛1+ 𝜋1 (1− 𝜋1 )
+
𝑛2+ 𝜋2 (1− 𝜋2 )

1 1 1 1
= + + + (short-cut formula, p.52, text)
𝒏𝟏𝟏 𝒏𝟏𝟐 𝒏𝟐𝟏 𝒏𝟐𝟐

Confidence interval for odds ratio:


– First construct a confidence interval for log(odds ratio):
𝒏 𝒏 1 1 1 1
𝐥𝐧 𝟏𝟏 𝟐𝟐  𝒁/𝟐 + + +
𝒏𝟐𝟏 𝒏𝟏𝟐 𝒏𝟏𝟏 𝒏𝟏𝟐 𝒏𝟐𝟏 𝒏𝟐𝟐
– Then take the antilogarithm of the endpoints to get
confidence interval for odds ratio.
97
Confidence interval for Odds Ratio:
Smoking and Cancer Data
CANCER CONTROL
SMOKER 83 72
NON-SMOKER 3 14

1. Odds ratio and its log 𝜑 = 𝟓. 𝟑𝟖 ln 𝜑 = 1.683

2. Shortcut method for the 1 1 1 1


SE of the log odds ratio + + + 14 = 0.656
83 72 3

3. 95% interval for the log of 1.683  1.96× 0.656 = [0.396,2.969]


odds ratio

4. 95% interval for the odds ratio exp(0.396) to exp(2.969);


or 1.486 to 19.471
Conclusion: The odds of cancer for the smokers are estimated to be 5.38 times the
odds of cancer for non-smokers (approximate 95% CI: 1.486 to 19.471)
Confidence interval for odds ratio:
Vitamin C and Cold data
Cold No Cold
Placebo 335 76

Vitamin C 302 105

1. Odds ratio and its log

2. Shortcut method for the


SE of the log odds ratio

3. 95% interval for the log


odds ratio

4. 95% interval for the odds ratio exp(0.093) to exp(0.761); or 1.10 to 2.14

Conclusion: The odds of a cold for the placebo group are estimated to be 1.53
times the odds of a cold for the vitamin C group (approximate 95% CI: 1.10 to 2.14)
99
Intentionally Kept Blank

100
Test for Marginal Homogeneity
(McNemar’s Test, Text, p.55-56)

Comparing dependent proportions in matched pair or


pre-post treatment study design

H0: Prevalence of Depression at two time points are equal


(P(X=1)= p1+ = p+1=P(Y=1), i.e., treatment has no effect)
(𝑛12 −𝑛21 )2
McNemar’s (Chi-square) test statistic = ~ 1 2
𝑛12 −𝑛21
101
Test for Marginal Homogeneity
(McNemar’s Test, Text, p.55-56)

H0: Prevalence of Depression at two time points are


equal (i.e., treatment has no effect)
(𝑛12 −𝑛21 )2
McNemar’s (Chi-square) test statistic = ~ 1 2
𝑛12 +𝑛21

Here, (9-41)2/(9+41) = 20.48; P-value =6.02E-06;


Conclusion: The treatment seems to be effective
102
Intentionally Kept Blank

103
Cochran-Mantel-Haenszel Test for no row by
column association in any of the 22 Tables
(pp. 94-101)

104
Cochran-Mantel-Haenszel Test (pp. 94-101)

𝑞 (ℎ) (ℎ) 2
𝑛11 −𝑚11
𝑄𝐶𝑀𝐻 =
ℎ=1
𝑞 (ℎ) , Text, p. 100: QCMH = (18-16.4 +
𝑣11
ℎ=1 32 – 28.8)2/(2.3855 + 3.7236) =
Here, h=1,2
3.7714; P-value = 0.052 with
(ℎ) 𝑛 𝑛
(ℎ) (ℎ)
𝑛2+ 𝑛+2
(ℎ) (ℎ) 12 dist
𝑤ℎ𝑒𝑟𝑒 𝑣11 = 1+(ℎ) +1
2 𝑛(ℎ) − 1
𝑛
105
Intentionally Kept Blank

106
Cochran-Armitage Trend Test
(See Text, p.60-61)
Binary categorical (row) variable X, ordered (column)
response variable Y.

Test H0: Proportions of X=1 follow some (linear) pattern


as a function of the levels of Y.
Cochran-Armitage test statistic ~ N(0,1).
It basically tests whether slope of linear regression of
proportions of X=1 on levels of Y is zero or not. [Exercise:
Check Calculations, on p. 61, text] 107
References
• Agresti, A. (2012). Categorical Data Analysis, Wiley Series in
Probability and Statistics.
• Bishop, Y., Fienberg, S. E. and Holland, P. W. (1975). Discrete
Multivariate Analysis, MIT Press, Cambridge.
• Christensen, R. (1990). Loglinear Models. Springer-Verlag,
New York.
• Ramsey, F. L. and Schafer, D. W. (1997). The Statistical Sleuth.
Duxbury Press, Belmont, California.
• Read, T. R. C. and Cressie, N. (1988). Goodness of fit Statistics
for Discrete Multivariate Data. Springer-Verlag, New York.
• Goon, Gupta, Dasgupta, Fundamentals of Statistics, Volume
One.
108

You might also like