Lecture 17: Statistical Inference by Dr.
Javed Iqbal
Chi Square Procedures: Chi Square Test of Independence
Weiss, p-599, Chi Square Distribution: A positively skewed probability
distribution with an associated degrees of freedom parameter. The domain of a Chi
Square RV, just like an F distribution, is from zero to infinity. This distribution has
many applications in statistical inference including the test of independence and
goodness of fit test.
Contingency Table: We know that data on one variable are grouped into a
frequency distribution. Data from two variables are called bivariate data, and a
frequency distribution for bivariate data is called a contingency table or two-way
table or cross tabs.
Weiss Example 13.5, p-610. The table 13.7 shows the classification of students with
respect to two attributes i.e. the class level (Freshmen, Sopho, Junior, Senior) and
political party affiliation (Democratic, Republican, Other). Table 13.9 shows the
contingency table for the data.
Association between variables: Two variables of a population are associated if
knowledge of the value of one variable can help in predicting the value of the other
variable. Two associated variables are also called statistically dependent variables.
Hence two non-associated variables are often called statistically independent
variables.
Chi Square Test of Independence: If we have bivariate sample data arranged in a
contingency table, this test helps us to assess whether the two variables involved are
statistically independent. This test is commonly used for qualitative (nominal or
ordinal) data. In case of quantitative variable, we have a nice tool of regression for
such purpose. However, Chi square test can also be used for quantitative data when
normality or constant variance assumption is violated. For that we categorize a
quantitative variable into classes.
Example: Consider the data given in the following contingency table where 180
individuals are classified wrt smoking pattern and hypertension status.
Smoking Pattern Total
Non- Moderate Heavy
Smokers Smokers Smokers
Hypertension Hypertension 21 36 30 87
Status No Hypertension 48 26 19 93
Total 69 62 49 180
Test the hypothesis that the presence or absence of hypertension is independent of
smoking pattern. Use a 0.05 level of significance.
Sol: Here we want to test whether the smoking pattern of adults is associated with
their hypertension status.
H0: Smoking pattern and hypertension status are not associated (i.e. independent)
H1: Smoking pattern and hypertension status are associated (i.e. dependent)
From probability theory we know that, if two events A and B are independent then
𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴)𝑃(𝐵)
Applying this to the events defined over the bivariate distribution, e.g.,
P(Hypertension and Non-Smokers) = P(Hypertension) P(Non-Smokers)
87 69
= ×
180 180
Thus expected frequency of persons of those who have hypertension and who are
non-smokers from a total of 180 people under the condition of independence is:
87 69 87×69
180×P(Hypertension and Non-Smokers) = 180 × × = = 33.35
180 180 180
In general, expected frequency of a cell (under independence) is found as:
𝑅𝑜𝑤 𝑇𝑜𝑡𝑎𝑙 × 𝐶𝑜𝑙𝑢𝑚𝑛 𝑇𝑜𝑡𝑎𝑙
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝐶𝑒𝑙𝑙 𝐹𝑟𝑒𝑞𝑢𝑒𝑐𝑛𝑦 =
𝐺𝑟𝑎𝑛𝑑 𝑇𝑜𝑡𝑎𝑙
The Chi Square test of independence is based on comparison of observed frequency
(𝑂𝑖 ) of two events occurring simultaneously and expected frequency (𝐸𝑖 ) of their
joint occurrence under the assumption of statistical independence. If the two
frequencies are quite similar, then the test concludes that the two variables under
study are independent. Otherwise, statistical dependence is concluded. The test
statistic is given by:
2
(𝑂𝑖 − 𝐸𝑖 )2
𝜒 = ∑
𝐸𝑖
This test follows a Chi-Square distribution with number of degrees of freedom
as: (r − 1)(c − 1), where r and c are the number of rows and number of columns
in the contingency table under consideration.
Note: The test is valid if each expected frequency is at least 5. In some cases, we
need to merge two adjacent rows or columns of the contingency table to fulfil this
condition.
[However, note on Weiss, p-622 mentions that the statistician Cochran considers this rule of at
least 5 as too restrictive].
For the smoking pattern and hypertension example, the observed and expected
frequencies (shown in brackets) are as follows:
Smoking Pattern Total
Non- Moderate Heavy
Smokers Smokers Smokers
Hypertension Hypertension 21 36 30 87
Status (33.35) (29.97) (23.68)
No Hypertension 48 26 19 93
(35.65) (32.03) (25.32)
Total 69 62 49 180
Note the sum of expected frequency in any row or column must be the same as the
sum of observed (actual) frequency. Any discrepancy is due to rounding.
Then the Chi Square test statistic can be calculated in the following table.
(𝑂 − 𝐸)2
Cell # Observed Frequency (O) Expected Frequency (E) 𝐸
1 21 33.35 4.573
2 36 29.97 1.213
3 30 23.68 1.687
4 48 35.65 4.278
5 26 32.03 1.135
6 19 25.32 1.578
Sum 180 180 14.463
The test statistic has value = 14.463 and Chi Square [0.05, (2-1)(3-1)]=Chi
Square(0.05,2 df) = 5.991 (Table on Weiss p-776)
Hence the null hypothesis of independence of smoking pattern and hypertension
status is rejected, and we conclude that smoking pattern and hypertension status are
dependent.
Many of the discoveries in sciences and social sciences are first flagged using such
statistical tests e.g., the relationship between smoking and hypertension (high blood
pressure). The actual causes through which smoking affects human body and hence
hypertension are studied later in medical experiments and clinical trials. Similarly,
the finding that physical activity can be as effective as antidepressants or
psychotherapy in treating mild or moderate depression or that sports improve your
mood or are first discovered using statistical methodology.
Weiss Ex 13.85, p-627, Anderson Ex 11 pdf p-599), Ex 12, pdf p-600), Ex 16 pdf p-601)
[Anderson Ex 11: Verify that expected frequencies are
35.58913 150.7304 455.6804 15.41087 65.26957 197.3196
Calculated test statistic is 100.4, Chi square (0.05, 2) = 5.991, so the null of independence is
rejected]