Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
48 views20 pages

Chapter Two (Estimation and Hypothesis Testing)

Chapter 2 discusses statistical estimation and hypothesis testing as methods for making inferences about population parameters. It explains point and interval estimation, the properties of good estimators, and the process of constructing confidence intervals for population means and proportions. The chapter also provides examples to illustrate the application of these statistical concepts in real-world scenarios.

Uploaded by

abebenegash436
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views20 pages

Chapter Two (Estimation and Hypothesis Testing)

Chapter 2 discusses statistical estimation and hypothesis testing as methods for making inferences about population parameters. It explains point and interval estimation, the properties of good estimators, and the process of constructing confidence intervals for population means and proportions. The chapter also provides examples to illustrate the application of these statistical concepts in real-world scenarios.

Uploaded by

abebenegash436
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Chapter 2

2. Statistical Estimation and Hypothesis testing


Introduction
Inference, specifically decision making and prediction, is centuries old and plays a very
important role in our lives. Each of us faces daily personal decisions and situations that
require predictions concerning the future. The inferences that individuals make should
be based on relevant facts, which we call observations, or data.
Many individuals tend to feel that their own built-in inference-making equipment is quite
good. However, experience suggests that most people are incapable of utilizing large
amounts of data, mentally weighing each bit of relevant information, and arriving at a
good inference. The statistician, rather than relying upon his or her own intuition, uses
statistical results to aid in making inferences. Although we touched on some of the
notions involved in statistical inference, we will now collect our ideas in a presentation of
some of the basic ideas involved in statistical inference.
Methods for making inferences about parameters fall into one of two categories. Either
we will estimate (predict) the value of the population parameter of interest or we will test
a hypothesis about the value of the parameter. These two methods of statistical
inference—estimation and hypothesis testing—involve different procedures, and, more
important, they answer two different questions about the parameter. In estimating a
population parameter, we are answering the question, ‘‘what is the value of the
population parameter?’’ In testing a hypothesis, we are answering the question, ‘‘is the
parameter value equal to this specific value?’’
These types of questions can be addressed through statistical hypothesis testing, which is
a decision-making process for evaluating claims about a population. In hypothesis
testing, the researcher must define the population under study, state the particular
hypotheses that will be investigated, give the significance level, select a sample from the
population, collect the data, perform the calculations required for the statistical test, and
reach a conclusion.
Inference is the process of making interpretations or conclusions from sample data for
the totality of the population. Inferential statistics uses the sample results to make

1
decisions and draw conclusions about the population from which the sample is drawn. In
statistics there are two ways through which inference can be made.
 Statistical estimation
 Statistical hypothesis testing
Both involve using sample statistics to make inferences about the population parameter.

Population Inference Analyzed


data

Sample Numerical
data

2.1. Statistical Estimation:


This is one way of making inference about the population parameter where the
investigator does not have any prior notion about values or characteristics of the
population parameter.
i. Point Estimation: The goal of point estimation is to make a reasonable guess of the
unknown value of a designated population quantity, e.g., the populations mean. The
quality of an individual estimate depends on the individual sample from which it was
computed and is therefore affected by chance variation. Point Estimation is a single
value or number of sample information that is used to estimate a parameter. The best
point estimate of the population mean  is the sample mean X.

ii. Interval estimation: It is the procedure that results in the interval of values as an
estimate for a parameter, which is interval that contains the likely values of a
parameter. It deals with identifying the upper and lower limits of a parameter.
Estimator and Estimate
Estimator is the rule or random variable that helps us to approximate a population
parameter. But estimate is the different possible values which an estimator can assume.

2
n

X i
For example: The sample mean X  i 1 is an estimator for the population mean and
n

X  10 is an estimate, which is one of the possible values of X .


Properties of best estimator
The following are some qualities of an estimator
o It should be unbiased.
o It should be consistent.
o It should be relatively efficient.
To explain these properties let ˆ be an estimator of θ
1. Unbiased Estimator: An estimator whose expected value is the value of the parameter

being estimated. i.e. E ˆ   .
2. Consistent Estimator: An estimator which gets closer to the value of the parameter as
the sample size increases. i.e. ˆ gets closer to θ as the sample size increases.
3. Relatively Efficient Estimator: The estimator for a parameter with the smallest
variance. This actually compares two or more estimators for one parameter.
2.1.1. Point and Interval Estimation of the population mean: μ
i. Point estimation of the population mean: μ
Another term for statistic is point estimate, since we are estimating the parameter value.
A point estimator is the mathematical way we compute the point estimate. For instance,
sum of X i over n is the point estimator used to compute the estimate of the population

means, . That is, X 


X i
is a point estimator of the population mean.
n
ii. Confidence interval estimation of the population mean
Although X possesses nearly all the qualities of a good estimator, because of sampling
error, we know that it's not likely that our sample statistic will be equal to the population
parameter, but instead will fall into an interval of values. We will have to be satisfied
knowing that the statistic is "close to" the parameter. That leads to the obvious question,
what is "close"?
We can phrase the latter question differently: How confident can we be that the value of
the statistic falls within a certain "distance" of the parameter? Or, what is the probability

3
that the parameter's value is within a certain range of the statistic's value? This range is
the confidence interval. A confidence interval is a specific interval estimate of a
parameter determined by using data obtained from a sample and the specific confidence
level of the estimate.
The confidence level is the probability that the value of the parameter falls within the
range specified by the confidence interval surrounding the statistic. There are different
conditions to be considered to construct confidence intervals of the population mean, .

Condition-1: If the population variance  2 is known; what ever the value of sample
size but the population is normal
Recall the Central Limit Theorem, which applies to the sampling distribution of the mean
of a sample. Consider samples of size n drawn from a population, whose mean is μ and
standard deviation is  with replacement and order important. The population can have
any frequency distribution. The sampling distribution of X will have a mean  X  


and a standard deviation  X  , and approaches a normal distribution as n gets large.
n
This allows us to use the normal distribution curve for computing confidence intervals.

Z
X    ~ N (0,1)
 n

   X  Z n

 X   , where  is a measure of error.


   Z n
- For the interval estimator to be good the error should be small. How it is small?
• By making n large
• Small variability
• Taking Z small
-To obtain the value of Z, we have to attach this to a theory of chance. That is, there is an
area of size 1-  Such that:
 P  Z  2  Z  Z  2   1  

Where:  = is the probability that the parameter lies outside the interval

4
Z  2  is the value of the standard normal variable corresponding to

the right of which  2 probability lie , i.e.  P Z  Z  2    2

 X  
 P  Z  2   Z  2   1  
  n 


 P X  Z 2  n    X  Z 2  
n  1

If the population has a normal distribution and  is known, then a 1   100%


confidence interval for  is given by:

X  Z 2  n , X  Z 2  n 
Note: When (as is often the case) we don't know the population standard deviation 
and n is large ( n  30 ), we can approximate it by the sample standard deviation S , and
obtain the following (good) approximation of the 1   100% confidence interval for  :

X  Z 2 S n , X  Z 2 S n 
Z  2  Z-value with an area of /2 to its right (obtained from a table).

Condition-2: If the population variance  2 is not known and n is Small (n<30 the
population is normal:

In most practical research, the standard deviation for the population of interest is not
known. In this case, the standard deviation  is replaced by the estimated standard
deviation S, also known as the standard error. Since the standard error is an estimate for
the true value of the standard deviation, the distribution of the sample mean X is no
longer normal with mean  and standard deviation  n . Instead, the sample mean

follows the t -distribution with mean X and standard deviation S n . The t-


distribution is also described by its degrees of freedom. For a sample of size n, the t -
distribution will have n-1 degrees of freedom. The notation for a t -distribution with n-
1 degrees of freedom is t  n  1  . As the sample size n increases, the t -distribution

5
becomes closer to the normal distribution, since the standard error approaches the true
standard deviation for large n.

t 
X    has t distribution with n-1 degree of freedom.
S n

-The value of t 2 can be obtained from a table with an area of  2 to the right with

n  1 degrees of freedom.
Therefore, the 1   100% confidence interval for  when the population is normally
distributed and  is not known is given by:

X  t 2 S n , X  t 2 S n 
Example 2.1: A random sample of 900 workers showed an average height of 67 inches
with a standard deviation of 5 inches.
a) Find a 95% confidence interval of the mean height of all workers
b) Find a 99% confidence interval of the mean height of all workers
Solution:

a) X  67 , S=5, n=900
 1   100%  95%  1     0.95
   0.05   2  0.025
 Z  2  Z 0.025  1.96, from the table.

The required interval will be:


X  Z 2 S n , X  Z 2 S n 
 ( 67  1 . 96 * 5 30 , 67  1 . 96 * 5 30 )
 66 . 673 , 67 . 327 
 1   100%  99%  1     0.99
b)
   0.01   2  0.005
 Z  2  Z 0.005  2.58, from the table.

The required interval will be:

6
X  Z 2 S n , X  Z 2 S n 
 ( 67  2 . 58 * 5 30 , 67  2 . 58 * 5 30 )
 66 . 57 , 67 . 43 

Example 2.2: A Drug Company is testing a new drug which is supposed to reduce blood
pressure. From the six people who are used as subjects, it is found that the average drop
in blood pressure is 2.28 points, with a standard deviation of 0.95 points. What is the 95%
confidence interval for the mean change in pressure?
Solution:
X  2.28 , S  0.95 , n  6
 1   100%  95%  1     0.95
   0.05   2  0.025
 t 2  t 0.025  2.571, from the table, with df  5.

The required interval will be:


X  t S n , X  t
 2  2 S n 
 ( 2 . 28  2 . 571 * 0 . 95  
6 , 2 . 28  2 . 571 * 0 . 95 
6 )
 1 . 28 . 3 . 28 

Example 2.3: Suppose we want to estimate a 95% confidence interval for the average
quarterly returns of all fixed-income funds in the Ethiopia. We draw a sample of 100
observations and calculate the sample mean to be 0.05 and the standard deviation 0.03.
We assume that those returns are normally distributed with known variance.
Solution:
X  0.05,   0.03, n=100
 1   100%  95%  1     0.95
   0.05   2  0.025
 Z  2  Z 0.005  2.58, from the table

 The confidence interval is:

7

 X  Z 2  n 
 0 .05  1 .96 ( 0 .03 10 ) 
 ( 0 .04412 , 0 .05588 )
2.1.2. Point and Interval Estimation of the Population proportion:
X
If P represents for the population proportion then the sample proportion Pˆ  provides
n
a good estimate of P. Therefore, the sample proportion P̂ is the point estimation of the
population proportion. To construct the confidence interval for the proportion we follow
the following conditions:
Conditions: If the population proportion is not too close to zero or one, and
that the sample size is large (at least 30):
X
 Under these conditions, the sampling distribution Pˆ  can be approximated by
n

a normal distribution that has mean P and standard deviation P (1  P )


.
n

 To construct a confidence interval for P, we can now adopt the same argument
that was used in finding a confidence interval for  and write:

P (1  P ) P (1  P )
P ( Pˆ  Z  2  P  Pˆ  Z  2 )  1
n n

Hence a ( 1   ) 100% confidence interval is population proportion P is given by:


P (1  P ) P (1  P )
Pˆ  Z  2  P  Pˆ  Z  2 )
n n

An Approximate ( 1   ) 100% confidence interval for the population proportion P is


given by:

Pˆ (1  Pˆ ) Pˆ (1  Pˆ )
Pˆ  Z  2  P  Pˆ  Z  2 )
n n

If the sample size is large (usually n>30)

Example 2.4: In a sample of 400 people who were questioned regarding their
participation in sports, 160 said that they did participate. Construct a 98 % confidence
interval for P, the proportion of P in the population who participate in sports.

8
Solution:
Let X= be the number of people who are interested to participate in sports.

X=160, n=400,  =0.02, Hence Z 2  Z0.01  2.33

X 160
Pˆ    0 .4
n 400
P (1  P ) 0 .4 ( 0 .6 )
 Pˆ    0 . 0245
n 400
As a result, an approximate 98% confidence interval for P is given by:

Pˆ (1  Pˆ ) Pˆ (1  Pˆ )
 Pˆ  Z  2  P  Pˆ  Z  2 )
n n

 (0.4  (2.33* 0.0245)), (0.4  (2.33* 0.0245


 0.345,0.457
Hence, we can conclude that about 98% confident that the true proportion of people in
the population who participate in sports between 0.345 and 0.457.

2.2. STATISTICAL HYPOTHESIS TESTING


A statistical hypothesis test is a method of making statistical decisions using
experimental data.
Hypothesis Testing: Is a common method of drawing inferences about a population
based on statistical evidence from a sample.
Definitions:
Statistical hypothesis: Is an assertion, statement, or claim about the population whose
plausibility is to be evaluated on the basis of the sample data.
Test statistic: Is a statistics whose value serves to determine whether to reject or not
reject the hypothesis to be tested. There are two types of statistical hypotheses for each
situation: the null hypothesis and the alternative hypothesis.
a. Null hypothesis: Is a claim or statement about a population parameter that is usually
assumed to be true from the very beginning until it is declared false. It is a statistical
hypothesis that states a hypothesis of equality or the hypothesis of no difference
between a parameter and a specific value. It is usually denoted by H .
0

9
b. Alternative hypothesis: Is a claim or statement about a population parameter that
will be true if the null hypothesis is false. It is a statistical hypothesis that states a
hypothesis of difference between a parameter and a specific value. It is usually
denoted by H or H .
1 A

Types and size of errors:


Testing hypothesis is based on sample data which may involve sampling and non
sampling errors.
 Type I error: Rejecting the null hypothesis when it is actually true. The
significance level (  ) can be interpreted as the probability of rejecting the null
hypothesis when it is actually true. The distribution of the test statistic under the
null hypothesis determines the probability  of a type I error.
 =P (type I error) = level of significance
 Type II error: Occurs when a false null hypothesis is not rejected. The null
hypothesis is actually false but we wrongfully conclude do not reject it. 
represents the probability that H0 is not rejected when actually H0 is false. The
distribution of the test statistic under the alternative hypothesis determines the
probability  of a type II error.
 =P (type II error)
 The power of a test ( 1   ) is the probability of correctly rejecting a false null
hypothesis. The value of ( 1   ) is called the power of a test.
1   =Power of test
Note: The two types of errors that occur in tests of hypothesis depend on each other.
We can not lower the values of  and  simultaneously for a test of hypothesis for a
fixed sample size. Lowering the value of  will raise the value of  , and lowering
the value of  will raise the value of  . However, we can decrease both  and 
simultaneously by increasing the sample size.
The following table gives a summary of possible results of any hypothesis test:
Actual situation (condition)
H0 is true H0 is false

10
Do not Reject H0 Correct Decision Type II error
Decision
Reject H0 Type I error Correct Decision
General steps in hypothesis testing:
1. State the appropriate hypothesis
2. Select the level significance, 
3. Select an appropriate test statistics
4. Identify the critical region.
5. Compute the test value
6. Making the decision.
7. Summarize the results.
2.2.1 Hypothesis tests about a population mean: 
Suppose the assumed or hypothesized value of  is denoted by  0 then one can
formulate two sided (1) and one sided (2 and 3) hypothesis as follows:
1. H 0 :    0 VS H1 :   0

2. H 0 :    0 VS H1 :   0

3. H 0 :    0 VS H1 :   0

Condition-1: If the population standard deviation,  is known what ever the value of
sample size is and when sampling is from a normal distribution:

The formula for the test statistic is: Z cal 


X    0

 n
After specifying α we have the following test criteria corresponding to the above three
hypothesis.

Hypothesis Decision rule is to


Null Alternative reject H0 if:

  0 Z cal  Z  2

VS   0 Z cal  Z 
  0
  0 Z cal   Z 

11
Note: When we don't know the population standard deviation  and n is large ( n  30 ),
we can approximate it by the sample standard deviation S , and obtain the following test
statistics:

Z cal 
X    ~ N (0,1)
0

S n

The decision rule is the same as condition-1.

Condition-2: When the population standard deviation,  , is unknown, the population is


normally or approximately normally distributed, and sample size is small (n<30).
( X  0 )
The formula for the test statistic is: t cal  ~ t ( n1)
S n
After specifying α we have the following test criteria corresponding to the above three
hypothesis.

Hypothesis
Decision rule is to reject H0 if:
Null Alternative
  0 t cal  t 2

VS   0 t cal  t
  0
  0 t cal  t

Example 2.5: The Tele Co. provides telephone service in an area. According to the
company’s records, the average length of all calls placed was 12.5 minutes. A sample of
150 such calls placed through this Co. produced a mean length of 13 minutes with a
standard deviation of 2.6 minutes. Can you conclude that the mean length of all current
calls is different from 12.5 minutes? Use the 0.05 level of significance and assume that
the distribution of all call is normal.
Solution:
Let  0  population mean
1. State the null and alternative hypothesis:

12
H 0 :   12.5 (The mean length of all current calls is 12.5 minutes)

H 1 :   12.5 (The mean length of all current calls is different from12.5 minutes).
2. Select the level significance,  = 0.05 (given)
3. Select an appropriate test statistics:
Z-statistic is appropriate because the sample size is large
4. Identify the critical region:
Here we have two critical regions since we have two tailed hypothesis. The
critical region is Z cal  Z 0.025  1.96  (1.96,1.96) is the acceptance region

5. Compute the test value. Given that X  13 ,   2.6 , n=150

 Z cal 
X    
0 13  12.5

0 .5
 2.27
S n 2 .6 150 0.22

6. Decision: Reject H0, since Z cal is not in the acceptance region


7 Conclusion: At 5% level of significance, we have evidence to say that the average
length of all such calls is not equal to 12.50 minutes.
Example 2.6: Ten individuals are chosen at random from a population and their height is
found to be in inches 63, 63, 66, 67, 68, 69, 70, 71 and 71. In the height of the data the
average height of the population is 66 inches. Can we conclude that the height of an
individual is decreasing? (Use   0.05 and assume the normality of the population)
Solution:
Let  0  population mean
1. State the null and alternative hypothesis:
H 0 :   66 VS H 1 :   12.5

2. Select the level significance,  = 0.05 (given)


3. Select an appropriate test statistics: t -statistic is appropriate because the
population standard deviation is unknown and the sample size is small.
4. Critical region: t cal  t ,n 1  t 0.05,9  1.8331  (,1.8331) is the acceptance

region.
5. Compute the test value

13
10 n

 Xi (X i  X )2
X i 1
 67.8 , S  i 1
 3.01, n=10
101 n 1

 t cal 
X     67.8  66  1.891
0

S n 3.01 10
6. Decision: Reject H0, since t cal is not in the acceptance region
7. Conclusion: At 5% level of significance, we have evidence to say that the average
height of an individual is less than 66 inches.
Example 2.7: A national magnitude claims that the average college student watches less
television. The average national of all college students is 29.4 hours per week with a
standard deviation of 2 hours. A sample of 25 college students has a mean of 27 hours.
Test the claim at   0.01 and assume normality of the population.
Solution:
1. State the null and alternative hypothesis:
H 0 :   29.4 VS H 1 :   29.4

2. Select the level significance,  = 0.01 (given)


3. Select an appropriate test statistics:
Z-statistic is appropriate because the population standard deviation is known.
4. Critical region:
Z cal  Z   Z 0.01  2.33

 (, 2.33) is the acceptance region for the null hypothesis


5. Compute the test value
X  27   2, n=25

 Z cal 
X     27  29.4  6
0

 n 2 25
6. Decision:
Do not reject H0, since Z cal is not in the acceptance region
7. Conclusion: The average college students watches less television at 1% level of
significance

14
Example 2.8: An authority from a district power station of the town told reporters
recently that the average monthly electric Bill of households in AA is not more than Birr
100. A random sample of 400 households from the city produces a mean of Birr 105 Bill
with standard deviation of Birr 40. Test the claim of the authority at 5% level of
significance.

Solution:
1. State the null and alternative hypothesis:
H 0 :   100 (claim) VS H 1 :   100

Select the level significance,  = 0.05 (given)


2. Select an appropriate test statistics:
Z-statistic is appropriate because the sample size is large and the population is
non-normal
3. Critical region:
Z cal  Z   Z 0.05  1.645

 (, 2.5) is the acceptance region for the null hypothesis


4. Compute the test value

 Z cal 
X     105  100  2.5
0

S n 40 400
5. Decision:
Reject H0, since Z cal is not in the acceptance region
6. Conclusion: At 5% level of significance the claim of the authority is not correct.
2.2.2 Tests about a population proportion: P
The procedure to make tests of hypothesis about the population proportion P for large
samples is similar in many aspects to the population mean. The procedure includes the
same seven steps. Similarly, the test can be two-tailed or one tailed. When the sample
size is large, the sample proportion P̂ is approximately normally distributed with its

P (1  P )
mean equal to P and standard deviation equal to . Hence; we use the normal
n
distribution to perform a test of hypothesis about the population proportion P for a large

15
Sample. The sample size considered to be large when nPˆ and n(1  Pˆ ) are both greater
than 5.
Suppose the assumed or hypothesized value of P (parameter of the binomial
distribution) is denoted by P0 then one can formulate two sided (1) and one sided (2 and
3) hypothesis as follows:
1. H 0 : P  P0 VS H 1 : P  P0

2. H 0 : P  P0 VS H 1 : P  P0

3. H 0 : P  P0 VS H 1 : P  P0

The choice of H 1 depends on the prior information we have on the values of P0 .


Decision Rule:

Hypothesis
Decision rule is to reject H0 if:
Null Alternative
P  P0 Z cal  Z  2

VS P  P0 Z cal  Z 
P  P0
P  P0 Z cal   Z 

Z cal 
Pˆ  P 
0
~ N (0,1)
P0 (1  P0 )
n

Example 2.9: A manufacturing company has submitted a claim that 100% of items
produced by a certain process are non defective. An improvement in the process is being
considered that the feel will lower the proportion of defectives below the current 10%. In
an experiment 100 items are produced with the new process and 5 are defective: Is this
evidence sufficient to conclude that the method has been improved? Use a 0.05 level of
significance.

Solution: As usual, we follow the steps:

1. H 0 : P  0.9 (actually P  0.9 ) VS H 1 : P  0.9

16
2.   0.05
3. Critical Region: Z>1.645
4. Computation
X 95
Pˆ    0.95
n 100

Z cal 
Pˆ  P 
0

0.95  0.90
 1.67
P0 (1  P0 ) 0 .9 * 0 .1
n 100
5. Decision: Reject H0
6. Conclusion: At 0.05 we have an evidence to say that the improvement has
reduced the proportion of defective.

Example 2.10: the unemployment rate in a given country at a given period is believed to
be 10%. The government embarked on a series of projects to reduce unemployment. It
was of interest to determine whether unemployment decreases as a result of the projects.
A random sample of 500 people was chosen, and 48 of them were found to be
unemployed. Test at 1% level of significance if the government projects reduced the
unemployment rate

Solution: As usual, we follow the steps:

1. H 0 : P  0.1 VS H 1 : P  0.1
2.   0.05
3. Critical Region: Z<-Z1.645
4. Critical Region: Z  Z 
5. Computation
X 48
Pˆ    0.096
n 500

Z cal 
Pˆ  P 
0

0.096  0.1
 0.3
P0 (1  P0 ) 0.1* 0.9
n 500
 Z tab   Z   Z 0.01   2 .33

17
6. Decision: Do not reject H0 since Zcal > Ztab
7. Conclusion: the government projects didn’t reduce unemployment.

Example 2.11: A large sample of 200 students from the students of a certain high school
is interviewed and 85 of them are found to use city bus. Can you conclude that at least
40% of the students use city bus? Use a 0.05 level of significance (Exercise)

8.7 Test of Association


In the previous section we tried to see how we can test hypothesis for numeric data give
in the from of mean or proportion. It is also possible to apply hypothesis testing on
categorical data.
- Suppose we have a population consisting of observations having two attributes or
qualitative characteristics say A and B.
- If the attributes are independent then the probability of possessing both A and B is
P *P
A B

Where P is the probability that a number has attribute A.


A

P
B is the probability that a number has attribute B.

- Suppose A has r mutually exclusive and exhaustive classes.


B has c mutually exclusive and exhaustive classes
- The entire set of data can be represented using c*r contingency table.

A B1 B2 . . Bj . Bc Total

A O O O O R
1 11 12 1j 1c 1

A O O O O R
2 21 22 2j 2c 2

.
.
A O O O O R
i i1 i2 ij ic i

18
.
.
A O O O O
r r1 r2 rj rc

Total C C C n
1 2 j

- The chi-square procedure test is used to test the hypothesis of independency of two
attributes

- The statistic is given by:


 Oij  eij 2 
r c
     ~  with r  1c  1 deg ree of freedom
2 2

i 1 j 1 
 eij 

..Where Oij =The number of units that belong to category i of A and j of B.

eij = Expected frequency that belong to category i of A and j of B and eij is

given by
Ri  C j
eij  Where Ri=the i th raw total
n
Cj= the j th column total.
n=total number of observation.
Remarks:
r c r c

 Oij   eij
i 1 j 1 i 1 j 1

- The null and alternative hypothesis may be stated as:


H0: There is no association between A and B.
H1: not H0 (There is association between A and B).
Decision Rule:
- Reject H for independency at α level of significance if the calculated value of  2
0

exceeds the tabulated value with degree of freedom equal to (c-1) (r-1).

19
Example 8.12 A researcher is interested to assess the effect of litracy on family planning
use. Accordingly he collected data and tabulated the findings in the following manner.
Can we say there is association between educational status and family planning use?
FP Use Educational Status Total
Ilitrate Litrate
Yes a 63 b 49 112
No c 15 d 33 48
Total 78 82 160

Example 8.13: A geneticist took a random sample of 300 men to study whether there is
association between father and son regarding boldness. He obtained the following
results.

Son
Father Bold Not
Bold 85 59
Not 65 91
Using α=5% test whether there is association between father and son regarding boldness.
Example 8.14: Random samples of 200 men, all retired were classified according to
education and number of children is as shown below
Number of children
Education level 0-1 2-3 Over 3
Elementary 14 37 32
Secondary and above 31 59 27

1. Define null and alternative hypotheses, and give an example of each.


2. What is meant by a type I error? A type II error? How are they related?
3. What is meant by a statistical test?
4. Explain the difference between a one-tailed and a two-tailed test.
5. What is meant by the critical region? The non critical region?

20

You might also like