Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
2K views6 pages

Sample Size Calculation

This document discusses practical issues in calculating sample sizes for studies estimating population prevalence. It addresses how to determine appropriate values for parameters used in the standard sample size calculation formula. The key parameters discussed are precision (d) and expected proportion/prevalence (P). The document recommends setting d as half of P if P is below 10% or 0.5(1-P) if above 90%, and using the highest estimate if P is reported as a range. It emphasizes the importance of choosing parameter values that allow for a reasonably precise prevalence estimate given limitations of the study.

Uploaded by

Amit Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views6 pages

Sample Size Calculation

This document discusses practical issues in calculating sample sizes for studies estimating population prevalence. It addresses how to determine appropriate values for parameters used in the standard sample size calculation formula. The key parameters discussed are precision (d) and expected proportion/prevalence (P). The document recommends setting d as half of P if P is below 10% or 0.5(1-P) if above 90%, and using the highest estimate if P is reported as a range. It emphasizes the importance of choosing parameter values that allow for a reasonably precise prevalence estimate given limitations of the study.

Uploaded by

Amit Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Archives of Orofacial Sciences 2006; 1: 9-14

MEDICAL STATISTICS

Practical Issues in Calculating the Sample Size for Prevalence Studies


L. Naing1,2*, T. Winn2, B.N. Rusli1,2
1
Department of Community Dentistry, School of Dental Sciences, 2Department of Community Medicine, School of
Medical Sciences, Universiti Sains Malaysia, Health Campus, 16150 Kubang Kerian, Kelantan, Malaysia
*Corresponding author: [email protected]

ABSTRACT
The sample size calculation for a prevalence only needs a simple formula. However, there are a number of practical
issues in selecting values for the parameters required in the formula. Several practical issues are addressed and
appropriate recommendations are given. The paper also suggests the application of a software calculator that checks
the normal approximation assumption and incorporates finite population correction in the sample size calculation.

Keywords: sample size calculator, prevalence study


______________________________________________________________________________________

INTRODUCTION who want to be more confident (say 99%) about


their estimates, the value of Z is set at 2.58.
“How big a sample do I require?” is one of the
Expected proportion (P): This is the
most frequently asked questions by investigators.
proportion (prevalence) that investigators are
Sample size calculation for a study estimating a
going to estimate by the study. Sometimes,
population prevalence has been shown in many
investigators feel a bit puzzled and a common
books (Daniel, 1999, Lwanga and Lemeshow,
response is that ‘We don’t know this P. That is
1991). The aim of the calculation is to determine
why we are going to conduct this study’. We need
an adequate sample size to estimate the
to understand that the scale of P is from zero to
population prevalence with a good precision. It
one, and the sample size varies depending on the
can be calculated using a simple formula as the
value of P (Figure 1). Therefore, we have to get
calculation needs only a few simple steps.
an estimate of prevalence (P) in order to calculate
However, the decision to select the appropriate
the sample size. In many cases, we can get this
values of parameters required in the formula is
estimate from previous studies. In this paper, P is
not simple in some situations. In this paper, we
in proportion of one, not using a percentage in all
highlight the problems commonly encountered,
formulae. For example, if prevalence is 20%,
and give recommendations to handle these
then P is equal to 0.2.
problems.
Precision (d): It is very important for
investigators to understand this value well. From
HOW TO CALCULATE THE SAMPLE
the formula, it can be conceived that the sample
SIZE
size varies inversely with the square of the
precision (d2).
The following simple formula (Daniel, 1999) can
At the end of a study, we need to present
be used:
the prevalence with its 95% confidence interval.
Z 2 P(1 − P) For instance, the prevalence in a sample is 40%
n=
d2 and 95% CI is 30% to 50%. It means that the
where n = sample size, study has estimated the population prevalence as
Z = Z statistic for a level of confidence, between 30% and 50%. Please notice that the
P = expected prevalence or proportion precision (d) for this estimate is 10% (i.e. 40%
(in proportion of one; if 20%, P = 0.2), and ±10% = 30%~50%). It shows that the width of CI
d = precision is two times of the precision (width of CI = 2d).
(in proportion of one; if 5%, d = 0.05). If the width of the CI is wide like in this
example (30% to 50%, the width of the interval is
Z statistic (Z): For the level of confidence of 20%), it may be considered as a poor estimate.
95%, which is conventional, Z value is 1.96. In Most investigators want a narrower CI. To obtain
these studies, investigators present their results a narrower CI, we need to design a study with a
with 95% confidence intervals (CI). Investigators smaller d (good precision or smaller error of

9
Naing et al.

estimate). For instance, if investigators want the Therefore, we recommend d as a half of


width of CI as 10% (0.1), d should be set at 0.05. P if P is below 0.1 (10%) and if P is above 0.9
Again, d in the formula should be a proportion of (90%), d can be {0.5(1-P)}. For example, if P is
one rather than percentage. 0.04, investigators may use d=0.02, and if P is
0.98, we recommend d=0.01. Figure 1 is plotted
PRACTICAL ISSUES IN DETERMINING with this recommendation. Investigators may also
SAMPLE SIZE PARAMETERS select a smaller precision than what we suggest if
they wish.
Determining Precision (d) However, if there is a resource limitation,
What is the appropriate precision for prevalence investigators may use a larger d. In case of a
studies? Most of the books or guides show the preliminary study, investigators may use a larger
steps to calculate the sample size but there is no d (e.g. >10%). However, justification for the
definite recommendation for appropriate d. selection of d should be stated clearly (e.g.
Investigators generally ends up with the ball-park limitation of resources) in their research proposal
figures of the study sizes usually based on their so that reviewers will be well informed. In
limitations such as financial resources, time or addition, the larger d should meet the assumption
availability of subjects. However, we should of normal approximation that we will discuss
calculate the sample size with a reasonable or later.
acceptable precision and then allowing for other
limitations. Estimating P
In our experience, it is appropriate to Speculating P may be intriguing in practice. The
have a precision of 5% if the prevalence of the investigator may get several Ps from the
disease is going to be between 10% and 90%. literature. Preferably, P from the studies with
This precision will give the width of 95% CI as similar study design and study population from
10% (e.g. 30% to 40%, or 60% to 70%). the most recent studies would be most preferable.
However, when the prevalence is going to be If we have a range of P, for instance,
below 10% or more than 90%, the precision of 20% to 30%, we should use 30% as it will give a
5% seems to be inappropriate. For example, if the larger sample size (Figure 1). If the range is 60%
prevalence is 1% (in a rare disease) the precision to 80%, we should use 60% as it will give a
of 5% is obviously crude and it may cause larger sample size. If the range is 40% to 60%,
problems. The obvious problem is that 95% CIs 50% will give a larger sample size. Macfarlane
of the estimated prevalence will end up with (1997) also suggested that if there was doubt
irrelevant negative lower-bound values or larger about the value of P, it is best to err towards 50%
than 1 upper bound values as seen in the Table 1. as it would lead to a larger sample size.

800

700

600
Sample Size (n )

500

400

300

200

100

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Expected Proportion (P )

Figure 1 Relationship between Sample Size and Expected Proportion (Prevalence)


(Plotted using Microsoft Excel)

10
Archives of Orofacial Sciences 2006; 1: 9-14

Table 1 95% CI of rare diseases (P =< 0.05) and common diseases (P => 0.95) with a precision (d) set at 0.05
95% CI
P n
Lower Upper
0.01 16 -0.04 0.06
0.02 31 -0.03 0.07
0.03 45 -0.02 0.08
0.04 60 -0.01 0.09
0.05 73 0 0.10
0.95 73 0.90 1.00
0.96 60 0.91 1.01
0.97 45 0.92 1.02
0.98 31 0.93 1.03
0.99 16 0.94 1.04

Setting P=0.5 does not necessarily provide the determine this P. If within 10%-90%, it is safe to
biggest sample size apply the ‘P=0.5’ suggestion.
Some books or guides suggest that if it is
impossible to come up with a good estimate for Assumption of Normal Approximation
P, one may set P equal to 0.5 to yield the The above sample size calculation formula is
maximum sample size (Daniel, 1999, Lwanga based on the assumption of normal
and Lemeshow, 1991). In our opinion, this approximation. It says that nP and n(1-P) must be
suggestion should be taken with caution. If P is greater than 5 (Daniel, 1999). In other words,
between 10% and 90%, it is a good guide to take both cases and non-cases in the selected sample
P as 0.5 (if it is impossible to make a better must be greater than 5. Small sample sizes might
estimate) as it will give the biggest sample size. not fulfill this assumption, and we should check
However, if P is quite small (<10%) or very large this assumption after calculating the sample size.
(>90%), we may need a larger sample size than The recommendation that we have made to apply
those calculated using P=0.5 (Figure 1). Our the precision (d) of half of P and 0.5(1-P) will
arguments are as follows. Firstly, for instance, also ensure to meet this assumption (Table 2).
calculation was done using P=0.5 and d=0.05 as
an investigator could not estimate P, so that the
sample size was 385. However, if the real P is For those who wish to know more about the
unfortunately 1%, we may get, on average, 3 or 4 normal approximation assumption, the following
cases (diseases) from 385 subjects or you may is a worked example:
not get any disease case at all. Secondly, with this
small number of cases (diseases), the assumption Suppose we wish to estimate a proportion of
of normal approximation that is used in this population (P) who are regular cigarette smokers
sample size calculation is not met. Similarly, if P in a village. We would like our sample estimate
is too large (e.g. 99%), with the sample size of (p) to have a high probability of falling between
385, you may get only a few non-cases (non- P-d and P+d. (Please notice that P is a population
diseases) or perhaps none, and again, the normal proportion and p is a sample estimate). Before
approximation assumption that is discussed later, calculating the sample size, we must take what
may not be met. seems to be an ironical step. We must speculate
In practice, investigators should be a little what this proportion (P) is. The basic reason for
cautious before applying this ‘P=0.5’ suggestion. this step is that sample size depends on the
It is not very difficult for an investigator to standard error (SE) of the distribution of
estimate whether P is below 10%, between 10% prevalence of smokers (p). As a matter of fact,
and 90% or above 90% with his or her the sample size calculation formula shown earlier
experience. Otherwise, a very crude pilot study has been derived from the following SE
(e.g. with a sample size of 20~30) can also easily equations.

11
Naing et al.

Table 2 Checking assumption, nP and n(1-P) for calculated sample sizes

P d n np n(1-P)
0.01 0.005 1521 15.2 1506.1
0.02 0.010 753 15.0 737.9
0.03 0.015 497 14.9 481.9
0.04 0.020 369 14.8 354.0
0.05 0.025 292 14.6 277.4
0.06 0.030 241 14.4 226.3
0.07 0.035 204 14.3 189.9
0.08 0.040 177 14.1 162.6
0.09 0.045 155 14.0 141.4
0.10 0.050 138 13.8 124.5
0.20 0.050 246 49.2 196.7
0.30 0.050 323 96.8 225.9
0.40 0.050 369 147.5 221.3
0.50 0.050 384 192.1 192.1
0.60 0.050 369 221.3 147.5
0.70 0.050 323 225.9 96.8
0.80 0.050 246 196.7 49.2
0.90 0.050 138 124.5 13.8
0.91 0.045 155 141.4 14.0
0.92 0.040 177 162.6 14.1
0.93 0.035 204 189.9 14.3
0.94 0.030 241 226.3 14.4
0.95 0.025 292 277.4 14.6
0.96 0.020 369 354.0 14.8
0.97 0.015 497 482.0 14.9
0.98 0.010 753 737.9 15.1
0.99 0.005 1521 1506.1 15.2

d = Z x SE ( p ) and would get a smoking prevalence (p) as low as


25% or as high as 35% (30%±5%). The idea
P(1 − P) behind this example is that if we repeat this
SE( p ) = , and therefore
n survey 100 times, hundred different values of
P (1 − P ) . smoking prevalence (p) will be obtained. Out of
d =Z
n these 100 values, 95 of them would take the
If we judge the proportion of smokers in the values between 25% and 35%. All these values
village (P) to be 0.3, d as 0.05, and Z as 1.96 (p) are approximately normally distributed
then, (Figure 2). This example also illustrates why do
0.3 x 0.7 we need to have the values of the parameters (Z,
0.05 = 1.96 P and d) before we start calculating the sample
n
size.
The final step in our derivation expresses the
prior opinion that we would like 1.96 standard
errors of our estimate equal 0.05. If 1.96*SE(p) is
equal to 0.05, then our sample estimate has a 95%
chance of being within five percentage points of
the true (P). This assumes that n is large enough
to make distribution of all possible p
approximately normal. By reading off the row in
Table 2 at P=0.3 (30%) and d=0.05, the sample
size n is 323. It means that if we take a random
sample of 323 villagers and measure smoking
prevalence using a survey questionnaire, we Figure 2 Normally distributed sample estimates (p)

12
Archives of Orofacial Sciences 2006; 1: 9-14

Finite Population Correction Table 3 Gain in precision (error reduction) by


The above sample size formula is valid if the increasing the sample size while Z (1.96) and P (0.5)
calculated sample size is smaller than or equal to remain constant
5% of the population size (n/N ≤0.05) (Daniel,
1999). If this proportion is larger than 5% (n/N % Gain in the
n d
>0.05), we need to use the formula with finite precision
population correction (Daniel, 1999) as follows. 97 0.100 -
194 0.070 30.0
291 0.057 43.0
NZ 2 P(1 − P)
n' = 388 0.050 50.0
d 2 ( N − 1) + Z 2 P(1 − P)
where
n' = sample size with finite population "The larger the sample size the better the study"
correction, is not always true
N = Population size, One of the aims of applying appropriate sample
Z = Z statistic for a level of confidence, size calculation formula is not to obtain the
P = Expected proportion (in proportion of biggest sample size ever. The aim is to get an
one), and optimum or adequate sample size. Unnecessarily
d = Precision (in proportion of one). large sample is not cost-effective. In some
circumstances it is unethical. In drug trials, for
Cluster or Multistage Sampling instance, a very large sample would lead to a
The above sample size formulae are valid only if conclusion that the new drug is significantly
we apply the simple random or systematic better that an old drug in the statistical sense
random sampling methods. Cluster or multistage although the difference may be clinically
sampling methods require a larger sample size to insignificant. If one examines Table 3 carefully, a
achieve the same precision. Therefore, the two-fold increase in sample size improves the
calculated sample size using the above formulae precision by 30% (i.e. reducing the error of
need to be multiplied by the design effect (deff) estimate by 30%). After the sample size is
(Cochran, 1977). For example, in immunization quadrupled, the precision becomes halved (i.e.
coverage cluster surveys, the design effect has reducing the error of estimate by 50%).
been found to be approximately two (Macfarlane,
1997). This means that such cluster sampling Other Objectives of the Study
requires double the sample size of above This article describes and discusses on the sample
calculation. size calculation for estimating a prevalence as
However, in practice, investigators rarely one of the study objectives. Most studies,
report their design effects in the literature. What however, have more than one objectives. It is
one may do is to contact the authors who have therefore recommended to also calculate the
published these articles and request for their sample size required for other study objectives
design effect. We strongly recommend reporting appropriately, and the biggest sample size
the design effect if investigators apply cluster or obtained out of all calculations should be taken as
multistage sampling method in their studies. If the sample size that would accommodate all
the design effect is not available at the end, a study objectives.
pilot study can be done to estimate the design
effect. Normally, the cluster or multistage Anticipating Non-Response or Missing Data
sampling is applied in large-scale surveys, and it The calculated sample size is for the desired
is worth to conduct a pilot study for several precision or CI width assuming that there is no
reasons at the first stage. Investigators should problem with non-response or missing values. If
consult a statistician before conducting such a this is the case, the investigators will not achieve
study. the desired precision. Therefore, it is wise to
oversample by 10% to 20% of the computed
number required depending on how much the
investigators would anticipate these
discrepancies.

13
Naing et al.

Role of Statisticians and Reviewers the sample size calculation formulae.


In this calculation, statisticians may assist Investigators may use precision 0.05 if P is
investigators in order to ensure correct between 0.1 and 0.9. However, a smaller d should
application of formulae, considering for the be applied if P<0.1 or P>0.9. A range of P should
assumption and finite population correction. The be obtained and apply the P which gives the
appropriate precision must be well understood biggest sample size. Investigators should be
and decided by the investigators with the up-to- cautioned that setting P=0.5 doesn’t necessarily
date knowledge of the research area that they are provide the biggest sample size. In addition, the
going to study. Similarly, specialists (in the study calculation should consider the assumption of
area) in research committees and reviewers of the normal approximation, finite population
proposals must have a good understanding of correction and sampling method. Using a
precision and critically look into it to ensure that calculator considering these issues will be helpful
it is worth to conduct the study with the precision in this sample size calculation.
proposed by the investigators.
REFERENCES
A Helpful Calculator
Cochran WG (1977). Sampling Techniques, 3rd
In many situations, investigators need to calculate
edition. New York: John Wiley & Sons.
repeatedly, ensure the assumption, and check the
Daniel WW (1999). Biostatistics: A Foundation
necessity of applying the finite population
for Analysis in the Health Sciences. 7th
correction. The authors have developed a
edition. New York: John Wiley & Sons.
calculator (Naing et al. 2006) using Microsoft
Lwanga SK and Lemeshow S (1991). Sample
Excel and it can be downloaded freely from
Size Determination in Health Studies: A
http://www.kck.usm.my/ppsg/stats_resources.htm.
Practical Manual. Geneva: World Health
The calculator is designed to give the sample size
Organization.
for various precisions (error of estimate) with or
Macfarlane SB (1997). Conducting a Descriptive
without finite population correction, and also will
Survey: 2. Choosing a Sampling Strategy.
suggest the need to apply the finite population
Trop Doct, 27(1): 14-21.
correction. It will also determine whether the
Naing L, Winn T and Rusli BN (2006). Sample
normal approximation assumption is met or not.
Size Calculator for Prevalence Studies.
Available at:
CONCLUSION http://www.kck.usm.my/ppsg/stats_resources.htm
The paper highlights a number of practical issues
in making decision for the parameters applied in

14

You might also like