Module I - Sampling Methodology
Module I - Sampling Methodology
SAMPLING METHODOLOGY
Chapter 1. GENERAL INTRODUCTION
Decision makers make better decisions when they use all available information in an effective
and meaningful way. The primary role of statistics is to provide decision makers with
methods for obtaining and analyzing information to help make these decisions. Statistics is
used to answer long-range planning questions, such as when and where to locate facilities to
handle future sales.
Definition
Types of Statistics
1. Descriptive Statistics is concerned with summary calculations, graphs, charts and tables.
Statistical Population
Variable
A variable is an item of interest that can take on many different numerical values.
1. Qualitative Variables are nonnumeric variables and can't be measured. Examples include
gender, religious affiliation, state of birth.
2. Quantitative Variables are numerical variables and can be measured. Examples include
balance in your checking account, number of children in your family. Note that quantitative
variables are either discrete (which can assume only certain values, and there are usually
"gaps" between the values, such as the number of bedrooms in your house) or continuous
(which can assume any value within a specific range, such as the air pressure in a tire.)
1. Nominal Data: The weakest data measurement. Numbers are used to represent an item or
characteristic. Example: designating male=1 and female=2
2
2. Ordinal or Rank Data: Numbers are used to rank. Example is excellent, good, fair and
poor. The main difference between ordinal data and nominal data is that ordinal data contain
both an equality (=) and a greater-than (>) relationship, whereas the nominal data contain only
an equality (=) relationship.
3. Interval Data: If we have data with ordinal properties (> & =) and can also measure the
distance between two data items, we have an interval measurement.
Interval data are preferred over ordinal data because, with them, decision makers can
precisely determine the difference between two observations, i.e., distances between numbers
can be measured. For example, frozen-food packagers have daily contact with a common
interval measurement--temperature.
4. Ratio Data: Is the highest level of measurement and allows for all basic arithmetic
operations, including division and multiplication. Data measured on a ratio scale have a fixed
or no arbitrary zero point. Examples include business data, such as cost, revenue and profit.
Sources of Data:
1. Secondary Data: Data which are already available. An example: statistical abstract of USA.
Advantage: less expensive. Disadvantage: may not satisfy your needs.
There are many ways to collect a sample. The most commonly used methods are:
A. Statistical Sampling:
1. Simple Random Sampling: Is a method of selecting items from a population such that every
possible sample of specific size has an equal chance of being selected. In this case, sampling
may be with or without replacement.
2. Stratified Random Sampling: Is obtained by selecting simple random samples from strata
(or mutually exclusive sets). Some of the criteria for dividing a population into strata are: Sex
(male, female); Age (under 18, 18 to 28, 29 to 39); Occupation (blue-collar, professional,
other).
3
A sample of clusters could then be randomly selected, and every household within these
clusters could be interviewed to find the average annual household income.
B. Nonstatistical Sampling:
1. Judgment Sampling: In this case, the person taking the sample has direct or indirect control
over which items are selected for the sample.
2. Convenience Sampling: In this method, the decision maker selects a sample from the
population in a manner that is relatively easy and convenient.
3. Quota Sampling: In this method, the decision maker requires the sample to contain a
certain number of items with a given characteristic. Many political polls are, in part, quota
sampling.
SAMPLING
Sampling is that part of statistical practice concerned with the selection of a subset of
individual observations within a population of individuals intended to yield some knowledge
about the population of concern, especially for the purposes of making predictions based on
statistical inference. Sampling is an important aspect of data collection.
Researchers rarely survey the entire population for two reasons: the cost is too high, and the
population is dynamic in that the individuals making up the population may change over time.
The three main advantages of sampling are that the cost is lower, data collection is faster, and
since the data set is smaller it is possible to ensure homogeneity and to improve the accuracy
and quality of the data.
Each observation measures one or more properties (such as weight, location, color) of
observable bodies distinguished as independent objects or individuals. In survey sampling,
survey weights can be applied to the data to adjust for the sample design. Results from
probability theory and statistical theory are employed to guide practice. In business and
medical research, sampling is widely used for gathering information about a population.
SAMPLING PROCESS
POPULATION DEFINITION
4
everything in a population, the goal becomes finding a representative sample (or subset) of
that population.
Sometimes that which defines a population is obvious. For example, a manufacturer needs to
decide whether a batch of material from production is of high enough quality to be released to
the customer, or should be sentenced for scrap or rework due to poor quality. In this case, the
batch is the population.
Although the population of interest often consists of physical objects, sometimes we need to
sample over time, space, or some combination of these dimensions. For instance, an
investigation of supermarket staffing could examine checkout line length at various times, or
a study on endangered penguins might aim to understand their usage of various hunting
grounds over time. For the time dimension, the focus may be on periods or discrete occasions.
SAMPLING FRAME
In the most straightforward case, such as the sentencing of a batch of material from
production (acceptance sampling by lots), it is possible to identify and measure every single
item in the population and to include any one of them in our sample. However, in the more
general case this is not possible. There is no way to identify all rats in the set of all rats.
Where voting is not compulsory, there is no way to identify which people will actually vote at
a forthcoming election (in advance of the election). These imprecise populations are not
amenable to sampling in any of the ways below and to which we could apply statistical
theory.
As a remedy, we seek a sampling frame which has the property that we can identify every
single element and include any in our sample. The most straightforward type of frame is a list
of elements of the population (preferably the entire population) with appropriate contact
information. For example, in an opinion poll, possible sampling frames include:
• Electoral register
• Telephone directory
Not all frames explicitly list population elements. For example, a street map can be used as a
frame for a door-to-door survey; although it doesn't show individual houses, we can select
streets from the map and then visit all houses on those streets. (One advantage of such a frame
is that it would include people who have recently moved and are not yet on the list frames
discussed above.)
The sampling frame must be representative of the population and this is a question outside the
scope of statistical theory demanding the judgment of experts in the particular subject matter
being studied. All the above frames omit some people who will vote at the next election and
contain some people who will not; some frames will contain multiple records for the same
person. People not in the frame have no prospect of being sampled. Statistical theory tells us
about the uncertainties in extrapolating from a sample to the frame. In extrapolating from
frame to population, its role is motivational and suggestive.
To the scientist, however, representative sampling is the only justified procedure for choosing
individual objects for use as the basis of generalization, and is therefore usually the only
acceptable basis for ascertaining truth.
5
BASIC PROBLEM OF SAMPLING FRAMES:
Missing elements: Some members of the population are not included in the frame.
Foreign elements: The non-members of the population are included in the frame.
Duplicate entries: A member of the population is surveyed more than once.
Groups or clusters: The frame lists clusters instead of individuals.
A frame may also provide additional 'auxiliary information' about its elements; when this
information is related to variables or groups of interest, it may be used to improve survey
design. For instance, an electoral register might include name and sex; this information can be
used to ensure that a sample taken from that frame covers all demographic categories of
interest. (Sometimes the auxiliary information is less explicit; for instance, a telephone
number may provide some information about location.)
Having established the frame, there are a number of ways for organizing it to improve
efficiency and effectiveness.
It's at this stage that the researcher should decide whether the sample is in fact to be the whole
population and would therefore be a census.
A probability sampling scheme is one in which every unit in the population has a chance
(greater than zero) of being selected in the sample, and this probability can be accurately
determined. The combination of these traits makes it possible to produce unbiased estimates
of population totals, by weighting sampled units according to their probability of selection.
Example: We want to estimate the total income of adults living in a given street. We visit
each household in that street, identify all adults living there, and randomly select one adult
from each household. (For example, we can allocate each person a random number, generated
from a uniform distribution between 0 and 1, and select the person with the highest number in
each household). We then interview the selected person and find their income. People living
on their own are certain to be selected, so we simply add their income to our estimate of the
total. But a person living in a household of two adults has only a one-in-two chance of
selection. To reflect this, when we come to such a household, we would count the selected
person's income twice towards the total. (In effect, the person who is selected from that
household is taken as representing the person who isn't selected.)
In the above example, not everybody has the same probability of selection; what makes it a
probability sample is the fact that each person's probability is known. When every element in
the population does have the same probability of selection, this is known as an 'equal
probability of selection' (EPS) design. Such designs are also referred to as 'self-weighting'
because all sampled units are given the same weight.
PROBABILITY SAMPLING
6
NON PROBABILITY SAMPLING
Nonprobability sampling is any sampling method where some elements of the population
have no chance of selection (these are sometimes referred to as 'out of
coverage'/'undercovered'), or where the probability of selection can't be accurately
determined. It involves the selection of elements based on assumptions regarding the
population of interest, which forms the criteria for selection. Hence, because the selection of
elements is nonrandom, nonprobability sampling does not allow the estimation of sampling
errors.
These conditions give rise to exclusion bias, placing limits on how much information a
sample can provide about the population. Information about the relationship between sample
and population is limited, making it difficult to extrapolate from the sample to the population.
Example: We visit every household in a given street, and interview the first person to answer
the door. In any household with more than one occupant, this is a nonprobability sample,
because some people are more likely to answer the door (e.g. an unemployed person who
spends most of their time at home is more likely to answer than an employed housemate who
might be at work when the interviewer calls) and it's not practical to calculate these
probabilities.
SAMPLING METHODS
Within any of the types of frame identified above, a variety of sampling methods can be
employed, individually or in combination. Factors commonly influencing the choice between
these designs include:
• Nature and quality of the frame
• Availability of auxiliary information about units on the frame
• Accuracy requirements, and the need to measure accuracy
• Whether detailed analysis of the sample is expected
• Cost/operational concerns
7
Chapter 2. SIMPLE RANDOM SAMPLING
In a simple random sample ('SRS') of a given size, all such subsets of the frame are given an
equal probability. Each element of the frame thus has an equal probability of selection: the
frame is not subdivided or partitioned. Furthermore, any given pair of elements has the same
chance of selection as any other such pair (and similarly for triples, and so on). This
minimizes bias and simplifies analysis of results. In particular, the variance between
individual results within the sample is a good indicator of variance in the overall population,
which makes it relatively easy to estimate the accuracy of results.
However, SRS can be vulnerable to sampling error because the randomness of the selection
may result in a sample that doesn't reflect the makeup of the population. For instance, a
simple random sample of ten people from a given country will on average produce five men
and five women, but any given trial is likely to over represent one sex and under represent the
other. Systematic and stratified techniques, discussed below, attempt to overcome this
problem by using information about the population to choose a more representative sample.
SRS may also be cumbersome and tedious when sampling from an unusually large target
population. In some cases, investigators are interested in research questions specific to
subgroups of the population. For example, researchers might be interested in examining
whether cognitive ability as a predictor of job performance is equally applicable across racial
groups. SRS cannot accommodate the needs of researchers in this situation because it does not
provide subsamples of the population. Stratified sampling, which is discussed below,
addresses this weakness of SRS.
• Expected Value
The expected value is the average value of a feature from all possible samples of the same
size. Mathematically, we define the expected value or average of a random variable Y as
follows:
E ( y ) = ∑ y. p( y )
y
where ∑ p( y ) = 1
y
The expected value is the sum of products of all possible values of variable Y and their
associated probabilities p(y).
For example, consider the following table data from a census where the variable Y is the size
of a randomly selected household:
8
5 persons 5 996 0,076
6 persons 2 499 0,032
7 persons or more 1 748 0,022
Total 79 108 1,00
This means that the average of these different sizes of household is 2.75. Note that the size of
household "7 or more" is the aggregation of data for households where Y = 7, 8, 9 ... We
determined the average size of all households with 7 or more persons, or 7.7.
• Unbiased estimator
An estimator is unbiased when its expected value is equal to the parameter that it is
advantageous to estimate. Thus, the bias is the difference between the expected value of an
estimate and the true population-value (or parameter) is zero.
• Consisting Estimator
An estimator is consistent if its values tend to concentrate around the true value over time as
the sample size increases.
Let us return to the sampling distribution. This is the probability distribution of all possible
values that can make an estimate in a sampling as possible.
Let us use the following table that links each population area to the unit size of household.
We can make the sampling distribution of the mean by creating a list of all possible random
samples of size n = 2 that can be drawn from a population of size N = 5 units.
9
If one draws a sample of size 2 in this population, there are C25 = 10 possibilities as
follows:
From the table of samples of size n = 2, one can calculate the expected value of
−
10
_
1 70 −
E ( y) = ∑ 10
10 i=1
y = = 7 = Y
Thus, in simple random sampling, the sample average is an unbiased estimate of the mean-
population.
10
It should be noted that we would get the same results for samples of any size.
• Standard deviation
The Variability in the population is measured by the standard deviation (σ) whose square is
the variance (σ2). The population variance is defined as the average of squared deviations of
all individual observations from their average value, or the following formula:
Population variance with bias:
− − −
2
(Y1 − Y ) + (Y2 − Y ) + ......+ (YN − Y )2 1 N
2 2 _
σ =
2
= ∑(Yi − Y )
N N i=1
Population variance unbiased (corrected):
2
1 N _
S =
2
∑ i )
N − 1 i=1
(Y − Y
Variance in sample:
2
1 n _
s =
2
∑ ( yi − y)
n − 1 i=1
Note that s2 is an unbiased estimator of S2 (σ2 no unbiased estimator).
N −1 2 N −1 2
Note also that σ2 = S and S2 = σ
N N
• Standard error of the sample average
The variance of the sample average is the average of the average of the square average over
the true value for all possible samples of size n. The true variance is denoted S2 ( ) and is
calculated as follows:
S2 ⎛ N − n ⎞
−
S ( y) =
2
⎜ ⎟
n ⎝ N ⎠
The square root of the variance of is the standard error of averages from samples of size n.
Its formula is:
−
S ⎛ N −n⎞
S ( y) = ⎜ ⎟
n ⎝ N ⎠
It is important to note that the error varies with the size of the sample, the error decreases by
increasing the sample size.
11
⎛ N −n⎞
In the formula for the variance of , the factor ⎜ ⎟ is known as factor of finite
⎝ N ⎠
population correction
⎛ N −n⎞
If n ≤ 0.05, ⎜ ⎟ can be ignored because its value is almost equal to 1.
⎝ N ⎠
Illustration
We know that the probability for an estimate to be equal to the true value (parameter) is zero
for continuous variables. So it is more useful to state how much it is likely that an interval
based on our estimate may contain the parameter value that it is advantageous to estimate.
An interval estimator is a formula that uses the observations of a sample to calculate two
numbers that define an interval containing the parameter with some probability. The interval
obtained is called confidence interval and the probability that it contains the parameter value
is called confidence coefficient. When a confidence interval has a confidence coefficient of
0.95, it is said to set a confidence line of 95%.
In general, the confidence interval for a parameter is given by [ ± tS ( )]. The symbol t is
the value of the variable that corresponds to the normal value of the probability of desired
confidence.
In practice, we do not know S2 and s2, the sample variance is used as an estimate of the value
of S2.
It should be noted that if the sample size n is large, s provides a good estimate of S, while for
small samples; the estimate is not very good.
−
For the parameter Y , the confidence interval is:
− − −
s ⎛ N −n⎞
y ± ts ( y ) = y ± t ⎜ ⎟
n ⎝ N ⎠ (Ignore fpc if n ≤ 0.05 N)
The value of t depends on the level of confidence. For large samples, the values used are:
12
An estimation by confidence interval has two parts: a ponctual estimate and a value ± which
describes the precision of the estimate. We call the value ± the margin of error.
If the sample size is below 30, the values must be obtained from the table for t distribution
with (n-1) degrees of freedom.
Example
A simple random sample of 100 households was selected from a village where the average
monthly expenditure for electricity is $ 75 with a standard deviation of $ 15. Find the
confidence interval at 95% of the estimated average expenditure of the entire population.
15
75 ± 1,96 = 75 ± 2,94
100
The confidence interval is [72.06, 77.94]
The average monthly expenditure is [72.06, 77.94].
The population may be confident at 95% that the average monthly expenditure is between
72.06 and 77.94.
Note
the difference between S2 and σ 2 has no importance for large populations where σ 2 is most
used.
Note also that in simple random sampling, s2 is an unbiased estimator of S2.
N −1 2 N 2
σ2 = S and
S2 = σ
N N−1
• Population-value: estimates and precision measures
^
The estimate of the total population size (Y), denoted by Y is given by the following
expression:
^ _
N N n
Y = N y = y = ∑ yi
n n i=1
^
The error type of Y is given by : NS( )
Illustration
• Relative error
We can express the error as a proportion (or percentage) of the estimate value. The proportion
is called relative standard error or coefficient of variation CV and written in parentheses.
The CV allows comparing two distributions of values.
To estimate the size (total), the true coefficient of variation is
13
^
N −n
^
S (Y ) NS N
CV (Y ) = ^
= _
Y n Ny
1 S N − n CV N −n
= _
= and with
n y N n N
N −n
≈1
N
CV
=
n
Similarly, the true coefficient of variation for the average-sample is given as follows:
_
N −n
S
_
S ( y) 1 N
CV ( y ) = _
= _
y n y
CV N −n
= and with
n N
N −n
≈1
N
CV
=
n
^ _
We observe that : CV (Y ) = CV ( y )
It may be noted that the standard error of the estimated population is N times the average,
while the coefficients of variation are the same for both. This result is not surprising because
the estimation of a number is obtained by multiplying the sample mean (an estimate) by the
number of elements of the population (a known number), the only source of error is the
average sample.
So, we can predict the error of the size number, when it is expressed as a proportion or
percentage. That error is equal to one for the average. However, when the actual error is
14
expressed in absolute terms, it will be equal to N times the average because N acts as a
multiplication factor.
The variance of the variable Yi is given by:
2
1 N _
σ = ∑ (Yi − Y )
2
N i=1
and the variance of the distribution is given by:
N −
∑ (Y − Y )
i =1
i
2
N σ2
(CV ) = 2
_
=
−2
2
(Y ) Y
Illustration
In many situations, we use the sample proportion p to perform statistical inference on the
proportion of the population P.
In statistical analysis, the proportions appear in two different ways:
a) we can look at proportions rather than numbers or averages. For example, the proportion
of the unemployed population, the percentage of households whose income falls below the
poverty line, the proportion of business firms interested in a product
b) it may be desirable to classify a population into groups and determine the percentage of the
population falling into each group. For example, distribution of population by five year age
group, the classification of firms by revenue.
Assuming a total population and a sample and considering a particular class of units, we can
use the following notations:
15
We can apply all the formulas discussed previously in this particular case by considering each
unit in the population as having a characteristic that may take two values: either the value 0 (if
the unit is not in the class of interest) or the value 1 (if the unit belongs to). If we add up the
values of the population, we obtain the size number A.
N
To estimate the proportions in the case of simple random sampling, we use the following
formulas:
^
a NPQ
p= p= S2 =
n and
N −1
The population variance is PQ. This is the variance of the distribution after having coupled
the value 1 or 0 to the element if it is or not in the class of interest.
^
N − n PQ
σ ( P) =
2
N −1 n
The estimate of this variance derived from a single sample of size n is given by:
^ pq ⎛ N − n ⎞
s ( P) =
2
⎜ ⎟
n −1 ⎝ N ⎠
In this case, the relative standard error (coefficient of variation) of a mean and standard error
of an estimated population data are given by the following formulas:
^
Q ⎛ N −n⎞
CV ( P) = ⎜ ⎟
Pn ⎝ N − 1 ⎠
q ⎛ N −n⎞
cv( p ) = ⎜ ⎟
p(n − 1) ⎝ N ⎠
And
16
^
PQ ⎛ N − n ⎞ ^ pq ⎛ N − n ⎞
S ( A) = N ⎜ ⎟ s ( A) = N ⎜ ⎟
n ⎝ N −1 ⎠ (n − 1) ⎝ N ⎠
Note that the error the number size is equal to that of the average.
The confidence interval for the proportion, assuming that the sample proportion follows a
normal distribution, is equal to:
pq ⎛ N − n ⎞
p±t ⎜ ⎟ where the value of t depends on the desired confidence level.
n ⎝ N ⎠
Illustration
One of the first questions to the statistician when designing a survey concerns the sample size
necessary for estimating the population parameter with a desired accuracy. The decision on
sample size for the survey is very important because a size too large, leads to a waste of
resources while a too small size reduces the reliability of results.
E = Absolute error of the estimate. E is measured in the same unit as the variable.
RE = Relative error of the estimate. E is expressed as a proportion (or percentage) of the true
value of the parameter that is estimated. For example, if E = 5 hectares and the true parameter
value is 100.
Note:
The confidence level indicates the probability for obtaining the expected degree of accuracy
for the calculated sample size n. For example, a confidence level of 95% means that, except
for a low probability of 5%, we can be sure at 95% that the expected accuracy will be reached
with the calculated value for n. This means that the level of acceptable risk that the true
parameter value is outside the limits specified in the confidence limits is 5%.
In the case of large samples with known σ, we establish the following result on the sampling
error when the sample mean is used to estimate the population mean:
• There is a probability equal to (1-α) that the value of the average sample generates a
sampling error ≤ to zα/2 σ .
17
• And as σ = σ/√ n, we can rephrase this result as follows: there is a probability equal
to (1-α) that the value of the average sample generates a sampling error ≤ to
zα/2(σ/√n).
This quantity zα/2 (σ/√ n) is the margin of error. So, the margin of error is determined both by
the values of zα/2, of σ and the sample size n.
When the confidence coefficient 1-α is selected, the value zα/2 can be obtained using statistical
tables and σ being known, one can determine the sample size n required to obtain the
predefined margin of error E.
More samples are larger; more the margin of error is low and better is the accuracy of the
estimate.
σ z σ z 2α / 2σ 2
E = zα / 2 and n = α /2 thus, n=
n E E2
In this formula where n is the minimum size that satisfies the condition imposed to the margin
of error, the value E is the margin of error that the user is willing to accept for a given
confidence level, and the value of zα/2 corresponds to the confidence level used to perform
limits estimation. Although the user has a choice, it should be noted that the confidence level
of 95% is the most used, with the corresponding value zα/2 equal to 1.96.
Applying the formula given above assumes that the standard deviation σ of the population is
well known. Where σ is not known, the three following methods allow obtaining an initial
value of σ.
1) Use the sample standard deviation obtained with a previous sample having similar
characteristics.
2) Use a pilot study to select a preliminary sample and use the deviation from this sample
as the initial value of σ.
3) Use intuition to evaluate σ by estimating the extent of the population (maximum value
minus minimum value), which is then divided by four. The extent divided by four is
often considered a good estimate of the standard deviation σ.
Illustration
b) Sample size required for limits estimation of the proportion of a population with an
absolute error E
The margin of error associated with the estimate of the proportion of the population is zα/2 σ ,
p(1 − p)
with σ =
−
p n
It is bases on the value zα/2 the proportion p and sample size n.
18
Given E the desired margin of error,
p(1 − p ) ( zα / 2 ) 2 p (1 − p )
E = zα / 2 so, n=
n E2
In this formula, the margin of error E must be specified by the user and it is in most cases less
than or equal to 0.10. The user also specifies the confidence level and therefore the
corresponding value zα/2. Finally, the application of the formula for determination of n
requires an initial value of the proportion of the population p. It is obtained by one of the
following four ways:
1. Use the sample proportion obtained from a previous sample having similar
characteristics.
2. Use a pilot study to select a preliminary sample and use the proportion of that
sample as the initial value of p.
4. If no information is available, use the p-value = 0.50. Indeed, the sample size is
proportional to the quantity p (1-p) and the greatest value of this quantity obtained
when p = 0.50 ensures that all estimated proportions satisfy the conditions imposed to
the margin error.
Illustration
1. Sampling frame
For determining a sample, we need a sampling frame. The sampling frame is a list of all
individuals in the population, without omission or duplication, so that the probability of
selection of each individual is known in advance. Before drawing a sample, we need to make
sure that the frame is the best possible.
The practical procedure to select a simple random sample is to choose the units, one at a time,
so that the remaining units in the population have the same probability of being selected.
This selection is done using a random number table that presents sets of random digits
classified into groups in horizontal order and vertical order.
To select a set of random numbers, one can start from any point in the table. In addition, after
choosing the first digit, we can do by going in any direction: towards the bottom or top of the
column to the right or left of the line
19
Illustration
The selection of a simple random sample can be tedious when the number of units to choose
is great. To get a 5% sample of 10000 individuals (N), it will be necessary to choose 500
random numbers (n) in a table. In practice, we use a different method that involves obtaining
a sampling interval (I) N / n = 20 for the case, then choose a random number between 1 and
20 to serve as a random starting point. Finally, select the frame after every 20th individual
posterior element.
If the elements of the population are sorted in an almost random, ie with little correlation
between successive elements, the results of systematic sampling will agree with those of
simple random sampling. Experience shows that, generally, the two methods give results with
approximately the s The systematic sample has a sampling error often a bit lower, since it
ensures that the dispersal of One can make use of simple random sampling formulas for
evaluating the reliability of estimates of a systematic sample.
1. Assign serial numbers 1 through N to the population units (N is the total number of
individuals in the population).
3. Choose a random number between 0 and I from the table of random numbers. This number
is called the random starting point, expressed by R. If using a calculator that gives random
numbers between zero and one, multiply the random number by the value of I to obtain a
random number between zero and I. do not round off.
4. Begin the series of cumulative numbers starting with R. To this figure add I to determine
the second number in the series. Then add I to the second number to get the third, and so on.
5. Stop when the cumulative last number exceeds N. This must happen after accumulating n
numbers.
7. On the list of the population unites, circle the item numbers that correspond to the
determined integers. These are the units that constitute the survey sample.
Generalization:
The jth sample unit (Sj) in the population can be selected as follows:
S j = R + ( j − 1) I With j = 1, 2... n
20
where: Sj = selected number of jth individual in the population
R = random starting point;
I = drawing line (interval)
n = sample size
1. Principle
Where the population embraces a number of distinct categories, the frame can be organized
by these categories into separate "strata". Each stratum is then sampled as an independent sub-
population, out of which individual elements can be randomly selected.
Stratified random sampling is a method in which the elements of the population are divided
into homogeneous groups, and then selecting a simple random sample within each group. This
group setting process is called stratification and groups called are called strata. Strata may
reflect country's regions, highly or less populated areas, various ethnic groups, religious
groups or various other groups.
In stratification, we group similar components together for having the variance within the
strata low, at the same time, It is important that the means from different strata can be
different as much as possible.
The initial population size N is divided into L groups G1, G2, ..., Gh, ..., GL of respective sizes
N1, N2, ...., Nh, ..., NL such as:
L
N = ∑ Nh
h=1
Note that the letter h is used to identify the strata so that, if L strata are determined, h ranges
from 1 to L.
In stratified sampling, the probabilities of selection can be either equal in all groups, or
different. In stratified sampling, all items in particular strata have the same probability of
being selected. The draw of sampling units, a count of selected units, distribution and
supervision of fieldwork and, in general, the entire administration of the survey is greatly
simplified. However, the procedure requires a prior knowledge of the sizes of strata, that is to
say, the number of sampling units in each strata and the existence of a sampling frame that
can be used to draw the sample within each strata.
2. Notation
we use the same notation as for simple random sampling, except that there will be a sub-index
to indicate a particular strata when referring to information relating to that strata. The sub
index h denotes strata and sub-index i, the unit within the strata. As in the simple random
sampling, the capital letters refer to population values, while lowercase letters refer to values
corresponding to samples.
21
Illustration
3. Estimation
• Mean estimation
The mean of population can be expressed in terms of numbers within strata, as follows:
_
1 L Y L
Y = ∑ Yh = Where the population-number Y = ∑ Yh
N h =1 N h =1
_ _
1 L _
Illustration
As in the simple random sampling, we get the estimate of the population size by
multiplying the estimate of mean by the number of elements in the population:
22
^ _ L _ L
Nh
Y st = N y st = ∑ N h y h = ∑ yh
h =1 n
h =1 h
• Proportion estimate
The procedure of estimating the population-proportion is similar to that of the mean, except
that the possible values of Yi are 0 and 1. In this case,
_
1 Nh
Yh = Ph =
Nh
∑Y
i =1
hi where Yhi= 0 or 1
23
CHAPITRE IV. SONDAGE PAR GRAPPES
Après ce regroupement, un échantillon aléatoire simple des grappes est ensuite sélectionné.
L’échantillonnage par grappes tend à fournir de meilleurs résultats lorsque les éléments
contenus dans les grappes sont hétérogènes et, dans le cas idéal, chaque grappe est une
représentation à petite échelle de la population entière. La valeur de l’échantillonnage par
grappes dépend du degré de représentativité de la population entière, de chaque grappe. Si
toutes les grappes sont semblables dans ce sens, échantillonner un petit nombre de grappes
fournira de bonnes estimations des paramètres de la population.
Après la sélection des grappes échantillons, celles-ci déterminent à leur tour les unités à
inclure. La détermination peut s’effectuer de deux façons suivantes :
a) l’échantillon peut comprendre toutes les unités à l’intérieur des grappes choisies. Cette
procédure est appelée généralement « sondage par grappes à un degré) ».
b) on peut tirer un sous-échantillon des unités des grappes échantillonnées. Ceci s’appelle
« sondage par grappes à plusieurs degrés » ou simplement « sondage à plusieurs degrés ».
Il y a deux raisons pour l’emploi du sondage par grappes. Fréquemment, il n’existe pas de
base de sondage adéquate à partir de laquelle effectuer le tirage des éléments de la population,
et le coût de la construction d’une telle base de sondage serait énorme.
D’autre part, une telle base pourrait exister, mais l’épargne dans les coûts des travaux de
terrain obtenue avec le sondage par grappes peut faire de cette méthode une alternative plus
efficace qu’un échantillon aléatoire simple.
Dans la plupart de cas pratiques, un échantillon d’un effectif quelconque d’unités choisies
aléatoirement aura une variance plus faible qu’un échantillon de la même taille choisi en
grappes. Cependant, quand il faut balancer le coût contre la fiabilité, l’échantillon par grappes
pourrait être plus efficace.
Quoique les unités d’analyse ne soient pas choisies directement, la probabilité de sélection
d’une grappe et de chaque unité à l’intérieur de celle-ci est connue préalablement ; ainsi donc,
le sondage par grappes satisfait le critère de sondage probabiliste.
24
IV.1. Sondage aréolaire
Le sondage aréolaire (area sample) est un échantillonnage à plusieurs degrés, dont l’avant
dernier degré s’appuie sur des aires géographiques faisant office de grappes. Il est très utile
quand une ou toutes les deux conditions suivantes existent :
b) quand les frais de déplacement sont élevés, c’est-à-dire il y a un coût élevé associé aux
déplacements des enquêteurs d’une unité-échantillon choisie au hasard à une autre choisie
aussi au hasard.
Pour le tirage d’un échantillon aréolaire, on utilisera les pâtés d’une ville comme illustration
(dans les zones rurales, on peut utiliser des segments administratifs ayant des limites
identifiables). On pourra procéder de la manière suivante :
1) obtenir un plan raisonnablement fidèle de la ville, qui donne le plus possible de détails sur
les pâtés. Si le plan n’est pas récent, il faut prendre certaines dispositions au moyen des
renseignements au niveau local pour l’actualiser ;
2) numéroter tous les pâtés successivement, en plaçant les numéros directement sur le plan, un
système serpentin est conseillé pour ne pas omettre un pâté ;
4) interviewer tous les ménages dans les pâtés-échantillons ou sélectionner quelques ménages
à l’intérieur des pâtés-échantillons, soit avec une table de nombres aléatoires, soit par sondage
systématique en utilisant les numéros affectés aux unités de logement.
IV.2. Estimation
Il existe une analogie étroite entre le sondage par grappes et le sondage stratifié. Dans les
deux cas, on regroupe les éléments individuels ensemble avant de choisir l’échantillon. La
différence est que dans le sondage stratifié, il faut prélever un échantillon à l’intérieur de
chaque strate, tandis que dans le sondage par grappes, on prélève un échantillon des grappes
avant de choisir les éléments individuels dans les grappes-échantillons. Il convient de noter
toutefois que la combinaison des deux types de sondages est souvent utilisée.
• Notation
M = ∑ Mi
N
25
Mi = effectif des USS dans le ième UPS (grappe), où i = 1, …, N.
_
1 N
M =
N
∑M
i =1
i = nombre moyen de USS par UPS dans la population ou taille moyenne
de la grappe.
m = ∑ mi
n
_
1 n
m = ∑ mi = nombre moyen de USS par UPS-échantillon.
n i=1
Yij = valeur d’une caractéristique pour la jème USS dans la ième UPS-population.
Mi
Y = ∑ Yi
N
_
1 Mi
Yi =
Mi
∑Y
j =1
ij = valeur moyenne de la caractéristique dans la ième UPS.
_
1 N
Yc = ∑ Yi = valeur moyenne de la caractéristique par UPS.
N i=1
_
Y
µ =Y = = moyenne-population.
M
mi
26
_
1 mi
yi =
mi
∑y
j =1
ij = moyenne-échantillon de la caractéristique dans la ième UPS-
échantillon.
Les formules présentées au chapitre précédent (Stratification) pour estimer les moyennes-
populations sont valables quand l’unité de sondage est identique à l’unité d’analyse. Or, une
caractéristique particulière du sondage par grappes est que l’unité de sondage (au 1er degré)
n’est pas l’unité d’analyse.
Considérons un plan de sondage à deux étapes/degrés dans lequel les unités de la deuxième
étape sont les unités d’analyse ; on choisit n grappes parmi N, par sondage aléatoire simple ;
puis on choisit mi unités dans la ième UPS par sondage aléatoire simple avec i = 1, …, n.
Les unités à l’intérieur de la grappe étant choisies par sondage aléatoire simple, on sait qu’on
peut estimer cette moyenne sans biais, à l’aide de la formule suivante :
_
1 mi
yi =
mi
∑y
j =1
ij
Ces estimations des moyennes au niveau de la grappe pour les n grappes-échantillons doivent
être combinées pour estimer l’effectif-population global (Y) et la moyenne-population
_
Y
µ =Y =
M
Pour l’effectif-population de la caractéristique Y, on construit d’abord un estimateur non
biaisé de Yi, l’effectif de la caractéristique Y dans la ième UPS, qui est donnée par :
^ Mi
Yi = yi
mi
Ainsi, un estimateur non biaisé pour l’effectif-population est :
^ n ^ n n _ n mi
N N Mi N N Mi
Y =
n
∑
i =1
Yi =
n
∑i =1 mi
yi =
n
∑
i =1
M i yi =
n
∑
i =1 mi
∑y
j =1
ij
_
Y ∑Y i
µ =Y = = i =1
N
∑M
De la même façon, on peut estimer la moyenne-population M
i
i =1
27
_
Un estimateur de Y est :
^ _ mi
N n
Mi yij ^ n
∑ ∑m
N
µ=y=
n i =1
^
j =1
où M =
n
∑Mi =1
i
M i
Il importe de noter que pour l’estimation d’une proportion, la formule est la même, sauf que
l’on considère seulement les yij = 1, c’est-à-dire les unités possédant l’attribut.
Illustration
On est par exemple dans une ville où l’on doit procéder à une enquête par grappes à plusieurs
étapes/degrés. La ville est constituée de :
Parallèlement aux sondages à probabilités égales, il existe des plans de sondage où les
individus ont des probabilités d’inclusion inégales. Généralement, les tirages stratifiés et les
tirages à plusieurs degrés sont des tirages où les individus de la base de sondage n’ont pas
tous la même probabilité d’être sélectionnés. Une sélection avec probabilité proportionnelle à
la taille signifie, par exemple, qu’une grappe qui est 5 fois plus grande qu’une autre aura 5
fois de probabilité de tomber dans l’échantillon.
Comme résultat du tirage à plusieurs degrés, la possibilité que n’importe quelle unité de la
dernière étape soit comprise dans l’échantillon est le produit des probabilités de sélection de
toutes les toutes les étapes effectuées.
L’illustration du sondage à plusieurs degrés nous amènera au tirage systématique déjà évoqué,
mais avec des probabilités égales. Pour le cas présent, l’algorithme du tirage systématique est
appliqué avec des probabilités inégales.
M hi
P1h = ah
Mh
avec Mh = nombre de ménages dans la strate h
mhij
P2 h =
M hi
avec mhij = nombre de ménages tirés dans l’UPS (grappe) i de la strate h
Ph = P1h .P2 h
4. Facteur d’extrapolation dans une strate
Wh = 1 / Ph = 1 / P1h .P2 h
5. Probabilité de tirage d’un ménage dans toutes les strates (niveau national, par
exemple)
P = ∏ Ph
h
6. Facteur d’extrapolation dans toutes les strates (niveau national, par exemple)
W = 1 / P = 1 / ∏ Ph
h
29
CHAPTER VI. SAMPLING IN TIME (Panel)
Taking into account the time will add a dimension to the issue of the survey. The difference
between the values of a variable Y on the individuals of a given population is likely to be not
only spatial but also temporal. Indeed, the time will intervene in the definition of two
fundamental concepts: population and variables. No one can deny that populations evolve
over time, as the variables are affected by time.
If we look at a complex parameter ∆ that compares two similar parameters, but set at two
different times t and t +1. If both parameters are noted θt and θt+1 (∆ can be the difference
θt+1- θt or the ratio θt+1/θt).
One question will be raised about the reference population at time t Ωt or involve both
population Ωt (where we define θ t) and population Ωt +1 (where we θ t +1)
This duality of approaches will impact on the choice of sampling method, but it may also
reconcile the two by opting for a rotating sample plan.
If we study a population Ωt defined at time t and we measure the changes over time in this
population, we are in presence of a longitudinal survey. In another case, when the study
intends to describe the status of a population at one point, we talk about a cross-sectional
survey.
Whether the approach is longitudinal or transverse, we can use three types of surveys:
samples drawn independently at each date, panels and rotating samples. However, we often
use panels or rotating samples.
When attempting to measure the evolution of a parameter over time, the population being that
of the original date, it often happens that one chooses to take a sample from the original date,
and follow the individuals in the sample as long as it is relevant for the purposes of the study.
Such a sample when individuals are interviewed at least twice is called a panel.
Panels can be used because the parameter we want to estimate explicitly involves the
individual developments of the variable of interest. Example, for calculation of monthly price
indexes, it is necessary to establish a range of products to estimate the index.
The establishment of a panel can be done to minimize errors of observation which necessarily
affect the results of surveys.
Using a panel seems inconsistent with the estimation of a parameter to the current date.
Indeed, a panel is a sample that represents the population set to its drawing. To move from an
initial population (old) to the current one (contemporary),we need to take into account the
new individuals.
To move from the panel to the transverse sample, we must ensure that each individual of the
current population can be sampled. The principle is to avoid probabilities of zero selection.
For this, there are three methods:
30
• Making a stratum (or strata) gathering new individuals, defined as individuals who belong to
the current population without being in the initial population. Then, we draw from this
stratum to form the necessary complement. This method has the major drawback of being
based on the knowledge of all new individuals.
• Either you draw, regardless of the panel, a sample in the cross-sectional population. Thus,
each individual will have a new non-zero probability of inclusion in the final sample.
• Either we use a combination "natural" individuals and we decide to investigate the entire
group if and only if it contains at least one individual of the panel. This methodology known
as "indirect sampling" is not infallible, but significantly reduces the problem. For example, to
conduct a physical survey of individuals, as individuals are grouped into units or households,
they may decide to investigate the current date all the people who are in housing or
households that contain at least one individual panel.
a) Principle
The rotary rotational sampling consists of taking regular (quarterly or annual) samples
panelized for a given period, common to all panels. Thus, at each survey date (called a
"wave"), a panel enters the system while a panel comes out.
The main advantages of the rotary sampling are those of the panel, namely the gain of
accuracy in estimating the changes and the reduction of the observation error. However, this
sampling has some drawbacks; including difficulties in monitoring field where a change of
panel address of individuals (this is sometimes called "tracing" or "tracking" of individuals).
b) Longitudinal exploitation
In times of change between dates t and t +1, assume that each panel must be interviewed four
times, eg every quarter for one year. The sample taken at the time α will be investigated to
waves α, α +1, α+ 2 and α +3.
c) Transverse exploitation
In this approach, it is estimated, for example, the total Tt where t is the current date. At each
date t, we practice an inference on the entire population Ωt. Regarding the protocol for the
collection, we follow all the individuals of the panel in time and we question all individuals
living in the same housing or household as the sample person.
31