Chapter 3
Preliminaries
This section contains the basic theories to provide a good sampling method. These
are based on the books of Cochran (1977) and Lohr (2010). Sampling has not always been
accepted by researcher at the start. Complete enumeration has been favored long time ago.
Although this is no longer perceived now a days, it is good to point out the advantages
of sampling compared with complete enumeration. The advantages are the following (1)
Reduced Cost (2) Greater Speed (3) Greater Scope. The types of information that can
be gathered through surveys that use sampling are more varied and flexible. (4) Greater
Accuracy. Because the volume of labor is reduced, higher quality personnel can be hired,
given intensive training, more careful field supervision of the job, and result processing are
possible.
3.1 Steps in a Design of Sample Survey
The following list are the steps in design of sample survey. The degree of complexity
in surveys varies substantially.The 11 survey main steps are not strictly arranged and are
not limited to it.
10
1. Objectives of the Study
2. Population to be Sampled
3. Data to be Collected
4. Degree of Precision Desired
5. Methods of Measurement
6. The Frame
7. Selection of the Sample
8. The Pretest
9. Organization of the Field Work
10. Summary and Analysis of the Data
11. Information Gained for Future Surveys
Some of the concepts that worth elaborating are the population to be sampled, data to
be collected, and degree of precision desired.
Definition 1. Population The group from which the sample is drawn is referred to as the
population or in inferential Statistics the target population.
The population should be clearly defined. The enumerator must be able to quickly
determine in the field whether a questionable case is representative of the population or not.
Any additional information that can be acquired about the nature of the differences between
the sampled and target populations may be helpful. The population must be separated into
groups known as sampling units, or Primary Sampling Units (PSU), before the sample
is chosen. These units must encompass the whole population and cannot overlap in any way
such that every component of the population is a member of one and only one unit. In
survey sampling it is important to have a list of members of the target population.
Definition 2. Sampling Frame. A list, map, or other specification of sampling units in the
11
population from which a sample may be selected [1]
Examples of sampling frame are list of every home phone number in the city, list of street
addresses. In agricultural survey, sampling frame may be a list of all farms or a map of the
areas with farms.
Ideally, the target population should coincide with the sampling frame. But this is rare.
For example, the target population is employees in the Philipppines. The sampling frame
excludes employees in institutional places such as prison, dormitories, and home for the
elderly.
For data to be collected, the level of the responses is lowered by a lengthy questionnaire.
For the precision desired, because just a portion of the population has been measured and due
to measurement mistakes, the results of sample surveys are always subject to some degree
of uncertainty.A crucial stage is defining the level of precision desired in the outcomes. The
person who will use the data is responsible for this stage. Since many administrators are not
used to thinking in terms of the amount of inaccuracy that can be tolerated in estimations
while still making wise judgments, this could provide some challenges. At this point, the
statistician can frequently assist.
This subsection is a quick overview of how to design a sample survey. The focus of this
study is one part of the survey design which is the statistical theory that is introduced in
the next subsection, ”Role of Sampling Theory.”
3.2 Role of Sampling Theory
Several types of skills are needed to develop a survey- population definition, determi-
nation of data to be collected, methods of measurement, and the organization of the field
work. Sampling theory has a small part. However, poor work in one phase affects everything
12
else.
Two important factors are prioritize in sampling theory the (1) efficiency at a (2) lowest
possible cost. Sample selection is develop to provide estimates that are precise enough for a
minimum cost possible. The expected precision and cost should be predicted. This is better
understood in the succeeding sections.
3.3 Properties of Probability Sampling
The sampling theories that are discussed in this study are based on probability sam-
pling. The term probability sampling has the following mathematical properties:
1. From the sampling frame the derivation of distinct sets of samples are defined,
S1 , S2 , · · · , Sv . For example, a population of six units numbered as 1 to 6, a sample of
size 2 gives possible set of samples- S1 ∼ (1, 4); S2 ∼ (2, 5); S3 ∼ (3, 6). Note that not all
possible samples of size 2 need be included.
2. Each possible sample Si has assigned to it a known probability of selection, Si .
πi = P (unit i in the Sample)
3. The method for computing the estimate from the sample must be stated and must
lead to a unique estimate for any specific sample. For example, the estimate for a set of
sample is the average of units in the sample.
4. The frequency distribution of the estimates can be generated if sampling procedure
is repeatedly applied to the same population. That is to know how frequent any particular
sample Si will be selected, and its estimate.
13
3.4 Nonprobability Sampling
The term probability sampling is better understood by its opposite, nonprobability
sampling. Here are common types of nonprobability sampling:
1. Only a portion of the population that is easily available is included in the sample.
The top 6 to 9 inches of coal in an open wagon can be sampled.
2. The sample is conveniently chosen. The investigator may choose 10 rabbits at random
from a huge cage in the lab by just resting his hands on them.
3. With a small but diverse population, the sampler examines the entire population and
chooses a small sample of ”typical” units—that is, units that are reasonably near to his
perception of the population’s average.
4. In studies when the individual being tested finds the measurement process uncomfort-
able or difficult, the sample is mostly made up of volunteers.
With the right conditions, these methods provide useful results. However, they are not
applicable to sampling theory.
3.5 Normal Distribution
Assumption of Normality
A part of sampling survey theory is finding formulas for means and variances. This is
derived from frequency distribution of estimates from each Si that is repeated sampled from
the population. A simplification of this can be observed in the size of the sample. In practice,
sampling is only done once for big sample size. This large sample size is a good reason to
assume that the frequency distribution of the sample estimates is normally distributed.
Setting aside the type of frequency distribution of the estimates. It is important to
establish notations used in every sampling survey design. As well as the definitions. Most
14
of the following notations are used in survey books such as Cochran and Lohr.
Definition 3. Estimator. The estimator, denoted by µ̂, denotes the rule by which an
estimate of some population parameter, µ , is calculated from the sample results.
Definition 4. Estimate. It is the value obtained for the estimator from a sample, Si .
Definition 5. Expected Value. An estimator µ̂ of µ taken over all possible samples provided
by the plan has an expected value denoted by
v
X
E(µ̂) = πi µ̂i (3.1)
i=1
where µ̂i is the estimate given by the i th sample, and πi is the probability of selecting Si .
The symbol E stands for ”the expected value of,”.
Definition 6. Unbiased Estimator. a method of estimation is unbiased if the average value
of the estimate, taken over all possible samples of given size n, is exactly equal to the true
population value. An estimator µ̂ is called unbiased if the mean value of µ̂, taken over all
possible samples provided by the plan, is equal to the true value µ(Cochran, 1977).
E(µ̂) = µ (3.2)
The amount of bias is
Bias(µ̂) = E(µ̂) − µ (3.3)
It was mentioned earlier that the role of sampling theory is to have efficient estimator.
One factor in efficiency is precision.
Definition 7. Precision. Precision refers to the size of deviation from the mean E(µ̂)
obtained by repeated application of the sampling procedure. This is computed using variance
15
of estimator.
Definition 8. Variance of estimator. The variance of µ, denoted as σµ̂ , is defined as
σµ̂ 2 = E(µ̂ − µ)2 (3.4)
Definition 9. Standard deviation. Standard deviation of the estimator µ̂, denoted as SE(µ̂)
q
or σµ̂ , is derived from σµ̂2 . It is also known as standard error.
How Good is an Unbiased Estimate?
Suppose a sampling procedure gives an unbiased estimator. A sample Si is selected
and have the following: estimate µ̂ and its standard deviation σµ̂ (for a sample set it is
alternatively termed as standard error ).
We cannot know the exact value of the error of estimate (µ̂ − µ) from the population
parameter µ but, from the properties of the normal curve, the chances are
0.32 (about 1 in 3 ) that the absolute error |µ̂ − µ| exceeds σµ̂
.
0.05 (1 in 20) that the absolute error |µ̂ − µ| exceeds 1.96σµ̂ = 2σµ̂
0.01 (1 in 100) that the absolute error |µ̂ − µ| exceeds 2.58σµ̂
Definition 10. Confidence Limits. Lower limit
µ̂L = µ̂ − tσµ̂ (3.5)
Upper limit
µ̂U = µ̂ + tσµ̂ (3.6)
The symbol t is the value of the normal deviate corresponding to the desired confidence
probability. The most common values are
16
Confidence probability (%) 50 80 90 95 99
0.67 1.28 1.64 1.96 2.58
For example, if a probability sample of the records of batteries in routine use in a large
factory shows an average life µ̂ = 394 days, with a standard error σµ̂ = 4.6 days, the chances
are 99 in 100 that the average life in the population of batteries lies between
µ̂L = 394 − (2.58)(4.6) = 382 days
and
µ̂U = 394 + (2.58)(4.6) = 406 days
First, it is not correct to interpret it as a single estimate from a single survey such as, ”µ
lies between 382 and 406 days” . The correct way to interpret it is, ”using the same sampling
plan repeatedly, multiple samples provides their corresponding estimates and confidence
limit, about 99% of these statements would be correct and 1% wrong. The example above
is just one of the estimates and confidence limit.
What to do with a bias estimator?
It is ideal to have an unbias estimator, but in sampling theory there are reasons to
consider a bias estimator. These are the following:
1. In common cases, especially estimation of ratios, the estimators derived are biased.
2. Errors in measurement and non-response may lead to biases in the data we may
derive from, even using estimators that are unbiased in probability sampling.This occurs,
for example, when those who decline to be questioned are almost all against a certain public
funding expenditure, but people who are interviewed are equally divided between those who
support and oppose it.
Unbiased estimator are measured by precision, or its variance. For comparison of a
17
biased estimator with an unbiased estimator, or two bias estimators, a useful measurement
is accuracy, or the mean square error (MSE) of the estimate
Definition 11. Accuracy. It refers to the size of deviations from the true mean µ. Accuracy
of an estimator is the sum of the precision, and the square of bias of the estimator.
Definition 12. Mean Square Error.
MSE(µ̂) = E(µ̂ − µ)2 = E[(µ̂ − m) + (m − µ)]2
= E(µ̂ − m)2 + 2(m − µ)E(µ̂ − m) + (m − µ)2 (3.7)
= ( variance of µ̂) + ( bias )2
where E(µ̂), and the cross-product term vanishing since E(µ̂ − m) = 0.
Survey design
Survey design defines and plans the tools needed in survey sampling such as the primary
sampling unit (PSU). The objective of the survey should also be in line with the question-
naire design. Choosing PSU from the sampling frame affects the precision and accuracy of
estimates gathered from the sample.
The simplest design to consider is Simple Random Sampling (SRS). If the sampling frame
can be arranged in an order that is useful for the variable of interest Systematic Sampling
has better representation than SRS, although there is no theory that supports it. If more
details are available in the sampling frame then Stratified Sampling has better precision than
SRS. To manage the cost for face-to-face interviews Cluster Sampling is the most practical.
Cluster sampling may increase variance compared to SRS or stratified sampling, but it might
be considerable compared to the cost of data gathering. Details of the
Methodology of data gathering is also part of the design because it considers the budget
for the survey. The cheapest way of data gathering is through mail or email, but it will
18
also result in a low response rate (Lohr, 2010) [1]. There are several ways to increase
the response rate through telephone interviews such as sending mail to inform about the
interview. The face-to-face interview has the highest response rate compared to other forms
of data collection, but it is also very costly.
Another factor of survey design is the number of samples. Finding the number of samples
needed for data collection is very important, and will only be possible to find if there is an
existing population count. In finding the sample size it is very important to ask the following:
(1) How much precision is needed (Variance of population, S)? This can be based on previous
studies, conduct a pilot study first, or an expert’s guess. (2) How much error is tolerable (e)?
Usually e=0.03. We want to use a 95% Confidence interval with width at most 2e. That is
Z( /2) = 1.96(Lohr, 2010).
Even in the 1950’s sampling is not confined to partial coverages: the census of popu-
lation, agriculture, commerce, and industry in the United States and Canada has included
concurrent and supplementary samples, not only to broaden the scope of the information
but to study and evaluate the errors and biases of the census so that the data may be made
more useful (Deming, 1950). The chief concern of any government statistical agency should
be an assurance of the necessity, the success, and the economy of any survey that is autho-
rized at public expense. Such obligations cannot be met in the absence of knowledge and
research in statistical theory. Statistical research is constantly lowering costs and enhancing
the reliability and usefulness of statistical information.
Probability samples are frequently adjusted to solve specific research and policy questions,
but they come with a variety of drawbacks. Response rates are declining globally, with a
typical telephone survey response rate of less than 10% (Kohut et al., 2012)—far below the
95 percent response rate for mail surveys predicted by Deming (1950), page 35. Even face-
to-face surveys, such as the U.S. National Health Interview Survey (NHIS) response rate fell
19
from 92 percent in 1997 to 70 percent in 2015 (National Center for Health Statistics, 2016),
with additional nonresponse happening among individuals inside sampled households.
At least for some statistics, investigations have not identified strong connections between
response rate and bias (Groves, 2006), but lowering response rates have contributed to greater
data collection expenses. Because conducting probability samples is more expensive, sample
sizes are limited. As a result, if credible estimates for subpopulations of interest can be
estimated at all, they may require multiple years of data, and the estimates may be out of
date when they are created. (Lohr Raghunathan, 2017).
One of the example of survey is The Gallup World Poll. It monitors the most pressing
global concerns, including food security, employment, leadership performance, and well-
being. (Gallup, Inc., 2022). Gallup utilizes a random-digit-dial (RDD) approach or a
nationally representative list of phone numbers in countries where telephone interviewing
is practiced. In the United States, Canada, Western Europe, Japan, Australia, and other
countries, telephone methodology is commonplace.
20
Chapter 4
Alternative Sampling Strategies
4.1 Development of Sampling Schemes
This chapter discusses the overview of the alternative sampling strategies considered
in this study. It will start with a recap of the design and data collection of 2018 FIES. It is
followed by the common problems in the history of FIES. A short recap of FIES as sampling
on successive occasions. Then the discussion of alternative sampling strategies using Simple
Random Sampling.
As a quick recap from Chapter 1, the design of the 2018 FIES makes use of the 2013 Mas-
ter Sample (2013 MS), (PSA, 2020). The sampling frame of 2018 FIES is groups of replicates
from 2013 MS. These replicates are groups of Primary Sampling Unit (PSU) with 100-400
households. The details on how replicates are formed are in FIES 2018 technical report,
(PSA, 2020). A summary is provided in the introduction under the Philippine Household
Survey. For the 2018 FIES, there are samples of around 12 replicates in Highly Urbanized
Cities (HUC) and 16 replicates in the province domain. The sample size of replicate sample
allocation per domain was deemed sufficient for the following: 1 replicate for the national-
21
level estimate, 4 replicates for regional level estimates, and 16 replicates for provincial level
estimates. (PSA, 2021). The 16 replicates for province-level estimates leads to 170, 917
sampled households.
The data collection of 2018 FIES is on June 2018 and January 2019, where the 170, 917
households are interviewed twice. This is illustrated in Figure 3.1. The reference period for
both interviews is within the past 6 months. , FIES conducted the data collection twice per
survey year to minimize respondent memory bias, and at the same time capture seasonality
(PSA, 2020).
Figure 4.1: Illustration of the coverage and duration of the two visit data collection of 2018
FIES
4.2 Theoretical Framework
The purpose of this chapter is to illustrate how the alternative sampling strategies are
compared with the FIES 2018 strategy using theories of simple random sampling (SRS).
Assume a simple random sample of size n. There are four sampling schemes which are
denoted as: S∅, S1, S2 and S3, wherein,
Strategy ∅ : represents the complete overlapping sample between the two visits of FIES
2018 where all n samples interviewed in both visits
Strategy 1: represents the alternative sampling strategy of non-overlapping samples be-
22
tween the two visits where the n samples are divided into two.
Strategy 2: represents the alternative sampling strategy that uses the assumption that
the first visit values are equal to the second visit values where all n samples are interviewed
in the first visit only.
Strategy 3: represents the alternative sampling strategy of partial-overlapping samples
between the two visits where the n samples are divided into three sets. First and second sets
are interviewed in first visit. Second and third sets are interviewed in second visit.
In this section, we begin by assuming SRS and the measurements SRS made. The primary
objective is to estimae annual parameters.
4.3 Simple Random Sampling
It has been discussed in Chapter 3 the role of survey design in survey sampling. Simple
random sampling (SRS) is simplest design to start with, and provides the theoretical basis
for the more complicated forms. Given a population with N sampling units, and n units
are to be selected. SRS is the method of choosing the samples with equal chance of being
selected. Each unit in the population are numbered as i = 1, 2, 3, ...., N . A simple method
to selected n units with equal probability is by using fish bowl method or by means of a
computer program.
There are N Cn set of n samples from N population. In probability, the first sampling
unit has is n/N probability of selection. The second sampling unit has (n − 1)/(N − 1)
probabiity of selection, and so on. Hence the the probability that all n specified units are
selected in n draws is
n (n − 1) (n − 2) 1 n!(N − n)! 1
· · ··· = =
N (N − 1) (N − 2) (N − n + 1) (N )! N Cn
23
4.4 Definition and Notations
The primary objective of this study is to estimate annual parameters for example the
average annual income of a Filipino household based on FIES 2019. Suppose each household
sample be denoted as i = 1, 2, 3, . . . , N , where N has yi as the characteristic of interest.
Capital letter refer to the characteristics form the population and lowercase letters refer
to the characteristics from the sample. Note that the sample will not consist of the first n
units in the population. This is to avoid confusion.
Population Mean
PN
x1 + x2 + · · · + xN i=1 xi
X̄ = =
N N
Sample Mean
Pn
x1 + x2 + · · · + xn i=1 xi
x̄ = =
n n
Suppose {πi ≥ 0 : i = 1, 2, . . . , N } is a set of known selection probabilities for the units
n
of the population such that πi = for SRS.
N
24
4.5 Strategy ∅ (S∅)
This section discuss the first strategy which represents the data collection method of FIES
2018 where, t1 and t2 , are visit one and visit two, respectively. Suppose a simple random
sample of size n is drawn from a final population of N units. From this same sets of sample
measurement, are taken at two time intervals t1 and t2 . Let x1i and x2i be measurements
about variable X at time 1 and 2 respectively. Let xi = x1i + x2i . In the context of FIES,
x1i represents income/expenditure measurement incured by household i for the period of
January to June, and x2i for the period of July to December. In this context, xi represents
annual Value for household i = 1, 2, 3, . . . , N . In figure 4.2 shows an illustration of complete
overlapping data collection of Strategy ∅.
Figure 4.2: Strategy ∅ with complete overlap of interviews for two visits. Each household
are interviewed for both visits, visit one (t1 ) and visit two (t2 ).
Supposed one is interested in estimation of the population mean X̄ defined as
N
1 X
X̄ = xi (4.1)
N i=1
25
This can be interpreted as annual values for a specific/income or expenditure variable of
FIES. Further equation 4.1 can be expressed in terms of the average of first visit and second
visit.
N N N N
1 X 1 X 1 X 1 X
X̄ = xi = (x1i + x2i ) = x1 + x2
N i=1 N i=1 N i=1 i N i=1 i
X̄ = X̄1 + X̄2 (4.2)
where, X̄1 is the true mean for the first visit; and X̄2 is the true mean for second visit.
Random sampling of n samples gives an unbiased estimator of the parameter X̄ as
n n n n
1X 1X 1X 1X
x̄ˆ = xi = (x1i + x2i ) = x1i + x2 = x̄ˆ1 + ¯ˆx2
n i=1 n i=1 n i=1 n i=1 i
x̄ˆ = x̄ˆ1 + x̄ˆ2 (4.3)
where x̄ˆ1 = sample mean for first semester; and x̄ˆ2 = sample mean for second semester.
To show that x̄ is unbiased, note that, the bias of an estimator is computed using the
difference of the parameter and expected value of the estimator. If the expected value of the
estimator is equal to its corresponding parameter, then the estimator is unbiased.
From, SRS theory
E(x̄ˆ) = E x̄ˆ1 + x̄ˆ2 = E x̄ˆ1 + E x̄ˆ2 = X̄1 + X̄2 = X̄ (4.4)
This shows that the estimator x̄ˆ of Strategy ∅ is unbiased. Now the variance of Strategy
∅. Using the theories in SRS the variance of ȳ is
26
V (x̄ˆ) = V x̄ˆ1 + x̄ˆ2 = V x̄ˆ1 + V x̄ˆ1 + 2 cov x̄ˆ1 , x̄ˆ1
(4.5)
1 1
s2x1 + s2x2 + 2ρx1 x2 sx1 sx2
V (x̄ˆ) = −
r N
where ρx1 x2 is correlation between the value from first visit and the value form second
visit in the population; s2x1 and s2x2 are sample variance of values from first visit and second
visit.
Note that in SRS x̄ is an unbiased estimator of X̄ with unbiased estimator of its variance
as
1 1
v(x̄) = s2x̄ = − s2x
n N
where
1 N
X
s2x = (xi − X̄N )2
N −1 i=1
It is important to look into the relationship of variance between the two visits. Since
variances are constant values they can be written as a ratio of the other. This ratio may be
represented by a constant k. Let s2x̄2 = ks2x̄1 for some constant k. From here we will take
note of the variance of Strategy ∅ as
√
1 1 2
V (x̄ˆ) = − sx1 + ks2x1 + 2 kρx1 x2 s2x1
n N
√ (4.6)
1 1
V (hatx̄) = − s2x1 1 + k + 2 kρx1 x2
n N
Using STATA as the statistical software, empirical illustration using FIES 2018 data of
the values for k and ρx̄1 x̄2 are observed in Chapter 5. For illustration purposes FIES 2009 is
used because FIES 2018 data is not yet available. Take note that these values assume simple
27
random sampling for this chapter of the study.
28