Chapter 1
OBTAINING DATA
Introduction
Statistics may be defined as the science that deals with the collection,
organization, presentation, analysis, and interpretation of data in order be able to draw
judgments or conclusions that help in the decision-making process. The two parts of this
definition correspond to the two main divisions of Statistics. These are Descriptive
Statistics and Inferential Statistics. Descriptive Statistics, which is referred to in the first
part of the definition, deals with the procedures that organize, summarize and describe
quantitative data. It seeks merely to describe data. Inferential Statistics, implied in the
second part of the definition, deals with making a judgment or a conclusion about a
population based on the findings from a sample that is taken from the population.
Intended Learning Outcomes
At the end of this module, it is expected that the students will be able to:
1. Demonstrate an understanding of the different methods of obtaining data.
2. Explain the procedures in planning and conducting surveys and experiments.
Statistical Terms
Before proceeding to the discussion of the different methods of obtaining data, let
us have first definition of some statistical terms:
Population or Universe refers to the totality of objects, persons, places, things used in
a particular study. All members of a particular group of objects (items) or people
(individual), etc. which are subjects or respondents of a study.
Sample is any subset of population or few members of a population.
Data are facts, figures and information collected on some characteristics of a population
or sample. These can be classified as qualitative or quantitative data.
Ungrouped (or raw) data are data which are not organized in any specific way. They are
simply the collection of data as they are gathered.
Grouped Data are raw data organized into groups or categories with corresponding
frequencies. Organized in this manner, the data is referred to as frequency distribution.
Parameter is the descriptive measure of a characteristic of a population
Statistic is a measure of a characteristic of sample
Constant is a characteristic or property of a population or sample which is common to all
members of the group.
Variable is a measure or characteristic or property of a population or sample that may
have a number of different values. It differentiates a particular member from the rest of
the group. It is the characteristic or property that is measured, controlled, or manipulated
in research. They differ in many respects, most notably in the role they are given in the
research and in the type of measures that can be applied to them.
1.1 Methods of Data Collection
Collection of the data is the first step in conducting statistical inquiry. It simply refers
to the data gathering, a systematic method of collecting and measuring data from different
sources of information in order to provide answers to relevant questions. This involves
acquiring information published literature, surveys through questionnaires or interviews,
experimentations, documents and records, tests or examinations and other forms of data
gathering instruments. The person who conducts the inquiry is an investigator, the one
who helps in collecting information is an enumerator and information is collected from a
respondent. Data can be primary or secondary. According to Wessel, “Data collected in
the process of investigation are known as primary data.” These are collected for the
investigator’s use from the primary source. Secondary data, on the other hand, is
collected by some other organization for their own use but the investigator also gets it for
his use. According to M.M. Blair, “Secondary data are those already in existence for
some other purpose than answering the question in hand.”
In the field of engineering, the three basic methods of collecting data are through
retrospective study, observational study and through a designed experiment. A
retrospective study would use the population or sample of the historical data which had
been archived over some period of time. It may involve a significant amount of data but
those data may contain relatively little useful information about the problem, some of the
relevant data may be missing, recording errors or transcription may be present, or those
other important data may not have been gathered and archived. These result in statistical
analysis of historical data which identifies interesting phenomena but difficulty of obtaining
solid and reliable explanations is encountered.
In an observational study, however, process or population is observed and
disturbed as little as possible, and the quantities of interests are recorded. In a designed
experiment, deliberate or purposeful changes in the controllable variables of the system
or process is done. The resulting system output data must be observed, and an inference
or decision about which variables are responsible for the observed changes in output
performance is made. Experiments designed with basic principles such as randomization
are needed to establish cause-and-effect relationships. Much of what we know in the
engineering and physical-chemical sciences is developed through testing or
experimentation. In engineering, there are problem areas with no scientific or engineering
theory that are directly or completely applicable, so experimentation and observation of
the resulting data is the only way to solve them. There are times there is a good underlying
scientific theory to explain the phenomena of interest. Tests or experiments are almost
always necessary to be conducted to confirm the applicability and validity of the theory in
a specific situation or environment. Designed experiments are very important in
engineering design and development and in the improvement of manufacturing processes
in which statistical thinking and statistical methods play an important role in planning,
conducting, and analyzing the data. (Montgomery, et al., 2018)
1.2 Planning and Conducting Surveys
A survey is a method of asking respondents some well-constructed questions. It is
an efficient way of collecting information and easy to administer wherein a wide variety of
information can be collected. The researcher can be focused and can stick to the
questions that interest him and are necessary in his statistical inquiry or study.
However surveys depend on the respondents honesty, motivation, memory and
his ability to respond. Sometimes answers may lead to vague data. Surveys can be done
through face-to-face interviews or self-administered through the use of questionnaires.
The advantages of face-to-face interviews include fewer misunderstood questions, fewer
incomplete responses, higher response rates, and greater control over the environment
in which the survey is administered; also, the researcher can collect additional information
if any of the respondents’ answers need clarifying. The disadvantages of face-to-face
interviews are that they can be expensive and time-consuming and may require a large
staff of trained interviewers. In addition, the response can be biased by the appearance
or attitude of the interviewer.
Self-administered surveys are less expensive than interviews. It can be
administered in large numbers and does not require many interviewers and there is less
pressure on respondents. However, in self-administered surveys, the respondents are
more likely to stop participating mid-way through the survey and respondents cannot ask
to clarify their answers. There are lower response rates than in personal interviews.
When designing a survey, the following steps are useful:
1. Determine the objectives of your survey: What questions do you want to answer?
2. Identify the target population sample: Whom will you interview? Who will be the
respondents? What sampling method will you use?
3. Choose an interviewing method: face-to-face interview, phone interview, self-
administered paper survey, or internet survey.
4. Decide what questions you will ask in what order, and how to phrase them.
5. Conduct the interview and collect the information.
6. Analyze the results by making graphs and drawing conclusions.
In choosing the respondents, sampling techniques are necessary. Sampling is the
process of selecting units (e.g., people, organizations) from a population of interest.
Sample must be a representative of the target population. The target population is the
entire group a researcher is interested in; the group about which the researcher wishes
to draw conclusions.
There are two ways of selecting a sample. These are the non-probability sampling
and the probability sampling.
Non-Probability Sampling
Non-probability sampling is also called judgment or subjective sampling. This
method is convenient and economical but the inferences made based on the findings are
not so reliable. The most common types of non-probability sampling are the convenience
sampling, purposive sampling and quota sampling.
In convenience sampling, the researcher use a device in obtaining the information
from the respondents which favors the researcher but can cause bias to the respondents.
In purposive sampling, the selection of respondents is predetermined according to
the characteristic of interest made by the researcher. Randomization is absent in this type
of sampling.
There are two types of quota sampling: proportional and non proportional. In
proportional quota sampling the major characteristics of the population by sampling a
proportional amount of each is represented.
For instance, if you know the population has 40% women and 60% men, and that
you want a total sample size of 100, you will continue sampling until you get those
percentages and then you will stop.
Non-proportional quota sampling is a bit less restrictive. In this method, a minimum
number of sampled units in each category is specified and not concerned with having
numbers that match the proportions in the population.
Probability Sampling
In probability sampling, every member of the population is given an equal chance
to be selected as a part of the sample. There are several probability techniques. Among
these are simple random sampling, stratified sampling and cluster sampling.
Simple Random Sampling
Simple random sampling is the basic sampling technique where a group of
subjects (a sample) is selected for study from a larger group (a population). Each
individual is chosen entirely by chance and each member of the population has an equal
chance of being included in the sample. Every possible sample of a given size has the
same chance of selection; i.e. each member of the population is equally likely to be
chosen at any stage in the sampling process.
Stratified Sampling
There may often be factors which divide up the population into sub-populations
(groups / strata) and the measurement of interest may vary among the different sub-
populations. This has to be accounted for when a sample from the population is selected
in order to obtain a sample that is representative of the population. This is achieved by
stratified sampling.
A stratified sample is obtained by taking samples from each stratum or sub-group
of a population. When a sample is to be taken from a population with several strata, the
proportion of each stratum in the sample should be the same as in the population.
Stratified sampling techniques are generally used when the population is
heterogeneous, or dissimilar, where certain homogeneous, or similar, sub-populations
can be isolated (strata). Simple random sampling is most appropriate when the entire
population from which the sample is taken is homogeneous. Some reasons for using
stratified sampling over simple random sampling are:
1. the cost per observation in the survey may be reduced;
2. estimates of the population parameters may be wanted for each subpopulation;
3. increased accuracy at given cost.
Cluster Sampling
Cluster sampling is a sampling technique where the entire population is divided
into groups, or clusters, and a random sample of these clusters are selected. All
observations in the selected clusters are included in the sample.
1.3 Planning and Conducting Experiments: Introduction to Design of Experiments
The products and processes in the engineering and scientific disciplines are mostly
derived from experimentation. An experiment is a series of tests conducted in a
systematic manner to increase the understanding of an existing process or to explore a
new product or process. Design of Experiments, or DOE, is a tool to develop an
experimentation strategy that maximizes learning using minimum resources. Design of
Experiments is widely and extensively used by engineers and scientists in improving
existing process through maximizing the yield and decreasing the variability or in
developing new products and processes. It is a technique needed to identify the "vital
few" factors in the most efficient manner and then directs the process to its best setting
to meet the ever-increasing demand for improved quality and increased productivity.
The methodology of DOE ensures that all factors and their interactions are
systematically investigated resulting to reliable and complete information. There are five
stages to be carried out for the design of experiments. These are planning, screening,
optimization, robustness testing and verification.
1. Planning
It is important to carefully plan for the course of experimentation before embarking
upon the process of testing and data collection. At this stage, identification of the
objectives of conducting the experiment or investigation, assessment of time and
available resources to achieve the objectives. Individuals from different disciplines related
to the product or process should compose a team who will conduct the investigation. They
are to identify possible factors to investigate and the most appropriate responses to
measure. A team approach promotes synergy that gives a richer set of factors to study
and thus a more complete experiment. Experiments which are carefully planned always
lead to increased understanding of the product or process. Well planned experiments are
easy to execute and analyze using the available statistical software.
2. Screening
Screening experiments are used to identify the important factors that affect the
process under investigation out of the large pool of potential factors. Screening process
eliminates unimportant factors and attention is focused on the key factors. Screening
experiments are usually efficient designs which require few executions and focus on the
vital factors and not on interactions.
3. Optimization
After narrowing down the important factors affecting the process, then determine
the best setting of these factors to achieve the objectives of the investigation. The
objectives may be to either increase yield or decrease variability or to find settings that
achieve both at the same time depending on the product or process under investigation.
4. Robustness Testing
Once the optimal settings of the factors have been determined, it is important to make the
product or process insensitive to variations resulting from changes in factors that affect
the process but are beyond the control of the analyst. Such factors are referred to as
noise or uncontrollable factors that are likely to be experienced in the application
environment. It is important to identify such sources of variation and take measures to
ensure that the product or process is made robust or insensitive to these factors.
5. Verification
This final stage involves validation of the optimum settings by conducting a few follow-
up experimental runs. This is to confirm that the process functions as expected and all
objectives are achieved.