LECTURE 2
Data Collection
Contents
Basic Concepts
Scales of Measurement
Sampling Concepts
Sampling Methods
Surveys
Objectives
Explain the distinction between numerical and categorical
data
Explain the difference between time series and cross-
sectional data
Recognize levels of measurement in data
Explain the common sampling methods and how to
implement them
Describe basic elements of survey design, survey types,
and sources of error
Recognize a Likert scale and know how to use it
Definition
What is data
Data are facts and figures… collected for analysis, presentation and interpretation.
Variable: a characteristic about the items that we want to study (e.g., student name, Gender, DOB).
Observation: a single member of items that we want to study, such as a student, firm, or region.
Data set: all the values of all of the variables for all of the observations we chose.
What is data
Variable
Employee Name Gender DOB Annual
Income in $
Observation Gladys Simpson Female 1-May- 120,000
1971
Divid Hinds Male 17-Dec- 135,000
1968
Kenneth Henry Male 3-Sep- 98,000
1965
A data set with 3 observations
Types of Data
Categorical or qualitative data have values that
are described by words, may be coded.
Numerical or quantitative data comes from
counting, measuring, or mathematical operation.
Time Series Data
Each observation represents a different
equally spaced point in time.
Periodicity may be annual, quarterly,
monthly, weekly, daily, hourly, etc.
To study trends and patterns over time
Example: daily closing price of a certain
stock recorded last week.
Cross-Sectional Data
Each observation represents a different
individual unit at the same point in time.
To study variation among observations or
relationships.
Example: daily closing prices of a group of
20 stocks recorded on December 1, 2015.
Pooled Data
Combine the two data types to get pooled
cross-sectional and time series data.
Example: daily closing price of a group of
20 stocks recorded last week.
Collecting Data
Primary Secondary
Data Collection Data Compilation
Print or Electronic
Observation Survey
Experimentation
Scales of Measurement
Scales of Measurement
Scales
Scales of
of measurement
measurement include:
include:
Nominal Interval
Ordinal Ratio
The
The scale
scale determines
determines the
the amount
amount of
of information
information
contained
contained in
in the
the data.
data.
The
The scale
scale indicates
indicates the
the data
data summarization
summarization and
and
statistical
statistical analyses
analyses that
that are
are most
most appropriate.
appropriate.
Nominal Scale
Data are labels or names used to identify an
attribute of the element.
Example:
Students of a university are classified by as Business,
Humanities, Education, and so on.
Alternatively, a numeric code could be used for the
school variable (e.g. 1 denotes Business, 2 denotes
Humanities, 3 denotes Education, and so on).
No ordering.
Ordinal Scale
The data have the properties of nominal data and
the order or rank of the data is meaningful.
Example:
Students of a university are classified as Freshman,
Sophomore, Junior, or Senior.
Alternatively, a numeric code could be used for the
class standing variable (e.g. 1 denotes Freshman, 2
denotes Sophomore, and so on).
Ordering, but differences have no meaning.
Interval Scale
The data have the properties of 0 °C 32.0 °F
ordinal data, and the difference 1 °C 33.8 °F
between measurements is meaningful 2 °C 35.6 °F
quantity, but the measurements have 3 °C 37.4 °F
no true zero value. 4 °C 39.2 °F
Example: 5 °C 41.0 °F
6 °C 42.8 °F
Difference between a temperature of 00C
and 20C is the same difference as
between 20C and 40C, but we couldn’t say
that 40C is as twice as hot as 20C.
Differences have meaning, but ratios have no
meaning.
Ratio Scale
The data have all the properties of interval data
and the ratio of two values is meaningful.
The measurements have a true zero value.
Example:
Kevin has $200, while Melissa has $100. Kevin has
twice as much money as Melissa.
Ratios have meaning.
Scales of Measurement
Differences between
Ratio Data measurements and ratios
money
Interval Data Differences between
measurements but no ratio
year, temperature,…
Ordinal Data Ordered Categories
Nominal Data Categories (no ordering)
Quiz
Classify each of the following as Nominal, Ordinal, Interval
or Ratio data:
1.letter grade you will receive in this class ordinal
2.country you were born in nominal
3.amount of money you have ratio
4.gender of customer nominal
5.brand of chocolate you prefer nominal
6.year of your birth interval
7.weight of a package ratio
8.satisfaction rating from 1 to 5 ordinal
9.pizza sizes (Small, Medium, Large, Extra Large) ordinal
Quiz
An investment firm rates bonds for AardCo Inc. as
"B+" while bonds of Deva Corp. are rated "AA."
Which level of measurement would be appropriate
for such data?
Quiz
We can perform different operations on the various
type of data. For each of the following type of data
(Nominal 1, Ordinal 12, Interval 123, or Ratio 1234)
on which we can perform the operation:
1.count the frequencies
2.put in order
3.add items
4.divide one item by another
Sampling Concepts
Population vs. Sample
A population is the collection of all items of
interest or under investigation, could be finite or
infinite.
A census is an examination of all items in a
defined population.
A sample is an observed subset of the
population.
Population vs. Sample
Population Sample
a b cd b c
ef gh i jk l m n gi n
o p q rs t u v w o r u
x y z y
Parameters vs. Statistics
A parameter is a specific characteristic of a population
A statistic is a specific characteristic of a sample
Sampling Concepts
The population must be carefully specified and
the sample must be drawn scientifically so that
the sample is representative.
The target population is the population we are
interested in (e.g., U.S. gasoline prices).
The sampling frame is the group from which we
take the sample (e.g., 115,000 stations).
Sampling Concepts
If we allow duplicates when sampling, then we
are sampling with replacement.
Duplicates are unlikely when n is much smaller
than N.
If we do not allow duplicates when sampling,
then we are sampling without replacement.
Sampling Methods
Sampling Methods
Sampling Methods
Nonstatistical (non-
Statistical Sampling
random) Sampling
Convenience Simple Systematic
Random
Judgment
Cluster
Focus group Stratified
Nonstatistical Sampling
(Non-random Sampling)
Convenience Sample
Use a sample that happens to be available (e.g., ask
co-worker opinions at lunch).
Judgment Sample
Use expert knowledge to choose “typical” items (e.g.,
which employees to interview).
Focus Groups
In-depth dialog with a representative panel of
individuals (e.g. iPhone users).
Statistical Sampling
Items of the sample are chosen based on
known or calculable probabilities
Statistical Sampling
(Probability Sampling)
Simple Random Systematic Stratified Cluster
Simple Random Sampling
Every member of the population has an equal
chance of being selected
Every possible sample of a given size has an equal
chance of being selected
Selection may be with replacement or without
replacement
The sample can be obtained using a table of random
numbers or computer random number generator
Computer Methods
Systematic Random Sampling
Decide on sample size: n
Divide frame of N individuals into n groups of k
individuals: k=N/n
Randomly select one individual from the first
group
Select every kth individual thereafter
N = 64
n=8 First Group
k=8
Stratified Random Sampling
Divide population into subgroups (called strata)
according to some common characteristic (e.g.
age, gender, occupation)
Select a simple random sample from each
subgroup
Combine samples from subgroups into one
Population
Divided
into 4
strata
Sample
Cluster Sampling
Divide population into several “clusters” (e.g.
regions), each representative of the population
One-stage cluster sampling: randomly selected k clusters
Two-stage cluster sampling: randomly select k clusters and
then choose a random sample of elements within each cluster.
Population
divided into
16 clusters. Randomly selected
clusters for sample
Quiz
Professor Hardtack chose a sample of 7 students from his
statistics class of 35 students by picking every student who
was wearing red that day. Which kind of sample is this?
Stratified
Thirty work orders are selected from a filing cabinet containing
500 work order folders by choosing every 15th folder. Which
sampling method is this?
Systematic
Quiz
A manager chose two people from his team of eight to give an
oral presentation because she felt they were representative of
the whole team's views. What sampling technique did she use
in choosing these two people? Focus group
From its 32 regions, the FAA selects 6 regions, and then
randomly audits 25 departing commercial flights in each
region for compliance with legal fuel and weight requirements.
This is an example of what sampling technique? Clusters
Survey
Basic Steps of Survey Research
Step 1: State the goals of the research
Step 2: Develop the budget (time, money, staff)
Step 3: Create a research design (target population,
frame, sample size).
Step 4: Choose a survey type and method.
Step 5: Design a data collection instrument
(questionnaire).
Step 6: Pretest the survey instrument and revise as
needed.
Step 7: Conduct the survey.
•
Step 8: Code the data and analyze the data.
Questionnaire Design
Begin with short, clear instructions.
State the survey purpose.
Assure anonymity.
Instruct on how to submit the completed survey.
Break survey into naturally occurring sections
Let respondents bypass sections that are not applicable
(e.g., “if you answered no to question 7, skip directly to
Question 15”).
Types of Questions
Open-ended
Fill-in-the-blank
Check boxes
Ranked choices
Pictograms
Likert scale
Likert Scales
Likert Scales (examples)
Question Wording
The way a question is asked has a profound
influence on the response. For example,
1. Shall state taxes be cut?
2. Shall state taxes be cut, if it means reducing
highway maintenance?
3. Shall state taxes be cut, if it means firing teachers
and police?
Question Wording
Make sure you have covered all the
possibilities, for example:
Are you married? Yes No
Avoid overlapping classes or unclear
categories, for example:
How old is your father?
35 – 45
45 – 55
55 – 65
65 or older
Sources of Errors
Source of Error Characteristics
Respondents differ from non-
Nonresponse bias
respondents
Self-selected respondents are
Selection bias
atypical
Respondents give false
Response error
information
Incorrect specification of frame or
Coverage error
population
Unclear survey instrument
Measurement error
wording
Responses influenced by
Interviewer error
interviewer
Sampling error Random and unavoidable
Coding and Data Screening
Responses are usually coded numerically
(e.g., 1 = male 2 = female).
Missing values are typically denoted by special
characters (e.g., blank, “.” or “*”).
Discard questionnaires that are flawed or
missing many responses.
Watch for multiple responses or inconsistent
replies or range answers.
Online Survey Tools
www.surveymonkey.com
www.sogosurvey.com