Collecting, organizing, and interpreting statistical data
It is a is a foundational process in data analysis. It involves several steps, from gathering the raw
data to making meaningful conclusions.
Definition of key terms
Data: Is facts and statistics collected together with desired aim to help make conclusions
Population: Is the whole number of people or objects in a given area
Sample: Is a carefully chosen part of the population
Census: This is complete official count of a given population
Sampling: Is the process of selecting a sample from a population
The reason why we collect data is
To get tangible returns.
To grow existing database.
Improving of existing products.
For customer feedback.
To understand the audience.
Increase engagement
2 AUDIENCE / POPULATION / SAMPLE UNIT
There are two main measures of a population. This includes
Population size (the number of individuals)
Population density: this is the number of individuals per unit area or volume.
A sample is usually used to help get the characteristics of a given population. The reason why a
sample is used is because sampling method is cheaper and faster.
There are various methods of choosing a sample from a population. These methods are random
and non random.
Random (Probability) Sampling Methods
These methods ensure that every individual in the population has a known (and often equal)
chance of being selected. This helps avoid selection bias and makes the sample more
representative of the population.
1. Simple Random Sampling (SRS)
Definition: Every individual in the population has an equal and independent chance of
being selected.
How It Works: Typically, individuals are randomly selected using methods like random
number generators or drawing lots.
o Example: If you have a list of 100 people, you assign each person a number from
1 to 100 and randomly pick a set of numbers to form your sample.
2. Stratified Sampling
Definition: The population is divided into subgroups (strata) based on specific
characteristics, and a random sample is taken from each stratum.
How It Works:
o First, divide the population into non-overlapping strata (e.g., age groups, gender,
income levels).
o Then, sample randomly from each group.
o Can be done in two ways:
Proportional Stratified Sampling: Sample size from each stratum is
proportional to its size in the population.
Equal Allocation Stratified Sampling: Each stratum has the same sample
size, regardless of its size in the population.
o Example: A school may divide students into strata by grade level and then
randomly select students from each grade.
3. Systematic Sampling
Definition: Every k-th individual is selected from a list after a random starting point.
How It Works:
o First, you determine the sampling interval (k), which is the population size
divided by the desired sample size.
o Select a random starting point and then select every k-th individual.
o Example: For a population of 1,000 and a sample size of 100, you select every
10th individual.
4. Multistage Sampling
Definition: A combination of various sampling methods. It is used when the population
is large and geographically spread out.
How It Works:
o This technique involves selecting a sample in stages, often using a combination of
cluster sampling and random sampling.
o First, the population is divided into clusters. Then, a random sample of clusters is
selected, and within those clusters, random sampling is done again.
o Example: A researcher studying national education might first randomly select
schools (clusters) from different states (regions) and then randomly sample
students within those selected schools.
Non-Random (Non-Probability) Sampling Methods
These methods do not give all individuals in the population an equal chance of being selected.
They are often used when the researcher is unable to obtain a random sample or when the
study is exploratory.
1. Judgment Sampling (Purposive Sampling)
Definition: The researcher uses their judgment to select individuals who they believe are
most representative or knowledgeable about the phenomenon of interest.
How It Works:
o The researcher intentionally chooses individuals who have specific characteristics
that are believed to be relevant to the study.
o Example: A researcher studying the experiences of experienced teachers might
specifically select veteran teachers from a school.
2. Quota Sampling
Definition: The population is divided into subgroups (or quotas), and participants are
selected non-randomly from each subgroup until the quota for each subgroup is filled.
How It Works:
o The researcher sets quotas for certain characteristics (e.g., age, gender, income)
and selects individuals based on these characteristics, but not randomly.
o Example: If a survey requires 50 male and 50 female participants, the researcher
would select individuals from the population to meet these quotas, but not
randomly.
3. Cluster Sampling (Non-random in some contexts)
Definition: The population is divided into clusters, and then clusters are selected, with
all members of the chosen clusters being surveyed.
How It Works:
o While cluster sampling is typically a probability sampling method when the
clusters are selected randomly, it can become non-random if the researcher uses
judgment to select the clusters.
o Example: In an educational study, the researcher might pick a few schools
(clusters) and sample all students within those schools.
4. Snowball Sampling
Definition: This method is used when the target population is hard to reach, and initial
participants refer others to the study.
How It Works:
o Start with a few participants, and then ask them to refer others who meet the
study’s criteria. This process continues, creating a "snowball" effect.
o Example: A researcher studying drug users might begin with a few known
individuals and have them refer others involved in drug use.
5. Purposive Sampling (Non-random)
Definition: Similar to judgment sampling, where the researcher intentionally selects
participants who have specific characteristics or experiences related to the research
topic.
How It Works:
o The researcher identifies individuals with particular knowledge or characteristics,
selecting them based on purpose or judgment rather than random chance.
o Example: A researcher studying medical professionals might purposively select
doctors working in a specific field of medicine.
Other Sampling Methods
5. Convenience Sampling
Definition: Involves selecting the sample based on ease of access or availability.
How It Works:
o The researcher selects the most convenient individuals to survey, such as those
who are readily available or near at hand.
o Example: A researcher might survey people in a shopping mall or students in
their class simply because they are easy to reach.
6. Panel Sampling
Definition: A type of longitudinal sampling, where the same individuals are surveyed
multiple times over a period.
How It Works:
o A sample of individuals is chosen, and they are repeatedly surveyed at different
points in time, allowing the researcher to observe changes over time.
o Example: A market research company surveys the same group of consumers
every few months to track their changing preferences.
Classification Based on Nature of Data
a. Qualitative (Categorical) Data
Qualitative data, also known as categorical data, refers to data that can be categorized based on
characteristics or qualities. This data is descriptive and does not have a numeric value.
Nominal Data:
o Data that can be categorized without any order or ranking.
o Example: Gender (Male, Female), Eye color (Blue, Brown, Green), Marital status
(Single, Married, Divorced).
Ordinal Data:
o Data that can be categorized and ranked in a specific order but does not have a
consistent difference between categories.
o Example: Education level (High school, Undergraduate, Postgraduate), Customer
satisfaction (Very dissatisfied, Dissatisfied, Neutral, Satisfied, Very satisfied).
b. Quantitative (Numerical) Data
Quantitative data refers to data that is numeric and can be measured or counted. It allows for
mathematical operations and is often used for statistical analysis.
Discrete Data:
o Data that takes on distinct, separate values, often counted in integers.
o Example: Number of children in a family (0, 1, 2, etc.), Number of cars in a
parking lot.
Continuous Data:
o Data that can take on any value within a range and is measured, not counted.
Continuous data can have an infinite number of values within a given interval.
o Example: Height, weight, temperature, distance.
2. Classification Based on Measurement Level (Scale of Measurement)
This classification is based on the level of measurement of the data. It is important because it
determines the types of analysis that can be applied to the data.
a. Nominal Scale
The lowest level of measurement.
Data is classified into categories that have no order or ranking.
Examples: Gender, Eye color, Nationality, Religion.
b. Ordinal Scale
Data is classified into categories that can be ordered or ranked, but the differences
between the categories are not meaningful or consistent.
Examples: Education level (High school, College, Graduate), Customer satisfaction (Very
dissatisfied, Dissatisfied, Neutral, Satisfied, Very satisfied).
c. Interval Scale
Data is ordered, and the differences between values are meaningful and consistent, but
there is no true zero point.
Examples: Temperature (Celsius or Fahrenheit), IQ scores, Calendar years.
d. Ratio Scale
The highest level of measurement, where the data has an absolute zero and the
differences between values are meaningful.
Examples: Height, Weight, Age, Income, Speed.
Measures of Central Tendency:
1. Arithmetic Mean (Average)
The mean is the sum of all data points divided by the number of data points
2. Mode
The mode is the value that appears most frequently in a data set. It is the observation that
occurs more times than any other.
3. Median
The median is the middle value in a data set when arranged in either ascending or descending
order.
If the number of observations is odd, the median is the middle number.
If the number of observations is even, the median is the average of the two middle
numbers.
These measures of central tendency (mean, mode, and median) help summarize the data by
identifying a central or typical value.