Descriptive Statistics
By
Dr. Jupiter Simbeye
Preliminary issues
STA121: Descriptive Statistics
• Course outline uploaded on the classroom
• Classroom code:
• Date for tests:
Test 1: 21st April, 2023
Test 2: 26th May, 2023
2020 STA121 Perfomance
25
20
15
Percent
10
0
A+ A B+ B C+ C C- D E F
Module Aims and Learning Outcomes
Aim:
• To introduce students to basic descriptive statistical analysis
Learning outcomes:
On successful completion of this module, students should be able to:
• Summarise data in form of central measures, frequencies, tables and
graphs,
• Interpret summary statistics,
• Apply descriptive statistics to answer practical questions,
Indicative Content
• Review of statistical concepts: definition of statistics, types of
statistics, data and types, scales of measurement.
• Tables and graphs for frequencies and other statistics: use and
interpretation of multi-way tables.
• Numerical summaries for quantitative data: percentile, quartile,
deciles, mean, median, mode, range, variance and standard deviation,
relative variation, coefficient of variation, skewness and kurtosis.
Indicative Content
• Processing single and multiple variables: concepts and calculations
applied on real data, effect of outliers on calculation of standard
deviation, summarising single columns of data
• Risk and return periods: cumulative frequency distributions and their
interpretations.
• Introducing a statistics package: working with SPSS and Stata.
• Common complications when analysing survey data: analysis of
multiple response questions, presence of missing values in the data,
need to produce weighted tables, presence of zero values.
Descriptive Statistics
Terms and definitions
• Statistics: the discipline that concerns the collection, organization,
analysis, interpretation and presentation of data
• Two types of statistical methods are used in analyzing
data: descriptive statistics and inferential statistics.
• Descriptive statistics are used to summarize data from a sample e.g.
in form of mean or standard deviation.
• Inferential statistics are used when data is viewed as a subclass of a
specific population
Terms and definitions
• Population: In statistics, a
population is a set of similar
items or events which is of
interest for some question or
experiment.
• Sample: a sample is a set of
individuals or objects collected
or selected from
a statistical population
More terms and definitions
1. Variable: A variable is any quantity or attribute whose value varies
from one unit of investigation to another.
Examples
a) Age: if you try to record age of students in this class, you are likely
to get different values each time you ask the next student’s age
b) Sex of babies at birth: Babies born to mothers take any values of
male or female. In this case, “male” and “female” are the two
possible values of the variable sex
c) GPA: As you progress with your studies, your end of semester GPA
is likely going to be different from semester to semester.
More terms and definitions
2. Observation: An observation is the value taken by a variable for a
particular unit of investigation
Example
Below are percentage point grades obtained by 10 students in STA121
67, 70, 55, 62, 40, 81, 90, 60, 69, 56
Observation
More terms and definitions
3. Quantitative variable: A quantitative variable is a variable whose
values are numerical.
Examples
a) Age in years (25, 15, 74, etc)
b) Birth-weight of a babies in kg (3.1, 2.5, 2.9, 3.5, 4.2, etc).
c) Number of antenatal care (ANC) visits by a pregnant mother (0, 1, 4,
7, etc)
More terms and definitions
Quantitative variables can be divided into two types: continuous or
discrete
4. Continuous variable: A continuous variable is a variable which may
take all values within a given range.
5. Discrete variable: A discrete variable is variable whose values change
by steps or jumps.
More terms and definitions
Thus age or birth-weight are continuous, because they can take any
values such as 25.5237873244 years or 2.93927634529 kg,
respectively, even if we may not have scales that could measure this
accurately!
However, number of antenatal care (ANC) visits by a pregnant mother
is discrete, since it values must be a whole number 0 , 1, 2, …, 9;
decimal values cannot be accommodated.
More terms and definitions
6. Qualitative variable or attribute: A qualitative variable or attribute is a
variable whose values are not numerical.
Examples
a) Names of countries (Malawi, Zambia, Egypt, Mozambique)
b) Answer to opinion question (strongly disagree, disagree, agree, strongly
agree)
c) Sex of an individual ( male, female)
Note: In most analysis qualitative variables that take limited values are
discretized by assigning them codes (e.g. 1=strongly disagree, …, 4=strongly
agree)
More terms and definitions
7. Frequency Distribution: A frequency distribution is an overview of all
distinct values in some variable and the number of times they occur
• Frequency distributions are mostly used for summarizing discrete /
categorical variables. Metric (continuous) variables tend to have many
distinct values. These result in huge tables and charts that don't give insight
into your data.
Example:
• A sample of 183 students were asked to state which study major they are
following. Below shows part of these data.
Study majors
SN Name of student Sex Major
1 Andrew Gondwe Male Biology
2 John Samale Male Mathematics
3 Pempho Yasini Female Other
4 Felix Wadabwa Male Mathematics
: : : :
: : : :
182 Maren Dickson Female Physics
183 Jack Filipo Male Chemistry
Observations
• Just looking at our 183 values can not provide any important
information about majoring subjects.
• A more viable approach is to simply tabulate each distinct study
major in our data and its frequency -the number of times it occurs.
• The resulting table (below) shows how frequencies are
distributed over values – majoring subjects in this example- and
hence is a frequency distribution.
Frequency distribution table
What is currently your majoring subject? N Percent
Mathematics 62 33.9%
Biology FREQUENCIES 35 19.1%
Chemistry ARE DISTRIBUTED OVER 33 18.0%
Physics VALUES 37 20.2%
Others 16 8.7%
Total 183 100%
Observations
• The most popular study major is mathematics (n = 62).
• “Other” is the least popular major (n = 16).
• The remaining majors are roughly equally popular (n between 33 and
37).
• Note that the frequencies add up to our sample size of 183 students.
This is always the case unless a variable contains missing values:
respondents can sometimes skip a question or answer “no answer”
or something similar.
Relative frequencies
Optionally, a frequency distribution may contain relative frequencies:
frequencies relative to (divided by) the total number of values. Relative
frequencies are often shown as percentages or proportions.
What is currently your majoring subject? N Percent
Mathematics 62 33.9%
Biology 35 19.1%
Chemistry RELATIVE FREQUENCIES 33 18.0%
Physics 37 20.2%
Others 16 8.7%
Total 183 100%
Relative frequencies
• Relative frequencies provide easy insight into frequency distributions.
Besides, they facilitate comparisons.
• For example,
“33.9% of students major in mathematics”
is much easier to understand the degree of majoring
“62 out of 183 major in mathematics”.
Frequency distributions for continuous variables
• Sometimes, we are interested in summarizing continuous variables into
frequency tables.
• However, as we noted, metric variables tend to have many distinct values.
These result in huge tables and charts that don't give insight into your data.
• Therefore, instead of looking at the frequency of each variable-value that
occurs we can first group the values of the variable into intervals, that is
subdivisions of the total range of possible values of the variable.
Example: Malawi birth-weights (MDHS, 2010)
• The Demographic and Health Survey of 2010 collected birth-weights from
13, 079 babies. These values are so may such that creating a frequency
table from these values may not give a reasonable way for summarizing
birth weight.
• However, if we decided to do it anyway, get results as tabulated below.
Frequency table – birth-weight in grams
Children's birth weight in
grams Frequency Percent Cummulative
200 1 0.01 0.01
300 6 0.05 0.05
400 2 0.02 0.07
500 1 0.01 0.08
600 1 0.01 0.08
700 2 0.02 0.1
1000 52 0.4 0.5
1100 10 0.08 0.57
1200 16 0.12 0.7
1300 12 0.09 0.79
1400 4 0.03 0.82
1500 53 0.41 1.22
1600 20 0.15 1.38
1700 20 0.15 1.53
1800 38 0.29 1.82
1900 30 0.23 2.05
Frequency table - continued
2000 584 4.47 6.51
2100 187 1.43 7.94
2200 111 0.85 8.79
2300 233 1.78 10.57
2400 144 1.1 11.68
2500 788 6.02 17.7
2600 199 1.52 19.22
2700 182 1.39 20.61
2800 439 3.36 23.97
2900 315 2.41 26.38
3000 2,428 18.56 44.94 Majority
3100 454 3.47 48.41
3200 1,029 7.87 56.28
3300 288 2.2 58.48
3400 538 4.11 62.6
3500 1,267 9.69 72.28
3600 250 1.91 74.2
3700 204 1.56 75.76
3800 326 2.49 78.25
3900 208 1.59 79.84
Frequency table - continued
4000 1,074 8.21 88.05
4100 113 0.86 88.91
4200 224 1.71 90.63
4300 155 1.19 91.81
4400 53 0.41 92.22
4500 320 2.45 94.66
4600 44 0.34 95
4700 23 0.18 95.18
4800 41 0.31 95.49
4900 30 0.23 95.72
5000 302 2.31 98.03
5100 27 0.21 98.23
5200 36 0.28 98.51
5300 25 0.19 98.7
5400 15 0.11 98.81
5500 36 0.28 99.09
5600 14 0.11 99.2
5700 1 0.01 99.2
Frequency table - continued
6000 48 0.37 99.63
6100 2 0.02 99.64
6200 4 0.03 99.67
6300 4 0.03 99.7
6400 4 0.03 99.73
6500 7 0.05 99.79
6600 2 0.02 99.8
6700 4 0.03 99.83
7000 10 0.08 99.91
7100 1 0.01 99.92
7200 1 0.01 99.92
7300 2 0.02 99.94
7500 1 0.01 99.95
8000 1 0.01 99.95
8500 2 0.02 99.97
9000 3 0.02 99.99
9100 1 0.01 100
Total 13,079 100
Observations
• The majority of the babies are born weighing 3000 grams
• One important message we observe is age heaping at 2000, 2500, 3000, 3200, 3500,
4000, 4500 and 5000 grams. This could be recording errors by birth attendants or
mothers rounding the figures when recalling birth-weights.
• Since the summary is not very informative, it is a good idea to group the birth-weights
into some sensible groups before tabulating, say: 1 – 1000, 1001- 2000, 2001-3000, 3001
– 4000, 4001 – 5000, 5001 – 6000, 6001 – 7000, 7001 – 8000, 8001 – 9000, 9001 – 10000
grams.
• Table below provides a frequency table from the ten groups that we have created.
Observations Grouped birth-weight
in grams Frequency Percent Cummulative
• The majority of the babies (5,638)
are born weighing between 3001 1-1000 65 0.5 0.5
and 4000 grams. This represents 1001-2000 787 6.02 6.51
43.11 % of all 13,079 babies whose
birth-weight was recorded in the 2001-3000 5,026 38.43 44.94
survey. 3001-4000 5,638 43.11 88.05
• The second majority of babies are 4001-5000 1,305 9.98 98.03
born weighing between 2001 – 3000
grams. 5001-6000 209 1.6 99.63
• Overall, over 80% of the babies are 6001-7000 37 0.28 99.91
born weighing between 2001 to 7001-8000 6 0.05 99.95
4000 grams.
8001-9000 5 0.04 99.99
• The least likely birth-weights are
birth-weights over 6000 grams. 9000-10000 1 0.01 100
Total 13,079 100
More terms and definitions
8. Class-interval: A class interval is a subdivision of the total range of
values which a (continuous) variable may take
In our example above, the age variable is reported into class-intervals
of 1 – 1000, 1001- 2000, … , 8001 – 9000, 9001 – 10000 grams.
Grouped birth-weight
in grams Frequency Percent Cummulative
• 9. Class-frequency: A class- 1-1000 65 0.5 0.5
frequency is the number of
1001-2000 787 6.02 6.51
observations of the variable
which fall in a given interval 2001-3000 5,026 38.43 44.94
Therefore, 3001-4000 5,638 43.11 88.05
Class
• 10. The frequency distribution of 4001-5000 1,305 9.98 98.03
frequencies
a (continuous) variable is the set 5001-6000 209 1.6 99.63
of class-intervals for the variable, 6001-7000 37 0.28 99.91
together with the associated
class-frequences 7001-8000 6 0.05 99.95
8001-9000 5 0.04 99.99
9000-10000 1 0.01 100
Total 13,079 100
Disaggregated frequency distributions
• At times it is important to disaggregate frequency distributions by factors
that are known or can be thought to affect their distributions
• For example, in our DHS data, if we suspect that birth-weights of babies
may differ between babies born in the rural and those born in the urban, a
frequency distribution disaggregated by rural/urban residence may prove
to be useful.
• Since the total frequencies in the two groups differ, it becomes difficult to
make direct comparisons. In this case, the use of relative frequencies
become useful.
Frequency table of birth-weights distributed by rural and urban residence
Rural Urban
Grouped birth-weight in grams Frequency Percent Frequency Percent
1-1000 65 0.56 0 0
1001-2000 702 6.09 85 5.45
2001-3000 4,461 38.73 565 36.19
3001-4000 4,850 42.11 788 50.48
4001-5000 1,200 10.42 105 6.73
5001-6000 195 1.69 14 0.9
6001-7000 34 0.3 3 0.19
7001-8000 6 0.05 0 0
8001-9000 4 0.03 1 0.06
9000-10000 1 0.01 0 0
Total 11,518 100 1,561 100
Frequency polygons
• Comparisons between the two groups can be more visible using
graphs.
• A frequency polygon illustrating a set of frequencies through
percentages (relative frequencies) is obtained by plotting class-
frequencies or relative frequencies as ordinates against centre-points
of class-intervals as abscissae. Then the plotted points are joined by
the straight lines.
• Figure below contains frequency polygons for the birth-weight for
rural and urban babies.
Frequency polygons of birth-weight of babies born in the rural versus those born in
the urban
60
50
Percent of babies
40
30
20
10
0
500 1500 2,500 3,500 4,500 5,500 6,500 7,500 8,500 9,500
Rural Urban
Observations
• The two distributions have similar shapes but greatly overlap. It appears urban or
rural are equally likely to have heavier or lighter babies.
• A big gap is observed around weights of 3,500 grams, where you have more
babies around those weights in the urban than rural.