Chapter 3 - Central Tendency & Variability
Chapter 3 - Central Tendency & Variability
Introduction
Learning Outcomes
At the end of the chapter, you are expected to:
1. compute the mean, median, and mode of a given set of data;
2. decide which measure of central tendency should be used for certain types of data;
3. compute the standard deviation and variance of a given set of data; and
4. Interpret the computed measures.
A measure of central tendency is a summary statistic that represents the center point or
typical value of a data set. These measures indicate where most values in a distribution fall and are
also referred to as the central location of a distribution. You can think of it as the tendency of data to
cluster around a middle value. The three most common measures of central tendency are
the mean, median, and mode. Each of these measures calculates the location of the central point
using a different method. Colloquially, measures of central tendency are often called averages.
Choosing the best measure of central tendency depends on the type of data you have.
Use the following summary table to know what the best measure of central tendency is with respect
to the different types of variable.
nds* 2020-2021
Psychological Statistics
modal class. A distribution with only one mode is said to be unimodal. When two measures have the
same frequency, the set is said to be bimodal. If the set has more than two modes then the set is
multimodal. It is also possible for a distribution to have no mode. The set 3, 4, 5, 7, 9, 12, 15 has no
mode.
Consider this data set showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
This table shows a simple frequency distribution of the retirement age data.
Age Frequency
54 3
55 1
56 1
57 2
58 2
60 2
The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.
It is also possible to have more than one mode for the same distribution of data, (bi-modal, or multi-
modal). The presence of more than one mode can limit the ability of the mode in describing the
center or typical value of the distribution because a single value to describe the center cannot be
identified.
In some cases, particularly where the data are continuous, the distribution may have no mode at
all. This is in the case when all values are different. In such cases, it may be better to consider using
the median or mean, or group the data in to appropriate intervals, and find the modal class.
nds* 2020-2021
Psychological Statistics
Looking at the retirement age distribution (which has 11 observations), the median is the middle
value, which is 57 years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
When the distribution has an even number of observations, the median value is the mean of the two
middle values. In the following distribution, the two middle values are 56 and 57, therefore the
median equals 56.5 years:
As the mean includes every value in the distribution the mean is influenced by outliers and skewed
distributions.
How does the shape of a distribution influence the Measures of Central Tendency?
Symmetrical distributions:
When a distribution is symmetrical, the mode, median and mean are all in the middle of the
distribution. The following graph shows a larger retirement age data set with a distribution which is
symmetrical. The mode, median and mean all equal 58 years.
nds* 2020-2021
Psychological Statistics
Skewed distributions:
When a distribution is skewed the mode remains the most commonly occurring value, the median
remains the middle value in the distribution, but the mean is generally ‘pulled’ in the direction of the
tails. In a skewed distribution, the median is often a preferred measure of central tendency, as the
mean is not usually in the middle of the distribution.
A distribution is said to be positively or right skewed when the tail on the right side of the
distribution is longer than the left side. In a positively skewed distribution, it is common for the mean
to be ‘pulled’ toward the right tail of the distribution. Although there are exceptions to this rule,
generally, most of the values, including the median value, tend to be less than the mean value.
Data which are arranged in a frequency distribution are called grouped data. When the
number of items is too large, it is best to compute for the measures of central tendency and
variability using the frequency distribution.
nds* 2020-2021
Psychological Statistics
The Quantiles
The quantiles are a natural extension of the median concept in that they are values which
divide a set of data into equal parts. While the median divides the distribution into two parts, the
quantiles divide it into four, or ten, or one hundred equal parts. The quantiles which divide the
distribution into four parts are called quartiles, those which divides the distribution into ten parts are
called deciles; and those which divides the distribution into one hundred parts are called percentiles.
nds* 2020-2021
Psychological Statistics
Step 1: Order the data from smallest to largest. The data in the question is already in ascending order.
Step 2: Count how many observations you have in your data set. this particular data set has 40 items.
Step 3: Convert any percentage to a decimal for “q”. We are looking for the number where 20 percent
of the values fall below it, so convert that to 0.20.
Step 4: Insert your values into the formula:
ith observation = q (n + 1)
ith observation = 0.20 (40 + 1) = 8.2
Answer: The ith observation is at 8.2, so we round down to 8 (remembering that this formula is an
estimate). The 8th number in the set is 13, which is the number where 20 percent of the values fall
below it.
Summary statistics such as the median, first quartile and third quartile are measurements of position.
This is because these numbers indicate where a specified proportion of the distribution of data lies.
For instance, the median is the middle position of the data under investigation. Half of the data have
values less than the median. Similarly, 25% of the data have values less than the first quartile and 75%
of the data have values less than the third quartile.
This concept can be generalized. One way to do this is to consider percentiles. The 90th percentile
indicates the point where 90% percent of the data have values less than this number. More generally,
the pth percentile is the number n for which p% of the data is less than n.
nds* 2020-2021
Psychological Statistics
Example: Let us consider the frequency distribution that was organized in chapter 2. Solve for the
mean, median and mode of the distribution and interpret.
15,831
x ̅ (mean) = = 158.31
100
On the average, the heights of the 100 students is 158.31cm (this now serves as the
representative heights of the 100 students)
nds* 2020-2021
Psychological Statistics
n 100
= = 50 . The median is the mean of the 50 th and 51st observation, when arranged
2 2
into an array, and these two observations are within the class interval 158-160 as indicated by the
“less than” cumulative frequency. Hence, the median class interval is 158-160 with 157.5 as lower
class boundary. The size of the class interval (i) is 3. Therefore,
50−40
Median = 157.5 + ( ) 3 = 158.8
23
This means that ½ of 100 or 50 students have heights greater than 158.8 cm and the other 50
students have heights lower than 158.8 cm.
The modal class is the interval 158-160 since it has the greatest frequency. The lower
boundary of the modal class is 157.5. Δ1 = 23 – 18 = 5 and Δ2 = 23 – 18 = 5. The size of the class
interval is 3. Hence,
Mode = 157.5 + ¿ ) 3 = 159.
The mode of the heights of the 100 students is 159.
To illustrate finding the quartiles, let us consider the same data about the heights of 100 students.
Since there are 100 observations, the first quartile lies between the 25 th and 26th observations and the
third quartile lies between the 75 th and 76th observations. Hence, the first quartile is within the class
interval 155-157 and the third quartile is within the class interval 161-163. Hence, the first and third
quartiles are:
Q1 = 154.5 + ¿ ) 3 = 155
and
Q3 = 160.5 + ¿ ) 3 = 162.5
nds* 2020-2021
Psychological Statistics
Similar process is applied in the computation of deciles (D 1, D2, D3, …,D9) and percentiles (P1, P2, P3, P4,
P5, …, P99).
Measures of Variability
Variability refers to how spread apart the scores of the distribution are or how much the scores vary
from each other. When descriptive statistics are presented, there is usually at least one measure of
central tendency and at least one measure of variability reported. While measures of central
tendency are useful statistics for summarizing the scores in a distribution, they are not sufficient. Two
distribution may have identical means and medians yet be quite different in other ways. There is
need, therefore, for measures researchers can use to describe variability, that exists within a
distribution.
The Range
The range is the difference between the largest and smallest values in a set of values.
For example, consider the following numbers: 1, 3, 4, 5, 5, 6, 7, 11. For this set of numbers, the range
would be 11 - 1 or 10.
The Interquartile Range (IQR)
The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles.
Quartiles divide a rank-ordered data set into four equal parts. The values that divide each part are
called the first, second, and third quartiles; and they are denoted by Q 1, Q2, and Q3, respectively.
Q2 is the median of the entire data set - the middle value. In this example, we have an even number of
data points, so the median is equal to the average of the two middle values. Thus, Q 2 = (4 + 5)/2 or Q2
= 4.5. Q1 is the middle value in the first half of the data set. Since there are an even number of data
points in the first half of the data set, the middle value is the average of the two middle values; that is,
Q1 = (2 + 3)/2 or Q1 = 2.5. Q3 is the middle value in the second half of the data set. Again, since the
second half of the data set has an even number of observations, the middle value is the average of the
two middle values; that is, Q3 = (6 + 7)/2 or Q3 = 6.5. The interquartile range is Q3 minus Q1, so
IQR = 6.5 - 2.5 = 4.
The interquartile range indicates the distance between the two values which determine the middle
50% of all observations within the distribution. One-half this distance is called the semi-interquartile
Q 3−Q1
range or the quartile deviation (QD). Thus, Q D = 2
Mean Deviation
nds* 2020-2021
Psychological Statistics
The mean deviation measures the average deviation of the values from the arithmetic mean.
It gives equal weight to the deviation of every observation. The mean deviation id used in
determining the extent of the differences or variabilities among the members of a group. It is also an
indicator of how compact the group is on a certain measure.
The formula to calculate the mean deviation for the given data set is given below.
Mean Deviation = [Σ |X – x̅ |]/n
Here,
Σ represents the addition of values
X represents each value in the data set
x̅ represents the sample mean
n represents the number of data values
|| represents the absolute value, which ignores the “-” symbol
Example 1:
Determine the mean deviation for the data values 5, 3,7, 8, 4, 9.
Solution:
Given data values are 5, 3, 7, 8, 4, 9.
First, find the mean for the given data:
Mean, x̅ = ( 5+3+7+8+4+9)/6
x̅ = 36/6
x̅ = 6
Therefore, the mean value is 6.
Now, subtract each mean from the data value, and ignore the minus symbol if any
(Ignore”-”)
5–6=1
3–6=3
7–6=1
8–6=2
4–6=2
9–6=3
Now, the obtained data set is 1, 3, 1, 2, 2, 3.
Finally, find the mean value for the obtained data set
Therefore, the mean deviation is
= (1+3 + 1+ 2+ 2+3) /6
= 12/6
=2
Hence, the mean deviation for 5, 3,7, 8, 4, 9 is 2.
For a grouped data,
Mean Deviation = [Σ |X – x̅ |]/n
Here,
Σ represents the addition of values
X represents the midpoint or class mark of a class interval
x̅ represents the sample mean
n represents the total number of observations
|| represents the absolute value, which ignores the “-” symbol
Example:
|X - x̅ | or
Class interval Frequency(f) Class mark(X)
|X – 158.31| f| X - x̅ |
nds* 2020-2021
Psychological Statistics
490.8
The computed mean is 158.31, and the mean deviation = = 4.908. This number means that
100
some values are greater than the mean, some lesser. But on the average, each value differs from the
mean by the representative value of 4.908.
The variance is equal to the sum of the squared deviations about the mean divided by the
number of observations. The standard deviation is the square root of the average of the squares of
the deviation of each observation from the mean. It is calculated as the square root of the variance.
They are used when the mean is the preferred measure of central tendency. They show whether or
not the values are grouped closely around the mean of the distribution. The symbols for sample and
population variances are s2 and 2, respectively. Variance is frequently discussed by researchers as
an indicator of how much variability there is in an entire distribution of values. The standard
deviation is used to determine how far the data are from the mean.
If the values are clustered tightly about their mean, the standard deviation is small and if the
values become more and more scattered about the mean, the standard deviation of these sets is
large.
If the data points are further from the mean, there is a higher deviation within the data set; thus, the
more spread out the data, the higher the standard deviation. A low standard deviation indicates that
the values tend to be close to the mean of the set, while a high standard deviation indicates that the
values are spread out over a wider range.
We are normally interested in knowing the population standard deviation because our population
contains all the values we are interested in. Therefore, you would normally calculate the population
standard deviation if: (1) you have the entire population or (2) you have a sample of a larger
population, but you are only interested in this sample and do not wish to generalize your findings to
the population. However, in statistics, we are usually presented with a sample from which we wish to
estimate (generalize to) a population, and the standard deviation is no exception to this. Therefore, if
all you have is a sample, but you wish to make a statement about the population standard deviation
from which the sample is drawn, you need to use the sample standard deviation. Confusion can often
arise as to which standard deviation to use due to the name "sample" standard deviation incorrectly
being interpreted as meaning the standard deviation of the sample itself and not the estimate of the
population standard deviation based on the sample.
nds* 2020-2021
Psychological Statistics
What type of data should you use when you calculate a standard deviation?
The standard deviation is used in conjunction with the mean to summarize continuous data, not
categorical data. In addition, the standard deviation, like the mean, is normally only appropriate when
the continuous data is not significantly skewed or has outliers.
Q. A teacher sets an exam for her students. The teacher wants to summarize the results the students
attained as a mean and standard deviation. Which standard deviation should be used?
A. Population standard deviation. Why? Because the teacher is only interested in this class of
students' scores and nobody else.
Q. A researcher has recruited males aged 45 to 65 years old for an exercise training study to
investigate risk markers for heart disease (e.g., cholesterol). Which standard deviation would most
likely be used?
A. Sample standard deviation. Although not explicitly stated, a researcher investigating health related
issues will not simply be concerned with just the participants of their study; they will want to show
how their sample results can be generalized to the whole population (in this case, males aged 45 to
65 years old). Hence, the use of the sample standard deviation.
Q. One of the questions on a national consensus survey asks for respondents' age. Which standard
deviation would be used to describe the variation in all ages received from the consensus?
A. Population standard deviation. A national consensus is used to find out information about the
nation's citizens. By definition, it includes the whole population. Therefore, a population standard
deviation would be used.
where,
nds* 2020-2021
Psychological Statistics
where,
Example: Compute the standard deviation of the heights of 100 students in the activity.
(X - x̅ ) or
Class interval Frequency(f) Class mark(X)
(X – 158.31) (X - x̅ )2
If treated as sample:
s=
√ 1,461.1392 =
100−1
√ 14.75953 =
If treated as population:
=
√ 1,461.1392 =
100
√ 14.6114 =
Coefficient of Variation
nds* 2020-2021
Psychological Statistics
to the value of this estimate. The lower the value of the coefficient of variation, the more precise the
estimate.
Mathematically, the standard formula for the coefficient of variation is expressed in the following
way:
s
or Coefficient of variation = x̅ x 100%
where: where:
Two sets of data with known means and standard deviations may be compared quantitatively by
taking the coefficient of variation of each group.
Example 1. Suppose a set of data has mean = 32 and s = 5, and another set has mean = 26 and s = 4.
5
For the first set, CV = 32 x 100 = 15.62%
4
For the second set, CV = 26 x 100 = 15.38%
Since the CV of the second group is smaller, the second group is better than the first group. While its
mean is a little lower than that of the first group, the values are less variable than those of the first.
Thus, a high mean does not always imply a better set of values. The standard deviation, together with
the mean, gives a better description of the set of data.
Example 2.
A researcher is comparing two multiple-choice tests with different conditions. In the first test, a
typical multiple-choice test is administered. In the second test, alternative choices (i.e. incorrect
answers) are randomly assigned to test takers. The results from the two tests are:
Randomized
Regular Test Answers
SD 10.2 12.7
Trying to compare the two test results is challenging. Comparing standard deviations doesn’t really
work, because the means are also different. Calculating the coefficient of variation helps to make
sense of the data:
Randomized
Regular Test Answers
nds* 2020-2021
Psychological Statistics
SD 10.2 12.7
Looking at the standard deviations of 10.2 and 12.7, you might think that the tests have similar
results. However, when you adjust for the difference in the means, the results have more significance:
Regular test: CV = 17.03
Randomized answers: CV = 28.35
The coefficient of variation can also be used to compare variability between different measures. For
example, you can compare IQ scores to scores on the Woodcock-Johnson III Tests of Cognitive
Abilities.
Note: The Coefficient of Variation should only be used to compare positive data on a ratio scale. The CV has little or no
meaning for measurements on an interval scale. Examples of interval scales include temperatures in Celsius or Fahrenheit,
while the Kelvin scale is a ratio scale that starts at zero and cannot, by definition, take on a negative value (0 degrees Kelvin
is the absence of heat).
Assessment Task
From the same data considered in the activity, organize the heights of the 100 students into a
frequency distribution with a class interval of 5, the highest value must be the upper limit of the
highest class interval.
nds* 2020-2021
Psychological Statistics
References
Altares, Priscilla S., et. al. 2003. Elementary Statistics: A Modern Approach. Rex Book Store.
Manila, Philippines
Deauna, Melecio C. 1999. Elementary Statistics for Basic Education. Phoenix Publishing House,
Inc. QC. Philippines
into account_Th
Febre, Francisco A. 1987. Introduction to Statistics. Phoenix Publishing House, Inc. QC. Phil.
nds* 2020-2021