• Central tendency refers to the statistical measure that identifies the
centre or typical value of a dataset.
• It is the value around which the data points tend to cluster.
• Summarize a dataset by providing a single value that represents the
middle or "center" of the data.
• Mean (Arithmetic Average):
• What it tells: The mean gives the arithmetic average of a dataset. It sums all the
values and divides by the total number of observations.
• When to use it:
• When the data lacks outliers.
• When you need to calculate a general average for comparison or prediction
(e.g., calculating the average score of students, or average sales per month).
• How it informs understanding: The mean provides a summary measure that
represents the "typical" value of the dataset. However, it's sensitive to extreme
values, which can skew the mean and make it unrepresentative of the data as a
whole when outliers are present.
• Example: If most students in a class scored between 60-80 on a test, but one
student scored 200, the mean will be pulled up toward that outlier, even
though most students scored in the lower range.
• Median (Middle Value):
• What it tells : The median is the middle value in a dataset when it is sorted
in ascending or descending order. If there is an even number of
observations, it is the average of the two middle values.
• When to use it:
• When the data contains outliers that may distort the mean.
• When you want to understand the central position of the data without being
affected by extreme values.
• How it informs understanding: The median represents the middle value,
so it gives you an indication of the typical value that divides the dataset
into two halves.
• Example: In a dataset of house prices where most homes are priced
between $200,000 and $400,000, but one house is priced at $10 million,
the median would provide a better understanding of the central price of
homes, as it is unaffected by the outlier.
• Mode (Most Frequent Value):
• What it tells you: The mode is the value that appears most frequently in the
dataset. A dataset may have no mode, one mode, or multiple modes (bimodal or
multimodal).
• When to use it:
• When identifying the most common value or category in categorical or
discrete data.
• When the data is non-numeric (e.g., the most common colour or the most
frequent category in survey responses).
• How it informs understanding: The mode helps identify the most frequent
occurrence in a dataset, which can be useful when analyzing categorical data or
when you're interested in the most common event or occurrence.
• Why It's Important to Choose the Appropriate Measure
• The choice of central tendency measure can dramatically impact your
interpretation of the data and subsequent conclusions.
1.Sensitivity to Skewed Data and Outliers:
1. The mean is sensitive to outliers and skewed distributions. If your data has
extreme values, the mean might not be representative of the majority of your
data.
2. The median, however, is robust to outliers and gives a better sense of the
central position in skewed datasets.
3. The mode helps identify trends and common occurrences, but it doesn't
necessarily represent the centre of the data, especially in continuous datasets.
2. Type of Data:
For interval or ratio data (where the differences between values are meaningful, e.g., height, weight, age),
the mean and median are appropriate. The mean is typically preferred if the data is symmetric, while the
median is better for skewed distributions.
For nominal (categorical) data (e.g., colour, type of fruit, gender), the mode is the only valid measure of
central tendency, since it identifies the most common category.
3. Interpretation and Comparison:
The choice of central tendency measure impacts how you interpret data, especially when comparing
multiple datasets. For example:
o If you compare the average income of two cities and one city has a few ultra-wealthy individuals, using
the mean may mislead you, as it could be inflated. The median income would provide a more accurate
reflection of the typical income in each city.
o If you're comparing exam scores where the majority of students perform similarly, the mean gives a
good overall picture, but if a few students performed poorly, the median could show you the
performance of the majority without being skewed by extreme cases.
• Quartiles are values that divide a dataset into four equal parts, each
containing 25% of the data.
• The Four Quartiles
There are three main quartile points in a dataset, which divide it into
four sections. These are usually labeled as follows:
1.Q1 (First Quartile):
1.The 25th percentile of the data.
2.This is the value below which 25% of the data falls.
3.Q1 marks the boundary between the lowest 25% of data and the
rest.
• Q2 (Second Quartile or Median):
The 50th percentile, also known as the median.
This is the value below which 50% of the data falls, meaning it divides the
dataset in half.
Q2 represents the midpoint of the data.
• Q3 (Third Quartile):
The 75th percentile of the data.
This is the value below which 75% of the data falls.
Q3 marks the boundary between the lowest 75% of data and the top 25%.
• Q4 is sometimes mentioned, but it simply represents the maximum value in the
dataset (100th percentile).
• Example
• Consider the dataset: 2, 4, 7, 10, 12, 15, 18, 20, 22
1. Arrange the data in ascending order (already done).
2. Find Q2 (Median):
o Since there are 9 values, the median (Q2) is the 5th value: Q2 = 12.
3. Find Q1:
o The lower half of the data is: 2, 4, 7, 10
o The median of the lower half (Q1) is the average of 4 and 7: Q1 = 5.5.
4. Find Q3:
o The upper half of the data is: 15, 18, 20, 22
o The median of the upper half (Q3) is the average of 18 and 20: Q3 = 19.
• Quartile Deviation (or Semi-Interquartile Range)
• The Quartile Deviation, also known as the Semi-Interquartile Range, is a measure of the spread
of the middle 50% of the data. It is based on the difference between the third quartile (Q3) and
the first quartile (Q1).
Formula:
• Quartile Deviation=Q3−Q1/2
• where:
o Q1 is the 25th percentile, or the value below which 25% of the data falls.
o Q3 is the 75th percentile, or the value below which 75% of the data falls.
Example: For the dataset: 1, 3, 5, 7, 9, 11, 13
o Q1 = 3
o Q3 = 11
o Quartile Deviation = (11-3)/2=4
Pros:
o Less affected by outliers, as it focuses on the middle 50% of data.
o Gives a more robust measure of dispersion compared to the range.
Cons:
o Does not consider data points outside the middle 50%, so it may overlook
certain aspects of variability.
• Range
• The range is the simplest measure of dispersion. It shows the
difference between the maximum and minimum values in a dataset.
Formula:
• Range=Maximum Value−Minimum Value
Example: Suppose we have a dataset: 2, 5, 7, 10, 12.
o Maximum Value = 12
o Minimum Value = 2
o Range = 12 - 2 = 10
Pros:
o Easy to calculate.
o Provides a quick sense of the spread of the data.
Cons:
o Highly sensitive to outliers (extreme values).
o Does not provide information about how values are spread between the
minimum and maximum.