Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views43 pages

Summarizing Data

Chapter Three discusses methods of summarizing data, focusing on measures of central tendency such as mean, median, and mode, as well as measures of variation like range and standard deviation. It highlights the importance of understanding these statistics for effective data analysis, including their advantages and disadvantages. Additionally, it covers concepts like skewness and the use of box-and-whisker plots to visualize data distribution.

Uploaded by

eldana.endale77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views43 pages

Summarizing Data

Chapter Three discusses methods of summarizing data, focusing on measures of central tendency such as mean, median, and mode, as well as measures of variation like range and standard deviation. It highlights the importance of understanding these statistics for effective data analysis, including their advantages and disadvantages. Additionally, it covers concepts like skewness and the use of box-and-whisker plots to visualize data distribution.

Uploaded by

eldana.endale77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

CHAPTER THREE

SUMMARIZING DATA
Mengistu Y.

4/25/2025
Objectives
• At the end of this lesson, the student will be able to:
• Identify the different methods of data summarization
• Compute appropriate summary values for a set of data
• Appreciate the properties and limitations of summary
values

4/25/2025
Summary Measures

Describing Data Numerically

Central Tendency Variation Shape

Arithmetic Mean Range Skewness

Median Interquartile Range

Mode Variance

Geometric Mean Standard Deviation

Quartiles Coefficient of Variation

4/25/2025
MEASURES OF CENTRAL TENDENCY

• The tendency of statistical data to get concentrated at


certain values is called the “Central Tendency or
average”
• Mean
• Median
• Mode

4/25/2025
The Arithmetic Mean or simple Mean
•The mean is the average of the numbers. It
is add up all the numbers, then divide by
how many numbers there are
• It is written in statistical terms as:

4/25/2025
• Example 1: What is the Mean of these numbers? 6, 11, 7
• Add the numbers: 6 + 11 + 7 = 24
• Divide by how many numbers (there are 3 numbers): 24 / 3 = 8
• The Mean is 8
Why Does This Work?
• It is because 6, 11 and 7 added together is the same as 3 lots of 8:
• It is like you are "flattening out" the numbers.

4/25/2025
Example 2
Birth weights(gm) of all live
born infant born at a private What is the arithmetic mean
hospital in a city, during a 1- for the sample birth weights?
week period.

4/25/2025
Weighted Mean
•When averaging quantities, it is often necessary
to account for the fact that not all of them are
equally important in the phenomenon being
described.

•In order to give quantities being averaged there


proper degree of importance, it is necessary to
assign them relative importance called weights,
and then calculate a weighted mean.
4/25/2025
•The weighted mean of a set
of numbers X1, X2, … and Xn,
whose relative importance is
expressed numerically by a
corresponding set of
numbers w1, w2, … and wn, is
given by

4/25/2025
• Example: In a given drug shop four different drugs were sold for unit
price of 60, 85, 95 and 50 birr and the total numbers of drugs sold
were 10, 10, 5 and 20 respectively. What is the average price of the
four drugs in this drug shop?
• Solution: for this example we have to use weighted mean using
number of drugs sold as the respective weights for each drug's price.
Therefore, the average price will be: 65 birr
• If we don't consider the weights, the average price will be 72.5 birr
𝟔𝟎∗𝟏𝟎+𝟖𝟓∗𝟏𝟎+𝟗𝟓∗𝟓+𝟓𝟎∗𝟐𝟎
Weighted mean= =65
𝟏𝟎+𝟏𝟎+𝟓+𝟐𝟎

4/25/2025
Weighted Mean
• We can also calculate a weighted mean using some weighting
factor:
e.g. What is the average income of all
n

w x
people in cities A, B, and C :
City Avg. Income Population
i i
x
A $23,000 100,000
i 1
n B $20,000 50,000

w
i 1
i
C $25,000 150,000

Here, population is the weighting factor and the average


income is the variable of interest

4/25/2025
Geometric Mean
• The Geometric Mean is a special type of average where we multiply
the numbers together and then take a square root (for two numbers),
cube root (for three numbers) etc.
Example: What is the Geometric Mean of 2 and 18?
• First we multiply them: 2 × 18 = 36
• Then (as there are two numbers) take the square root: √36 = 6

• Geometric Mean of 2 and 18 = √(2 × 18) = 6


• It is like the area is the same!

4/25/2025
Example: What is the Geometric Mean of 10, 51.2 and 8?
• First we multiply them: 10 × 51.2 × 8 = 4096
• Then (as there are three numbers) take the cube root: 3√4096 = 16
• For n numbers: multiply them all together and then take the nth
root (written n√ )

• Geometric Mean = 3√(10 × 51.2 × 8) = 16


• It is like the volume is the same:

4/25/2025
Characteristics of mean
• The value of the arithmetic mean is determined by every
item in the series.
• It is greatly affected by extreme values.
Advantages
• It is based on all values given in the distribution.
• It is most easily understood.
• It is most amenable to algebraic treatment.

4/25/2025
Disadvantages
• It may be greatly affected by extreme items and its
usefulness as a “Summary of the whole” may be
considerably reduced.
• When the distribution has open-ended classes, its
computation would be based assumption, and therefore may
not be valid.

4/25/2025
Median
•Suppose there are n observations in a sample. If
these observations are ordered from smallest to
largest, then the median is defined as follows:
•The sample median is

4/25/2025
Example 2
2.2. Consider the following
2.1. Compute the sample data, which consists of white
median for the birth weight blood counts taken on
data in example 1. admission of all patients
entering a small hospital on a
given day. Compute the
median white-blood count
(103).
7, 35,5,9,8,3,10,12,8

4/25/2025
i) Characteristics of Median
• It is an average of position/location .
• It is affected by the number of items than by extreme values.

ii) Advantages
• It is easily calculated and is not much disturbed by extreme
values
• It is more typical of the series
• The median may be located even when the data are
incomplete, e.g, when the class intervals are irregular and the
final classes have open ends.

4/25/2025
iii) Disadvantages
• it is determined mainly by the middle points in a
sample and is less sensitive to the actual numerical
values of the remaining data points.
• It is not so generally familiar as the arithmetic mean

4/25/2025
Mode
• It is the value of the observation that occurs with the greatest
frequency.
• A particular disadvantage is that, with a small number of
observations, there may be no mode.
• In addition, sometimes, there may be more than one mode
such as when dealing with a bimodal (two-peak) distribution.
• Find the modal values for the following data
a) 22, 66, 69, 70, 73. (No modal value)
b) 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal value = 3.0 kg)

4/25/2025
Mode
Characteristics
• It is an average of position
• It is not affected by extreme values
• It is the most typical value of the distribution
Advantages
• Since it is the most typical value it is the most descriptive
average
• Since the mode is usually an “actual value”, it indicates the
precise value of an important part of the series.
4/25/2025
Disadvantages:-
• Unless the number of items is fairly large and the
distribution reveals a distinct central tendency, the mode has
no significance
• It is not capable of mathematical treatment
• In a small number of items the mode may not exist.

4/25/2025
Skewness:
• If extremely low or extremely high observations are present in a
distribution, then the mean tends to shift towards those scores.
Based on the type of skewness, distributions can be:
• Negatively skewed distribution: occurs when majority of scores are
at the right end of the curve and a few small scores are scattered at
the left end.
• Positively skewed distribution: Occurs when the majority of scores
are at the left end of the curve and a few extreme large scores are
scattered at the right end.
• Symmetrical distribution: It is neither positively nor negatively
skewed. A curve is symmetrical if one half of the curve is the mirror
image of the other half.

4/25/2025
Skewness…
• Data can be "skewed", meaning it tends to have a long tail on one
side or the other:

• Negative Skew?
• Why is it called negative skew? Because the long "tail" is on the
negative side of the peak.
• The mean is also on the left of the peak.
4/25/2025
Skewness…
The Normal Distribution has No Skew
A Normal Distribution is not skewed.
It is perfectly symmetrical.
And the Mean is exactly at the peak.

4/25/2025
Skewness…
Positive Skew
And positive skew is when the long tail is on the
positive side of the peak, and some people say it
is "skewed to the right".
The mean is on the right of the peak value.

4/25/2025
Skewness…

4/25/2025
Measures of Dispersion
• Which of the
distributions of scores
has the larger 125

dispersion? 100
75
50
25

The upper distribution 0


1 2 3 4 5 6 7 8 9 10

has more dispersion


because the scores 125

100

are more spread out 75

50

25

0
1 2 3 4 5 6 7 8 9 10

4/25/2025
Measures of Dispersion

• How “spread out” the numbers are about the centre?


• Consider the following data sets:
Mean
Set 1: 60 40 30 50 60 40 70 50
Set 2: 50 49 49 51 48 50 53 50

• The two data sets given above have a mean of 50, but obviously set 1 is
more “spread out” than set 2 how do we express this numerically?
• Some of the commonly used measures of dispersion (variation) are: Range,
inter quartile range, quartiles, percentiles, variance, standard deviation and
coefficient of variation.
4/25/2025
Range and Interquartile Rage
• Range
• Simplest and the crudest measure of variation
• Difference between the largest and the smallest observations: Range =
Xlargest – Xsmallest
• Ignores the way in which data are distributed
• It wastes information for it takes no account of the entire data.
• Sensitive to outliers
• Interquartile Range
• Eliminate some high- and low-valued observations and calculate the range
from the remaining values
• Interquartile range = 3rd quartile – 1st quartile
= Q 3 – Q1
4/25/2025
Quartiles and Percentiles

• The quartiles divide the distribution into four equal parts.


• Deciles: If data is ordered and divided into 10 parts, then cut points
are called Deciles

• Percentiles: If data is ordered and divided into 100 parts, then cut
points are called Percentiles

4/25/2025
Quartiles
• The 25th percentile is When we wish to find the
often referred to as the quartiles for a set of data, the
first quartile and denoted following formulas are used
Q1.
• The 50th percentile (the
median) is referred to as
the second or middle
quartile and written Q2’
and
• the 75th percentile is
referred to as the third
quartile, Q3.

4/25/2025
Using the Five-Number Summary to Explore the Shape
• Box-and-Whisker Plot: A Graphical display of data using 5-number
summary:

Minimum, Q1, Median, Q3, Maximum

• The Box and central line are centered between the endpoints if data
are symmetric around the median

Min Q1 Median Q3 Max


Distribution Shape and
Box-and-Whisker Plot

Left-Skewed Symmetric Right-Skewed

Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3
Standard Deviation and Variance
• show the scatter of the individual measurements around the mean of
all the measurements in a given distribution.
• The variance represents squared units and, therefore, is not an
appropriate measure of dispersion when we wish to express this
concept in terms of the original units.
• To obtain a measure of dispersion in original units, we merely take the
square root of the variance. The result is called the standard
deviation.
• Variance the average of the squared difference from the mean
• Standard deviation is the square root of variance

4/25/2025
Variance and Standard Deviation
Population Sample

 i
 x   2

 ix  x 2

 s
N n 1

4/25/2025
SD  variance
To calculate standard deviation
1. Calculate the mean
x
2. Calculate the residual for each x xx

3. Square the residuals ( x  x )2

4. Calculate the sum of the squares



 xx 
2

5. Divide the sum in Step 4 by (n-1)   x  x 2

n 1
6. Take the square root of quantity
in Step 5
  x  x 2

n 1

4/25/2025
Example- Find Standard Deviation of Ungroup
Data

Family No. 1 2 3 4 5 6 7 8 9 10

Size (xi) 3 3 4 4 5 5 6 6 7 7

4/25/2025
Here, x
 x i

50
5
n 10

Family No. 1 2 3 4 5 6 7 8 9 10 Total

xi 3 3 4 4 5 5 6 6 7 7 50
xi  x -2 -2 -1 -1 0 0 1 1 2 2 0

x i  x  2
4 4 1 1 0 0 1 1 4 4 20


 ix  x  2

s  2.2  1.48
20
s2    2.2,
n 1 9

4/25/2025
Example
• The length of a newborn baby are: 600mm, 470mm, 170mm, 430mm
and 300mm.
• Find out the Mean, the Variance, and the Standard Deviation.
• Your first step is to find the Mean:
• Answer:
• Mean = 600 + 470 + 170 + 430 + 300 = 1970 = 394
5 5
• so the mean (average) height is 394 mm.

4/25/2025
To calculate the Variance, take each difference,
square it, and then average the result:

Standard Deviation

σ = √21,704
= 147.32...
= 147 (to the nearest
mm)
4/25/2025
Coefficient of Variation

• Measures relative variation


• Always in percentage (%)
• Shows variation relative to mean
• Can be used to compare two or more sets of data
measured in different units
S
CV     100%

X
4/25/2025
Thank you!

4/25/2025

You might also like