Analyzing Distributions
Percentiles Empirical Rule
Quartiles Identifying Outliers
Z-Scores Box Plots
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
Analyzing Distributions
Percentiles
• Value of a variable at which a specified (approximate) percentage of
observations are below that value
• The pth percentile tells us the point in the data where:
• Approximately p percent of the observations have values less than the pth
percentile
• Approximately (100 – p) percent of the observations have values greater than
the pth percentile
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
2
Analyzing Distributions
• Steps to calculate the pth percentile:
• Arrange the data in ascending order (smallest to largest value)
• Compute k = (n + 1) × p
• Divide k into its integer component, i, and its decimal component, d
• If d = 0, find the kth largest value in the data set; this is the pth percentile
• If d > 0, the percentile is between the values in positions i and i + 1 in the sorted data;
to find this percentile, we must interpolate between these two values:
i. Calculate the difference between the values in positions i and i + 1 in the sorted data
set; we define this difference between the two values as m
ii. Multiply this difference by d: t = m × d
iii. To find the pth percentile, add t to the value in position i of the sorted data
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
3
Analyzing Distributions
• Illustration
• To determine the 85th percentile for the home sales data in Table 2.9.
1. Arrange the data in ascending order
108,000 138,000 138,000 142,000 186,000 199,500
208,000 254,000 254,000 257,500 298,000 456,250
2. Compute k = (n + 1) × p = (12 + 1) × 0.85 = 11.05
3. Dividing 11.05 into the integer and decimal components gives us i =
11 and d = 0.05
d > 0, interpolate between the values in the 11th and 12th positions in the
sorted data
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
4
Analyzing Distributions
Illustration (contd.)
• To determine the 85th percentile for the home sales data in Table 2.9
• The value in the 11th position is 298,000
• The value in the 12th position is 456,250
m = 456,250 – 298,000 = 158,250
t = m × d = 158,250 × 0.05 = 7912.5
pth percentile = 298,000 + 7912.5 = 305,912.5
$305,912.50 represents the 85th percentile of the home sales data
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
5
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
Analyzing Distributions
Quartiles
• When the data is divided into four equal parts:
• Each part contains approximately 25% of the observations
• Division points are referred to as quartiles
• = first quartile, or 25th percentile
• = second quartile, or 50th percentile (also the median)
• = third quartile, or 75th percentile
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
7
Analyzing Distributions
z-score
• Measures the relative location of a value in the data set
• Helps to determine how far a particular value is from the mean
relative to the data set’s standard deviation
• Standardized value
• If , , . . . , is a sample of n observations
=
• = z-score for
• = sample mean
• s = sample standard deviation
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
8
Table 2.13: z-Scores for the Class Size Data
• For class size data, = 44 and s = 8
• For observations with a value > mean, z-score > 0
• For observations with a value < mean, z-score < 0
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
9
Figure 2.20: Calculating z-Scores for the Home Sales Data in Excel
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
10
Why standard scores? – For fair
comparison
Class 1 Class 2
Exam score z-score Exam score z-score
72 1.2903 80 -1.1795
55 0.2701 85 -0.4423
40 -0.6301 92 0.5898
35 -0.9302 95 1.0321
Mean 50.5 88
Standard Deviation 16.6633 6.7823
Superficially, 72 < 80.
But after z-score conversion, z-score of 72 > z-score of 80.
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
Analyzing Distributions
Empirical Rule
• For data having a bell-shaped distribution:
• Within 1 standard deviation—approximately 68% of the data values
• Within 2 standard deviations—approximately 95% of the data values
• Within 3 standard deviations—almost all the data values
Identifying Outliers
• Outliers: Extreme values in a data set
• It can be identified using standardized values (z-scores)
• Any data value with a z-score less than –3 or greater than +3 is an outlier
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
12
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
Analyzing Distributions
Box Plots
• Graphical summary of the distribution of data
• Developed from the quartiles for a data set
Figure 2.22: Box Plot
for the Home Sales
Data
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
14
Figure 2.23: Box Plots Comparing Home Sale Prices in Different
Communities
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
15
Measures of Association
Between Two Variables
Scatter Charts
Covariance
Correlation Coefficient
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
Measures of Association Between Two Variables
• Scatter Charts: Useful graph for analyzing the relationship between
two variables
• Covariance: Descriptive measure of the linear association between
two variables
• Sample covariance for a sample of size n with the observations
(, ), (, ), and so on: =
• Population covariance, =
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
17
Table 2.14: Data for Bottled Water Sales at Queensland
Amusement Park for a Sample of 14 Summer Days
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
18
Figure 2.26: Chart Showing the Positive Linear
Relation Between Sales and High Temperatures
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
19
Table 2.15: Sample Covariance Calculations for Daily High
Temperature and Bottled Water Sales at Queensland Amusement Park
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
20
Figure 2.27: Calculating Covariance and Correlation
Coefficient for Bottled Water Sales Using Excel
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
21
Measures of Association Between Two Variables
• Correlation coefficient: Measures the relationship between two
variables
• Not affected by the units of measurement for x and y
• Sample correlation coefficient denoted by
• =
• = sample covariance =
• = sample standard deviation of x =
• = sample standard deviation of y =
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
22
Interpretation of Correlation
Coefficient
–1 ≤ r ≤ +1
r value Relationship between the x
and y variables
<0 Negative linear
Near 0 No linear relationship
>0 Positive linear
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
23
Figure 2.28: Scatter Diagrams and Associated
Covariance Values for Different Variable Relationships
(a) (b) (c)
Positive: Approximately 0: Negative:
(x and y are positively (x and y are not (x and y are negatively
linearly related) linearly related) linearly related)
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
24
Computation of Correlation Coefficient
Illustration
• To determine the sample correlation coefficient for bottled water
sales at Queensland Amusement Park:
= = = 0.93
• There is a very strong linear relationship between high temperature
and sales
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
25
Figure 2.29: Example of Nonlinear Relationship
Producing a Correlation Coefficient Near Zero
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain
product or service or otherwise on a password-protected website for classroom use.
26