Exploratory Data
Analysis /
Descriptive
Statistics
• In statistics, exploratory data analysis is an
approach to analyzing data sets to
summarize their main characteristics, often
with visual methods.
Exploratory • It focuses on exploring data to understand
the data’s underlying structure and
Data Analysis variables, to develop intuition about the
data set, to consider how that data set came
into existence, and to decide how it can be
investigated with more formal statistical
methods.
Exploratory Data
Analysis
• Perform Exploratory
Data Analysis (EDA) to
understand the distribution
of a variable and to check
for anomalies and outliers.
• Create histograms and
boxplots, transform variables,
and examine trade-offs in
visualizations
Probability & Statistics
Descriptive Statistics
▪ Customers in using Statistics Scenarios are
asking questions about numerical variables.
When Summarizing and describing numerical
variables.
▪ You have to do more than just prepare the tables
▪ You need to consider the Central Tendency,
Variation, shapes etc., of each numerical
variable.
Numerical Descriptive Measures
Central Tendency
Mean
The arithmetic mean For a population of
(mean) is the most size N
common measure of
central tendency For a sample of size n
Median
Probability & Statistics
The location of the median
If the number of values is odd, the median is the
middle number. If the number of values is even, the
median is the average of the two middle numbers
n +1
Median position = position in the ordered data
2
n +1
Note that is not the value of the median, only
2
the position of the median in the ranked data
• Value with the highest frequency
• Ameasure of central tendency
• Value that occurs most often
• Not affected by extreme values
MODE • Used for either numerical or categorical
data
• There may be no mode
• There may be several modes
Probability & Statistics
Mode
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
Mode = 9 No Mode
Probability & Statistics
QUARTILES
Quartiles split the ranked data into 4 segments
with an equal number of values per segment
25% 25% 25% 25%
Q1 Q2 Q3
The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50%
are larger)
Only 25% of the observations are greater than the third
quartile
Probability & Statistics
Quartile Formulas
Find a quartile by determining the value in the
appropriate position in the ranked data, where
First quartile position: Q1 = (n+1)/4
Second quartile (median)position:
Q2 = (n+1)/2
Third quartile position: Q3 = 3(n+1)/4
where n is the number of observed values.
Probability & Statistics
Quartiles
E x a m p l e : F i n d t h e first quartile
S a m p l e D a t a in O r d e r e d Ar r a y: 11 12 13 16 16 17 18 21 22
(n = 9)
Q 1 = is in the (9+1)/4 = 2.5 position of the r a n k e d d a ta
s o u s e the value half w a y b e t w e e n the 2nd a n d 3rd values,
so Q1 = 12.5
Q 1 a n d Q 3 are m e a s u r e s of n o n central location
Q 2 = median, a m e a s u r e of central tendency
Probability & Statistics
EXERCISE
Consider th e following stem-and-leaf display
Find Ra n ge , M e d i a n M o d e , Q1, Q 2 a n d Inter Quartile R a n g e
-2 2
-1 20
-0 5320
0 01146688
1 3357
2 23346889999
3 056789
4 235799
5 48
6 38
7
8 6
Probability & Statistics
EXERCISE
Consider th e following stem-and-leaf display
Find Ra n ge , M e d i a n M o d e , Q1, Q 2 a n d Inter Quartile R a n g e
-2 2 R a n g e = 8 6 – (-22)=108
-1 20
-0 5320 M e d i a n = (47+1)/2th value
0 01146688 = 24 t h Value
1 3357 = 26
2 23346889999
3 056789 Mode = 29
4 235799
5 48 Q 1 = (47+1)/4 t h Value = 12 t h value
6 38 =6
7 Q 3 = (47+1) *3/4 t h value = 36 t h Value
8 6 = 39
I QR = 3 9 – 6 = 3 3
Probability & Statistics
Measures of Variation
Va r i a t i o n
Range In te r q u a r tile Va r i a n c e Standard Coefficient
Range Deviation o f Va r i a t i o n
◼ M e a s u r e s of var i at i on g i v e
i n f o r m a t i o n o n t h e s p r e a d or
va r i a b i l i t y o f t h e d a t a v a l u e s .
S a m e c e n t e r,
different variation
Probability & Statistics
Range
Range
• S i m p l e s t m e a s u r e o f va r i at i o n
• D i f fe r e n c e b e t w e e n t h e l a rge st a n d t h e
smallest observations:
R a n g e = Xlargest – Xsmallest
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Saturday, J anuar y 2, 2 0 1 6
Probability & Statistics
Inter Quartile Range
Interquartile Range
• Can eliminate some outlier problems by using the
interquartile range
• Eliminate some high- and low-valued observations and
calculate the range from the remaining values
Inter-quartile range = 3 rd quartile – 1 st
quartile = Q 3 – Q 1
Saturday, J a n u a ry 2, 2 0 1 6
Probability & Statistics
Variance
• Ave ra ge (approximately) of s q u a re d deviations
o f values f ro m t h e m e a n
– S a m p l e variance: n
(Xi − X) 2
S2 = i=1
n -1
W h e re
X= a r ithmetic m e a n
n = s a m p l e size
X i = i t h v a l u e o f t h e va r i a b l e X
Probability & Statistics
Standard Deviation
• M o s t c o m m o n l y u s e d m e a s u r e o f va r i at i o n
• S h o w s va r i at i o n a b o u t t h e m e a n
• H a s t h e s a m e u n i t s a s t h e o r i g i n a l d ata
– S a m p l e sta n d a rd d ev i at i o n :
(Xi − X) 2
S = i=1
n -1
Probability & Statistics
Population
Standard Deviation
Here we use the formula,
(x )
n
−x
2
i
i=1
=
n
That is replace n – 1 of Sample Standard deviation
formula by n in the denominator
Probability & Statistics
Small standard deviation
Large standard deviation
Saturday, January 2, 2016
Probability & Statistics
Shape of a Distribution
• D e s c r i b e s h o w d ata is dist ributed
• Measures of shape
• – Symmetric or skewed
Saturday, Fe b r uar y 4, 2 0 1 7 Chap 3-29
Probability & Statistics
Exploratory Data Analysis
• B o x - a n d - W h i s ke r P l o t : A G r a p h i c a l d i s p l a y o f
data using 5-number su mmary :
M i n i m u m -- Q 1 -- M e d i a n -- Q 3 -- M a x i m u m
Example:
25% 25% 25% 25%
Probability & Statistics
Shape of Box-and-Whisker Plots
• The Box and central line are centered between
the endpoints if data are symmetric around
the median
Min Q1 Median Q3 Max
• A Box-and-Whisker plot can be shown in either
vertical or horizontal format
Probability & Statistics
Distribution Shape and
Box-and-Whisker Plot
Left-Skewed Symmetric Right-Skewed
Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3