School of Computing
Science and Engineering
Program: M.C.A.
Course Code: MCAS9220
Course Name: Data Science
Fundamentals
Exploratory Data
Analysis and Data
Visualization
Credits: ChrisVolinsky - Columbia University
2
Outline
• EDA
• Visualization
– One variable
– Two variables
– More than two variables
– Other types of data
– Dimension reduction
3
EDA and Visualization
• Exploratory Data Analysis (EDA) and
Visualization are very important steps in any
analysis task.
• get to know your data!
– distributions (symmetric, normal, skewed)
– data quality problems
– outliers
– correlations and inter-relationships
– subsets of interest
– suggest functional relationships
• Sometimes EDA or viz might be the goal!
4
Data Visualization – cake bakery
5
Exploratory Data Analysis (EDA)
• Goal: get a general sense of the data
– means, medians, quantiles, histograms, boxplots
• You should always look at every variable - you will learn
something!
• data-driven (model-free)
• Think interactive and visual
– Humans are the best pattern recognizers
– You can use more than 2 dimensions!
• x,y,z, space, color, time….
• Especially useful in early stages of data mining
– detect outliers (e.g. assess data quality)
– test assumptions (e.g. normal distributions or skewed?)
– identify useful raw data & transforms (e.g. log(x))
• Bottom line: it is always well worth looking at your data!
6
Summary Statistics
• not visual
• sample statistics of data X
– mean: = i Xi / n
– mode: most common value in X
– median: X=sort(X), median = Xn/2 (half below, half above)
– quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n
• interquartile range: value(Q3) - value(Q1)
• range: max(X) - min(X) = Xn - X1
– variance: 2 = i (Xi - )2 / n
– skewness: i (Xi - )3 / [ (i (Xi - )2)3/2 ]
• zero if symmetric; right-skewed more common (what kind of
data is right skewed?)
– number of distinct values for a variable (see unique() in
R)
– Don’t need to report all of thses: Bottom line…do these
numbers make sense??? 7
Single Variable Visualization
• Histogram:
– Shows center, variability, skewness, modality,
– outliers, or strange patterns.
– Bin width and position matter
– Beware of real zeros
8
Issues with Histograms
• For small data sets, histograms can be
misleading.
– Small changes in the data, bins, or anchor can deceive
• For large data sets, histograms can be quite
effective at illustrating general properties of the
distribution.
• Histograms effectively only work with 1 variable
at a time
– But ‘small multiples’ can be effective
9
But be
careful with
axes and
scales!
10
Smoothed Histograms - Density
Estimates
• Kernel estimates smooth out the
contribution of each datapoint over a local
neighborhood of that point.
n
ˆf (x) 1 K( x x i )
nh
i1 h
h is the kernel width
• Gaussian kernel is common:
2
1 x x (i )
2 h
Ce
11
Bandwidth
choice is an
art
Usually want
to try several
12
Boxplots
• Shows a lot of
information about a
variable in one plot
– Median
– IQR
– Outliers
– Range
– Skewness
• Negatives
– Overplotting
– Hard to tell
distributional shape
– no standard
implementation in
software (many options
for whiskers, outliers)
13
Time Series
If your data has a temporal component, be sure to exploit
it
summer bifurcations in air travel
(favor early/late)
summer
peaks
steady growth
trend
New Year bumps
14
Time-Series Example 3
mean weight vs mean age
for 10k control group
Scotland experiment:
Possible explanations:
“ milk in kid diet better health” ?
Grow less early in year than later?
20,000 kids: Would expect smooth weight growth plot.
5k raw, 5k pasteurize,
No steps in height plots; so why
10k control (no supplement) Visually reveals
height uniformly, weight spurts?
unexpected pattern (steps),
not apparent from raw data table.
Kids weighed in clothes: summer garb
lighter than winter?
Spatial Data
• If your data has a
geographic
component, be
sure to exploit it
• Data from
cities/states/zip
cods – easy to get
lat/long
• Can plot as
scatterplot
16
Spatial data: choropleth Maps
• Maps using color shadings to represent numerical values are called chloropleth maps
• http://elections.nytimes.com/2008/results/president/map.html
17
Two Continuous Variables
• For two numeric variables, the
scatterplot is the obvious choice
interesting?
interesting?
18
2D Scatterplots
• standard tool to display • useful to answer:
relation between 2 – x,y related?
variables • linear
– e.g. y-axis = response, • quadratic
x-axis = suspected • other
indicator – variance(y) depend on
x?
– outliers present?
interesting
?
interesting
?
19
Scatter Plot: No apparent
relationship
20
Scatter Plot: Linear relationship
21
Scatter Plot: Quadratic relationship
22
Scatter plot: Homoscedastic
Why is this important in classical statistical modelling?
23
Scatter plot: Heteroscedastic
variation in Y differs depending on the value of X
e.g., Y = annual tax paid, X = income
24
Two variables - continuous
• Scatterplots
– But can be bad with lots of data
25
Two variables - continuous
• What to do for large data sets
– Contour plots
26
Transparent plotting
Alpha-blending:
• plot( rnorm(1000), rnorm(1000), col="#0000ff22",
pch=16,cex=3)
27
Jittering
• Jittering points helps too
• plot(age, TimesPregnant)
• plot(jitter(age),jitter(TimesPregnant)
28
Displaying Two Variables
• If one variable is
categorical, use
small multiples
• Many software
packages have this
implemented as
‘lattice’ or ‘trellis’
packages
library(‘lattice’)
histogram(~DiastolicBP | TimesPregnant==0
29
Two Variables - one categorical
• Side by side boxplots are very effective in showing
differences in a quantitative variable across factor
levels
– tips data
• do men or women tip better
– orchard sprays
• measuring potency of various orchard sprays in repelling
honeybees
30
Barcharts and Spineplots
stacked barcharts
can be used to
compare
continuous values
across two or more
categorical ones.
orange=M blue=F
spineplots show
proportions well,
but can be hard to
interpret 31
More than two
variables
Pairwise
scatterplots
Can be somewhat
ineffective for
categorical data
32
33
Multivariate: More than two
variables
• Get creative!
• Conditioning on variables
– trellis or lattice plots
– Cleveland models on human perception,
all based on conditioning
– Infinite possibilities
• Earthquake data:
– locations of 1000 seismic events of MB > 4.0.
The events occurred in a cube near Fiji since
1964
– Data collected on the severity of the
earthquake
34
35
36
How many
dimensions
are
represented
here?
Andrew Gelman blog 7/15/2009 37
Multivariate Vis: Parallel
Coordinates
Petal, a non-reproductive
part of the flower
Sepal, a non-reproductive
part of the flower
The famous iris data!
38
Parallel Coordinates
Sepal
Length
5.1
sepal sepal petal petal
length width length width
5.1 3.5 1.4 0.2
39
Parallel Coordinates: 2 D
Sepal Sepal
Length Width
3.5
5.1
sepal sepal petal petal
length width length width
5.1 3.5 1.4 0.2
40
Parallel Coordinates: 4 D
Sepal Sepal Petal Petal
Length Width length Width
3.5
5.1 1.4 0.2
sepal sepal petal petal
length width length width
5.1 3.5 1.4 0.2
41
Parallel Visualization of Iris data
3.5
5.1
1.4
0.2
42
Multivariate: Parallel coordinates
Alpha blending
can be effective
Courtesy Unwin, Theus, Hofmann
43
Parallel coordinates
• Useful in an interactive setting
44
Networks and Graphs
• Visualizing networks is helpful, even if is not
obvious that a network exists
45
Network Visualization
• Graphviz (open source software) is a nice layout tool
for big and small graphs
46
What’s missing?
• pie charts
– very popular
– good for showing simple relations of proportions
– Human perception not good at comparing arcs
– barplots, histograms usually better (but less pretty)
• 3D
– nice to be able to show three dimensions
– hard to do well
– often done poorly
– 3d best shown through “spinning” in 2D
• uses various types of projecting into 2D
• http://www.stat.tamu.edu/~west/bradley/
47
Worst graphic in the
world?
48
Dimension Reduction
• One way to visualize high dimensional
data is to reduce it to 2 or 3 dimensions
– Variable selection
• e.g. stepwise
– Principle Components
• find linear projection onto p-space with maximal
variance
– Multi-dimensional scaling
• takes a matrix of (dis)similarities and embeds the
points in p-dimensional space to retain those
similarities
More on this in next Topic
49
Visualization done right
• Hans Rosling @ TED
• http://www.youtube.com/watch?
v=jbkSRLYSojo
50