GET1030
Computers and the humanities
Lecture 3
Visualizing data
Dr Miguel Escobar Varela 1
GET1030
Learning Objectives
1. To describe the main types of visualizations
used today
2. To identify different approaches to data
visualization
3. To identify the potential for bias in
visualizations
4. To offer critical perspectives on data
visualization
Lecture 3
Visualizing data
Part 1: Most common scientific data visualizations today
To describe the main types of
visualizations used today
3
GET1030
What is a visualization?
Representing numerical and categorical data with graphical elements (color, shape, position, size)
What is it useful for?
- To give an overview of the data
- As a first step for further research
4
GET1030
Charts in this session
Boxplots
Barplots
Lineplots
Scatterplots
Histograms
KDE plots
Violinplots
Joint plots
5
GET1030
Boxplot
Description of a univariate distribution (one variable)
For this toy example: number of actors required for a theatre play
1st quartile =
Median =
3rd quartile =
Minimum =
Maximum =
6
GET1030
Boxplot (Outliers)
1.5 x IQR
Calculating outliers using Tukey’s rule
IQR = 75th percentile - 25th percentile
1st quartile = 8.5
Median = 10
3rd quartile = 11.5
Minimum = 3
Maximum = 17
7
GET1030
Categories in boxplots
How many variables are
represented in this graph?
variables
1. days of the week
2. bill size
categories
1. smoke or not
8
GET1030
Standard deviation in bar charts
Measure of the dispersion of the data
Square root of the variance
Variance is the average of the squared differences from the
mean
Mean =
Variance =
Standard Deviation =
9
GET1030
Standard deviation in lineplots
Consider this example of
actors in plays over time
10
GET1030
Scatterplot
Shows the numerical relationship
between two variables. Color can
be used to indicate categorical
variables.
11
GET1030
Scatterplot
Shows the relationship between two variables.
12
GET1030
Scatterplot
Shows the numerical relationship between two variables. Color can be use categorical variables.
13
GET1030
Histogram
Each bar groups numbers into
ranges. Taller bars show that more
data falls in that range. It displays
the shape and spread of the data.
14
GET1030
Histogram
Consider this example of stars given to films in a group of reviews.
15
GET1030
Histogram (relative frequency)
Consider this example of stars given to films in a group of reviews.
16
GET1030
Histogram (relative frequency)
Consider this example of stars given to films in a group of reviews.
17
GET1030
Histogram (bin size)
Consider this example of stars given to films in a group of reviews.
18
GET1030
Histogram
https://tinlizzie.org/histograms/
19
GET1030
Kernel Density Estimation (KDE) plots
Closely related to histograms. They show a
smoothed representation of the data
distribution.
20
GET1030
Violinplot
Another visual representation of
the 5-number summary but also
represents the distribution of
values, in a way similar to a
KDE.
21
GET1030
Jointplots
Combined scatterplot with
regression line, confidence
interval, histogram and KDE.
22
GET1030
Pairplots
Scatterplots and KDE of
multiple variables for the
same samples.
23
Lecture 3
Visualizing data
Part 2: Different approaches to Data Visualization
24
GET1030
William Playfair (1759-1823)
From The Commercial and Political Atlas; Representing, by Means of Stained Copper-Plate
Charts, the Exports, Imports, and General Trade of England, at a Single View (1785)
25
GET1030
John Snow (1813-1858)
John Snow’s map of cholera outbreaks and wells (1854)
Digital version by Robin Wilson (2013)
http://blog.rtwilson.com/john-snows-cholera-data-in-more-formats/
26
GET1030
Florence Nightingale (1820-1910)
‘Coxcom’ (polar area graph) visualization of the cause of death in the army (1850s
27
Crimea War)
*preventable causes, wounds, accidents
GET1030
Charles Joseph Minard (1781-1870)
Napoleon’s March to Moscow (published 1869)
Beginning at the Polish-Russian border, the thick band shows the size of the army at each position. 28
The path of Napoleon's retreat from Moscow is depicted by the dark lower band, which is tied to
temperature and time scales.
GET1030
The scatterplot
First graphic representation of interquartile range boxes.
*This is a reconstruction based on
data from Herschell’s 1833 paper, “On
the Investigation of the Orbits of
Revolving Double Stars” by Friendly
and Denis (2005).
Sir John Frederick William
Herschel (1972-1871)
29
GET1030
The scatterplot
Francis Galton’s (1986)
smoothed correlation
diagram for the data on
heights of parents and
children, showing one ellipse
of equal frequency.
Francis Galton (1822-1911)
30
GET1030
The ‘Infoviz’ Approach
Statisticians often disagree with a type of visualization aesthetic common in the news, which was most
influential developed by Nigel Holmes, while working for TIME magazine in the 1970s.
*as we’ll see later, this is the kind of graph that Tufte would classify as ‘chartjunk’
31
GET1030
Additional resources
https://exhibits.stanford.edu/dataviz 32
GET1030
Exploratory data analysis (EDA)
Theorized by John Tukey (1915-2000).
Visualizations are often central to EDA.
Often includes statistical information (error bars, confidence intervals, standard deviation)
Many different visualizations with shared axes.
33
GET1030
Modern scientific approach to dataviz
Edward Tufte:
Against “chartjunk”
Popularized sparklines.
Famous for several concepts:
lie factor, the data-ink ratio (against decoration), small multiples and
the data density of a graphic.
34
GET1030
What’s the purpose?
To grab attention?
To show trends?
To help scientists analyze data?
35
Lecture 3
Visualizing data
Part 3: Sources of bias in data visualization
36
GET1030
Axis Cropping
Kendall Fortney, “5 Ways Data Visualizations can Lie”, Towards data science,
https://towardsdatascience.com/5-ways-data-visualizations-can-lie-46e54f41de37
37
GET1030
Axis Scalling
Ravi Parikh, ‘How to lie with data visualization’,
Gizmodo, 2014,
https://gizmodo.com/how-to-lie-with-data-visualization-15
63576606
38
GET1030
Axis Scaling
Kendall Fortney, “5 Ways Data Visualizations can Lie”, Towards data science,
https://towardsdatascience.com/5-ways-data-visualizations-can-lie-46e54f41de37 39
GET1030
The problem of pie charts
The green slice is actually equal to a quarter
of the yellow one, and pink is a third of the
value of the purple slice
Maryland is bigger than the others (by 3%). The 3D effect visually
adds more volume to NH, tricking your eyes. Without labels telling
the percentages there would be little to no chance of accurately
guessing it.
Kendall Fortney, “5 Ways Data Visualizations can Lie”, Towards data science,
https://towardsdatascience.com/5-ways-data-visualizations-can-lie-46e54f41de37 40
GET1030
The problem of pie charts
The green slice is actually equal to a quarter of the
yellow one, and pink is a third of the value of the purple
slice
Which is bigger? If you choose New Hampshire you would be wrong, it
was actually Maryland. The 3D affect visually adds more volume to that
slice, tricking your eyes. Without labels telling the percentages there
would be little to no chance of accurately guessing it.
People tend to underestimate the size of acute
angles (<90°) and overestimate the size of obtuse
ones (>90°) (Nundy et al, 2000, text at 41
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC25873/)
GET1030
The problem of pie charts
Walt Hickey, “The Worst Chart In
The World”, Business Insider,
2013,
https://www.businessinsider.com/
pie-charts-are-the-worst-2013-6
42
GET1030
Multiple dimensions
Nathan Yau, How to Spot Visualization
Lies, Flowingdata, 2017,
https://flowingdata.com/2017/02/09/how-t
o-spot-visualization-lies/
43
GET1030
Values not normalized
Chiqui Esteban, ‘A Quick Guide to
Spotting Graphics That Lie, National
Geographic, May 2015,
https://www.nationalgeographic.com/news/2015/06/1
50619-data-points-five-ways-to-lie-with-charts/
44
GET1030
Spurious correlations
Linecharts tend to suggest correlation
http://www.tylervigen.com/spurious-correlations 45
GET1030
Even good graphics tell different stories
Cabanski, C., Gilbert, H., & Mosesova, S. (2018). Can Graphics Tell Lies? A Tutorial on How
To Visualize Your Data. Clinical and translational science, 11(4), 371–377.
doi:10.1111/cts.12554
46
GET1030
Raw data, not just summary statistics
Nine data sets with equivalent
summary statistics. Each data set
has the same x mean (54.26), y
mean (47.83), x SD (16.76), y SD
(26.93), and Pearson correlation
coefficient ( −0.06). The nine distinct
patterns show the importance of
plotting the raw data rather than only
displaying summary statistics or
models.
Cabanski, C., Gilbert, H., &
Mosesova, S. (2018). Can
Graphics Tell Lies? A Tutorial on
How To Visualize Your Data.
Clinical and translational science,
11(4), 371–377.
doi:10.1111/cts.12554
47
GET1030
List of caveats
https://www.data-to-viz.com/caveats.html
48
Lecture 3
Visualizing data
Part 4: critical perspectives on data visualization
49
GET1030
Alternative modes
A basic bar chart compares the
number of men (top bar) and
the number of women (bottom
bar) in seven different nations,
A through F, at the present time
(2010). The assumptions are
that quantities (number),
entities (nations), identities
(gender) and temporality (now)
are all self-evident. Graphic
credit Xárene Eskandar.
Drucker (2011)
50
GET1030
Alternative modes
In this chart gendered identity is
modified. In nation A, the top bar
contains a changing gradient,
indicating that “man” is a continuum
from male infant to adult, or in
countries E and D, gender ambiguity
is displayed.
In country F women only register as
individuals after coming of
reproductive age, thus showing that
quantity is a effect of cultural
conditions, not a self-evident fact.
The movement of men back and forth
across the border of nations B and C
makes the “nations” unstable entities.
Graphic credit Xárene Eskandar.
Drucker (2011)
51
GET1030
Two geographical visualizations
52
Hepword and Church (2011)
GET1030
Two geographical visualizations
53
Hepword and Church (2011)
GET1030
Two geographical visualizations
54
Hepword and Church (2011)
GET1030
Two geographical visualizations
55
GET1030
Each visualization tells a story
By Nathan Yau
“Let the data speak.” It’s a common saying
for chart design. The premise — strip out the
bits that don’t help patterns in your data
emerge — is fine, but people often
misinterpret the mantra to mean that they
should make a stripped down chart and let
the data take it from there.
You have to guide the conversation though.
You must help the data focus and get to the
point. Otherwise, it just ends up rambling
about what it had for breakfast this morning
and how the coffee wasn’t hot enough.
https://flowingdata.com/2017/01/24/one-data
set-visualized-25-ways/
56
GET1030
Jer Thorp
Collection → Computation → Representation
This formula is too simplistic
Whenever you look at data — as a spreadsheet or database view or a visualization, you are looking at [a
system]. What this diagram doesn’t capture is the immense branching of choice that happens at each
step along the way. As you make each decision — to omit a row of data, or to implement a particular
database structure or to use a specific color palette you are treading down a path through this wild, tall
grass of possibility. It will be tempting to look back and see your trail as the only one that you could have
taken, but in reality a slightly divergent you who’d made slightly divergent choices might have ended up
somewhere altogether different. To think in data systems is to consider all three of these stages at once
[...] (Thorp 2017)
https://medium.com/@blprnt/you-say-data-i-say-system-54e84aa7a421
57
GET1030
Ways of Seeing Data
● Book chapter: Gray et al (2016) “Ways of Seeing Data: Toward a Critical
Literacy for Data Visualizations as Research Objects and Research Devices” [in
LumiNUS].
● Checklist of questions for analyzing visualizations
● Influenced by John Berger’s Ways of Seeing (1972), which emphasized political
and cultural histories of how images are produced and consumed.
● Objective: developing a critical literacy of visualization.
● There is no ‘neutral’ option, every type of visualization reflects ‘narrative
regimes’ (ideas of what stories should be told, and how they should be told)
● 3 forms of mediation:
○ World ⇢ data
○ Data ⇢ image
○ Image ⇢ eye
● Culturally and historically specific ‘ways of seeing’ that reflect values, cultures,
aesthetics and prejudices.
58
GET1030
World ⇢ data
• What information and data are being represented in the visualization?
• What are the sources for this information? Where do the data come from?
(is it cited?)
• How are the data generated? What are the rationales, methods, and
standards inscribed in the data infrastructures through which the
data are generated?
• How are the data transformed or prepared?
• Which data sources are combined and how?
• How do the data selectively prioritize certain things over others?
59
GET1030
Data ⇢ image
• How are the data mediated into graphical form?
• What kinds of graphical techniques, methods, and technologies have
been used?
• What are their affordances? what does the data enable you to “do”, what
stories can you tell with it) How do they guide our attention toward
different aspects of the data?
• What design decisions have been taken? What are their consequences?
Try to reverse-engineer the visualizations: what steps were taken to
produce them?
60
GET1030
Image ⇢ eye
• What kinds of visual cultures and practices are implicated or reflected
in the data visualization? Where do these come from?
• What forms of usage are inscribed in the visualization? Who are the
publics of the data visualization? How is it circulated, cited, and
shared?
Visual cultures - what are they and where do they come from?
How are visualizations used, circulated, cited and shared?
Denaturalize - historicize (every visual “idea” comes from somewhere)
Find the genealogy of the visualization styles
61
GET1030
Other takeaways from this chapter
Data infrastructures
Visualization devices, rather than ‘tools’ (they emphasize world-making)
‘Communicative objectivity’ Halpern
Tufte advocates: championing efficiency and parsimony, maximizing the
‘data–ink’ ratio, and eliminating ‘chartjunk’
62
GET1030
For the individual assignment
Choose a data visualization and analyze it using the perspectives seen in
this chapter. It is strongly recommended that you use the checklist in Gray’s
chapter.
63
GEK 2050
GET1030
Individual assignment (20%)
A 1000 word analysis of a data visualization (it doesn’t need to be a
humanities data visualization, but you must use critical data perspectives,
as explained in Lectures #2 and #3).
You must explain the computational thinking process that led to this
project: how data was modeled, which aspects of the analysis were
automated, and what output was produced. You should also explain what
are the potential blind spots and limitations of the project.
File format: one file with your name clearly stated in the body of the text
(5 marks will be deducted if this missing). For citations, please use the
Chicago Manual of Style 17th Edition, Author-Date format. Do not cite
academic journals is if they were websites.
64
GEK 2050
GET1030
Where to find graphs to analyze?
1) The media. You can analyze examples that you consider to be
successful or problematic from web pages, social media and news
sources.
2) Digital Scholarship in the Humanities https://academic.oup.com/dsh
(Access provided via NUS libraries).
3) Ideally you shouldn’t choose a scientific data visualization from
natural science publications, where visualization practices are more
standardized. You want a visualization of a ‘less settled’ area, in
connection to the social and cultural world.
65
GET1030
That’s it for today
https://xkcd.com/688/
66
GET1030
References
Drucker, Johanna. 2011. “Humanities Approaches to Graphical Display.” Digital Humanities Quarterly 005 (1).
Hepworth, Katherine, and Christopher Church. 2019. “Racism in the Machine: Visualization Ethics in Digital Humanities
Projects.” Digital Humanities Quarterly 012 (4).
Gray, Jonathan, Liliana Bounegru, Stefania Milan, and Paolo Ciuccarelli. 2016. “Ways of Seeing Data: Toward a Critical
Literacy for Data Visualizations as Research Objects and Research Devices.” In Innovative Methods in Media and
Communication Research, edited by Sebastian Kubitschko and Anne Kaun, 227–51. Cham: Springer International
Publishing. https://doi.org/10.1007/978-3-319-40700-5_12.
Thorp, Jer. 2017. “You Say Data, I Say System.” Hacker Noon. July 13, 2017.
https://hackernoon.com/you-say-data-i-say-system-54e84aa7a421.
Friendly, Michael and Daniel Denis. “The early origins and development of the scatterplot.” Journal of the History of the
Behavioral Sciences, 41(2), 103–130.
Nundy, Surajit et al. 2000. “Why are angles misperceived?”. Proc Natl Acad Sci 97(10): 5592–5597.
67