0% found this document useful (0 votes)

11 views70 pages

Lec06-Data Visualization

The document provides an introduction to data visualization using the ggplot2 package in R, emphasizing its importance in exploratory data analysis (EDA). It discusses the components of a graph, aesthetic mappings, and various geometries, while also addressing updates in the dplyr package and how to handle deprecated functions. The document highlights the significance of visualization in identifying data issues and enhancing understanding of data distributions.

Uploaded by

mgjworkstation

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views70 pages

Lec06-Data Visualization

Uploaded by

mgjworkstation

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

Introduction to Data Science and Engineering

- Data visualization

RB Luo / 羅銳邦

Department of Computer Science

University of Hong Kong
Some materials courtesy of
Rafael A. Irizarry, and are modified
from the original version.
dplyr has updated from 1.0.10 to 1.1.0 on Jan 31st,
2023 that caused some behavioral differences from
the course materials
Question contributed by classmate Nayoung Lim (class 2023)
"""
As I was trying the examples from your lecture note, I spotted an unexpected error.
The error message says that the multiple summaries (returning more than 1 row) say to
be deprecated in dplyr 1.1.0 version. I would like you to confirm (1) if multiple summaries
are completely replaced and no longer in use anymore and if so, (2) how to use reframe()
instead to generate the same result as the example as suggested from the error message.

P.S) The code I was trying is from ‘Multiple summaries’ chapter, pg.18 from Lec04-
tidyverse.
"""
My reply
"""
Very good question!

To give you the answer first, please use the following code to install the previous version
of tidyverse, that contains the previous version 1.0.10 of dplyr. When installing, it asks
you a few questions, just select update all should there be such a question.
install.packages("devtools")
require(remotes)
install_version("tidyverse", version = "1.3.2", repos = "http://cran.r-project.org")

tidyverse has a very active community that they are not conservative on obsoleting old
ways if there are better ways. But as a coder, and especially if you wish the analysis
pipeline you wrote can be used by the others with the same exceptions you had on your
side, an important skill is to fix the versions of the libraries that you depended on.

In your case, you got a warning, not an error. And after the warning, check temp, it shows
what you need. The warning message only tells you that summarize function is
deprecated in the new version. It has not been removed. And the function will last for
many subsequent versions until the developers can make sure that removing the
summarize function won’t cause serious problems to the libraries that depend on it.
"""
Why visualize?
Look at the table for 10 seconds, what Look at a sensible visualization of the
table for 10 seconds, what you see?
you see?

How quickly can you determine which states have the largest populations? Which states have the smallest?
How large is a typical state? Is there a relationship between population size and total murders? How do
murder rates vary across regions of the country?
Visualization in
the early days
John Snow Cholera map

https://www.tableau.com/learn/arti
cles/best-beautiful-data-
visualization-examples
Influential visualization nowadays

https://coronavirus.jhu.edu/map.html
This is an example of how data
visualization can lead to discoveries
which would otherwise be missed if we
simply subjected the data to a battery
of data analysis tools or procedures.
Data visualization is the strongest tool
of what we call exploratory data
analysis (EDA). John W. Tukey,
considered the father of EDA, once
said,

“The greatest value of a picture is

when it forces us to notice what we
never expected to see.”

Many widely used data analysis tools

were initiated by discoveries made via
EDA. EDA is perhaps the most
important part of data analysis, yet it is
one that is often overlooked.

https://en.wikipedia.org/wiki/Demogr
aphics_of_Hong_Kong
Animation
(Moving targets are more visible)

Hans Rosling, The best stats you've ever seen, TED2006

https://www.ted.com/talks/hans_rosling_the_best_stats_you_ve_ever_seen
Visualization prevents garbage in,
garbage out
It is also important to note that mistakes, biases, systematic errors
and other unexpected problems often lead to data that should be
handled with care. Failure to discover these problems can give rise
to flawed analyses and false discoveries. As an example, consider
that measurement devices sometimes fail and that most data
analysis procedures are not designed to detect these. Yet these data
analysis procedures will still give you an answer. The fact that it can
be difficult or impossible to notice an error just from the reported
results makes data visualization particularly important
Data visualization
• We will use the ggplot2 package to code
Load ggplot2
or simply

• Facts
• gg means "grammar of graphics"
• By learning a handful of ggplot2 building blocks and its grammar,
you will be able to create hundreds of different plots
• Assuming that our data is tidy, ggplot2 simplifies plotting code and
the learning of grammar for a variety of plots
• There are other libraries in R that also creates graphics
• Such as grid and lattice
• ggplot2 generates a plot with multiple layers, allowing adding
components or elements to a figure one at a time
• ggplot2 functions and arguments needs memorization, we recommend
you have the ggplot2 cheat sheet handy
• https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-
cheatsheet.pdf
• Search for “ggplot2 cheat sheet”.
The components of a graph
• The first step in learning ggplot2 is to be able to break a graph
apart into components. Let’s break down the plot example and
introduce some of the ggplot2 terminology. The main three
components to note are
• Data: The US murders data table is being summarized. We refer to
this as the data component
• Geometry: The plot above is a scatterplot. This is referred to as the
geometry component. Other possible geometries are barplot,
histogram, smooth densities, qqplot, and boxplot. We will learn
more about these in the Data Visualization part of the book
• Aesthetic mapping: The plot uses several visual cues to represent
the information provided by the dataset. The two most important
cues in this plot are the point positions on the x-axis and y-axis,
which represent population size and the total number of murders,
respectively. Each point represents a different observation, and we
map data about these observations to visual cues like x- and y-
scale. Color is another visual cue that we map to region. We refer
to this as the aesthetic mapping component. How we define the
mapping depends on what geometry we are using
• Other components including
• Coordination system, position adjustment, label and legend,
theme, facet
The components of a graph
• We also note that
• The points are labeled with the state abbreviations
• The range of the x-axis and y-axis appears to be defined by
the range of the data. They are both on log-scales
• There are labels, a title, a legend, and we use the style of
The Economist magazine
• Now we construct the plot piece by piece
• We start by loading the dataset
ggplot objects
• The first step in creating a ggplot2 graph is
or to define a ggplot object. We do this with
the function ggplot, which initializes the
graph. If we read the help file for this
function, we see that the first argument is
used to specify what data is associated
with this object, thus pipe can be used

• It renders a plot, in this case a blank slate

since no geometry has been defined. The
only style choice we see is a grey
background

• If assigned ggplot an object, evaluate it or

print it to show it.
Geometries

• In ggplot2 we create graphs by adding layers. Layers can define geometries,

compute summary statistics, define what scales to use, or even change styles. To
add layers, we use the symbol +. In general, a line of code will look like this:
DATA |> ggplot() + LAYER 1 + LAYER 2 + … + LAYER N
Available
geometries

See these
examples?
Give them
a try

https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf
Aesthetic mappings
Aesthetic mappings describe how properties of the data connect with features of the graph,
such as distance along an axis, size, or color. The aes function connects data with what we see on
the graph by defining aesthetic mappings and will be one of the functions you use most often
when plotting. The outcome of the aes function is often used as the argument of a geometry
function. This example produces a scatterplot of total murders versus population in millions

We can drop the x = and y = if we wanted to since these are the first and second expected arguments.
Instead of defining our plot from scratch, we can also add a layer to the p object that was defined before
Aesthetic mappings
• The scale and labels are defined by default

• aes also uses the variable names from the

object component: we can use population and
total without having to call them as
murders$population and murders$total
• The behavior of recognizing the variables
from the data component is quite specific
to aes. With most functions, if you try to
access the values of population or total
outside of aes you receive an error
Layers
• A second layer in the plot we wish to make
involves adding a label to each point to
identify the state. The geom_label and
geom_text functions permit us to add text to
the plot with and without a rectangle behind
the text, respectively

• Because each point (each state in this case)

has a label, we need an aesthetic mapping to
make the connection between points and
labels. The code looks like this
Tinkering with arguments

Each geometry function has many

arguments other than aes and data.
They tend to be specific to the
function. For example, in the plot we
wish to make, the points are larger
than the default size. In the help file
we see that size is an aesthetic and we
can do this

size is not observation associated, and applies

to all data points
Nudge the label

Now because the points are larger it is

hard to see the labels. We can use the
nudge_x argument, which moves the
text slightly to the right or to the left

size is not observation associated, and applies

to all data points
Available
geometries

https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf
Use global against local aesthetic mappings

In the previous example, we defined the mapping aes(population/10^6, total) twice,

once in each geometry. We can avoid this by using a global aesthetic mapping

Applies to both geom_point and geom_text

later, geom_point does not need a label thus
ignored
Use both global and local aesthetic mappings

If necessary, we can override the global mapping

by defining a new mapping within each layer.
These local definitions override the global
Scales

Our desired scales are in log-scale. This is not the

default, so this change needs to be added through
a scales layer. The scale_x_continuous function
lets us control the behavior of scales

Because we are in the log-scale now, the nudge

must be made smaller
Reference:
Other
available
controls in the
cheatsheet

https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf
Scales

This particular transformation is so common that

ggplot2 provides the specialized functions
scale_x_log10 and scale_y_log10
Reference:
Other
available
controls in the
cheatsheet

https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf
Labels and titles

Plots without labels and titles are like babies

without name, it can't happen
Reference:
Other
available
controls in the
cheatsheet

https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf
Labels and titles alternatives

Some prefer using xlab, ylab, ggtitle instead of a

single labs for better "layerization"
Categories as colors
We can change the color of the points using the
col argument in the geom_point function

This is not what we want, but just a

demonstration of how to change a layer to a
single color
Categories as colors
A nice default behavior of ggplot2 is that if we
assign a categorical variable to color, it
automatically assigns a different color to each
category and also adds a legend

Since the choice of color is determined by a

feature of each observation, this is an aesthetic
mapping. To map each point to a color, we need
to use aes
These global aes are used
Categories as colors in all following geometries

While col=region is
specific to this geometry,
aes needs to be the first
argument

ggplot2 automatically adds a

legend that maps color to region.
To avoid adding this legend we set
the geom_point argument
show.legend = FALSE
Annotation, shapes, and adjustments
We often want to add shapes or annotation to
figures that are not derived directly from the
aesthetic mapping like labels, boxes, shaded
areas, and lines

Here we want to add a line that represents the

average murder rate r for the entire country.
When using normal scale, the line is defined as y =
rx, where r is the slope. In the log-scale, the line
turns into log(y) = log(r) + log(x), where slope is 1
and intercept is log(r)
Use a dashed line
Use dashed line or other line type (lty), and
lighter colors for a trend line or other
annotations

geom_abline (ab stands for y=ax+b) is not in the

cheatsheet, so make your own cheatsheet
Small tweaks
"region" is from the name of the column in the
original table, how to change it back to "Region"?

Well I myself like lowercasing everything (you

know what I am talking about if you like Taylor
Swift), but please make sure your style adds value
to you instead of causing you troubles. In a formal
working environment, we should still change
"region" to "Region"
Small tweaks
It's working

Ask chatGPT

Also working
Using theme
The power of ggplot2 is augmented further due to
the availability of add-on packages

The style of a ggplot2 graph can be changed using

the theme functions. Several themes are included
as part of the ggplot2 package
Using theme

A complete list of themes in ggthemes:

https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/
When there is a will, there is a way,
in R at least
Problem:
The labels are overlapping

Question:
How to rearrange the labels
in R so they don't overlap?
Use geom_text_repel instead of geom_text
Putting it all together
Grids of plots
Brief summary
• In the previous slides, we learned using ggplot, but with
just an example
• The functions and usages are not for you to memorize,
you will rely on the cheatsheet and Google heavily. Later
with more plots generated, you will have a collection of
working code of many types and styles of plots, that
speeds you up
• The final exam won’t demand a perfectly working copy of
code. Writing code from scratch is minimized in the final
exam. I care about correct understanding and good ideas
• Next, let’s learn visualizing data distributions
Visualizing data distributions
The medium salary is HKD 20k/16k for
Male/Female in Hong Kong in 2020

Compared to the information above, what in

M/F
addition can you obtain from the plot on the
left?

The most basic statistical summary of a list of

objects or numbers is its distribution. Once a
data has been summarized as a distribution,
there are several data visualization
techniques to effectively relay this
information

In the following slides, we will discuss

properties of a variety of distributions and
Census and Statistics Department of Hong Kong how to visualize distributions using a
* excluded foreign domestic workers
motivating example of student heights
Variable types
• The two main variable types are categorical/numeric
• A categorical variable can be ordinal or nominal
• Ordinal: {very bad, bad, neutral, good, very good}
• Nominal: {Northeast, South, North Central, West}
• A numerical variable can be discrete or continuous
• Discrete: can be rounded, e.g., population size
• Continuous: needs decimals for precision, e.g., heights
We will use both the murders and heights datasets for following examples
Here are some details about the heights dataset

In case you want to work on metric units

A simple count of categorical data using barplot
Distribution of categorical data
With categorical data, the proportion of each unique category
describes the distribution
n is a product
of count()

Use
geom_col
instead of
geom_bar,
check
https://sta
ckoverflow.
com/a/590
09108
Easy?

The two examples show the number or

proportion of each category. We
usually use barplots to display a few
numbers. Although they do not
provide much more insight than a
frequency table itself, it is a first
example of how we convert a vector
into a plot that succinctly summarizes
all the information in the vector. When
the data is numerical, the task of
displaying distributions is more
challenging
Histogram

• Numerical data that are not categorical also have distributions.

However, in general, when data is not categorical, reporting the
frequency of each entry, as we did for categorical data using barplot,
is not an effective summary since most entries are unique. For
example, in our case study, while several students reported a height of
68 inches, only one student reported a height of 68.50393 inches and
only one student reported a height 68.89763 inches
• Histogram divides the span of data into non-overlapping bins of the
same size. Then, for each bin, we count the number of values that fall
in that interval. The histogram plots these counts as bars with the
base of the bar defined by the intervals
Histogram
Here is the histogram for the height data splitting the range of values
into one-inch intervals:
(49.5,50.5],(50.5,51.5],(51.5,52.5],(52.5,53.5],...,(82.5,83.5]

As you can see in the figure above, a histogram is similar to a barplot, but it differs in that
the x-axis is numerical, not categorical
There is a reason that we don’t see the world in B/W
Color is another language
https://www.datanovia.com/en/blog/awesome-list-of-657-r-color-names/
What’s the best binwidth?
Histogram
From the histogram, we can immediately learn some
important properties about our data
• The range of the data is from 50 to 84 with the majority
(more than 95%) between 63 and 75 inches
• The heights are close to symmetric around 69 inches

What information do we lose? Note that all values in each

interval are treated the same when computing bin heights.
So, for example, the histogram does not distinguish
between 64, 64.1, and 64.2 inches. Given that these
differences are almost unnoticeable to the eye, the
practical implications are negligible and we were able to
summarize the data to just 23 numbers
Smoothed density
Smooth density plots relay the same information as a histogram but are
aesthetically more appealing

In this plot, we no longer have sharp edges at the interval boundaries and many of the local peaks have been
removed. Also, the scale of the y-axis changed from counts to density. To fully understand smooth densities, we
have to understand estimates, a topic we don’t cover until later. Here we simply describe them as making the
histograms prettier by drawing a curve that goes through the top of the histogram bars and then removing the
bars. The values shown y-axis are chosen so that the area under the curve at up to 1. This implies that for any
interval, the area under the curve for that interval gives us an approximation of how what proportion of the data
is in the interval
Smoothed density
An advantage of smooth densities over histograms for visualization
purposes is that densities make it easier to compare two distributions. This
is in large part because the jagged edges of the histogram add clutter. Here
is an example comparing male and female heights

Line color Fill color

Transparency
Smoothed density
Normal distribution
Histograms and density plots provide excellent summaries of a distribution. But can we
summarize even further? We often see the average and standard deviation (just two
numbers) used as summary statistics. To understand what these summaries are and why
they are so widely used, we need to understand the normal distribution

The normal distribution, also known as the bell curve and as the Gaussian distribution.
Here is what the normal distribution looks like

The plot shows a normal distribution with

average 0 and SD 1
Normal distribution
The normal distribution is one of the most famous mathematical concepts in history. A reason for this is
that the distribution of many datasets can be approximated with normal distributions. These include
including gambling winnings, heights, weights, blood pressure, standardized test scores, and experimental
measurement errors. Later in "Introduction to statistics with R", we will learn why But how can the same
distribution approximate datasets with completely different ranges for values, for example heights and
weights?

Normal distribution can be adapted to different datasets by just adjusting two numbers, referred to as the
average or mean and the standard deviation (SD)

Note that the fact that because only two numbers are needed to adapt the normal distribution to a
dataset, implies that if our data distribution can be approximated by a normal distribution, all the
information needed to describe the distribution can be encoded in just two numbers
Normal distribution
To investigate whether the heights dataset complies normal distribution, what to do in
visualization? Overlay the heights distribution with a normal distribution

The distribution of heights seems to be a normal

distribution, so what?

We will learn more later in the statistics section

Boxplot To understand boxplots, we need to define some terms that are
commonly used in exploratory data analysis

The percentiles are the values for which p = 0.01, 0.02, ... , 0.99 of
the data are less then or equal to that value, respectively. We call,
for example, the case of p = 0.10 the 10th percentile, which gives us
a number for which 10% of the data is below. The most famous
percentile is the 50th, also known as the median. Another special
case that receives a name are the quartiles, which are obtained
when setting p = 0.25, 0.50, and 0.75, which are used by the boxplot

To motivate boxplots, we will go back to the US murder data.

Suppose we want to summarize the murder rate distribution. Using
the data visualization skills we have learned, we can quickly see that
the normal approximation does not apply here, more information is
needed to describe the distribution of the data
Boxplot
The boxplot provides a five-number summary composed of the range along with the quartiles (the 25th,
50th, and 75th percentiles). The R implementation of boxplots ignore outliers when computing the
range and instead plot these as independent points. The boxplot shows these numbers as a “box” with
“whiskers”

How to define outliers?

Interquartile range (IQR), also called the middle 50%, is a measure of statistical dispersion, being equal
to the difference between 75th and 25th percentiles.
Outliers here are defined as observations that fall below Q1 − 1.5 IQR or above Q3 + 1.5 IQR.
Boxplot

From boxplot, we know that the

distribution is not symmetric, and
there are a few outliers
Boxplot
Q-Q plot
A Q-Q plot, short for “quantile-quantile” plot, is used to
assess whether or not a set of data potentially came from
some theoretical distribution

In most cases, this type of plot is used to determine

whether or not a set of data follows a normal distribution

If the data is normally distributed, the points in a Q-Q plot

will lie on a straight diagonal line

Q-Q Plots Explained

https://towardsdatascience.com/q-q-plots-explained-5aa8495426c0
Stratification
In the previous examples, the murders dataset was grouped by regions, the heights
dataset was grouped by gender

In data analysis we often divide observations into groups based on the values of
one or more variables associated with those observations. We call this procedure
stratification and refer to the resulting groups as strata. Stratification is common in
data visualization because we are often interested in how the distribution of
variables differs across different subgroups
So much for visualizing data distributions
Next, we will work on some more examples, and use more
ggplot geometries and tricks

R Data Visualization Techniques
No ratings yet
R Data Visualization Techniques
48 pages
R Data Visualization
No ratings yet
R Data Visualization
79 pages
Ggplot2 For Data Visualization: Grammer of Graphics "
No ratings yet
Ggplot2 For Data Visualization: Grammer of Graphics "
19 pages
Week1 Slides
No ratings yet
Week1 Slides
64 pages
DS-R Block 4 All
No ratings yet
DS-R Block 4 All
50 pages
Lecture 3&4
No ratings yet
Lecture 3&4
294 pages
Data Visulization1
No ratings yet
Data Visulization1
39 pages
R Graphics for Data Analysis
No ratings yet
R Graphics for Data Analysis
84 pages
R Programming Unit-3
No ratings yet
R Programming Unit-3
76 pages
04 Data Visualization
No ratings yet
04 Data Visualization
64 pages
Visualizing Data in R
No ratings yet
Visualizing Data in R
20 pages
Week10 Slides Updated
No ratings yet
Week10 Slides Updated
80 pages
Intro Ggplot2 3
No ratings yet
Intro Ggplot2 3
53 pages
Histograms and Density Plots in R
No ratings yet
Histograms and Density Plots in R
9 pages
Exploratory Data Analysis With R
No ratings yet
Exploratory Data Analysis With R
218 pages
Unit 3data Visualization With Ggplot2
No ratings yet
Unit 3data Visualization With Ggplot2
19 pages
R Graphics Essentials For Great Data Visualization
No ratings yet
R Graphics Essentials For Great Data Visualization
28 pages
A Comprehensive Guide On Ggplot2 in R
No ratings yet
A Comprehensive Guide On Ggplot2 in R
30 pages
Figures With GGPlot
No ratings yet
Figures With GGPlot
58 pages
P6ADBMS
No ratings yet
P6ADBMS
34 pages
06 Plots Export Plots
100% (1)
06 Plots Export Plots
17 pages
07 Graphics
No ratings yet
07 Graphics
139 pages
Aws Lambda Tutorial
88% (8)
Aws Lambda Tutorial
393 pages
Using Ggplot2 For Plots in R
No ratings yet
Using Ggplot2 For Plots in R
8 pages
Handout 3
No ratings yet
Handout 3
24 pages
Week4 2020
No ratings yet
Week4 2020
25 pages
Graphics Lecture
No ratings yet
Graphics Lecture
14 pages
R Graphics1
No ratings yet
R Graphics1
56 pages
Graphics
No ratings yet
Graphics
10 pages
IDS Unit-5
No ratings yet
IDS Unit-5
39 pages
Unit 5 R Programming
No ratings yet
Unit 5 R Programming
43 pages
Data Visualization for R Users
No ratings yet
Data Visualization for R Users
36 pages
R Visualizations: Derive Meaning From Data 1st Edition David Gerbing PDF Download
No ratings yet
R Visualizations: Derive Meaning From Data 1st Edition David Gerbing PDF Download
178 pages
Actex Pa Sample
No ratings yet
Actex Pa Sample
12 pages
Service Manual: Finisher
No ratings yet
Service Manual: Finisher
235 pages
Plotting With Ggplot: Install - Packages ("Ggplot2") Library (Ggplot2)
No ratings yet
Plotting With Ggplot: Install - Packages ("Ggplot2") Library (Ggplot2)
3 pages
Vani Ganapathy
No ratings yet
Vani Ganapathy
2 pages
Exploratory Data Analysis Course Notes
No ratings yet
Exploratory Data Analysis Course Notes
55 pages
Data Visualization in R - With Cheat Sheets PDF
100% (1)
Data Visualization in R - With Cheat Sheets PDF
62 pages
R Programming for Students
No ratings yet
R Programming for Students
10 pages
Lecture 2 Data Presentation
No ratings yet
Lecture 2 Data Presentation
18 pages
Apunts BLOC 1 Estadística
No ratings yet
Apunts BLOC 1 Estadística
15 pages
Unit 5 Advanced Graphics in R
No ratings yet
Unit 5 Advanced Graphics in R
43 pages
Create Elegant Data Visualisations Using The Grammar of Graphics - Ggplot2
No ratings yet
Create Elegant Data Visualisations Using The Grammar of Graphics - Ggplot2
5 pages
Visualization in R
No ratings yet
Visualization in R
44 pages
R Graphics for Psychology Students
No ratings yet
R Graphics for Psychology Students
96 pages
Aluminium Industry Trends
No ratings yet
Aluminium Industry Trends
7 pages
2015jan Ggplot2koffman
No ratings yet
2015jan Ggplot2koffman
79 pages
Unit 2
No ratings yet
Unit 2
32 pages
Apostila Ggplot
No ratings yet
Apostila Ggplot
59 pages
HGI Development Deck 2021
No ratings yet
HGI Development Deck 2021
57 pages
11 Data Visualization
No ratings yet
11 Data Visualization
44 pages
Corporation Law Course Syllabus
No ratings yet
Corporation Law Course Syllabus
10 pages
English For Aviation: Course Outline and Sample Materials
0% (1)
English For Aviation: Course Outline and Sample Materials
14 pages
R Module 4
No ratings yet
R Module 4
31 pages
Ggplot2 Cheat Sheet
No ratings yet
Ggplot2 Cheat Sheet
1 page
R Graphics
No ratings yet
R Graphics
76 pages
Data Visualization in R Sem-III 2021 PDF
No ratings yet
Data Visualization in R Sem-III 2021 PDF
57 pages
Basics of Data Analysis and Graphics in
No ratings yet
Basics of Data Analysis and Graphics in
103 pages
Data Science Visualization in R
No ratings yet
Data Science Visualization in R
42 pages
Nevada Report HSGAC Testimony Binnall Elections 2020
No ratings yet
Nevada Report HSGAC Testimony Binnall Elections 2020
2 pages
Billet Marker
0% (1)
Billet Marker
4 pages
ForceandPressureOlympaid
No ratings yet
ForceandPressureOlympaid
31 pages
Lab01 Note R
No ratings yet
Lab01 Note R
7 pages
Beginner Guide To R and R Studio V1
No ratings yet
Beginner Guide To R and R Studio V1
27 pages
DSCI Key Terms and Ideas For Review
No ratings yet
DSCI Key Terms and Ideas For Review
98 pages
Justification by Faith Term Paper
100% (2)
Justification by Faith Term Paper
6 pages
The Autoimmune Epidemic by Human Garage
No ratings yet
The Autoimmune Epidemic by Human Garage
12 pages
Introduction To Ggplot2: Saier (Vivien) Ye September 16, 2013
No ratings yet
Introduction To Ggplot2: Saier (Vivien) Ye September 16, 2013
32 pages
Book Release: Rock Garden
No ratings yet
Book Release: Rock Garden
4 pages
Technical Analyst
100% (1)
Technical Analyst
50 pages
People Analytics With R Part 4
No ratings yet
People Analytics With R Part 4
11 pages
BR SprayMaster
No ratings yet
BR SprayMaster
16 pages
Soal Ujian Kelas 9 SMP Inggris 2018
No ratings yet
Soal Ujian Kelas 9 SMP Inggris 2018
6 pages
Constantine, Sirmium & Early Christianity
No ratings yet
Constantine, Sirmium & Early Christianity
82 pages
Hematology & Drug Study Guide
No ratings yet
Hematology & Drug Study Guide
19 pages
Finetech GTX 620 Katalogu 944
No ratings yet
Finetech GTX 620 Katalogu 944
4 pages
Shivangi
No ratings yet
Shivangi
31 pages
Narrative Kelas Xi
No ratings yet
Narrative Kelas Xi
10 pages
Digital Natives Digital Immigrants - II
No ratings yet
Digital Natives Digital Immigrants - II
24 pages
20 Recipes For Oxtail
No ratings yet
20 Recipes For Oxtail
4 pages
Evidence of Evolution
No ratings yet
Evidence of Evolution
15 pages
P5 Pre-Practical Exercise CHM 171 2904 (QP)
No ratings yet
P5 Pre-Practical Exercise CHM 171 2904 (QP)
4 pages
DMS Admission
No ratings yet
DMS Admission
3 pages
Class-3 - Ratio & Proportion& Data Interpreation
No ratings yet
Class-3 - Ratio & Proportion& Data Interpreation
11 pages
Alka Bhagat
No ratings yet
Alka Bhagat
2 pages
RPG and Story-Based Game in Game Development
No ratings yet
RPG and Story-Based Game in Game Development
9 pages
The Light in The Abyss
No ratings yet
The Light in The Abyss
2 pages

Lec06-Data Visualization

Uploaded by

Lec06-Data Visualization

Uploaded by

Introduction to Data Science and Engineering

Department of Computer Science

“The greatest value of a picture is

Many widely used data analysis tools

Hans Rosling, The best stats you've ever seen, TED2006

• It renders a plot, in this case a blank slate

• If assigned ggplot an object, evaluate it or

• In ggplot2 we create graphs by adding layers. Layers can define geometries,

• aes also uses the variable names from the

• Because each point (each state in this case)

Each geometry function has many

size is not observation associated, and applies

Now because the points are larger it is

size is not observation associated, and applies

In the previous example, we defined the mapping aes(population/10^6, total) twice,

Applies to both geom_point and geom_text

If necessary, we can override the global mapping

Our desired scales are in log-scale. This is not the

Because we are in the log-scale now, the nudge

This particular transformation is so common that

Plots without labels and titles are like babies

Some prefer using xlab, ylab, ggtitle instead of a

This is not what we want, but just a

Since the choice of color is determined by a

ggplot2 automatically adds a

Here we want to add a line that represents the

geom_abline (ab stands for y=ax+b) is not in the

Well I myself like lowercasing everything (you

The style of a ggplot2 graph can be changed using

A complete list of themes in ggthemes:

Compared to the information above, what in

The most basic statistical summary of a list of

In the following slides, we will discuss

In case you want to work on metric units

The two examples show the number or

• Numerical data that are not categorical also have distributions.

What information do we lose? Note that all values in each

Line color Fill color

The plot shows a normal distribution with

The distribution of heights seems to be a normal

We will learn more later in the statistics section

To motivate boxplots, we will go back to the US murder data.

How to define outliers?

From boxplot, we know that the

In most cases, this type of plot is used to determine

If the data is normally distributed, the points in a Q-Q plot

Q-Q Plots Explained

You might also like