Lesson Guide: Creating EDA Reports
using ggplot2 in R Markdown
Course: Analytics Techniques and Tools using R
Learning Objectives
By the end of this lesson, students will be able to:
1. Understand the importance of using ggplot2 for EDA reports.
2. Apply the grammar of graphics to create effective visualizations.
3. Use R Markdown to generate structured and reproducible EDA reports with ggplot2.
4. Perform univariate and bivariate analysis using ggplot2.
5. Conduct statistical tests and visualize their results using ggplot2.
Lesson Content
1. Introduction to ggplot2 for EDA Reports
● Why use ggplot2?
● Grammar of Graphics: A structured approach to visualization
● Basic syntax of ggplot2
● Installing and loading ggplot2
# Install ggplot2 if not already installed
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2", dependencies = TRUE)
}
# Load the package
library(ggplot2)
2. Grammar of Graphics in ggplot2
ggplot2 follows a layered approach:
● Data Layer: The dataset used for visualization.
● Aesthetics (aes()) Layer: Mapping variables to visual properties.
● Geometry Layer (geom_*()): Defines the type of visualization.
● Faceting (facet_wrap() or facet_grid()): Splitting plots by categorical variables.
● Theme and Labels (theme(), labs()): Customizing appearance.
# Basic ggplot structure
ggplot(data, aes(x = variable1, y = variable2)) +
geom_point()
3. Data Structure Analysis
3.1. Understanding the Dataset
# Load dataset (Example: mtcars)
data <- mtcars
# Check dataset structure
str(data)
# Summary statistics
summary(data)
3.2. Data Quality and Handling Missing Values
# Check for missing values
sum(is.na(data))
# Handling missing values
data <- na.omit(data) # Remove rows with missing values
4. Univariate Analysis
4.1. Understanding Distribution and Normality
# Shapiro-Wilk test for normality
shapiro.test(data$mpg)
4.2. Visualizing Data Distribution with ggplot2
# Histogram
ggplot(data, aes(x = mpg)) +
geom_histogram(binwidth = 2, fill = "lightblue", color = "black") +
labs(title = "Histogram of MPG", x = "MPG", y = "Count")
# Boxplot
ggplot(data, aes(y = mpg)) +
geom_boxplot(fill = "lightblue") +
labs(title = "Boxplot of MPG", y = "MPG")
4.3. Outlier Detection
# Boxplot-based outliers
Q1 <- quantile(data$mpg, 0.25)
Q3 <- quantile(data$mpg, 0.75)
IQR_value <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
outliers <- data$mpg[data$mpg < lower_bound | data$mpg > upper_bound]
# 3-SD Rule Outliers
data_mean <- mean(data$mpg)
data_sd <- sd(data$mpg)
lower_sd_bound <- data_mean - 3 * data_sd
upper_sd_bound <- data_mean + 3 * data_sd
outliers_sd <- data$mpg[data$mpg < lower_sd_bound | data$mpg >
upper_sd_bound]
5. Bivariate Analysis
5.1. Categorical vs Categorical (Chi-Square Test & Stacked Bar Plots)
# Stacked bar plot
ggplot(data, aes(x = factor(cyl), fill = factor(gear))) +
geom_bar(position = "fill") +
labs(title = "Proportion of Cylinders by Gear Type", x = "Cylinders", y =
"Proportion")
# Chi-Square Test
chisq.test(table(data$cyl, data$gear))
5.2. Categorical vs Numerical (T-Test, ANOVA, Wilcoxon, Kruskal-Wallis)
# Boxplot comparison
ggplot(data, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(fill = "lightblue") +
labs(title = "MPG by Cylinder Count", x = "Cylinders", y = "MPG")
# T-Test for two groups
t.test(mpg ~ am, data = mtcars)
# ANOVA for multiple groups
anova(lm(mpg ~ cyl, data = mtcars))
# Wilcoxon Test
wilcox.test(mpg ~ am, data = mtcars)
# Kruskal-Wallis Test
kruskal.test(mpg ~ cyl, data = mtcars)
5.3. Numerical vs Numerical (Correlation & Regression)
# Scatterplot with trend line
ggplot(data, aes(x = hp, y = mpg)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", color = "red") +
labs(title = "HP vs MPG", x = "Horsepower", y = "MPG")
# Correlation matrix
cor(data[, c("mpg", "hp", "wt")])
6. Summary & Next Steps
● Key Takeaways: ggplot2 provides a structured and powerful approach to EDA
visualization.
● Next Lesson: Advanced Data Visualization Techniques with ggplot2.