Data Analysis
and
Data Visualization Basics
(Pt2)
2
Overview
Visualization Tools
Statistics.
Statistical Summaries.
Correlation and Trends.
Visualization Tools
Click here for link image
Statistics on Visualization
Our brains value visuals over any other type of information.
90% of the information transmitted to the brain is visual – (MIT)
The human brain can process an image in just 13 milliseconds – (MIT)
50% of the brain is active in visual processing - (Piktochart)
Human brains process visuals 60,000 times faster than they do text – (University of Minnesota)
Visualization Tools
Tool Features Best For
Matplotlib Basic charts and plots Foundational plotting in Python
Seaborn Advanced statistical plots Visualizing relationships and trends
ggplot Grammar of Graphics framework Data visualization in R
Tableau Interactive dashboards Business intelligence and analytics
Power BI Real-time data analytics Dynamic reporting in enterprises
Power BI Visualization Tableau Visualization
Click here for link image Click here for link image
Install Power BI Desktop and get Install Tableau Public and get
started started
Seaborne, Matplotlib, Ggplot
Click here for link image
Click here for link image
Statistics
Click here for link image
What is Statistics?
Statistics is the science of collecting, organizing,
analyzing, interpreting, and presenting data to make
informed decisions.
It is a foundational tool in many fields, including
business, healthcare, engineering, social sciences,
and natural sciences.
It is a foundation for data analysis and data science.
Types of Statistics
Inferential Statistics
Descriptive Statistics
Makes predictions, inferences, and
Summarizes and describes the main generalizations about a population based
features of a dataset. on a sample.
Does not make predictions or infer Involves probability theory and hypothesis
conclusions beyond the data. testing.
Common techniques include Key concepts include:
Measures of central tendency Estimation
Measures of dispersion Hypothesis Testing
Data visualizations Regression and Correlation
Analysis.
Key Concepts in Statistics
Population
The entire group of individuals or items being studied
Sample
A subset of the population used for analysis.
Probability
The active phase where data is accessed, processed, and analyzed for insights or operations.
Outliers
Data points significantly different from others. Can distort statistical measures like mean and
variance.
Statistical Methods
Data Collection
Surveys, experiments, observational studies, and simulations.
Data Analysis
Techniques for summarizing and exploring data.
Hypothesis Testing
• A systematic method to test assumptions about a dataset.
• Includes null and alternative hypothesis testes with statistical tests.
Regression Analysis
• Models the relationship between variables
• Simple Linear Regression and Multiple Regression.
Applications of Statistics
Healthcare Social Sciences
Engineering Education
Business
Challenges in Statistics
03
Interpretation
01 Data Quality
Bias
02
Statistical
Summaries
Click here for link image
Measures of Central Tendency
Measures of central tendency are They provide a summary of the data
statistical metrics that represent by identifying a single value that
the center or typical value of a reflects the overall distribution.
dataset.
Mean Mode
Median
Mean
The sum of all data points divided by the number of data points.
Affected by all values in the dataset, including outliers.
Mean = 𝒊
Where:
𝒊 : Each data point
: Total number of data points
Mean (Example 1)
Age 15 16 16 17 18 16 17 16 25
17
Mean = 𝒊
Mean = 15 + 16 + 16 + 17 +18 + 16 + 17 + 16 + 25 + 17 =
173 10 10
Mean = 17.3
Mean (Example 2)
Age 15 16 16 17 18 16 17 16 50
17
Mean = 𝒊
Mean = 15 + 16 + 16 + 17 +18 + 16 + 17 + 16 + 50 + 17 = 198
10 10
Mean = 19.8
Mod
e
The value(s) that occur most
frequently in a dataset. A dataset can be:
Unimodal: One mode.
Applicable for both numerical and Bimodal: Two modes.
categorical data.
Multimodal: More than
two modes.
Not influenced by extreme values.
Mode (Example)
Age 15 16 16 17 18 16 17 16 25
17
15 1
16 4
17 3 Mode = 16
18 1
25 1
Median
The middle value in a sorted dataset.
Robust to outliers and skewed data.
Best used for datasets with extreme
values or non-symmetrical distributions.
If the dataset has an even number of
observations, the median is the average of
the two middle values.
Median (Example)
Age 15 16 16 17 18 16 17 16 25 17
15 16 16 16 16 17 17 17 18 25
Median = 16 +17 = 16.5
2
15 16 16 16 16 17 17 17 18
Median = 16
Advantages and Disadvantages
Measure Advantages Disadvantages
Sensitive to outliers and
Mean Easy to calculate; uses all data.
skewed distributions.
Ignores some data points; less
Median Not affected by extreme values.
informative.
May not exist or may not be
Mode Easy to understand; works for any data.
unique.
Choosing the Right Measure
Mean: Use when the data is symmetrically distributed without outliers.
Median: Use when the data is skewed or contains outliers.
Mode: Use for categorical data or to identify the most common value(s) in numerical data.
Measures of Dispersion
• Measures of dispersion quantify the spread or variability of data in a dataset.
• They indicate how much the data points differ from each other and the central tendency
(mean, median, mode).
Types of Measures
Range
Variance
Standard Deviation
Interquartile Range
Range
The difference between the maximum and minimum values in a dataset.
Only considers the extremes, ignoring the distribution of the data.
Range = Maximum Value − Minimum Value
Range (Example)
Age 15 16 16 17 18 16 17 16 25
17
Min Max
15 16 16 16 16 17 17 17
18 25
Range = Maximum Value − Minimum Value
Range = 25 – 15 = 10
Variance
The average of the squared differences from the mean.
Measures how far data points are spread around the mean.
Variance = 2
Variance (Example)
Age 15 16 16 17 18 16 17 16 25 17
Mean = 17.3
Variance = ∑(𝒙𝒊 − 𝒙)2
𝒏
Variance = (15 - 17.3)2 + (16 - 17.3)2 + (16 - 17.3)2 + (17 - 17.3)2 + (18 - 17.3)2 + (16 - 17.3)2 + (17 - 17.3)2 + (16 - 17.3)2 + (25 - 17.3)2 + (17 - 17.3)2
10
Variance = 7.21
Standard Deviation
The square root of the variance, providing a measure of dispersion in the same
units as the data.
Indicates the average distance from the mean.
Preferred over variance for interpretability.
∑(𝒙𝒊 𝒙)2
Standard Deviation =
𝒏
Standard Deviation =
(Example 1)
Age 15 16 16 17 18 16 17 16 25 17
Mean = 17.3
∑(𝒙𝒊 𝒙)2
Standard Deviation = 𝒏
Standard Deviation =
(15 − 17.3)2 + (16 − 17.3)2 + (16 − 17.3)2 + (17 − 17.3)2 + (18 − 17.3)2 + (16 − 17.3)2 + (17 − 17.3)2 + (16 − 17.3)2 + (25 − 17.3)2 + (17 − 17.3)2
10
Standard Deviation = 2.68
(Example 2)
Age 15 16 16 17 18 16 17 16 25
17
Mean = 17.3
Variance = 7.21
Standard Deviation = 𝑽𝒂𝒓𝒊𝒂𝒏𝒄𝒆
Standard Deviation = 𝑽𝒂𝒓𝒊𝒂𝒏𝒄𝒆
Standard Deviation = 2.68
Interquartile Range
The range of the middle 50% of the data
Robust to outliers.
Useful for understanding data spread in non-symmetrical distributions.
Calculated as the difference between the 75th percentile (Q3) and the 25th percentile (Q1).
IQR = Q3 – Q1
(Example)
Age 15 16 16 17 18 16 17 16 25
17
Q1 Q2 Q3 Q4
15 16 16 16 16 17 17 17
18 25
Q1 = 16 Q2= 16.5 Q3 = 17 Q4 = 25
IQR = Q3 – Q1
IQR = 17 – 16
IQR = 1
Advantages and Disadvantages
Measure Advantages Disadvantages
Range Simple to compute. Affected by outliers; ignores data distribution.
Variance Accounts for all data points. Units are squared, making interpretation harder.
Standard Deviation Easy to interpret; uses same units as data. Sensitive to outliers.
IQR Robust to outliers. Ignores data outside Q1 and Q3.
Choosing the Right Measure
Range: Quick and simple but sensitive to outliers.
Variance/Standard Deviation: Best for understanding variability around the mean.
IQR: Effective for skewed data or datasets with outliers.
Applications
Finance
Quality Control
Healthcare
Education
Correlation
and
Trends
Click here for link image
Correlation
Correlation measures the strength and direction of the relationship between two variables.
It indicates whether and how strongly pairs of variables are related.
Helps identify relationships and dependencies between variables, which is crucial for predictive modeling.
Measured with numbers ranging from -1 to +1
Limitations
Does not imply causation.
Sensitive to outliers, which can distort 𝑟.
Assumes linear relationships; does not capture non-linear ones.
Types of Correlation
Positive Correlation
Both variables move in the same direction.
Negative Correlation
One variable increases while the other decreases.
No Correlation
No discernible relationship between the variables.
Scatterplot
• A visual tool to represent correlation.
wage energy
5000 50
4000 40
3000 30
2000 20
1000 10
2 4 6 8 10 hours 2 4 6 8 10 hours
Positive correlation Negative correlation
age wage price
50 5000 5000
40 4000 4000
30 3000 4000
20 2000 2000
10 1000 1000
20 25 30 35 40 temperature 2 4 6 8 10 hours 2 4 6 8 10 quantity
No Correlation
Correlation Coefficient ( )
• A numerical measure that quantifies correlation
• Range: −1 ≤ ≤ 1
• Interpretation:
= 1: Perfect positive correlation.
= −1: Perfect nega ve correla on.
= 0: No correlation.
Negative Correlation Positive
Strong Negative Moderate Negative Weak Weak Moderate Strong Positive
No
-1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Applications of Correlation
Business
Education
Healthcare
Trends
Trends indicate the general direction in which data values change over time or another
independent variable.
Guides decision-making by understanding past patterns and anticipating future behavior.
Limitations
Can be influenced by random fluctuations or external factors.
Long-term trends may mask short-term variations.
Types of Trends
Upward Trend
Values increase over time. Example: Increasing adoption of AI
Downward Trend
Values decrease over time. Example: Decline in open defecation
Sideways Trend
A variable remains relatively stable over time
Click here for link image
No Correlation
Data shows no consistent pattern.
Applications of Trends
Business
Climate Science
Healthcare
Differences between Correlation
and Trends
Feature Correlation Trends
Focus Relationship between two variables. Direction of change in one variable.
Visual Tool Scatter plot. Line chart, time-series plot.
Key Metric Correlation coefficient (𝑟). Slope or pattern direction.
Studies dependencies (e.g., height vs. Studies changes over time (e.g., sales
Application weight). growth).
Conclusion
Statistical is a foundation of data analysis and data science
Measures of central tendency describes the center of the data
Measures of dispersion described the spread
Correlation shows relationship and not causation
Trend shows variations over time
Assignment
1.Research on all the visualization tools discussed today. Which do you think is better and
why?
2.When should you not visualize data?
3.What types of Correlation best describes the following and why?
i. Age and height
ii. Salary and years in an organization