Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views50 pages

Data Analysis and Data Visualization Basics 2

The document provides an overview of data analysis and visualization basics, focusing on visualization tools, statistics, and correlation trends. It highlights the importance of visual data representation, the types of statistics, and measures of central tendency and dispersion. Additionally, it discusses correlation and trends, emphasizing their applications and limitations in various fields.

Uploaded by

ezepraise080
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views50 pages

Data Analysis and Data Visualization Basics 2

The document provides an overview of data analysis and visualization basics, focusing on visualization tools, statistics, and correlation trends. It highlights the importance of visual data representation, the types of statistics, and measures of central tendency and dispersion. Additionally, it discusses correlation and trends, emphasizing their applications and limitations in various fields.

Uploaded by

ezepraise080
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Data Analysis

and
Data Visualization Basics
(Pt2)

2
Overview
Visualization Tools

Statistics.

Statistical Summaries.

Correlation and Trends.


Visualization Tools

Click here for link image


Statistics on Visualization

Our brains value visuals over any other type of information.

90% of the information transmitted to the brain is visual – (MIT)

The human brain can process an image in just 13 milliseconds – (MIT)

50% of the brain is active in visual processing - (Piktochart)

Human brains process visuals 60,000 times faster than they do text – (University of Minnesota)
Visualization Tools

Tool Features Best For

Matplotlib Basic charts and plots Foundational plotting in Python

Seaborn Advanced statistical plots Visualizing relationships and trends

ggplot Grammar of Graphics framework Data visualization in R

Tableau Interactive dashboards Business intelligence and analytics

Power BI Real-time data analytics Dynamic reporting in enterprises


Power BI Visualization Tableau Visualization

Click here for link image Click here for link image

Install Power BI Desktop and get Install Tableau Public and get
started started
Seaborne, Matplotlib, Ggplot

Click here for link image

Click here for link image


Statistics

Click here for link image


What is Statistics?

Statistics is the science of collecting, organizing,


analyzing, interpreting, and presenting data to make
informed decisions.

It is a foundational tool in many fields, including


business, healthcare, engineering, social sciences,
and natural sciences.

It is a foundation for data analysis and data science.


Types of Statistics

Inferential Statistics
Descriptive Statistics
 Makes predictions, inferences, and
 Summarizes and describes the main generalizations about a population based
features of a dataset. on a sample.
 Does not make predictions or infer  Involves probability theory and hypothesis
conclusions beyond the data. testing.
 Common techniques include  Key concepts include:
 Measures of central tendency  Estimation
 Measures of dispersion  Hypothesis Testing
 Data visualizations  Regression and Correlation
Analysis.
Key Concepts in Statistics

Population
The entire group of individuals or items being studied

Sample
A subset of the population used for analysis.

Probability

The active phase where data is accessed, processed, and analyzed for insights or operations.

Outliers
Data points significantly different from others. Can distort statistical measures like mean and
variance.
Statistical Methods

Data Collection
Surveys, experiments, observational studies, and simulations.

Data Analysis
Techniques for summarizing and exploring data.

Hypothesis Testing
• A systematic method to test assumptions about a dataset.
• Includes null and alternative hypothesis testes with statistical tests.

Regression Analysis
• Models the relationship between variables
• Simple Linear Regression and Multiple Regression.
Applications of Statistics

Healthcare Social Sciences

Engineering Education
Business
Challenges in Statistics

03
Interpretation
01 Data Quality

Bias
02
Statistical
Summaries

Click here for link image


Measures of Central Tendency

Measures of central tendency are They provide a summary of the data


statistical metrics that represent by identifying a single value that
the center or typical value of a reflects the overall distribution.
dataset.

Mean Mode

Median
Mean

The sum of all data points divided by the number of data points.

Affected by all values in the dataset, including outliers.

Mean = 𝒊

Where:
𝒊 : Each data point
: Total number of data points
Mean (Example 1)

Age 15 16 16 17 18 16 17 16 25
17

Mean = 𝒊

Mean = 15 + 16 + 16 + 17 +18 + 16 + 17 + 16 + 25 + 17 =
173 10 10

Mean = 17.3
Mean (Example 2)

Age 15 16 16 17 18 16 17 16 50
17

Mean = 𝒊

Mean = 15 + 16 + 16 + 17 +18 + 16 + 17 + 16 + 50 + 17 = 198


10 10

Mean = 19.8
Mod
e

The value(s) that occur most


frequently in a dataset. A dataset can be:
 Unimodal: One mode.
Applicable for both numerical and  Bimodal: Two modes.
categorical data.
 Multimodal: More than
two modes.
Not influenced by extreme values.
Mode (Example)

Age 15 16 16 17 18 16 17 16 25
17

15 1
16 4
17 3 Mode = 16
18 1
25 1
Median

The middle value in a sorted dataset.

Robust to outliers and skewed data.

Best used for datasets with extreme


values or non-symmetrical distributions.

If the dataset has an even number of


observations, the median is the average of
the two middle values.
Median (Example)

Age 15 16 16 17 18 16 17 16 25 17

15 16 16 16 16 17 17 17 18 25
Median = 16 +17 = 16.5
2

15 16 16 16 16 17 17 17 18
Median = 16
Advantages and Disadvantages

Measure Advantages Disadvantages


Sensitive to outliers and
Mean Easy to calculate; uses all data.
skewed distributions.
Ignores some data points; less
Median Not affected by extreme values.
informative.
May not exist or may not be
Mode Easy to understand; works for any data.
unique.

Choosing the Right Measure


 Mean: Use when the data is symmetrically distributed without outliers.

 Median: Use when the data is skewed or contains outliers.

 Mode: Use for categorical data or to identify the most common value(s) in numerical data.
Measures of Dispersion

• Measures of dispersion quantify the spread or variability of data in a dataset.

• They indicate how much the data points differ from each other and the central tendency
(mean, median, mode).

Types of Measures

Range

Variance

Standard Deviation

Interquartile Range
Range

 The difference between the maximum and minimum values in a dataset.

 Only considers the extremes, ignoring the distribution of the data.

Range = Maximum Value − Minimum Value


Range (Example)

Age 15 16 16 17 18 16 17 16 25
17

Min Max
15 16 16 16 16 17 17 17
18 25
Range = Maximum Value − Minimum Value

Range = 25 – 15 = 10
Variance

The average of the squared differences from the mean.

Measures how far data points are spread around the mean.

 Variance = 2
Variance (Example)

Age 15 16 16 17 18 16 17 16 25 17

Mean = 17.3

Variance = ∑(𝒙𝒊 − 𝒙)2


𝒏

Variance = (15 - 17.3)2 + (16 - 17.3)2 + (16 - 17.3)2 + (17 - 17.3)2 + (18 - 17.3)2 + (16 - 17.3)2 + (17 - 17.3)2 + (16 - 17.3)2 + (25 - 17.3)2 + (17 - 17.3)2
10

Variance = 7.21
Standard Deviation

The square root of the variance, providing a measure of dispersion in the same
units as the data.

Indicates the average distance from the mean.

Preferred over variance for interpretability.

∑(𝒙𝒊 𝒙)2
Standard Deviation =
𝒏

Standard Deviation =
(Example 1)

Age 15 16 16 17 18 16 17 16 25 17

Mean = 17.3

∑(𝒙𝒊 𝒙)2
Standard Deviation = 𝒏

Standard Deviation =
(15 − 17.3)2 + (16 − 17.3)2 + (16 − 17.3)2 + (17 − 17.3)2 + (18 − 17.3)2 + (16 − 17.3)2 + (17 − 17.3)2 + (16 − 17.3)2 + (25 − 17.3)2 + (17 − 17.3)2
10

Standard Deviation = 2.68


(Example 2)

Age 15 16 16 17 18 16 17 16 25
17
Mean = 17.3
Variance = 7.21

Standard Deviation = 𝑽𝒂𝒓𝒊𝒂𝒏𝒄𝒆

Standard Deviation = 𝑽𝒂𝒓𝒊𝒂𝒏𝒄𝒆

Standard Deviation = 2.68


Interquartile Range

The range of the middle 50% of the data

Robust to outliers.

Useful for understanding data spread in non-symmetrical distributions.

Calculated as the difference between the 75th percentile (Q3) and the 25th percentile (Q1).

IQR = Q3 – Q1
(Example)

Age 15 16 16 17 18 16 17 16 25
17
Q1 Q2 Q3 Q4
15 16 16 16 16 17 17 17
18 25
Q1 = 16 Q2= 16.5 Q3 = 17 Q4 = 25

IQR = Q3 – Q1
IQR = 17 – 16
IQR = 1
Advantages and Disadvantages

Measure Advantages Disadvantages


Range Simple to compute. Affected by outliers; ignores data distribution.

Variance Accounts for all data points. Units are squared, making interpretation harder.

Standard Deviation Easy to interpret; uses same units as data. Sensitive to outliers.

IQR Robust to outliers. Ignores data outside Q1 and Q3.

Choosing the Right Measure

 Range: Quick and simple but sensitive to outliers.

 Variance/Standard Deviation: Best for understanding variability around the mean.

 IQR: Effective for skewed data or datasets with outliers.


Applications

Finance

Quality Control

Healthcare

Education
Correlation
and
Trends

Click here for link image


Correlation

 Correlation measures the strength and direction of the relationship between two variables.

 It indicates whether and how strongly pairs of variables are related.

 Helps identify relationships and dependencies between variables, which is crucial for predictive modeling.

 Measured with numbers ranging from -1 to +1

Limitations

Does not imply causation.

Sensitive to outliers, which can distort 𝑟.

Assumes linear relationships; does not capture non-linear ones.


Types of Correlation

Positive Correlation
Both variables move in the same direction.

Negative Correlation
One variable increases while the other decreases.

No Correlation
No discernible relationship between the variables.
Scatterplot

• A visual tool to represent correlation.


wage energy
5000 50
4000 40

3000 30

2000 20

1000 10

2 4 6 8 10 hours 2 4 6 8 10 hours

Positive correlation Negative correlation


age wage price
50 5000 5000

40 4000 4000

30 3000 4000

20 2000 2000

10 1000 1000

20 25 30 35 40 temperature 2 4 6 8 10 hours 2 4 6 8 10 quantity

No Correlation
Correlation Coefficient ( )

• A numerical measure that quantifies correlation

• Range: −1 ≤ ≤ 1

• Interpretation:
= 1: Perfect positive correlation.
= −1: Perfect nega ve correla on.
= 0: No correlation.

Negative Correlation Positive


Strong Negative Moderate Negative Weak Weak Moderate Strong Positive
No

-1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Applications of Correlation

Business

Education

Healthcare
Trends

Trends indicate the general direction in which data values change over time or another
independent variable.

Guides decision-making by understanding past patterns and anticipating future behavior.

Limitations
Can be influenced by random fluctuations or external factors.

Long-term trends may mask short-term variations.


Types of Trends

Upward Trend
Values increase over time. Example: Increasing adoption of AI

Downward Trend
Values decrease over time. Example: Decline in open defecation

Sideways Trend
A variable remains relatively stable over time

Click here for link image


No Correlation
Data shows no consistent pattern.
Applications of Trends

Business

Climate Science

Healthcare
Differences between Correlation
and Trends

Feature Correlation Trends


Focus Relationship between two variables. Direction of change in one variable.

Visual Tool Scatter plot. Line chart, time-series plot.

Key Metric Correlation coefficient (𝑟). Slope or pattern direction.

Studies dependencies (e.g., height vs. Studies changes over time (e.g., sales
Application weight). growth).
Conclusion

Statistical is a foundation of data analysis and data science

Measures of central tendency describes the center of the data

Measures of dispersion described the spread

Correlation shows relationship and not causation

Trend shows variations over time


Assignment

1.Research on all the visualization tools discussed today. Which do you think is better and
why?

2.When should you not visualize data?

3.What types of Correlation best describes the following and why?


i. Age and height
ii. Salary and years in an organization

You might also like