Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
24 views28 pages

What Is Exploratory Data Analysis

eda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views28 pages

What Is Exploratory Data Analysis

eda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

What Is Exploratory Data Analysis?

Steps
and Market Analysis
Lesson 3 of 11By Avijeet Biswal
Last updated on Jul 13, 2025179195

PreviousNext
Table of Contents
What Is Exploratory Data Analysis?
Steps Involved in Exploratory Data Analysis
Importance of Exploratory Data Analysis in Data Science
Types of Exploratory Data Analysis (EDA)
Exploratory Data Analysis Tools
View More
Exploratory Data Analysis (EDA) examines and visualizes data to understand its main
characteristics, identify patterns, spot anomalies, and test hypotheses. It helps
summarize the data and uncover insights before applying more advanced data analysis
techniques.

Become a Data Scientist through hands-on learning with hackathons, masterclasses,


webinars, and Ask-Me-Anything! Start learning now!

What Is Exploratory Data Analysis?


Exploratory Data Analysis is a data analytics process that aims to understand the data
in depth and learn its different characteristics, often using visual means. This allows one
to get a better feel for the data and find useful patterns.

Figure 1: Exploratory Data Analysis

It is crucial to understand it in depth before you perform data analysis and run your data
through an algorithm. You need to know the patterns in your data and determine which
variables are important and do not play a significant role in the output. Further, some
variables may have correlations with other variables. You also need to recognize errors
in your data.

Exploratory data analysis can do all of this. It helps you gather insights, better sense the
data, and remove irregularities and unnecessary values.

 Helps you prepare your dataset for analysis.

 Allows a machine learning model to predict our dataset better.

 Gives you more accurate results.

 It also helps us to choose a better machine learning model.


Figure 2: Exploratory Data Analysis uses

Take Your Data Scientist Skills to the Next Level

With the Data Scientist Master’s Program from IBMEXPLORE PROGRAM

Steps Involved in Exploratory Data Analysis

1. Understand the Data

Familiarize yourself with the data set, understand the domain, and identify the
objectives of the analysis.

2. Data Collection

Collect the required data from various sources such as databases, web scraping, or
APIs.

3. Data Cleaning

 Handle missing values: Impute or remove missing data.

 Remove duplicates: Ensure there are no duplicate records.

 Correct data types: Convert data types to appropriate formats.

 Fix errors: Address any inconsistencies or errors in the data.


4. Data Transformation

 Normalize or standardize the data if necessary.

 Create new features through feature engineering.

 Aggregate or disaggregate data based on analysis needs.

5. Data Integration

Integrate data from various sources to create a complete data set.

6. Data Exploration

 Univariate Analysis: Analyze individual variables using summary statistics and


visualizations (e.g., histograms, box plots).

 Bivariate Analysis: Analyze the relationship between two variables with scatter plots,
correlation coefficients, and cross-tabulations.

 Multivariate Analysis: Investigate interactions between multiple variables using pair


plots and correlation matrices.

7. Data Visualization

Visualize data distributions and relationships using visual tools such as bar charts, line
charts, scatter plots, heatmaps, and box plots.

8. Descriptive Statistics

Calculate central tendency measures (mean, median, mode) and dispersion measures
(range, variance, standard deviation).

9. Identify Patterns and Outliers

Detect patterns, trends, and outliers in the data using visualizations and statistical
methods.

10. Hypothesis Testing


Formulate and test hypotheses using statistical tests (e.g., t-tests, chi-square tests) to
validate assumptions or relationships in the data.

11. Data Summarization

Summarize findings with descriptive statistics, visualizations, and key insights.

12. Documentation and Reporting

 Document the EDA process, findings, and insights clearly and structured.

 Create reports and presentations to convey results to stakeholders.

13. Iterate and Refine

Continuously refine the analysis based on feedback and additional questions during the
process.

Your Data Analytics Career is Around The Corner!

Data Analyst Master’s ProgramEXPLORE PROGRAM

Importance of Exploratory Data Analysis in Data Science

Exploratory Data Analysis is a critical step in the data science process. It is the
foundation for understanding and interpreting complex data sets. EDA helps data
scientists identify patterns, spot anomalies, test hypotheses, and check assumptions
through various statistical and graphical techniques. Practitioners can uncover
underlying structures, detect outliers, and determine the relationships between
variables, which is essential for developing accurate predictive models by thoroughly
exploring the data.

Furthermore, Exploratory Data Analysis allows the identification of data quality issues,
such as missing values or errors, which can be addressed before proceeding to more
advanced analysis. This preliminary analysis enhances the reliability and accuracy of
the subsequent modeling and ensures that the insights derived are valid and actionable.
EDA allows data scientists to make informed decisions and derive meaningful insights
that drive business strategies and solutions.
Types of Exploratory Data Analysis (EDA)

1. Univariate Analysis

 Definition: Focuses on analyzing a single variable at a time.

 Purpose: To understand the variable's distribution, central tendency, and spread.

 Techniques:

 Descriptive statistics (mean, median, mode, variance, standard deviation).

 Visualizations (histograms, box plots, bar charts, pie charts).

2. Bivariate Analysis

 Definition: Examines the relationship between two variables.

 Purpose: To understand how one variable affects or is associated with another.

 Techniques:

 Scatter plots.

 Correlation coefficients (Pearson, Spearman).

 Cross-tabulations and contingency tables.

 Visualizations (line plots, scatter plots, pair plots).

3. Multivariate Analysis

 Definition: Investigates interactions between three or more variables.

 Purpose: To understand the complex relationships and interactions in the data.

 Techniques:

 Multivariate plots (pair plots, parallel coordinates plots).

 Dimensionality reduction techniques (PCA, t-SNE).

 Cluster analysis.

 Heatmaps and correlation matrices.


4. Descriptive Statistics

 Definition: Summarizes the main features of a data set.

 Purpose: To provide a quick overview of the data.

 Techniques:

 Measures of central tendency (mean, median, mode).

 Measures of dispersion (range, variance, standard deviation).

 Frequency distributions.

5. Graphical Analysis

 Definition: Uses visual tools to explore data.

 Purpose: To identify patterns, trends, and data anomalies through visualization.

 Techniques:

 Charts (bar charts, histograms, pie charts).

 Plots (scatter plots, line plots, box plots).

 Advanced visualizations (heatmaps, violin plots, pair plots).

6. Dimensionality Reduction

 Definition: Reduces the number of variables under consideration.

 Purpose: To simplify models, reduce computation time, and mitigate the curse of
dimensionality.

 Techniques:

 Principal Component Analysis (PCA).

 t-Distributed Stochastic Neighbor Embedding (t-SNE).

 Linear Discriminant Analysis (LDA).

Become the Highest Paid Data Science Expert

With Our Best-in-class Data Science ProgramEXPLORE NOW


Exploratory Data Analysis Tools

Using the following tools for exploratory data analysis, data scientists can effectively
gain deeper insights and prepare data for advanced analytics and modeling.

1. Python Libraries

 Pandas: Provides data structures and functions needed to manipulate structured


data seamlessly.

 Use: Data cleaning, manipulation, and summary statistics.

 Supports large, multi-dimensional arrays and matrices and a collection of


mathematical functions.

 Use: Numerical computations and data manipulation.

 Matplotlib: A plotting library that produces static, animated, and interactive


visualizations.

 Use: Basic plots like line charts, scatter plots, and bar charts.

 Seaborn: Built on Matplotlib, it provides a high-level interface for drawing attractive


statistical graphics.

 Use: Advanced visualizations like heatmaps, violin plots, and pair plots.

 SciPy: Builds on NumPy and provides many higher-level scientific algorithms.

 Use: Statistical analysis and additional mathematical functions.

 Plotly: A graphing library that makes interactive, publication-quality graphs online.

 Use: Interactive and dynamic visualizations.

2. R Libraries

 ggplot2: A framework for creating graphics using the principles of the Grammar of
Graphics.

 Use: Complex and multi-layered visualizations.

 dplyr: A set of tools for data manipulation, offering consistent verbs to address
common data manipulation tasks.
 Use: Data wrangling and manipulation.

 tidyr: Provides functions to help you organize your data in a tidy way.

 Use: Data cleaning and tidying.

 shiny: An R package that makes building interactive web apps straight from R easy.

 Use: Interactive data analysis applications.

 plotly: Also available in R for creating interactive visualizations.

 Use: Interactive visualizations.

3. Integrated Development Environments (IDEs)

 Jupyter Notebook: An open-source web application that allows you to create and
share documents that contain live code, equations, visualizations, and narrative text.

 Use: Combining code execution, rich text, and visualizations.

 RStudio: An integrated development environment for R that offers tools for writing
and debugging code, building software, and analyzing data.

 Use: R development and analysis.

4. Data Visualization Tools

 Tableau: A top data visualization tool that facilitates the creation of diverse charts
and dashboards.

 Use: Interactive and shareable dashboards.

 Power BI: A Microsoft business analytics service offering interactive visualizations


and business intelligence features.

 Use: Interactive reports and dashboards.

5. Statistical Analysis Tools

 SPSS: A comprehensive statistics package from IBM.

 Use: Complex statistical data analysis.


 SAS: A software suite developed by SAS Institute for advanced analytics, business
intelligence, data management, and predictive analytics.

 Use: Statistical analysis and data management.

6. Data Cleaning Tools

 OpenRefine: A powerful tool for cleaning messy data, transforming formats, and
enhancing it with web services and external data.

 Use: Data cleaning and transformation.

 SQL Databases: Tools like MySQL, PostgreSQL, and SQLite are used to manage
and query relational databases.

 Use: Data extraction, transformation, and basic analysis.

Our Data Scientist Master's Program covers core topics such as R, Python, Machine
Learning, Tableau, Hadoop, and Spark. Get started on your journey today!

Market Analysis With Exploratory Data Analysis

Now, perform Exploratory Data Analysis on market analysis data. You start by importing
all necessary modules.

Figure 3: Importing necessary modules

Then, you read in the data as a pandas data frame.


Figure 4: Market Analysis Data

The dataset is not formatted correctly. The first two rows contain the actual column
names, just arbitrary values.

Importing Data

When importing your data, skip the first two rows to overcome the skewed rows. This
will ensure that your column names are populated correctly.

Figure 5: Importing Market Analysis Data

The dataset is imported correctly now. The column names are in the correct row, and
you’ve dropped the arbitrary data.

The above data was collected while taking a survey. Information about the survey
takers, like their occupation, salary, whether they have taken a loan, age, etc., is given.
You will use exploratory data analysis to find patterns in this data and correlations
between columns. You will also perform basic data-cleaning steps.

Become an Expert in Data Analytics

With Our Unique Data Analyst Master’s ProgramEXPLORE PROGRAM

Data Cleaning

The next step is data cleaning. Let us drop the customer ID column, as it is just the row
numbers indexed at 1. Also, split the ‘jobedu’ column into two: one for the job and one
for the education field. After splitting the columns, you can drop the ‘jobedu’ column as it
is useless anymore.

Figure 6: Cleaning Market Analysis Data

This is what the dataset looks like now.


Figure 7: Market Analysis Data

Missing Values

The data has some missing values in its columns. There are three major categories of
missing values:

1. MCAR (Missing completely at random): These values are randomly missing and do
not depend on any other values.

2. MAR (Missing at random): These values depend on additional features.

3. MNAR (Missing not at random): There is a reason why these values are missing.

Let’s check the columns which have missing values.


Figure 8: Missing values

You cannot do anything about the missing age values. So, drop all rows without age
values.
Figure 9: Missing age values

Now, in the month column, you can fill in the missing values by finding the most
commonly occurring month and filling it in place of the missing values. You see the
mode of the month column to get the most commonly occurring values and fill in the
missing values using the fill function.

Figure 10: Filling in missing month values

Check to see the number of missing values left in your data.


Figure 11: Missing values

Finally, only the response column has missing values. You cannot change these values.
If the user hasn't filled in the response, you cannot auto-generate it, so you drop these
values.
Figure 12: Dropping Missing response values

Finally, the data is clean. You can now start finding the outliers.

Become the Highest Paid Data Scientist in 2025

With The Ultimate Data Scientist Program from IBMEXPLORE PROGRAM

Handling Outliers

There are two types of outliers in data:

1. Univariate outliers: Univariate outliers are the data points whose values lie outside
the expected range. Here, only a single variable is being considered.

2. Multivariate outliers: These outliers depend on the correlation between two variables.
While plotting data, one variable may not lie beyond the expected range. Still, when
you plot the same variable with another variable, these values may lie far from the
expected value.
Univariate Analysis

Now, consider the different jobs on which you have data. Plotting the job column as a
bar graph in ascending order of the number of people who work in that job tells us the
most popular jobs in the market. Normalize the data to ensure that they lie in the same
range and are comparable.

Figure 13: Plotting the number of people performing a certain job

Moving on, plot a pie chart to compare the education qualifications of the people in the
survey. Almost half of the people have only secondary school education, and one-fourth
have a tertiary education.
Figure 14: Plotting the education qualification of people

Bivariate Analysis

Bivariate analysis is of three main types:

1. Numeric-Numerical Analysis

When both variables are compared, they have numeric data, and the analysis is said to
be a Numeric-Numerical Analysis. You can use scatter plots, pair plots, and correlation
matrices to compare two numeric columns.

Scatter Plot
A scatter plot represents every data point in the graph. It shows how the data in one
column fluctuates according to the corresponding data points in another column. For
example, plot a scatterplot between different individuals' salaries and bank balances
and the balance and age of individuals.
Figure 15: Plotting a scatter plot of Salary vs. Balance

By looking at the above plot, it can be said that regardless of the individual salary, the
average bank balance ranges from 0 - 25,0000. The majority of the people have a bank
balance below 40k.

Figure 16: Plotting a scatter plot of Balance vs Age


From the above graph, you can conclude that the average balance of people,
regardless of age, is around 25,000. This is the average balance, irrespective of age
and salary.

Pair Plot
Pair plots are used to compare multiple variables simultaneously. They plot a scatter
plot of all input variables against each other, which helps save space and allows us to
compare various variables simultaneously. Let's plot the pair plot for salary, balance,
and age.

Figure 17: Plotting a pairplot

The figures below show the pair plots for salary, balance, and age. Each variable is
plotted against the others on both the x- and y-axes.
Figure 18: Pairplots of salary, balance, and age

Correlation Matrix
A correlation matrix is used to see the correlation between different variables. The
correlation coefficient determines how two variables are correlated. The below table
shows the correlation between salary, age, and balance. Correlation tells you how one
variable affects the other. This helps us determine how changes in one variable will also
cause a change in the other.
Figure 19: Correlation matrix between salary, balance, and age

The above matrix tells us that balance, age, and salary have a high correlation
coefficient and affect each other. Age and salary have a lower correlation coefficient.

Become a Data Analytics Expert in Just 8 Months!

With Purdue University's Data Analytics PG ProgramLEARN MORE

2. Numeric - Categorical Analysis

When one variable is of numeric type, and another is a categorical variable, you perform
numeric-categorical analysis.

You can use the group by function to arrange the data into similar groups. Rows that
have the same value in a particular column will be grouped. This way, you can see the
numerical occurrences of a certain category across a column. You can also group
values and find their mean.

Figure 20: Groupby of response with respect to salary


The above values tell you the average salary of the people who have responded yes or
no in the response column.

You can also find the middle value of salary or the median value of the people who have
responded with yes and no in our survey.

Figure 21: Median of groupby of response with respect to salary

You can also plot the box plot of response vs salary. A boxplot will show you the range
of values that fall under a certain category.

Figure 22: Boxplot of response with respect to salary

The above plot tells you that the salary range of people who said no on the survey is
between 20k - 70k with a median salary of 60k, while the salary range of people who
replied with yes on the survey was between 50k - 100k with a median salary of 60K.

Your Data Analytics Career is Around The Corner!


Data Analyst Master’s ProgramEXPLORE PROGRAM

3. Categorical — Categorical Analysis

When both the variables contain categorical data, you perform categorical-categorical
analysis. First, convert the categorical response column into a numerical column with 1
corresponding to a positive response and 0 corresponding to a negative response.

Figure 23: Changing categorical to numerical values

Now, plot the marital status of people with the response rate. The figure below tells you
the mean number of people who responded yes to the survey and their marital status.

Figure 24: Changing categorical to numerical values

Also, plot the mean loan with the response rate.


Figure 25: Changing categorical to numerical values

You can conclude that people who have taken a loan are likelier to respond with a no on
the survey.

Is Becoming a Data Scientist Your Next Milestone?

Achieve Your Goal With Our Data Scientist ProgramEXPLORE PROGRAM

Conclusion

Exploratory Data Analysis provides valuable insights through data exploration, cleaning,
and visualization. By understanding the fundamental steps of EDA and applying them to
market analysis, professionals can make data-driven decisions and uncover hidden
trends. Mastering EDA techniques is essential for anyone looking to excel in data
science.

Develop your skills further and become an expert in Exploratory Data Analysis with
Simplilearn's Data Scientist program. This course covers all foundational concepts and
advanced data science techniques, empowering you to transform data into actionable
insights. Start your journey today and unlock new career opportunities.
Upskill yourself with our trending Data Science Courses and Certifications

1. Professional Certificate Course in Data Science

2. Data Science Certificate Program

3. Professional Certificate in Data Science and Generative AI

FAQs

1. What Are the Benefits of EDA?

Exploratory Data Analysis helps identify patterns, detect outliers, understand


relationships between variables, and improve data quality, leading to more accurate and
reliable models.

2. How Does EDA Differ From Data Cleaning?

Exploratory Data Analysis involves analyzing and visualizing data to understand its
characteristics, while data cleaning focuses on correcting errors, handling missing
values, and ensuring data consistency.

3. Can EDA Be Performed on Any Type of Data?

Yes, Exploratory Data Analysis can be performed on any type of data, including
structured, unstructured, and semi-structured data, though the techniques and tools
may vary.

4. What Are Some Common Visualizations Used in EDA?

Common visualizations in Exploratory Data Analysis include histograms, scatter plots,


box plots, bar charts, line charts, heatmaps, and pair plots.

5. What Should Be Done if Outliers Are Found During EDA?


Investigate the cause of outliers to determine if they are errors, natural variations, or
significant insights. Based on the context of the analysis, decide whether to retain,
transform, or remove them.

You might also like