Data Visualization
1. What is Data Visualization?
2. Importance of Data Visualization in Data Science
3. Different Types of Data Visualization
4. Data Visualization Process/Workflow
5. Tools and Software for Data Visualization
6. Data Visualization Techniques in Data Science
7. Advantages and Disadvantages of Data Visualization
8. Examples of Data Visualization in Data Science
9. Data Visualization Best Practices
10. Essential Skills for Data Visualization
11. Conclusion
12. Frequently Asked Questions (FAQs)
A picture is worth more than thousands of words. People like to see
pictures rather than read words. That’s why visualization matters in all
data science project lifecycle steps. From data understanding to model
validation, data visualization plays an important role.
There are state-of-the-art technologies to make data visualization much
easier and more effective. We need to follow some standard workflow
to create good visualization which everyone can understand. All of them
will be discussed here. You will also see different data visualization
graphs with their relevant use cases.
What is Data Visualization?
Importance of Data Visualization in Data Science
Earlier, I mentioned the importance of data visualization in data science.
Here are some more details.
1. Data cleaning
Data visualization plays an important role in data clearing. Good
examples are detecting outliers and removing multicollinearity. We can
create scatterplots to detect outliers and generate heatmaps to check
multicollinearity.
2. Data Exploration
Before building any model, we need to do some exploratory data
analysis to identify dataset characteristics. For example, we can create
histograms for continuous variables to check for normality in the data.
We can create scatterplots between two features to check whether they
are correlated. Likewise, we can create a bar chart for the label column
with two or more classes to identify class imbalance.
3. Evaluation of modeling outputs
We can create a confusion matrix and learning curve to measure the
performance of a model during training. Plots are also useful in
validating model assumptions. For example, we can create a residuals
plot and histogram for the distribution of residuals to validate the
assumptions of a linear regression model.
4. Identifying trends
Time and seasonal plots are useful in time series analysis to identify
certain trends over time.
5. Presenting results
As a data scientist, you need to present your findings to the company or
other related persons who do not have more knowledge in the subject
domain. So, you need to explain everything in plain English. You can
use informative plots that summarize your findings. Are you interested
in data visualization?
Different Types of Data Visualization
There are many data visualization types. The following are the
commonly used data visualization charts.
1. Distribution plot
A distribution plot is used to visualize data distribution. Example:
Probability distribution plot or density curve.
Source: seaborn.pydata.org
2. Box and whisker plot
This plot is used to plot the variation of the values of a numerical feature.
You can get the values' minimum, maximum, median, lower and upper
quartiles.
3. Violin plot
Similar to the box and whisker plot, the violin plot is used to plot the
variation of a numerical feature. But it contains a kernel density curve in
addition to the box plot. The kernel density curve estimates the
underlying distribution of data.
Source: seaborn.pydata
4. Line plot
A line plot is created by connecting a series of data points with straight
lines. The number of periods is on the x-axis.
5. Bar plot
A bar plot is used to plot the frequency of occurring categorical data.
Each category is represented by a bar. The bars can be created
vertically or horizontally. Their heights or lengths are proportional to the
values they represent.
6. Scatter plot
Scatter plots are created to see whether there is a relationship (linear or
non-linear and positive or negative) between two numerical variables.
They are commonly used in regression analysis.
7. Histogram
A histogram represents the distribution of numerical data. Looking at a
histogram, we can decide whether the values are normally distributed (a
bell-shaped curve), skewed to the right or skewed left. A histogram of
residuals is useful to validate important assumptions in regression
analysis.
8. Pie chart
A categorical variable pie chart includes each category's values as slices
whose sizes are proportional to the quantity they represent. It is a
circular graph made with slices equal to the number of categories.
9. Area plot
The area plot is based on the line chart. We get the area plot when we
cover the area between the line and the x-axis.
Source: python-graph-gallery.com
10. Hexbin plot
Similar to the scatter plot, a hexbin plot represents the relationship
between two numerical variables. It is useful when there are a lot of data
points in the two variables. When you have a lot of data points, they will
overlap when represented in a scatter plot.
Source: python-graph-gallery.com
11. Heatmap
A heatmap visualizes the correlation coefficients of numerical features
with a beautiful color map. Light colors show a high correlation, while
dark colors show a low correlation. The heatmap is extremely useful for
identifying multicollinearity that occurs when the input features are
highly correlated with one or more of the other features in the dataset.
Do you want to be familiar with these plot types and many other things
in data science?
Data Visualization Process/Workflow
The data visualization process or workflow includes the fowling key
steps.
1. Develop your research question
This may be a business problem or any other related problem that could
be solved with a data-driven approach. You should note all the
objectives and outcomes plus required resources such as datasets,
open-source software libraries, etc.
2. Get or create your data
The next step is collecting data. You can use existing datasets if they’re
relevant to your research question. Alternatively, you can
download open-source datasets from the internet or do web scraping to
collect data.
3. Clean your data
Real-world data are messy. So, you need to clean them before using
them for visualization. You can identify missing values and outliers and
treat them accordingly. You can perform feature selection and remove
unnecessary features from the data. You can create a new set of
features based on the original features.
4. Choose a chart type
The chart type depends on many factors. For example, it depends on
the feature type (numerical or categorical). It also depends on the type
of visualization you need. Let’s say you have two numerical features. If
you want to find their distributions, you can create two histograms for
each feature. If you want to plot their variations, you can create box and
whisker plots for each feature. You can create a scatterplot if you want
to find a relationship (linear or non-linear, positive or negative) between
the two features.
5. Choose your tool
You can use open-source data visualization tools such as matplotlib,
seaborn, plotty and ggplot. You can also use API-based software such
as Matlab, Minitab, SPSS, etc.
6. Prepare data
You can extract relevant features. You can do feature standardization if
the values of the features are not on the same scale. You can apply data
preprocessing steps such as PCA to reduce the dimensionality of the
data. That will allow you to visualize high-dimensional data in 2D and
3D plots!
7. Create a chart
This is the final step. Here. You define the title and names for the axes.
You should also choose a proper chart background to ensure the
content is easily readable.
Tools and Software for Data Visualization
There are multiple tools and software available for data visualization.
1. Python provides open-source libraries such as
• Matplotlib
• Seaborn
• Plotty
• Bokeh
• Altair
2. R provides open-source libraries such as
• Ggplot2
• Lattice
3. Other data visualization libraries
• IBM SPSS
• Minitab
• Matlab for data visualization
• Tableau
• Microsoft Power BI are popular among data scientists.
Tableau and Microsoft Power BI are popular among data scientists.
Data Visualization Techniques in Data Science
Some of the main data visualization techniques in data science are
univariate analysis, bivariate analysis and multivariate analysis.
1. Univariate Analysis
In univariate analysis, as the name suggest, we analyze only one
variable at a time. In other words, we analyze each variable separately.
Bar charts, pie charts, box plots and histograms are common examples
of univariate data visualization. Bar charts and pie charts are created for
categorical variables, while box plots and histograms are created for
numerical variables.
2. Bivariate Analysis
In bivariate analysis, we analyze two variables at a time. Often, we see
whether there is a relationship between the two variables. The scatter
plot is a classic example of bivariate data visualization.
3. Multivariate Analysis
In multivariate analysis, we analyze more than two variables
simultaneously. The heatmap is a classic example of multivariate data
visualization. Other examples are cluster analysis and principal
component analysis (PCA).
Advantages and Disadvantages of Data Visualization
Advantages
There are many advantages of data visualization. Data visualization is
used to:
• Communicate your results or findings with your audience
• Tune hyperparameters
• Identify trends, patterns and correlations between variables
• Monitor the model’s performance
• Clean data
• Validate the model’s assumptions
Disadvantages
There are also some disadvantages of data visualization.
• We need to download, install and configure software and open-
source libraries. The process will be difficult and time-consuming for
beginners.
• Some data visualization tools are not available for free. We need to
pay for those.
• When we summarize the data, we’ll lose the exact information.
Examples of Data Visualization in Data Science
Here are some popular data visualization examples.
1. Weather reports: Maps and other plot types are commonly used in
weather reports.
2. Internet websites: Social media analytics websites such as Social
Blade and Google Analytics use data visualization techniques to
analyze and compare the performance of websites.
3. Astronomy: NASA uses advanced data visualization techniques in
its reports and presentations.
4. Geography
5. Gaming industry
Data Visualization Best Practices
1. Set the context
We need to develop a research question that could be solved with a
data-driven approach.
2. Know your audience
This is very important as the visualizations depend on the type of
audience you have. To present your findings to a business people
audience, you need to create visualizations closely related to money,
profits, and revenue the terms that business people are familiar with!
3. Choose an effective visual
You need to create the right plot that addresses your requirement. To
see the correlations between multiple variables, you can create
histograms for each pair of variables. But that is not very effective.
Instead, you can create a heatmap that is an effective way of visualizing
correlations. When you have many categories, the pie chart is not
suitable. Instead, you can create a bar chart. These are some examples
of choosing an effective visual for your requirements.
4. Keep it simple
Simple plots are easily readable. We can remove unnecessary
backgrounds to make things stand out. We should not include much
content in the plot. Title, names for axis, scale, and legends are just
enough.
Essential Skills for Data Visualization
You should have the following data visualization skills for effective data
visualization.
1. Programming
You should know R or Python language. R wins, hands down, when it
comes to data visualization. Its ggplot2 library provides high-level
functions to make complex plots with less code. Data visualization in
Python can be done using libraries like matplotlib, plotty, bokeh and
seaborn for data visualization. Plotty and bokeh can be used for
interactive data visualizations.
2. Software Expertise
In addition to using R or Python languages, you can also use data
visualization software such as Matlab, Minitab and SPSS for data
visualization. Data visualization in Excel is also popular. However, they
provide limited customizations for your plots. In addition to that, you
cannot automate the plot creation process as you can do it with Python
or R.
3. Data Science Skills
Data visualization is one of the data science skills. But, for effective data
visualization, you need other data science skills such as statistical
analysis, data cleaning, processing large data sets, data mining, etc.
Data visualization cannot be done alone. It is a collection of these skills.
4. Public Speaking and Presentation
When it comes to presenting your findings to the company or other
related people, you need to have excellent presentation skills. You
should have more confidence when explaining things to a larger
audience. For that, you should be familiar with the given problem
domain.
5. Machine Learning
Machine learning is the ability of computers to learn from data without
being explicitly programmed. It is completely different from traditional
programming. We can use machine learning algorithms to find important
patterns and features in the data. Then, we can visualize those things.
There are machine learning algorithms that can be used to perform data
cleaning before data visualization. Machine learning is part of the data
visualization process.
Conclusion
Data visualization is important in every aspect of data science. We
should clean our data before making any visualization. We should
choose the right tool or software that addresses our needs, such as
affordability, ease of use, etc. The main challenge in data visualization
is choosing the right plot type. It depends on many factors. Finally, you
need excellent public speaking and presentation skills to present your
findings.
Today, we discussed data visualization applications and methods in
detail with examples. Learning data visualization is not straightforward.
You should master many skills for that.
Frequently Asked Questions (FAQs)
1. What are the three main goals of data visualization?
• Communicating your results or findings with your audience
• Exploring (knowing) your data
• Identify trends, patterns and correlations between variables
2. How is data visualization used in data science?
Data visualization is used in every aspect of data science:
• Tuning hyperparameters
• Monitoring the model’s performance
• Cleaning data
• Validating the model’s assumptions
3. What are the major challenges of data visualization
• Choosing the right plot type
• Identifying the needs of your audience
• Developing the research question convert it to a data science
question
• Collecting data
4. What are the benefits of data visualization?
Commons use cases of data visualization include:
• Communicate your results or findings with your audience
• Tune hyperparameters
• Identify trends, patterns and correlations between variables
• Monitor the model’s performance
• Clean data
• Validate the model’s assumptions