Data visualization with different Charts in Python
- Sometimes data does not make sense until you can look at in a visual form, such as with charts and plots.
- Being able to quickly visualize your data samples for yourself and others is an important skill both in applied
statistics and in applied machine learning.
Data Visualization is the presentation of data in graphical format. It helps people understand
the significance of data by summarizing and presenting huge amount of data in a simple and
easy-to-understand format and helps communicate information clearly and effectively.
In this section, we are going to learn how to create the following 11 plots:
1. Histogram
2. Box Plot
3. Bar Plot
4. Column Chart
5. Pie Chart
6. Scatter Plot
7. Line Chart
8. Violin Plot
9. Density Plot
10. WordCloud
11. Heat Map
1- Histogram
Definition
A histogram is an accurate graphical representation of the distribution of a numeric variable. It takes as input numeric variables only. The
variable is cut into several bins, and the number of observation per bin is represented by the height of the bar.
In [1]:
# import pandas and matplotlib
import pandas as pd
import matplotlib.pyplot as plt
In [ ]:
In [2]:
data = [['E001', 'M', 34, 123, 'Normal', 350],
['E002', 'F', 40, 114, 'Overweight', 450],
['E003', 'F', 37, 135, 'Obesity', 169],
['E004', 'M', 30, 139, 'Underweight', 189],
['E005', 'F', 44, 117, 'Underweight', 183],
['E006', 'M', 36, 121, 'Normal', 80],
['E007', 'M', 32, 133, 'Obesity', 166],
['E008', 'F', 26, 140, 'Normal', 120],
['E009', 'M', 32, 133, 'Normal', 75],
['E010', 'M', 36, 133, 'Underweight', 40] ]
# dataframe created with
# the above data array
df = pd.DataFrame(data, columns = ['EMPID', 'Gender','Age', 'Sales','BMI', 'Income'] )
In [5]:
# create histogram for numeric data
df.hist()
# show plot
plt.show()
In [13]:
# create histogram for numeric data
df['Sales'].hist()
# show plot
plt.show()
In [ ]:
conda install seaborn
!pip install seaborn
Seaborn
Graphic library built on top of Matplotlib.
It allows to make your charts prettier, and facilitates some of the common data visualisation needs
In [12]:
# Import library and dataset
import seaborn as sns
sns.distplot( df["Sales"])
plt.show()
sns.distplot( df["Sales"] , vertical=True )
plt.show()
# bins:
# argument for matplotlib hist(), or None, optional
# Specification of hist bins.
sns.distplot( df["Sales"] , bins=20 )
plt.show()
2- Boxplot
Definition
- Boxplot is probably one of the most common type of graphic. It gives a nice summary of one or several numeric variables. The line that
divides the box into 2 parts represents the median of the data.
- The end of the box shows the upper and lower quartiles. The extreme lines shows the highest and lowest value excluding outliers.
In [27]:
# For each numeric attribute of dataframe
df.plot.box()
plt.show()
# individual attribute box plot
plt.boxplot(df['Income'])
plt.show()
In [29]:
# Make boxplot for one group only
sns.boxplot( y=df["Income"] )
plt.show()
3- Barplot
Definition
A barplot (or barchart) is one of the most common types of graphic. It shows the relationship between a numeric and a categoric
variable. Each entity of the categoric variable is represented as a bar. The size of the bar represents its numeric value.
In [55]:
# Make a fake dataset:
frequancy = [3, 12]
bars = ('Male', 'Female')
# Create bars
plt.bar(bars, frequancy)
# Create names on the x-axis
plt.xticks(bars)
# Show graphic
plt.show()
# Create horizontal bars
plt.barh(bars,frequancy)
# Create names on the y-axis
plt.yticks(bars)
# Show graphic
plt.show()
4- Column Chart
Definition
A column chart is used to show a comparison among different attributes, or it can show a comparison of items over time.
In [56]:
# Dataframe of previous code is used here
# Plot the bar chart for numeric values
# a comparison will be shown between
# all 3 age, income, sales
df.plot.bar()
plt.show()
# plot between 2 attributes
plt.bar(df['Age'], df['Sales'])
plt.xlabel("Age")
plt.ylabel("Sales")
plt.show()
5- Pie Chart
Definition
A pie chart shows a static number and how categories represent part of a whole the composition of something. A pie chart represents
numbers in percentages, and the total sum of all segments needs to equal 100%.
In [82]:
# make the plot
# To add percentages to each of the constitutents of the pie chart, we add in the line, autopct='%1.1f%%', to the plt.pie
# This formats the percentage to the tenth place.
# If you want to format the percentage to the hundredths place, you would use the statement, autopct='%1.2f%%'
# If you want to format the percentage to the thousandths place, you would use the statement, autopct='%1.3f%%'
plt.pie(df['Age'], labels = {"A", "B", "C",
"D", "E", "F",
"G", "H", "I", "J"},
autopct ='% 1.1f %%', shadow = True)
plt.show()
6- Scatter plot
Definition
A scatter chart shows the relationship between two different variables and it can reveal the distribution trends. It should be used when
there are many different data points, and you want to highlight similarities in the data set. This is useful when looking for outliers and for
understanding the distribution of your data.
In [86]:
# use the function regplot to make a scatterplot
sns.regplot(x=df["Age"], y=df["Sales"])
plt.show()
sns.regplot(x=df["Age"], y=df["Sales"], fit_reg=False)
plt.show()
In [85]:
# scatter plot between sales and age
plt.scatter(df['Age'], df['Sales'])
plt.show()
7- Line Chart
Definition
A line chart or line graph is a type of chart which displays information as a series of data points called ‘markers’ connected by straight line
segments. A line chart is often used to visualize a trend in data over intervals of time.
In [139…
#To custom color, just use the ‘color’ argument! To learn the color syntax, read this page.
plt.plot( 'Age','Sales', data=df[['Age','Sales']], color='skyblue')
plt.show()
plt.plot( 'Age','Sales', data=df[['Age','Sales']], color='skyblue', alpha=0.3 , linestyle='--' , linewidth=2)
plt.show()
8- Violin Plot
Definition
A violin plot can be used to display the distribution of the data and its probability density. Furthermore, we get a visualization of the mean
of the data (white dot in the center of the box plot, in the image below)
In [109…
df0 = pd.read_csv('mtcars.csv', index_col=0)
sns.violinplot(x="vs", y='wt', data=df0)
plt.show()
9- Density Plot
Definition
A density plot shows the distribution of a numerical variable. It takes only set of numeric values as input. It is really close to a histogram.
In [110…
# a density plot is made using the kdeplot function. As input, density plot need only one numerical variable.
# A kernel density estimate (KDE) plot is a method for visualizing the distribution of observations in a dataset, analago
# KDE represents the data using a continuous probability density curve in one or more dimensions.
sns.kdeplot(df['Sales'])
plt.show()
In [113…
# plot of 2 variables
p1=sns.kdeplot(df['Age'], shade=True, color="r")
p1=sns.kdeplot(df['Sales'], shade=True, color="b")
plt.show()
10- WORDCLOUD
Definition
A Wordcloud (or Tag cloud) is a visual representation of text data. It displays a list of words, the importance of each beeing shown with font
size or color. This format is useful for quickly perceiving the most prominent terms.
In [114…
# !pip install wordcloud
# conda install wordcloud
from wordcloud import WordCloud
In [117…
# Create a list of word
text=("Python Python Python Matplotlib Matplotlib Seaborn Network Plot Violin Chart Pandas Datascience Wordcloud Spider R
# Create the wordcloud object
wordcloud = WordCloud(width=480, height=480, margin=0).generate(text)
# Display the generated image:
plt.imshow(wordcloud,interpolation="gaussian")
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()
In [118…
import numpy as np
from PIL import Image
# to import the image
text=("Data visualization or data visualisation is viewed by many disciplines as a modern equivalent of visual communicat
wave_mask = np.array(Image.open( "twitter_mask.png"))
# Make the figure
wordcloud = WordCloud(mask=wave_mask,colormap="Blues").generate(text)
plt.figure()
plt.imshow(wordcloud,interpolation="gaussian")
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()
11- Heat Map
Definition
A heat map (or heatmap) is a graphical representation of data where the individual values contained in a matrix are represented as colors.
In [128…
# width, height in inches
fig = plt.figure(figsize=(12, 8))
sns.heatmap(df[['Age','Income', 'Sales']])
plt.show()