Tung Wah College
GEN3005 / GED3005 Big Data and Data Sciences
Lecture 5: Data Exploration Using Python
Learning Outcomes:
By completing the Lecture 5, you should be able to
1) use python to obtain simple statistics of numeric and categorical variables on a
dataframe;
2) draw boxplot and scatter plot on variables for data exploration; and
3) perform binning on variables to visualize data distribution.
1. Data Exploration in Python
A. Obtaining Simple Statistics
By looking at the simple statistics of a dataset, data scientists are able to identify if there
are mathematical issues in the dataset, such as the presence of outliers and large data
dispersions in the dataset.
# To set the option for display all columns in printing the dataframe
pd.set_option('display.max_columns', None)
i) Descriptive Statistics display all rows
pd.set_option('display.max_rows', None)
To get the simple statistics of the dataset, we can use the describe() method.
print(df.describe())
The simple statistics, such as “count”, “average”, “sample standard deviation”, and the five
quartiles are displayed. By default, the describe method skips rows and columns that
do not contain numbers.
1
To display the summary of all columns, we can add an argument to the describe
function as below.
print(df.describe(include = "all"))
It shows the summary of all columns, including those with pandas object type. For the
object type columns, a different set of statistics is obtained. The “unique” field shows the
number of distinct objects in the column. The “top” field shows the object with the highest
frequency. The “freq” field is the frequency of the most frequently occurred object.
ii) Frequency Distribution of categorical variable
Apart from showing the “unique”, “top”, and “freq” fields of categorical variables, we can
show the number of observations in each category (i.e., the frequency distribution) of the
categorical variables.
For example, to examine the frequency distribution of “occupation”, the following code
can be used.
can change to any things #how many ppl are there in each occupation
o_counts = df["Occupation"].value_counts().to_frame()
o_counts["RF (in %)"] = o_counts["count"]/o_counts["count"].sum()
o_counts["RF (in %)"] = (o_counts["RF (in %)"]*100).round(2)
print(o_counts) relative frequency
From the output, we can see that “Nurse” has the highest frequency (73 or 19.84%),
followed by “Doctor” (65 or 17.66%) and “Engineer” (63 or 17.12%).
2
B. Data Visualization
We will see how to use python to draw a boxplot, which shows the summary statistics of a
dataset and its distribution. We will also draw scatter plots to show the relationship between
two numeric (and continuous) variables.
i) By Boxplot
Boxplot is a graphical representation of numeric variables. The main features that the box
plot shows are
- the median of the data, which represents where the middle data point is;
- the upper quartile shows where the 75th percentile is;
- the lower quartile shows where the 25th percentile is, with the value between the upper
and lower quartile represents the interquartile range.
- the lower and upper extremes, which are calculated as 1.5 times the interquartile
range above the 75th percentile, and as 1.5 times the IQR below the 25th percentile;
- outliers as individual dots that occur outside the upper and lower extremes.
With box plots, you can easily spot outliers, and also see the distribution and skewness of
the data.
3
To show the boxplots of a variable “BMI” of all subjects, and that of male and female, the
following codes can be used.
import numpy as np
import matplotlib.pyplot as plt
df2 = df[["Gender", "BMI"]]
male_data = df2.loc[(df2["Gender"]=="Male")]["BMI"]
female_data = df2.loc[(df2["Gender"]=="Female")]["BMI"]
1 2
plt.boxplot([df2["BMI"]], labels = ["All Subjects"])
have numbers
plt.ylabel("BMI")
plt.title("Boxplot of BMI")
plt.show()display the graph
plt.boxplot([male_data, female_data], labels =["Male", "Female"])
plt.xlabel("Gender")
plt.ylabel("BMI")
plt.title("Boxplots of BMI by Gender")
plt.show()
The pyplot library in the matplotlib package is useful for graph plotting. In the
library, the function boxplot() is used to draw a boxplot. The first parameter is the
dataframes with the values to be plotted, and the second parameter is the label for each
dataframe values shown in the graph.
The xlabel(), ylabel() and title() function are used to display the x-axis label,
y-axis label, and the title of the graph being plotted. The last method show() draws the
graph as specified in the previous statements.
4
ii) By Scatter plot
Scatter plot is used to analyse the relationship between two continuous variables. One
variable is the dependent variable (y, or the target variable), and another variable is the
independent variable (x, or the predictor variable). The independent variable is used to
predict the dependent variable.
The following codes can be used to draw a scatter plot to show the relationship between
“Weight” and “BMI”.
x = df["Weight (in pounds)"]
y = df["BMI"]
plt.scatter(x,y)
plt.title("Scatter Plot of Weight vs. BMI")
plt.xlabel("Weight")have numbers
plt.ylabel("BMI")
plt.show()
The scatter plot shows there is a direct linear relationship between the “Weight” and the
“BMI”.
5
2. The idea of binning for data visualization
Binning is a method for data pre-processing. It is performed to group values into bins. For
example, you can bin “age” into [0 to 5], [6 to 10], [11 to 15] and so on. Sometimes, binning
can improve the accuracy of a predictive model. In addition, we use data binning to group
a set of numerical values into a smaller number of bins to have a better understanding of
the data distribution.
For example, suppose the stress level ranges from 3 to 8. We can categorize them into 3
bins. The following codes perform binning and draws a bar chart to show the frequencies
for each bin.
must be numeric
# To perform binning for separating stress level (numeric) into 3
groups (categorical)
bins = np.linspace(df["Stress Level"].min(), df["Stress
Level"].max(), 4)no.of category +1
group_names = ["Low", "Medium", "High"]
df["Stress Level (3 levels)"] = pd.cut(df["Stress Level"], bins,
labels=group_names, include_lowest=True)the default settings will omit the lowest value
so use this can can start the range in first number
# Draw a bar chart for the categorical variable - "Stress Level
(3 levels)"
freq_table = df["Stress Level (3
levels)"].value_counts(sort=False)no sort=false then will arrange in ascending order
plt.bar(freq_table.keys().values, freq_table)
plt.xlabel("Stress level")
plt.ylabel("Frequency")
plt.title("Bar chart of Stress Level")
plt.show()
6
The idea of binning is also used for drawing histograms that represent the distribution of
numeric variables. To draw a histogram, we have to specify the “bin” parameter to the
method hist(). The following codes are used to draw a histogram to visualize the
distribution of the “Age” variable, with 7 bins.
# To draw a histogram of age with 7 classes (7 bins)
plt.hist(df["Age"], bins=7)
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.title("Histogram of Age")
plt.show()