0% found this document useful (0 votes)

7 views7 pages

Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences

Uploaded by

Valiant Cheung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views7 pages

Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences

Uploaded by

Valiant Cheung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Tung Wah College

GEN3005 / GED3005 Big Data and Data Sciences

Lecture 5: Data Exploration Using Python

Learning Outcomes:

By completing the Lecture 5, you should be able to

1) use python to obtain simple statistics of numeric and categorical variables on a
dataframe;
2) draw boxplot and scatter plot on variables for data exploration; and
3) perform binning on variables to visualize data distribution.

1. Data Exploration in Python

A. Obtaining Simple Statistics

By looking at the simple statistics of a dataset, data scientists are able to identify if there
are mathematical issues in the dataset, such as the presence of outliers and large data
dispersions in the dataset.
# To set the option for display all columns in printing the dataframe
pd.set_option('display.max_columns', None)
i) Descriptive Statistics display all rows
pd.set_option('display.max_rows', None)
To get the simple statistics of the dataset, we can use the describe() method.

print(df.describe())

The simple statistics, such as “count”, “average”, “sample standard deviation”, and the five
quartiles are displayed. By default, the describe method skips rows and columns that
do not contain numbers.

1
To display the summary of all columns, we can add an argument to the describe
function as below.

print(df.describe(include = "all"))

It shows the summary of all columns, including those with pandas object type. For the
object type columns, a different set of statistics is obtained. The “unique” field shows the
number of distinct objects in the column. The “top” field shows the object with the highest
frequency. The “freq” field is the frequency of the most frequently occurred object.

ii) Frequency Distribution of categorical variable

Apart from showing the “unique”, “top”, and “freq” fields of categorical variables, we can
show the number of observations in each category (i.e., the frequency distribution) of the
categorical variables.

For example, to examine the frequency distribution of “occupation”, the following code
can be used.
can change to any things #how many ppl are there in each occupation
o_counts = df["Occupation"].value_counts().to_frame()
o_counts["RF (in %)"] = o_counts["count"]/o_counts["count"].sum()
o_counts["RF (in %)"] = (o_counts["RF (in %)"]*100).round(2)
print(o_counts) relative frequency

From the output, we can see that “Nurse” has the highest frequency (73 or 19.84%),
followed by “Doctor” (65 or 17.66%) and “Engineer” (63 or 17.12%).

2
B. Data Visualization

We will see how to use python to draw a boxplot, which shows the summary statistics of a
dataset and its distribution. We will also draw scatter plots to show the relationship between
two numeric (and continuous) variables.

i) By Boxplot

Boxplot is a graphical representation of numeric variables. The main features that the box
plot shows are
- the median of the data, which represents where the middle data point is;
- the upper quartile shows where the 75th percentile is;
- the lower quartile shows where the 25th percentile is, with the value between the upper
and lower quartile represents the interquartile range.
- the lower and upper extremes, which are calculated as 1.5 times the interquartile
range above the 75th percentile, and as 1.5 times the IQR below the 25th percentile;
- outliers as individual dots that occur outside the upper and lower extremes.

With box plots, you can easily spot outliers, and also see the distribution and skewness of
the data.
3
To show the boxplots of a variable “BMI” of all subjects, and that of male and female, the
following codes can be used.

import numpy as np
import matplotlib.pyplot as plt

df2 = df[["Gender", "BMI"]]

male_data = df2.loc[(df2["Gender"]=="Male")]["BMI"]
female_data = df2.loc[(df2["Gender"]=="Female")]["BMI"]
1 2
plt.boxplot([df2["BMI"]], labels = ["All Subjects"])
have numbers
plt.ylabel("BMI")
plt.title("Boxplot of BMI")
plt.show()display the graph

plt.boxplot([male_data, female_data], labels =["Male", "Female"])

plt.xlabel("Gender")
plt.ylabel("BMI")
plt.title("Boxplots of BMI by Gender")
plt.show()

The pyplot library in the matplotlib package is useful for graph plotting. In the
library, the function boxplot() is used to draw a boxplot. The first parameter is the
dataframes with the values to be plotted, and the second parameter is the label for each
dataframe values shown in the graph.

The xlabel(), ylabel() and title() function are used to display the x-axis label,
y-axis label, and the title of the graph being plotted. The last method show() draws the
graph as specified in the previous statements.

4
ii) By Scatter plot

Scatter plot is used to analyse the relationship between two continuous variables. One
variable is the dependent variable (y, or the target variable), and another variable is the
independent variable (x, or the predictor variable). The independent variable is used to
predict the dependent variable.

The following codes can be used to draw a scatter plot to show the relationship between
“Weight” and “BMI”.

x = df["Weight (in pounds)"]

y = df["BMI"]
plt.scatter(x,y)
plt.title("Scatter Plot of Weight vs. BMI")
plt.xlabel("Weight")have numbers
plt.ylabel("BMI")
plt.show()

The scatter plot shows there is a direct linear relationship between the “Weight” and the
“BMI”.

5
2. The idea of binning for data visualization

Binning is a method for data pre-processing. It is performed to group values into bins. For
example, you can bin “age” into [0 to 5], [6 to 10], [11 to 15] and so on. Sometimes, binning
can improve the accuracy of a predictive model. In addition, we use data binning to group
a set of numerical values into a smaller number of bins to have a better understanding of
the data distribution.

For example, suppose the stress level ranges from 3 to 8. We can categorize them into 3
bins. The following codes perform binning and draws a bar chart to show the frequencies
for each bin.
must be numeric
# To perform binning for separating stress level (numeric) into 3
groups (categorical)
bins = np.linspace(df["Stress Level"].min(), df["Stress
Level"].max(), 4)no.of category +1
group_names = ["Low", "Medium", "High"]
df["Stress Level (3 levels)"] = pd.cut(df["Stress Level"], bins,
labels=group_names, include_lowest=True)the default settings will omit the lowest value
so use this can can start the range in first number
# Draw a bar chart for the categorical variable - "Stress Level
(3 levels)"
freq_table = df["Stress Level (3
levels)"].value_counts(sort=False)no sort=false then will arrange in ascending order
plt.bar(freq_table.keys().values, freq_table)
plt.xlabel("Stress level")
plt.ylabel("Frequency")
plt.title("Bar chart of Stress Level")
plt.show()

6
The idea of binning is also used for drawing histograms that represent the distribution of
numeric variables. To draw a histogram, we have to specify the “bin” parameter to the
method hist(). The following codes are used to draw a histogram to visualize the
distribution of the “Age” variable, with 7 bins.

# To draw a histogram of age with 7 classes (7 bins)

plt.hist(df["Age"], bins=7)
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.title("Histogram of Age")
plt.show()

Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
42 pages
Biostatistics With 'R': A Guide For Medical Doctors: Marco Moscarelli
No ratings yet
Biostatistics With 'R': A Guide For Medical Doctors: Marco Moscarelli
248 pages
Final Note
No ratings yet
Final Note
23 pages
Aiag Gage R&R Part Number Average & Range Met: Required Outputs
No ratings yet
Aiag Gage R&R Part Number Average & Range Met: Required Outputs
29 pages
Data Visualization Essentials
No ratings yet
Data Visualization Essentials
87 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
22 pages
Datascienece
No ratings yet
Datascienece
18 pages
Data Visualization Using Pyplot: Submitted by
No ratings yet
Data Visualization Using Pyplot: Submitted by
27 pages
Operations Management: - Forecasting
No ratings yet
Operations Management: - Forecasting
96 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
6) BIOSTATISTICs
No ratings yet
6) BIOSTATISTICs
99 pages
DSBDAL - Assignment No 9
No ratings yet
DSBDAL - Assignment No 9
12 pages
Python Data Visualization Guide
No ratings yet
Python Data Visualization Guide
34 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Digital Image Processing (DIP) of Remotely Sensed Data CE 712
No ratings yet
Digital Image Processing (DIP) of Remotely Sensed Data CE 712
17 pages
Statistical Estimation
No ratings yet
Statistical Estimation
37 pages
Data Visualization Techniques Guide
No ratings yet
Data Visualization Techniques Guide
48 pages
Capstone Project
100% (1)
Capstone Project
7 pages
BPB31103 Production & Operations Management ch8
No ratings yet
BPB31103 Production & Operations Management ch8
89 pages
Python Data Visualization Guide
No ratings yet
Python Data Visualization Guide
17 pages
A Long Memory Property of Stock Returns and A New Model (Ding, Granger and Engle)
No ratings yet
A Long Memory Property of Stock Returns and A New Model (Ding, Granger and Engle)
24 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
Data Visualization with Matplotlib
No ratings yet
Data Visualization with Matplotlib
23 pages
Ass 8 DSBDL
No ratings yet
Ass 8 DSBDL
27 pages
Chapter 12
No ratings yet
Chapter 12
36 pages
Box Plot Data-Aggregation To Normalization DJB Notes 25-04-2024
No ratings yet
Box Plot Data-Aggregation To Normalization DJB Notes 25-04-2024
21 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
DSBDL Write Ups 8 To 10
No ratings yet
DSBDL Write Ups 8 To 10
7 pages
Measures of Dispersion or Variability Range Variance Standard Deviation
No ratings yet
Measures of Dispersion or Variability Range Variance Standard Deviation
12 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
BDA File
No ratings yet
BDA File
26 pages
Plotting
No ratings yet
Plotting
1 page
Chapter1.3 - Data Visualization
No ratings yet
Chapter1.3 - Data Visualization
27 pages
Aea2014 Ps Meta
No ratings yet
Aea2014 Ps Meta
24 pages
Matplotlib Notes
No ratings yet
Matplotlib Notes
5 pages
Lecture 4
No ratings yet
Lecture 4
60 pages
Econometrics Term Paper
No ratings yet
Econometrics Term Paper
8 pages
Towards Data Science All About Feature Scaling
No ratings yet
Towards Data Science All About Feature Scaling
16 pages
6) Exploratory Data Analysis
No ratings yet
6) Exploratory Data Analysis
29 pages
STA 405: Linear Modelling 2: Dr. Idah
100% (1)
STA 405: Linear Modelling 2: Dr. Idah
30 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
87 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
1.1 Univariate Analysis: 1.1.1 Categorical Data
No ratings yet
1.1 Univariate Analysis: 1.1.1 Categorical Data
10 pages
Institute of Actuaries of India: Subject CT3 - Probability and Mathematical Statistics
No ratings yet
Institute of Actuaries of India: Subject CT3 - Probability and Mathematical Statistics
7 pages
Universiteit Hasselt Concepts in Bayesian Inference Exam June 2015
No ratings yet
Universiteit Hasselt Concepts in Bayesian Inference Exam June 2015
8 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
97 pages
JRS, 523 - Mey Damayanti. C (143-151)
No ratings yet
JRS, 523 - Mey Damayanti. C (143-151)
9 pages
Ada Assign
No ratings yet
Ada Assign
6 pages
Aphical Representation
No ratings yet
Aphical Representation
8 pages
Notes9 - Class - 10 - Data Visualization Using MatPlotlib Notes
No ratings yet
Notes9 - Class - 10 - Data Visualization Using MatPlotlib Notes
5 pages
CHAPTER-2 Data Visualization
No ratings yet
CHAPTER-2 Data Visualization
4 pages
13.machine Learning Axioms-Completed
No ratings yet
13.machine Learning Axioms-Completed
8 pages
3-Data Description
No ratings yet
3-Data Description
91 pages
Unit 3
No ratings yet
Unit 3
45 pages
19 Matplotlib
No ratings yet
19 Matplotlib
26 pages
GEB1305 China and The World - Lecture 6 (Part 2) 複製
No ratings yet
GEB1305 China and The World - Lecture 6 (Part 2) 複製
12 pages
Probability and Statistics
No ratings yet
Probability and Statistics
4 pages
21L-1803 Data Visual Assignment#3
No ratings yet
21L-1803 Data Visual Assignment#3
26 pages
Data Visualization Using Python
No ratings yet
Data Visualization Using Python
3 pages
09 Stoudetal DR
No ratings yet
09 Stoudetal DR
9 pages
Jurnal Peran Sektor Pariwisata Dufan Terhadap Perekonomian Jakarta
No ratings yet
Jurnal Peran Sektor Pariwisata Dufan Terhadap Perekonomian Jakarta
15 pages
2014 Yu Mathiowetz Part I
No ratings yet
2014 Yu Mathiowetz Part I
7 pages
Sl-3 Assignment No.8
No ratings yet
Sl-3 Assignment No.8
21 pages
Experiment No 9
No ratings yet
Experiment No 9
13 pages
Lecture Week3
No ratings yet
Lecture Week3
51 pages
Lab Manual For Students
No ratings yet
Lab Manual For Students
38 pages
Machine Learning - Random Forest
No ratings yet
Machine Learning - Random Forest
6 pages
Data Visualization Using Matplotlib
No ratings yet
Data Visualization Using Matplotlib
10 pages
Unit 2
No ratings yet
Unit 2
36 pages
DVA Practical
No ratings yet
DVA Practical
19 pages
Lecture3 Classnotes
No ratings yet
Lecture3 Classnotes
31 pages
Ai&Ml Bail606 ML Lab Manual
No ratings yet
Ai&Ml Bail606 ML Lab Manual
50 pages
Business Statistics - 3022 - Bcom
No ratings yet
Business Statistics - 3022 - Bcom
3 pages
Basic Mathematics - I BCA Syllabus 2024-25
No ratings yet
Basic Mathematics - I BCA Syllabus 2024-25
2 pages
Heep Hong Society Kahoot
No ratings yet
Heep Hong Society Kahoot
2 pages
Prefix Prefix symbol Magnitude: 必讀概要 1 國際標準單位 (S.I. Units)
No ratings yet
Prefix Prefix symbol Magnitude: 必讀概要 1 國際標準單位 (S.I. Units)
3 pages
Datavisualization Interview
No ratings yet
Datavisualization Interview
3 pages
Unit 5
No ratings yet
Unit 5
25 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
Data Science EDA MCQs Document
No ratings yet
Data Science EDA MCQs Document
24 pages
Random Forest On Titanic
No ratings yet
Random Forest On Titanic
4 pages
Pandas 3-2
No ratings yet
Pandas 3-2
27 pages
Python Unit 4.notes
No ratings yet
Python Unit 4.notes
50 pages
STAT 252 2025 Winter Common Syllabus 1
No ratings yet
STAT 252 2025 Winter Common Syllabus 1
7 pages
Aphical Representation
No ratings yet
Aphical Representation
12 pages
Presentation
No ratings yet
Presentation
19 pages
Data Visualization
No ratings yet
Data Visualization
23 pages
13
No ratings yet
13
2 pages
Unit II 09 Data Visualization Matplotlib
No ratings yet
Unit II 09 Data Visualization Matplotlib
9 pages
Propensity Score Analysis Statistical Methods and Applications 2nd Edition Shenyang Y. Guo Instant Download
100% (1)
Propensity Score Analysis Statistical Methods and Applications 2nd Edition Shenyang Y. Guo Instant Download
61 pages
Wolters Kluwer: Ppincott Illiams & Wilki Li
No ratings yet
Wolters Kluwer: Ppincott Illiams & Wilki Li
596 pages
Title Page Template
No ratings yet
Title Page Template
1 page
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
No ratings yet
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
6 pages
Data Visualization
No ratings yet
Data Visualization
8 pages
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
No ratings yet
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
8 pages
EDA Graph
No ratings yet
EDA Graph
69 pages
Entry Form SIC 2024
No ratings yet
Entry Form SIC 2024
4 pages
01_Matplotlib
No ratings yet
01_Matplotlib
2 pages
Move To Focus (June-July)
No ratings yet
Move To Focus (June-July)
28 pages