0% found this document useful (0 votes)

4 views30 pages

STQS2223 CH 4

The document discusses data exploration as a crucial step in data analysis, emphasizing the importance of summary statistics, missing values, and data types. It provides a guide on how to perform data exploration using Python and R, including loading datasets, checking for missing values, and visualizing data. Additionally, it covers random number generation using a normal distribution to simulate real-world phenomena.

Uploaded by

a203173

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views30 pages

STQS2223 CH 4

Uploaded by

a203173

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

STQS2223

INTRODUCTION TO
DATA SCIENCE

CH4:DATA EXPLORATION
DATA EXPLORATION

• Data exploration is a critical step in any data analysis project. It helps us

understand the dataset, identify patterns, and gain insights.
• It include view summary statistics, check for missing values, and understand
the data types of each column in the dataset.
RECALL: BASIC STATISTICS MEASURES

• The mean (technically the arithmetic mean), a measure of central

tendency that is calculated by adding together all of the
observations and dividing by the number of observations.
• The median, another measure of central tendency, but one that
cannot be directly calculated. Instead, you make a sorted list of all of
the observations in the sample, then go halfway up that list.
Whatever the value of the observation is at the halfway point, that is
the median.
• The range, which is a measure of "dispersion" - how spread out a
bunch of numbers in a sample are - calculated by subtracting the
lowest value from the highest value
RECALL: BASIC STATISTICS MEASURES

• The mode, another measure of central tendency. The mode is the value that occurs most
often in a sample of data. Like the median, the mode cannot be directly calculated.You just
have to 39 count up how many of each number there are and then pick the category that
has the most.
• The variance, a measure of dispersion. Like the range, the variance describes how spread
out a sample of numbers is. Unlike the range, though, which just uses two numbers to
calculate dispersion, the variance is obtained from all of the numbers through a simple
calculation that compares each number to the mean.
• The standard deviation, another measure of dispersion, and a cousin to the variance. The
standard deviation is simply the square root of the variance, which puts us back in regular
units like "years.“
DATA EXPLORATION

• By viewing summary statistics, checking for missing values, and

understanding the data types of each column, we can gain valuable
insights into our dataset.

• Additionally, visualizing the data helps us uncover patterns and

trends that may not be apparent from the raw numbers alone. Data
exploration is an essential step in any data analysis project, as it
forms the foundation for further analysis and modeling.
DATA EXPLORATION USING PYTHON

Step 1: Import the necessary libraries

• To get started, let’s import the required libraries into our Python environment.
We will be using the Pandas library for data manipulation and analysis.
import pandas as pd

Step 2: Load the dataset

• Next, we need to load our dataset into a Pandas DataFrame. Replace dataset.csv
with the path or URL of your dataset.
df = pd.read_csv(‘dataset.csv’)

Step 3:View the first few rows of the dataset

• Let’s start by taking a peek at our data. By viewing the first few rows, we can get
a sense of the dataset’s structure and the information it contains.
print(df.head())
DATA EXPLORATION USING PYTHON

Step 4: Check the summary statistics of the dataset

• Summary statistics provide valuable insights into the distribution, central
tendency, and variability of our data.
print(df.describe())

Step 5: Check for missing values in the dataset

• Missing values can affect the accuracy and reliability of our analysis. Let’s
check if there are any missing values in our dataset.
print(df.isnull().sum())

Step 6: View the data types of each column

• Understanding the data types of our variables is crucial for data
manipulation and analysis. Let’s examine the data types of each column.
print(df.dtypes)
DATA EXPLORATION USING PYTHON

Step 7: Visualize the data using plots or graphs

• Data visualization provides a visual representation of our data, making it
easier to identify patterns and trends. Let’s create a histogram of a
numerical column as an example.

import matplotlib.pyplot as plt

# Replace 'column_name' with the name of the numerical column

plt.hist(df['column_name'])
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram of Column')
plt.show()
plt.hist(df['column_name’], bins = 25) #break the data into 25 bars
PYTHON: DATAFRAME BUILD-IN
FUNCTION

• .head() shows up 5 records from top.

• .tail() prints 5 records from bottom.
• .shape() tells the shape of the dataset in the form of the number of rows
and number of columns
• .describe() describes the details of the dataset.
• df[10:15] see row index 10 to 14 (its row 11 to row 15). Remember
indexing in Python starts from 0. Indexing 10:15 means that it start from
index 10 (row number 11) and ends with index before 15 (which is index
14 indicates row number 15).
• df.iloc[10:15, 1:3] see row index 10 to 14 (row 11 to 15), column index 1
to 2 (column 2 to 3).
• df.iloc[:, 1:3 ] see all rows for column index 1 to 2
PYTHON: DATA CLEANING – EMPTY
VALUES

• Remove rows
➢One way to deal with empty cells is to remove rows that
contain empty cells using dropna() function.
➢This is ok if the data sets is very big, and removing a few
rows will not have a big impact on the result.
• Replace empty values
➢This way you do not have to delete entire rows just
because of some empty cells.
➢The fillna() method allows us to replace empty cells with a
value.
• Replace only specified columns
➢To only replace empty values for one column, specify the
column name for the DataFrame.
PYTHON: DATA CLEANING – REMOVE
DUPLICATES

• Duplicate rows are rows that have been registered more

than one time.
• To discover duplicates, we can use the duplicated()
method.
• To remove duplicates, use the drop_duplicates() method.
PYTHON: DATA ANALYSIS

• We can use pandas to calculate the following statistics

from an Imported CSV File:
➢Mean
➢Total sum
➢Maximum salary
➢Minimum salary
➢Count
➢Median
➢Standard deviation
➢Variance
PYTHON: DATA ANALYSIS

DATA CORRELATION
• The corr() method calculates the relationship between
each column in your data set.
➢ The Result of the corr() method is a table with a lot of numbers
that represents how well the relationship is between two columns
• The corr() method ignores "not numeric" columns.
• The number varies from -1 to 1.
PYTHON: DATA VISUALIZATION

BASIC PLOTTING
• Pandas uses the plot() method to create diagrams.
• We can use Pyplot, a submodule of the Matplotlib library
to visualize the diagram on the screen.
PYTHON: DATA VISUALIZATION

SCATTER PLOT
• Specify that you want a scatter plot with the kind
argument:
kind = 'scatter'
• A scatter plot needs an x- and a y-axis.
• In the example, we will use "Duration" for the x-axis and
"Calories" for the y-axis. Hence, we include the x and y
arguments like this:
x = 'Duration', y = 'Calories‘
HISTOGRAM
• Specify that you want a hist plot with the kind argument:
kind = ‘hist'
EXAMPLE IN R: DATAFRAME BUILD-IN
FUNCTION

• head(data) shows up 6 records from top.

• tail(data) prints 6 records from bottom.
• data[5:10, ] see row 5 to 10
• data[5:10, 2:3 ] see row 5 to 10 and column 2 to 3
• data[, 2:3 ] see all row for column 2 to 3
DATA EXPLORATION USING R

Step 1: Import the necessary libraries

• To get started, let’s import the required libraries. Some of the related libraries:
library(Hmisc) – we use for missing value imputation
library(dpylr) – for data manipulation task
library(tidyverse) - helps to transform and better present data. It assists with
data import, tidying, manipulation, and data visualization.
Step 2: Load the dataset
• Next, we need to load our dataset. Replace data.csv with the path or URL of
your dataset if your data is placed outside the working directory.
data = read.csv(“data.csv”) - or paste the path
Step 3:View the first few rows of the dataset
• Let’s start by taking a peek at our data. By viewing the first few rows, we can get
a sense of the dataset’s structure and the information it contains.
head(data)
DATA EXPLORATION USING R

Step 4: Check the summary statistics of the dataset

• Summary statistics provide valuable insights into the distribution, central
tendency, and variability of our data.
summary(data)

Step 5: Check for missing values in the dataset

• Missing values can affect the accuracy and reliability of our analysis. Let’s
check if there are any missing values in our dataset.
colSums(is.na(data))

Step 6: View the data types of each column

• Understanding the data types of our variables is crucial for data
manipulation and analysis. Let’s examine the data types of each column.
str(data)
DATA EXPLORATION USING R

Step 7: Visualize the data using plots or graphs

• Data visualization provides a visual representation of our data, making it
easier to identify patterns and trends.

library(ggplot2)

plot(data)
plot(data$Calories) #to call specific column
plot(data$Calories, type = “o”) #add line to the plot
hist(data$Calories)
hist(data$Calories, breaks = 25) #breaks the data into 25 bar
RANDOM NUMBER GENERATION

• Random number generation using a normal distribution is commonly used

when you want to model or simulate data that follows a bell-shaped curve,
also known as Gaussian distribution.
• This is important in many real-world scenarios because many natural
phenomena and measurement errors tend to follow a normal distribution.
• The key reason for using the normal distribution is that it is mathematically
tractable and many real-world processes naturally exhibit normality due to
the Central Limit Theorem, which states that the sum of many independent,
identically distributed random variables tends to follow a normal
distribution regardless of the original distribution of the variables.
RANDOM NUMBER GENERATION

• The normal distribution, also known as the Gaussian distribution, is a

symmetric, bell-shaped probability distribution that is defined by two
parameters:
➢ Mean (μ): This is the average or center of the distribution.
➢ Standard Deviation (σ): This controls the width or spread of the
distribution. A smaller standard deviation means the data points are
closer to the mean, while a larger standard deviation means the data
points are more spread out.
WHY GENERATE RANDOM NUMBERS
USING A NORMAL DISTRIBUTION

• When you generate random numbers using a normal distribution, you're

essentially simulating values that follow this bell curve pattern.
• This is useful because many real-world phenomena, especially those in
natural and social sciences, tend to have characteristics that resemble a
normal distribution. For example:
➢ Heights of people
➢ Measurement errors in experiments
➢ Daily temperature variations
➢ Stock market returns
EXAMPLE OF GENERATING RANDOM
NUMBERS

• Suppose we want to simulate the heights of a group of people. Let's assume

the average height (mean) is 170 cm, with a standard deviation of 10 cm. If
we assume that the heights follow a normal distribution, we can generate
random heights that fit this distribution.
• Example in Python:
We can use Python to generate random numbers that follow a normal
distribution using the numpy library :-
import numpy as np
import matplotlib.pyplot as plt
EXAMPLE OF GENERATING RANDOM
NUMBERS

# Parameters for the normal distribution

mu = 170 # Mean (average height in cm)
sigma = 10 # Standard deviation (height spread in cm)

# Generate 1000 random heights that follow a normal distribution

random_heights = np.random.normal(mu, sigma, 1000)

# Plotting the histogram of generated heights to visualize the distribution

plt.hist(random_heights, bins=30, edgecolor='black', alpha=0.7)
plt.title('Random Heights Distribution')
plt.xlabel('Height (cm)')
plt.ylabel('Frequency')
plt.show()
EXAMPLE OF GENERATING RANDOM
NUMBERS
EXAMPLE OF GENERATING RANDOM
NUMBERS

• What happen in this example:

➢ np.random.normal(mu, sigma, 1000): This generates 1,000 random
numbers from a normal distribution with a mean of 170 cm and a
standard deviation of 10 cm. These are the "heights" we are simulating.
➢ Plotting: The histogram will show how these random heights are
distributed. You'll see that the most common values are around 170 cm,
and fewer people have heights that are much smaller or larger. The
distribution will be symmetric around the mean (170 cm).
EXAMPLE OF GENERATING RANDOM
NUMBERS

• Example in R:
We can use Python to generate random numbers that follow a normal
distribution using rnorm function:-
# Parameters for the normal distribution
mu <- 170 # Mean (average height in cm)
sigma <- 10 # Standard deviation (height spread in cm)

# Generate 1000 random heights that follow a normal distribution

random_heights <- rnorm(1000, mean = mu, sd = sigma)

# Plotting the histogram of generated heights to visualize the distribution

hist(random_heights, breaks = 25, main = "Random Heights Distribution",
xlab = "Height (cm)", ylab = "Frequency", col = "lightblue", border = "black")
EXAMPLE OF GENERATING RANDOM
NUMBERS
EXAMPLE OF GENERATING RANDOM
NUMBERS

• Explanation of the Code:

➢ rnorm(1000, mean = mu, sd = sigma):
This function generates 1000 random values from a normal distribution.
The first argument 1000 specifies how many random numbers you want.
The mean = mu sets the mean to 170 cm.
The sd = sigma sets the standard deviation to 10 cm.
➢ hist():
The hist() function plots a histogram to visualize the distribution of the random
heights.
The breaks = 25 argument controls the number of bins in the histogram.
We also customize the plot with a title, axis labels, and colors.
• If you run this code in R, you’ll get a histogram that looks like a bell curve,
centered around 170 cm, with most values between 150 cm and 190 cm
(because of the standard deviation of 10 cm). The shape of the histogram
will reflect the characteristics of the normal distribution.
RECAP

• Data cleaning include handling missing value, remove

duplicates
• Summary statistics: to describe the data
• Understand the relationship between variables:
correlation, covariance
• Histogram: give the idea of the data distribution
(visually)
• Random number generation: simulated data – helps
to understand how the distribution looks like

The Cinema of Federico Fellini - 30197: Syllabus
100% (1)
The Cinema of Federico Fellini - 30197: Syllabus
4 pages
Ad3301 Apr May 2024 Answer Key
No ratings yet
Ad3301 Apr May 2024 Answer Key
31 pages
Interlocking Paver Block Making Cost: Top Layer 500
No ratings yet
Interlocking Paver Block Making Cost: Top Layer 500
5 pages
Flow Chart
No ratings yet
Flow Chart
1 page
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
4.6.6 Lab - View Wired and Wireless NIC Information - ILM
No ratings yet
4.6.6 Lab - View Wired and Wireless NIC Information - ILM
4 pages
Data Science
No ratings yet
Data Science
6 pages
Python Data Exploration Guide
100% (1)
Python Data Exploration Guide
12 pages
English For Aviation: Course Outline and Sample Materials
0% (1)
English For Aviation: Course Outline and Sample Materials
14 pages
Capstone Project
No ratings yet
Capstone Project
14 pages
Unit 3
No ratings yet
Unit 3
20 pages
Childrens Writers Illustrators Market 33rd Edition The Most Trusted Guide To Getting Published Amy Jones Download
No ratings yet
Childrens Writers Illustrators Market 33rd Edition The Most Trusted Guide To Getting Published Amy Jones Download
35 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
4 pages
Legal Appeal on Property Dispute
No ratings yet
Legal Appeal on Property Dispute
7 pages
Excel Practical Assignments
No ratings yet
Excel Practical Assignments
88 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Aws Lambda Tutorial
88% (8)
Aws Lambda Tutorial
393 pages
Data Visualization
No ratings yet
Data Visualization
19 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
11 pages
KJWDH
No ratings yet
KJWDH
4 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
FOUND. DATA SCIENCE Practical
No ratings yet
FOUND. DATA SCIENCE Practical
15 pages
Data Mining Vs Data Exploration UNIT-II
No ratings yet
Data Mining Vs Data Exploration UNIT-II
11 pages
Week - 6-7
No ratings yet
Week - 6-7
9 pages
Data Unit4
No ratings yet
Data Unit4
8 pages
Presentation - University
No ratings yet
Presentation - University
52 pages
Eng 322
No ratings yet
Eng 322
10 pages
PHP Developer Training Course
No ratings yet
PHP Developer Training Course
10 pages
Dev Core
No ratings yet
Dev Core
7 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Parts Manual Parts Manual Parts Manual Parts Manual: Mfg. No: 122Q02-0001-H1
No ratings yet
Parts Manual Parts Manual Parts Manual Parts Manual: Mfg. No: 122Q02-0001-H1
25 pages
Course - Introduction To Data Science (SD211105)
No ratings yet
Course - Introduction To Data Science (SD211105)
10 pages
INDEX
No ratings yet
INDEX
16 pages
Exploratory Data Analysis (EDA) and Descriptive Analytic
No ratings yet
Exploratory Data Analysis (EDA) and Descriptive Analytic
47 pages
Python EDA: Stats, Visualization, Correlation
No ratings yet
Python EDA: Stats, Visualization, Correlation
7 pages
ML Report
No ratings yet
ML Report
12 pages
BA Underpayment Appeal Letter - NSA MNRP
No ratings yet
BA Underpayment Appeal Letter - NSA MNRP
3 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
DAUP Exam Notes - 2in1
No ratings yet
DAUP Exam Notes - 2in1
35 pages
RWS Vol. 29 No. 2 113 146 Presto 2020 Revisiting Intersectional Identities - Voices of Poor Bakla Youth in Rural Philippines
No ratings yet
RWS Vol. 29 No. 2 113 146 Presto 2020 Revisiting Intersectional Identities - Voices of Poor Bakla Youth in Rural Philippines
34 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
Additional Illustration 17
No ratings yet
Additional Illustration 17
2 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
Data Prep & EDA for Python Users
No ratings yet
Data Prep & EDA for Python Users
12 pages
Hcm65r-Hcm65b Manual - 2002 - Issue 1
No ratings yet
Hcm65r-Hcm65b Manual - 2002 - Issue 1
6 pages
Unit 2
No ratings yet
Unit 2
36 pages
Data Analysis & Visualization Guide
No ratings yet
Data Analysis & Visualization Guide
9 pages
Internship Progress Report Vivek
No ratings yet
Internship Progress Report Vivek
10 pages
UNIT 3 - Exploratory Graphs
No ratings yet
UNIT 3 - Exploratory Graphs
23 pages
Reissuance Process - Lost Owner's Duplicate
No ratings yet
Reissuance Process - Lost Owner's Duplicate
5 pages
Contents-Rules of English Grammar and Usage
No ratings yet
Contents-Rules of English Grammar and Usage
5 pages
Data Exploration
No ratings yet
Data Exploration
11 pages
PLSQL
100% (1)
PLSQL
195 pages
Healthy Living Tips for Kids
No ratings yet
Healthy Living Tips for Kids
2 pages
HRMS Guide for Employees
No ratings yet
HRMS Guide for Employees
30 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
Jira Essentials Powerpoint
No ratings yet
Jira Essentials Powerpoint
49 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Cbse Xii - MT - 06 (Assignment) (14-07-2025) - 11891 - Sol
No ratings yet
Cbse Xii - MT - 06 (Assignment) (14-07-2025) - 11891 - Sol
16 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
"Doğan v. Turkey: ECHR Just Satisfaction Judgment"
No ratings yet
"Doğan v. Turkey: ECHR Just Satisfaction Judgment"
17 pages
Exploratory Data Analysis: Prasad Deshmukh
No ratings yet
Exploratory Data Analysis: Prasad Deshmukh
15 pages
EDA Guide for Data Analysts
No ratings yet
EDA Guide for Data Analysts
35 pages
R-Programming Lab Mannual
No ratings yet
R-Programming Lab Mannual
33 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
2,3. Introduction Pandas & Matplotlib
No ratings yet
2,3. Introduction Pandas & Matplotlib
32 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
15 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Lesson 1 - Data Visualisation
No ratings yet
Lesson 1 - Data Visualisation
35 pages
Incorporating Data Warehouse Using SSIS
No ratings yet
Incorporating Data Warehouse Using SSIS
25 pages
Introduction To Literary Theory Syllabus
No ratings yet
Introduction To Literary Theory Syllabus
2 pages
Class-3 - Ratio & Proportion& Data Interpreation
No ratings yet
Class-3 - Ratio & Proportion& Data Interpreation
11 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
Research Methodology & EDA Guide
No ratings yet
Research Methodology & EDA Guide
29 pages
Research Methodogy Class 4
No ratings yet
Research Methodogy Class 4
29 pages
Hematology & Drug Study Guide
No ratings yet
Hematology & Drug Study Guide
19 pages
Week13 2 Data Analysis 2
No ratings yet
Week13 2 Data Analysis 2
44 pages
Dav Exps - Merged - Merged
No ratings yet
Dav Exps - Merged - Merged
99 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
20 Recipes For Oxtail
No ratings yet
20 Recipes For Oxtail
4 pages

STQS2223 CH 4

Uploaded by

STQS2223 CH 4

Uploaded by

STQS2223

• Data exploration is a critical step in any data analysis project. It helps us

• The mean (technically the arithmetic mean), a measure of central

• By viewing summary statistics, checking for missing values, and

• Additionally, visualizing the data helps us uncover patterns and

Step 1: Import the necessary libraries

Step 2: Load the dataset

Step 3:View the first few rows of the dataset

Step 4: Check the summary statistics of the dataset

Step 5: Check for missing values in the dataset

Step 6: View the data types of each column

Step 7: Visualize the data using plots or graphs

import matplotlib.pyplot as plt

# Replace 'column_name' with the name of the numerical column

• .head() shows up 5 records from top.

• Duplicate rows are rows that have been registered more

• We can use pandas to calculate the following statistics

• head(data) shows up 6 records from top.

Step 1: Import the necessary libraries

Step 4: Check the summary statistics of the dataset

Step 5: Check for missing values in the dataset

Step 6: View the data types of each column

Step 7: Visualize the data using plots or graphs

• Random number generation using a normal distribution is commonly used

• The normal distribution, also known as the Gaussian distribution, is a

• When you generate random numbers using a normal distribution, you're

• Suppose we want to simulate the heights of a group of people. Let's assume

# Parameters for the normal distribution

# Generate 1000 random heights that follow a normal distribution

# Plotting the histogram of generated heights to visualize the distribution

• What happen in this example:

# Generate 1000 random heights that follow a normal distribution

# Plotting the histogram of generated heights to visualize the distribution

• Explanation of the Code:

• Data cleaning include handling missing value, remove

You might also like