STQS2223
INTRODUCTION TO
DATA SCIENCE
CH4:DATA EXPLORATION
DATA EXPLORATION
• Data exploration is a critical step in any data analysis project. It helps us
understand the dataset, identify patterns, and gain insights.
• It include view summary statistics, check for missing values, and understand
the data types of each column in the dataset.
RECALL: BASIC STATISTICS MEASURES
• The mean (technically the arithmetic mean), a measure of central
tendency that is calculated by adding together all of the
observations and dividing by the number of observations.
• The median, another measure of central tendency, but one that
cannot be directly calculated. Instead, you make a sorted list of all of
the observations in the sample, then go halfway up that list.
Whatever the value of the observation is at the halfway point, that is
the median.
• The range, which is a measure of "dispersion" - how spread out a
bunch of numbers in a sample are - calculated by subtracting the
lowest value from the highest value
RECALL: BASIC STATISTICS MEASURES
• The mode, another measure of central tendency. The mode is the value that occurs most
often in a sample of data. Like the median, the mode cannot be directly calculated.You just
have to 39 count up how many of each number there are and then pick the category that
has the most.
• The variance, a measure of dispersion. Like the range, the variance describes how spread
out a sample of numbers is. Unlike the range, though, which just uses two numbers to
calculate dispersion, the variance is obtained from all of the numbers through a simple
calculation that compares each number to the mean.
• The standard deviation, another measure of dispersion, and a cousin to the variance. The
standard deviation is simply the square root of the variance, which puts us back in regular
units like "years.“
DATA EXPLORATION
• By viewing summary statistics, checking for missing values, and
understanding the data types of each column, we can gain valuable
insights into our dataset.
• Additionally, visualizing the data helps us uncover patterns and
trends that may not be apparent from the raw numbers alone. Data
exploration is an essential step in any data analysis project, as it
forms the foundation for further analysis and modeling.
DATA EXPLORATION USING PYTHON
Step 1: Import the necessary libraries
• To get started, let’s import the required libraries into our Python environment.
We will be using the Pandas library for data manipulation and analysis.
import pandas as pd
Step 2: Load the dataset
• Next, we need to load our dataset into a Pandas DataFrame. Replace dataset.csv
with the path or URL of your dataset.
df = pd.read_csv(‘dataset.csv’)
Step 3:View the first few rows of the dataset
• Let’s start by taking a peek at our data. By viewing the first few rows, we can get
a sense of the dataset’s structure and the information it contains.
print(df.head())
DATA EXPLORATION USING PYTHON
Step 4: Check the summary statistics of the dataset
• Summary statistics provide valuable insights into the distribution, central
tendency, and variability of our data.
print(df.describe())
Step 5: Check for missing values in the dataset
• Missing values can affect the accuracy and reliability of our analysis. Let’s
check if there are any missing values in our dataset.
print(df.isnull().sum())
Step 6: View the data types of each column
• Understanding the data types of our variables is crucial for data
manipulation and analysis. Let’s examine the data types of each column.
print(df.dtypes)
DATA EXPLORATION USING PYTHON
Step 7: Visualize the data using plots or graphs
• Data visualization provides a visual representation of our data, making it
easier to identify patterns and trends. Let’s create a histogram of a
numerical column as an example.
import matplotlib.pyplot as plt
# Replace 'column_name' with the name of the numerical column
plt.hist(df['column_name'])
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram of Column')
plt.show()
plt.hist(df['column_name’], bins = 25) #break the data into 25 bars
PYTHON: DATAFRAME BUILD-IN
FUNCTION
• .head() shows up 5 records from top.
• .tail() prints 5 records from bottom.
• .shape() tells the shape of the dataset in the form of the number of rows
and number of columns
• .describe() describes the details of the dataset.
• df[10:15] see row index 10 to 14 (its row 11 to row 15). Remember
indexing in Python starts from 0. Indexing 10:15 means that it start from
index 10 (row number 11) and ends with index before 15 (which is index
14 indicates row number 15).
• df.iloc[10:15, 1:3] see row index 10 to 14 (row 11 to 15), column index 1
to 2 (column 2 to 3).
• df.iloc[:, 1:3 ] see all rows for column index 1 to 2
PYTHON: DATA CLEANING – EMPTY
VALUES
• Remove rows
➢One way to deal with empty cells is to remove rows that
contain empty cells using dropna() function.
➢This is ok if the data sets is very big, and removing a few
rows will not have a big impact on the result.
• Replace empty values
➢This way you do not have to delete entire rows just
because of some empty cells.
➢The fillna() method allows us to replace empty cells with a
value.
• Replace only specified columns
➢To only replace empty values for one column, specify the
column name for the DataFrame.
PYTHON: DATA CLEANING – REMOVE
DUPLICATES
• Duplicate rows are rows that have been registered more
than one time.
• To discover duplicates, we can use the duplicated()
method.
• To remove duplicates, use the drop_duplicates() method.
PYTHON: DATA ANALYSIS
• We can use pandas to calculate the following statistics
from an Imported CSV File:
➢Mean
➢Total sum
➢Maximum salary
➢Minimum salary
➢Count
➢Median
➢Standard deviation
➢Variance
PYTHON: DATA ANALYSIS
DATA CORRELATION
• The corr() method calculates the relationship between
each column in your data set.
➢ The Result of the corr() method is a table with a lot of numbers
that represents how well the relationship is between two columns
• The corr() method ignores "not numeric" columns.
• The number varies from -1 to 1.
PYTHON: DATA VISUALIZATION
BASIC PLOTTING
• Pandas uses the plot() method to create diagrams.
• We can use Pyplot, a submodule of the Matplotlib library
to visualize the diagram on the screen.
PYTHON: DATA VISUALIZATION
SCATTER PLOT
• Specify that you want a scatter plot with the kind
argument:
kind = 'scatter'
• A scatter plot needs an x- and a y-axis.
• In the example, we will use "Duration" for the x-axis and
"Calories" for the y-axis. Hence, we include the x and y
arguments like this:
x = 'Duration', y = 'Calories‘
HISTOGRAM
• Specify that you want a hist plot with the kind argument:
kind = ‘hist'
EXAMPLE IN R: DATAFRAME BUILD-IN
FUNCTION
• head(data) shows up 6 records from top.
• tail(data) prints 6 records from bottom.
• data[5:10, ] see row 5 to 10
• data[5:10, 2:3 ] see row 5 to 10 and column 2 to 3
• data[, 2:3 ] see all row for column 2 to 3
DATA EXPLORATION USING R
Step 1: Import the necessary libraries
• To get started, let’s import the required libraries. Some of the related libraries:
library(Hmisc) – we use for missing value imputation
library(dpylr) – for data manipulation task
library(tidyverse) - helps to transform and better present data. It assists with
data import, tidying, manipulation, and data visualization.
Step 2: Load the dataset
• Next, we need to load our dataset. Replace data.csv with the path or URL of
your dataset if your data is placed outside the working directory.
data = read.csv(“data.csv”) - or paste the path
Step 3:View the first few rows of the dataset
• Let’s start by taking a peek at our data. By viewing the first few rows, we can get
a sense of the dataset’s structure and the information it contains.
head(data)
DATA EXPLORATION USING R
Step 4: Check the summary statistics of the dataset
• Summary statistics provide valuable insights into the distribution, central
tendency, and variability of our data.
summary(data)
Step 5: Check for missing values in the dataset
• Missing values can affect the accuracy and reliability of our analysis. Let’s
check if there are any missing values in our dataset.
colSums(is.na(data))
Step 6: View the data types of each column
• Understanding the data types of our variables is crucial for data
manipulation and analysis. Let’s examine the data types of each column.
str(data)
DATA EXPLORATION USING R
Step 7: Visualize the data using plots or graphs
• Data visualization provides a visual representation of our data, making it
easier to identify patterns and trends.
library(ggplot2)
plot(data)
plot(data$Calories) #to call specific column
plot(data$Calories, type = “o”) #add line to the plot
hist(data$Calories)
hist(data$Calories, breaks = 25) #breaks the data into 25 bar
RANDOM NUMBER GENERATION
• Random number generation using a normal distribution is commonly used
when you want to model or simulate data that follows a bell-shaped curve,
also known as Gaussian distribution.
• This is important in many real-world scenarios because many natural
phenomena and measurement errors tend to follow a normal distribution.
• The key reason for using the normal distribution is that it is mathematically
tractable and many real-world processes naturally exhibit normality due to
the Central Limit Theorem, which states that the sum of many independent,
identically distributed random variables tends to follow a normal
distribution regardless of the original distribution of the variables.
RANDOM NUMBER GENERATION
• The normal distribution, also known as the Gaussian distribution, is a
symmetric, bell-shaped probability distribution that is defined by two
parameters:
➢ Mean (μ): This is the average or center of the distribution.
➢ Standard Deviation (σ): This controls the width or spread of the
distribution. A smaller standard deviation means the data points are
closer to the mean, while a larger standard deviation means the data
points are more spread out.
WHY GENERATE RANDOM NUMBERS
USING A NORMAL DISTRIBUTION
• When you generate random numbers using a normal distribution, you're
essentially simulating values that follow this bell curve pattern.
• This is useful because many real-world phenomena, especially those in
natural and social sciences, tend to have characteristics that resemble a
normal distribution. For example:
➢ Heights of people
➢ Measurement errors in experiments
➢ Daily temperature variations
➢ Stock market returns
EXAMPLE OF GENERATING RANDOM
NUMBERS
• Suppose we want to simulate the heights of a group of people. Let's assume
the average height (mean) is 170 cm, with a standard deviation of 10 cm. If
we assume that the heights follow a normal distribution, we can generate
random heights that fit this distribution.
• Example in Python:
We can use Python to generate random numbers that follow a normal
distribution using the numpy library :-
import numpy as np
import matplotlib.pyplot as plt
EXAMPLE OF GENERATING RANDOM
NUMBERS
# Parameters for the normal distribution
mu = 170 # Mean (average height in cm)
sigma = 10 # Standard deviation (height spread in cm)
# Generate 1000 random heights that follow a normal distribution
random_heights = np.random.normal(mu, sigma, 1000)
# Plotting the histogram of generated heights to visualize the distribution
plt.hist(random_heights, bins=30, edgecolor='black', alpha=0.7)
plt.title('Random Heights Distribution')
plt.xlabel('Height (cm)')
plt.ylabel('Frequency')
plt.show()
EXAMPLE OF GENERATING RANDOM
NUMBERS
EXAMPLE OF GENERATING RANDOM
NUMBERS
• What happen in this example:
➢ np.random.normal(mu, sigma, 1000): This generates 1,000 random
numbers from a normal distribution with a mean of 170 cm and a
standard deviation of 10 cm. These are the "heights" we are simulating.
➢ Plotting: The histogram will show how these random heights are
distributed. You'll see that the most common values are around 170 cm,
and fewer people have heights that are much smaller or larger. The
distribution will be symmetric around the mean (170 cm).
EXAMPLE OF GENERATING RANDOM
NUMBERS
• Example in R:
We can use Python to generate random numbers that follow a normal
distribution using rnorm function:-
# Parameters for the normal distribution
mu <- 170 # Mean (average height in cm)
sigma <- 10 # Standard deviation (height spread in cm)
# Generate 1000 random heights that follow a normal distribution
random_heights <- rnorm(1000, mean = mu, sd = sigma)
# Plotting the histogram of generated heights to visualize the distribution
hist(random_heights, breaks = 25, main = "Random Heights Distribution",
xlab = "Height (cm)", ylab = "Frequency", col = "lightblue", border = "black")
EXAMPLE OF GENERATING RANDOM
NUMBERS
EXAMPLE OF GENERATING RANDOM
NUMBERS
• Explanation of the Code:
➢ rnorm(1000, mean = mu, sd = sigma):
This function generates 1000 random values from a normal distribution.
The first argument 1000 specifies how many random numbers you want.
The mean = mu sets the mean to 170 cm.
The sd = sigma sets the standard deviation to 10 cm.
➢ hist():
The hist() function plots a histogram to visualize the distribution of the random
heights.
The breaks = 25 argument controls the number of bins in the histogram.
We also customize the plot with a title, axis labels, and colors.
• If you run this code in R, you’ll get a histogram that looks like a bell curve,
centered around 170 cm, with most values between 150 cm and 190 cm
(because of the standard deviation of 10 cm). The shape of the histogram
will reflect the characteristics of the normal distribution.
RECAP
• Data cleaning include handling missing value, remove
duplicates
• Summary statistics: to describe the data
• Understand the relationship between variables:
correlation, covariance
• Histogram: give the idea of the data distribution
(visually)
• Random number generation: simulated data – helps
to understand how the distribution looks like