Exploratory Data Analysis
CSDS3202
5.1 STATISTICS
The science of statistics involves ...
Collection of data
Analysis of data
Interpretation of data
Presentation of data
CSDS3202-Introduction to Data Science 2
5.2 DATA
Data are the actual values of variables
Qualitative Data
Qualitative data can be described in words or symbols rather than in numbers
Example : colors, blood types etc..
Quantitative Data
Quantitative data is described by numbers
Two categories:
Discrete Data : Discrete data are quantitative data that are counted.
Continuous Data : Continuous data are quantitative data that are measured.
CSDS3202-Introduction to Data Science 3
5.3 FREQUENCY TABLE
Frequency is the number of times that a particular result occurs.
Frequency tables are used to organize data. A basic frequency table consists of a
column of data followed by a column of frequencies.
Example:
Look at the table below. It shows the three different ages represented in a pre-
school class. The table columns show the ages (3-5) and how many students there
are of those three ages in that class. Notice that the data is sorted in order from
smallest to largest.
CSDS3202-Introduction to Data Science 4
5.4 DESCRIPTIVE STATISTICS
Quartiles
• Quartiles divide an ordered set (smallest to largest) of data into quarters.
• Consider the following ordered set of 17 data values: {2, 2, 3, 3.5, 4, 4, 4, 6, 7.5, 8, 8, 10, 10,
11.5, 12, 12, 12}
• The value that divides the set in halves is called the second quartile (Q2). The second quartile, Q2,
is equal to 7.5. The second quartile is also called the median and the 50th percentile.
• The lower half of the data is 2, 2, 3, 3.5, 4, 4, 4, 6 .The value that divides the lower half into
halves is called the first quartile (Q1). The first quartile, Q1, is between the two middle values 3.5
and the first 4.
Q1 = (3.5 + 4)/2 = 3.75 [ Notice that 3.75 is not part of the data ]
• The upper half of the data is: 8, 8, 10, 10, 11.5, 12, 12, 12. The value that divides the upper half
into halves is called the third quartile (Q3). The third quartile, Q3, is between the two middle
values 10 and 11.5.
Q3 = (10 + 11.5)/2 = 10.75 [ Notice that 10.75 is not part of the data ]
CSDS3202-Introduction to Data Science 5
DESCRIPTIVE STATISTICS…
Quartiles…
• The data that falls below Q1= 3.75 is (2, 2, 3, 3.5) and is 25% of the data. We say that
25% of the data falls below Q1 = 3.75.
• The data that is more than Q1 = 3.75 but less than Q2 = 7.5 is (4, 4, 4, 6) and is 25% of
the data. We say that 25% of the data falls between Q1 =3.75 and Q2 = 7.5.
• The data that is more than Q2 = 7.5 but less than Q3 = 10.75 (8, 8, 10, 10) is 25% of the
data. We say that 25% of the data falls between Q2 = 7.5 and Q3 = 10.75.
• The data that falls above Q3 = 10.75 (11.5,12, 12, 12) is 25% of the data. We say that
25% of the data falls above Q3 = 10.75.
CSDS3202-Introduction to Data Science 6
DESCRIPTIVE STATISTICS…
Percentiles
Percentiles divide an ordered set (smallest to largest) of data into
hundredths.
Consider the ordered set of the 100 numbers 1, 2, 3, 4, 5, ..., 99, 100. Ten
percent of 100 numbers is 10 numbers. The 10 numbers 1, 2, 3, 4, 5, 6, 7, 8, 9,
10 fall below the 10th percentile. This means that the 10th percentile is
between 10 and 11. The 10th percentile (10th %ile) is equal to 10.5. Similarly,
the 90th percentile (90th %ile) is equal to 90.5.
CSDS3202-Introduction to Data Science 7
DESCRIPTIVE STATISTICS…
Mean
The mean is the same as the average. To find the mean, add all the values and divide by
the total number of values.
Example: {2, 3, 5, 6}
The mean
The letter x with a bar over it, represents the sample mean.
Mode
The mode is the most frequent value in the set of numbers.
Example: In the data set 52, 60, 65, 67, 70, 71, 74, 76, 78, 78, 78, 80, 86, 89, 95, the most
frequent value is 78. The mode = 78.
Example: In the data set 52, 53, 53, 53, 60, 67, 72,72,72, 90, both 53 and 72 occur the most
number of times (3 times each) so there are two modes, 53 and 72. We call this set of data
bimodal meaning it has two modes.
CSDS3202-Introduction to Data Science 8
DESCRIPTIVE STATISTICS…
Median
The median is the middle value of a set of numbers that has been ordered from smallest
to largest. The upper case letter M is used for the median.
Example: A sample of statistics exam scores for 14 students are (in order from smallest
to largest) as follows: 53, 59, 63, 63, 72, 72, 76, 78, 81, 83, 84, 84, 90, 93
Notice that 14 is an even number. The median is between 7th and 8th values (the middle two
values).
Example: A second sample of statistics exam scores for 15 students are (in order from
smallest to largest) as follows: 52, 60, 65, 67, 70, 71, 74, 76, 78, 78, 78, 80, 86, 89, 95
Notice that 15 is an odd number. The median is the 8th value (the middle value). The 8th value is
76 so the median M = 76.
CSDS3202-Introduction to Data Science 9
DESCRIPTIVE STATISTICS…
Variance
The variance is the average of the squares of the deviations. A deviation is the difference
between a value and the mean and is written as:
Example: {2, 3, 5, 6} is a set of data. The sample mean is 4. The deviations are:
2 - 4 = -2
3 - 4 = -1
5-4=1
6-4=2
The deviations squared are:
(-2)2 = 4
(-1)2 = 1
(1)2 = 1
(2)2 = 4
An average of the deviations squared is
CSDS3202-Introduction to Data Science 10
DESCRIPTIVE STATISTICS…
Standard Deviation
The standard deviation is a special average of the deviations. It measures how the data is
spread out from its mean.
The standard deviation is the square root of the variance and has the same units as the
mean. The letter s represents the sample standard deviation and the Greek
letter σ represents the population standard deviation.
Example: In the variance example above, the sample variance was s2 = 3.33 (to 2 decimal
places). The sample standard deviation is s =
rounded to one decimal place.
CSDS3202-Introduction to Data Science 11
5.5.1 THE STANDARD NORMAL PROBABILITY DISTRIBUTION
Standard Normal
The standard normal distribution is a normal probability distribution of standardized
values called z-scores.
The standard normal has a mean of 0 and a standard deviation of 1. Z is commonly used
as the random variable.
Notation: Z ~ N(0, 1)
Z-Scores
The formula for a z-score is:
where x is the value that is being standardized.
A z-score is measured in terms of the standard deviation.
So, if z = 2, then 2 is the standardized score for the value of X that is 2 standard deviations above
(positive z-score) the mean.
If z = -1, then -1 is the standardized score for the value of X that is 1 standard deviation below
(negative z-score) the mean.
CSDS3202-Introduction to Data Science 12
5.5.2 TYPES OF DISTRIBUTIONS
Uniform Distribution
Normal Distribution
Binomial Distribution
Bernoulli Distribution
Poisson Distribution
Exponential Distribution
CSDS3202-Introduction to Data Science 13
UNIFORM DISTRIBUTION
CSDS3202-Introduction to Data Science 14
NORMAL DISTRIBUTION
CSDS3202-Introduction to Data Science 15
EXPONENTIAL DISTRIBUTION
CSDS3202-Introduction to Data Science 16
5.6 CORRELATION COEFFICIENT
If a scatter plot shows a possible linear relationship, then the correlation coefficient indicates how
strong the relationship is between x and y. We use the letter r for the correlation coefficient.
If r = 1 or r = -1, there is "perfect correlation." This means that the points are already in a straight
line. In the real world, perfect correlation is very unlikely to happen.
The closer r is to 1 or -1, the better the correlation between x and y because the data points are
closer to the line of best fit.
There is positive correlation if x increases then y increases or if x decreases then y decreases. If
there is positive correlation, then the line has a positive slope.
There is negative correlation if x increases then y decreases or if x decreases then y increases. If
there is negative correlation, then the line has a negative slope.
There is no correlation if the correlation coefficient is 0 (r = 0). This means there is no relationship
between x and y. If there is no correlation, then the slope of the line is 0.
High correlation does not necessarily mean that x causes y or y causes x.
CSDS3202-Introduction to Data Science 17
CORRELATION COEFFICIENT…
Examples of scatter diagrams with different values of correlation coefficient (ρ)
CSDS3202-Introduction to Data Science 18
5.7 DIMENSIONALITY REDUCTION
Dimensionality reduction techniques can reduce the number of features in the
dataset without having to lose much information and keep /improve the model’s
performance
Benefits of applying dimensionality reduction to a dataset:
Space required to store the data is reduced as the number of dimensions comes
down.
Less dimensions lead to less computation/training time.
Some algorithms do not perform well when we have a large dimensions. So
reducing these dimensions needs to happen for the algorithm to be useful.
It takes care of multicollinearity by removing redundant features.
It helps in visualizing data.
CSDS3202-Introduction to Data Science 19
DIMENSIONALITY REDUCTION…
Dimensionality reduction can be done in two different ways:
Feature Selection
Dimensionality Reduction
Components/Factor Based
Factor Analysis
Principal Component Analysis(PCA)
Singular Value Decomposition(SVD)
Independent Component Analysis(ICA)
Projections Based
ISOMAP
t-Distributed Stochastic Neighbor Embedding (t-SNE)
Uniform Manifold Approximation and Projection (UMAP)
CSDS3202-Introduction to Data Science 20
5.7.1 FACTOR ANALYSIS
Suppose we have two variables: Income and Education. These variables will
potentially have a high correlation as people with a higher education level tend to
have significantly higher income, and vice versa.
In the Factor Analysis technique, variables are grouped by their correlations, i.e.,
all variables in a particular group will have a high correlation among themselves,
but a low correlation with variables of other group(s). Here, each group is known
as a factor. These factors are small in number as compared to the original
dimensions of the data. However, these factors are difficult to observe.
CSDS3202-Introduction to Data Science 21
FACTOR ANALYSIS…
Read in all the images contained in the train folder:
train = pd.read_csv("../input/fashionmnist/fashion-
mnist_train.csv",sep=',')
Convert these images into a numpy array format
train_data = np.array(train, dtype = 'float32')
image = []
for i in range(0,60000):
img = train_data[i].flatten()
image.append(img)
image = np.array(image)
CSDS3202-Introduction to Data Science 22
FACTOR ANALYSIS…
Create a dataframe containing the pixel values of every individual pixel present in each
image, and also their corresponding labels
train = pd.read_csv("../input/fashionmnist/fashion-
mnist_train.csv",sep=',') # Give the complete path of your
train.csv file
feat_cols = [ 'pixel'+str(i) for i in range(image.shape[1]) ]
df = pd.DataFrame(image,columns=feat_cols)
df['label'] = train['label']
Decompose the dataset using Factor Analysis:
from sklearn.decomposition import FactorAnalysis
fa = FactorAnalysis(n_components =
3).fit_transform(df[feat_cols].values)
CSDS3202-Introduction to Data Science 23
FACTOR ANALYSIS…
Visualize the results:
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(16,10))
plt.title('Factor Analysis Components')
plt.scatter(fa[:,0], fa[:,1],c='r',s=10)
plt.scatter(fa[:,1], fa[:,2],c='b',s=10)
plt.scatter(fa[:,2],fa[:,0],c='g',s=10)
plt.legend(("First Factor","Second Factor","Third
Factor"))
CSDS3202-Introduction to Data Science 24
5.7.2 UNIFORM MANIFOLD APPROXIMATION
AND PROJECTION (UMAP)
Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction
technique that can preserve as much of the local, and more of the global data
structure
Key advantages of UMAP are:
It can handle large datasets and high dimensional data without too much difficulty
It combines the power of visualization with the ability to reduce the dimensions of the
data
Along with preserving the local structure, it also preserves the global structure of the
data. UMAP maps nearby points on the manifold to nearby points in the low
dimensional representation, and does the same for far away points.
This method uses the concept of k-nearest neighbor and optimizes the results
using stochastic gradient descent.
CSDS3202-Introduction to Data Science 25
UMAP…
import umap
umap_data = umap.UMAP(n_neighbors=5, min_dist=0.3,
n_components=3).fit_transform(df[feat_cols][:6000].values
)
Here,
n_neighbors determines the number of neighboring points
used
min_dist controls how tightly embedding is allowed.
Larger values ensure embedded points are more evenly
distributed
CSDS3202-Introduction to Data Science 26
UMAP…
Visualize the transformation:
CSDS3202-Introduction to Data Science 27
5.8 FEATURE SELECTION
The process of choosing a subset of input features that contribute the most to
the output feature for use in model construction.
Important if we have datasets with high dimensionality (i.e., large number of
features).
Helps to mitigate these problems by selecting features that have high
importance to the model, such that the data dimensionality can be reduced
without much loss of the total information.
Benefits of feature selection are:
Reduce training time
Reduce the risk of overfitting
Potentially increase model's performance
Reduce model's complexity such that interpretation becomes easier
CSDS3202-Introduction to Data Science 28
5.8.1 METHODS OF FEATURE SELECTION
Filter Methods
ANOVA F-value
Variance Threshold
Mutual Information
Wrapper Methods
Exhaustive feature selection (EFS)
Sequential forward selection (SFS)
Sequential backward selection (SBS)
Embedded Methods
Random forest
CSDS3202-Introduction to Data Science 29
5.8.1.1 NECESSARY PYTHON LIBRARIES
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings('ignore')
CSDS3202-Introduction to Data Science 30
IRIS FLOWER DATASET FROM SCIKIT-
LEARN
# Load Iris dataset from Scikit-learn
from sklearn.datasets import load_iris
# Create input and output features
feature_names = load_iris().feature_names
X_data = pd.DataFrame(load_iris().data, columns=feature_names)
y_data = load_iris().target
# Show the first five rows of the dataset
X_data.head()
CSDS3202-Introduction to Data Science 31
5.8.1.2 ANOVA F-VALUE
ANOVA F-value method estimates the degree of linearity between the input
feature (i.e., predictor) and the output feature.
A high F-value indicates high degree of linearity and a low F-value indicates low
degree of linearity.
The main disadvantage of using ANOVA F-value is it only captures linear
relationships between input and output feature.
In other words, any non-linear relationships cannot be detected by F-value.
Scikit-learn has two functions to calculate F-value:
f_classif, which calculate F-value between input and output feature for classification
task
f_regression, which calculate F-value between input and output feature for
classification task
CSDS3202-Introduction to Data Science 32
ANOVA F-VALUE…
Use f_classif because the Iris dataset entails classification task
# Import f_classif from Scikit-learn
from sklearn.feature_selection import f_classif
# Create f_classif object to calculate F-value
f_value = f_classif(X_data, y_data)
# Print the name and F-value of each feature
for feature in zip(feature_names, f_value[0]):
print(feature)
CSDS3202-Introduction to Data Science 33
ANOVA F-VALUE…
Visualize the results by creating a bar chart:
# Create a bar chart for visualizing the F-values
plt.figure(figsize=(4,4))
plt.bar(x=feature_names, height=f_value[0], color='tomato')
plt.xticks(rotation='vertical')
plt.ylabel('F-value')
plt.title('F-value Comparison')
plt.show()
CSDS3202-Introduction to Data Science 34
5.8.1.3 EXHAUSTIVE FEATURE SELECTION (EFS)
EFS finds the best subset of features by evaluating all feature combinations.
Suppose we have a dataset with three features. EFS will evaluate the
following feature combinations:
feature_1
feature_2
feature_3
feature_1 and feature_2
feature_1 and feature_3
feature_2 and feature_3
feature_1, feature_2, and feature_3
EFS selects a subset that generates the best performance (e.g., accuracy,
precision, recall, etc.) of the model being considered.
Mlxtend provides ExhaustiveFeatureSelector function to perform EFS.
CSDS3202-Introduction to Data Science 35
EXHAUSTIVE FEATURE SELECTION (EFS)…
EFS has five important parameters:
estimator: the classifier that we intend to train
min_features: the minimum number of features to select
max_features: the maximum number of features to select
scoring: the metric to use to evaluate the classifier
cv: the number of cross-validations to perform
CSDS3202-Introduction to Data Science 36
EXHAUSTIVE FEATURE SELECTION (EFS)…
# Import ExhaustiveFeatureSelector from Mlxtend
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
# Import logistic regression from Scikit-learn
from sklearn.linear_model import LogisticRegression
# Create a logistic regression classifier
lr = LogisticRegression()
# Create an EFS object
efs = EFS(estimator=lr, # Use logistic regression as the
classifier/estimator
min_features=1, # The minimum number of features to consider is 1
max_features=4, # The maximum number of features to consider is 4
scoring='accuracy', # The metric to use to evaluate the classifier is
accuracy
cv=5) # The number of cross-validations to perform is 5
CSDS3202-Introduction to Data Science 37
EXHAUSTIVE FEATURE SELECTION (EFS)…
# Train EFS with our dataset
efs = efs.fit(X_data, y_data)
# Print the results
print('Best accuracy score: %.2f' % efs.best_score_) #
best_score_ shows the best score
print('Best subset (indices):', efs.best_idx_) # best_idx_
shows the index of features that yield the best score
print('Best subset (corresponding names):',
efs.best_feature_names_) # best_feature_names_ shows the feature
names
# that yield the best score
CSDS3202-Introduction to Data Science 38
EXHAUSTIVE FEATURE SELECTION (EFS)…
Transform the dataset into a new dataset containing only the subset of features that
generates the best score by using transform method.
# Transform the dataset
X_data_new = efs.transform(X_data)
# Print the results
print('Number of features before transformation:
{}'.format(X_data.shape[1]))
print('Number of features after transformation:
{}'.format(X_data_new.shape[1]))
# Show the performance of each subset of features
efs_results = pd.DataFrame.from_dict(efs.get_metric_dict()).T
efs_results.sort_values(by='avg_score', ascending=True, inplace=True)
efs_results
CSDS3202-Introduction to Data Science 39
EXHAUSTIVE FEATURE SELECTION (EFS)…
Visualize the performance of each subset of features by creating a horizontal bar chart:
# Create a horizontal bar chart for visualizing
# the performance of each subset of features
fig, ax = plt.subplots(figsize=(12,9))
y_pos = np.arange(len(efs_results))
ax.barh(y_pos, efs_results['avg_score'],
xerr=efs_results['std_dev'], color='tomato')
ax.set_yticks(y_pos)
ax.set_yticklabels(efs_results['feature_names'])
ax.set_xlabel('Accuracy')
plt.show()
CSDS3202-Introduction to Data Science 40
5.8.1.4 FEATURE SELECTION USING RANDOM FOREST
Random forest is one of the most popular learning algorithms used for feature
selection in a data science workflow.
Split dataset into train and test split because the feature selection is a part of the
training process.
Use gini criterion to define feature importance
CSDS3202-Introduction to Data Science 41
FEATURE SELECTION USING RANDOM
FOREST…
# Import RandomForestClassifier from Scikit-learn
from sklearn.ensemble import RandomForestClassifier
# Import train_test_split from Scikit-learn
from sklearn.model_selection import train_test_split
# Split the dataset into 30% test and 70% training
X_train, X_test, y_train, y_test =
train_test_split(X_data, y_data, test_size=0.3,
random_state=0)
CSDS3202-Introduction to Data Science 42
FEATURE SELECTION USING RANDOM
FOREST…
# Create a random forest classifier
rfc = RandomForestClassifier(random_state=0, criterion='gini') # Use gini
criterion to define feature importance
# Train the classifier
rfc.fit(X_train, y_train)
# Print the name and gini importance of each feature
for feature in zip(feature_names, rfc.feature_importances_):
print(feature)
If we add up al the importance scores, the result is 100%. As we can see, petal length and petal
width correspond to 83% of the total importance score. They are clearly the most important features!
CSDS3202-Introduction to Data Science 43
REFERENCES
J. Han and M. Kamber (2011). Data Mining: Concepts and Techniques. Morgan
Kaufmann, 3rd ed.
Rahil Shaikh (2018). Feature Selection Techniques in Machine Learning with Python.
Available 2023-02-22 at https://towardsdatascience.com/feature-selection-
techniques-in-machine-learning-with-python-f24e7da3f36e
CSDS3202-Introduction to Data Science 44