0% found this document useful (0 votes)

64 views44 pages

Exploratory Data Analysis Updated

The document discusses concepts in exploratory data analysis including statistics, data types, frequency tables, descriptive statistics, distributions, and correlation. Statistics involves collecting, analyzing, interpreting and presenting data. Descriptive statistics calculated from data include measures of central tendency (mean, median, mode), dispersion (range, variance, standard deviation), and position (quartiles, percentiles). Common distributions examined include uniform, normal, binomial, Bernoulli, Poisson and exponential. Correlation coefficients indicate the strength and direction of linear relationships between variables.

Uploaded by

Dr. Sanjay Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views44 pages

Exploratory Data Analysis Updated

Uploaded by

Dr. Sanjay Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Exploratory Data Analysis

CSDS3202
5.1 STATISTICS
The science of statistics involves ...
 Collection of data
 Analysis of data
 Interpretation of data
 Presentation of data

CSDS3202-Introduction to Data Science 2

5.2 DATA
Data are the actual values of variables
 Qualitative Data
 Qualitative data can be described in words or symbols rather than in numbers
 Example : colors, blood types etc..
 Quantitative Data
 Quantitative data is described by numbers
 Two categories:
 Discrete Data : Discrete data are quantitative data that are counted.
 Continuous Data : Continuous data are quantitative data that are measured.

CSDS3202-Introduction to Data Science 3

5.3 FREQUENCY TABLE

 Frequency is the number of times that a particular result occurs.

 Frequency tables are used to organize data. A basic frequency table consists of a
column of data followed by a column of frequencies.
 Example:
Look at the table below. It shows the three different ages represented in a pre-
school class. The table columns show the ages (3-5) and how many students there
are of those three ages in that class. Notice that the data is sorted in order from
smallest to largest.

CSDS3202-Introduction to Data Science 4

5.4 DESCRIPTIVE STATISTICS
Quartiles
• Quartiles divide an ordered set (smallest to largest) of data into quarters.
• Consider the following ordered set of 17 data values: {2, 2, 3, 3.5, 4, 4, 4, 6, 7.5, 8, 8, 10, 10,
11.5, 12, 12, 12}
• The value that divides the set in halves is called the second quartile (Q2). The second quartile, Q2,
is equal to 7.5. The second quartile is also called the median and the 50th percentile.
• The lower half of the data is 2, 2, 3, 3.5, 4, 4, 4, 6 .The value that divides the lower half into
halves is called the first quartile (Q1). The first quartile, Q1, is between the two middle values 3.5
and the first 4.
Q1 = (3.5 + 4)/2 = 3.75 [ Notice that 3.75 is not part of the data ]
• The upper half of the data is: 8, 8, 10, 10, 11.5, 12, 12, 12. The value that divides the upper half
into halves is called the third quartile (Q3). The third quartile, Q3, is between the two middle
values 10 and 11.5.
Q3 = (10 + 11.5)/2 = 10.75 [ Notice that 10.75 is not part of the data ]
CSDS3202-Introduction to Data Science 5
DESCRIPTIVE STATISTICS…

Quartiles…
• The data that falls below Q1= 3.75 is (2, 2, 3, 3.5) and is 25% of the data. We say that
25% of the data falls below Q1 = 3.75.

• The data that is more than Q1 = 3.75 but less than Q2 = 7.5 is (4, 4, 4, 6) and is 25% of
the data. We say that 25% of the data falls between Q1 =3.75 and Q2 = 7.5.

• The data that is more than Q2 = 7.5 but less than Q3 = 10.75 (8, 8, 10, 10) is 25% of the
data. We say that 25% of the data falls between Q2 = 7.5 and Q3 = 10.75.

• The data that falls above Q3 = 10.75 (11.5,12, 12, 12) is 25% of the data. We say that
25% of the data falls above Q3 = 10.75.

CSDS3202-Introduction to Data Science 6

DESCRIPTIVE STATISTICS…

Percentiles
 Percentiles divide an ordered set (smallest to largest) of data into
hundredths.
 Consider the ordered set of the 100 numbers 1, 2, 3, 4, 5, ..., 99, 100. Ten
percent of 100 numbers is 10 numbers. The 10 numbers 1, 2, 3, 4, 5, 6, 7, 8, 9,
10 fall below the 10th percentile. This means that the 10th percentile is
between 10 and 11. The 10th percentile (10th %ile) is equal to 10.5. Similarly,
the 90th percentile (90th %ile) is equal to 90.5.

CSDS3202-Introduction to Data Science 7

DESCRIPTIVE STATISTICS…

Mean
 The mean is the same as the average. To find the mean, add all the values and divide by
the total number of values.
 Example: {2, 3, 5, 6}
The mean
 The letter x with a bar over it, represents the sample mean.

Mode
 The mode is the most frequent value in the set of numbers.
 Example: In the data set 52, 60, 65, 67, 70, 71, 74, 76, 78, 78, 78, 80, 86, 89, 95, the most
frequent value is 78. The mode = 78.
 Example: In the data set 52, 53, 53, 53, 60, 67, 72,72,72, 90, both 53 and 72 occur the most
number of times (3 times each) so there are two modes, 53 and 72. We call this set of data
bimodal meaning it has two modes.
CSDS3202-Introduction to Data Science 8
DESCRIPTIVE STATISTICS…

Median
 The median is the middle value of a set of numbers that has been ordered from smallest
to largest. The upper case letter M is used for the median.
 Example: A sample of statistics exam scores for 14 students are (in order from smallest
to largest) as follows: 53, 59, 63, 63, 72, 72, 76, 78, 81, 83, 84, 84, 90, 93
 Notice that 14 is an even number. The median is between 7th and 8th values (the middle two
values).

 Example: A second sample of statistics exam scores for 15 students are (in order from
smallest to largest) as follows: 52, 60, 65, 67, 70, 71, 74, 76, 78, 78, 78, 80, 86, 89, 95
 Notice that 15 is an odd number. The median is the 8th value (the middle value). The 8th value is
76 so the median M = 76.

CSDS3202-Introduction to Data Science 9

DESCRIPTIVE STATISTICS…

Variance
 The variance is the average of the squares of the deviations. A deviation is the difference
between a value and the mean and is written as:

 Example: {2, 3, 5, 6} is a set of data. The sample mean is 4. The deviations are:
2 - 4 = -2
3 - 4 = -1
5-4=1
6-4=2
 The deviations squared are:
(-2)2 = 4
(-1)2 = 1
(1)2 = 1
(2)2 = 4
 An average of the deviations squared is
CSDS3202-Introduction to Data Science 10
DESCRIPTIVE STATISTICS…

Standard Deviation
 The standard deviation is a special average of the deviations. It measures how the data is
spread out from its mean.
 The standard deviation is the square root of the variance and has the same units as the
mean. The letter s represents the sample standard deviation and the Greek
letter σ represents the population standard deviation.
 Example: In the variance example above, the sample variance was s2 = 3.33 (to 2 decimal
places). The sample standard deviation is s =

rounded to one decimal place.

CSDS3202-Introduction to Data Science 11

5.5.1 THE STANDARD NORMAL PROBABILITY DISTRIBUTION

Standard Normal
 The standard normal distribution is a normal probability distribution of standardized
values called z-scores.
 The standard normal has a mean of 0 and a standard deviation of 1. Z is commonly used
as the random variable.
 Notation: Z ~ N(0, 1)
 Z-Scores
 The formula for a z-score is:
 where x is the value that is being standardized.
 A z-score is measured in terms of the standard deviation.
 So, if z = 2, then 2 is the standardized score for the value of X that is 2 standard deviations above
(positive z-score) the mean.
 If z = -1, then -1 is the standardized score for the value of X that is 1 standard deviation below
(negative z-score) the mean.
CSDS3202-Introduction to Data Science 12
5.5.2 TYPES OF DISTRIBUTIONS

 Uniform Distribution
 Normal Distribution
 Binomial Distribution
 Bernoulli Distribution
 Poisson Distribution
 Exponential Distribution

CSDS3202-Introduction to Data Science 13

UNIFORM DISTRIBUTION

CSDS3202-Introduction to Data Science 14

NORMAL DISTRIBUTION

CSDS3202-Introduction to Data Science 15

EXPONENTIAL DISTRIBUTION

CSDS3202-Introduction to Data Science 16

5.6 CORRELATION COEFFICIENT

 If a scatter plot shows a possible linear relationship, then the correlation coefficient indicates how
strong the relationship is between x and y. We use the letter r for the correlation coefficient.

 If r = 1 or r = -1, there is "perfect correlation." This means that the points are already in a straight
line. In the real world, perfect correlation is very unlikely to happen.
 The closer r is to 1 or -1, the better the correlation between x and y because the data points are
closer to the line of best fit.
 There is positive correlation if x increases then y increases or if x decreases then y decreases. If
there is positive correlation, then the line has a positive slope.
 There is negative correlation if x increases then y decreases or if x decreases then y increases. If
there is negative correlation, then the line has a negative slope.
 There is no correlation if the correlation coefficient is 0 (r = 0). This means there is no relationship
between x and y. If there is no correlation, then the slope of the line is 0.
 High correlation does not necessarily mean that x causes y or y causes x.

CSDS3202-Introduction to Data Science 17

CORRELATION COEFFICIENT…

Examples of scatter diagrams with different values of correlation coefficient (ρ)

CSDS3202-Introduction to Data Science 18

5.7 DIMENSIONALITY REDUCTION

Dimensionality reduction techniques can reduce the number of features in the

dataset without having to lose much information and keep /improve the model’s
performance
Benefits of applying dimensionality reduction to a dataset:
 Space required to store the data is reduced as the number of dimensions comes
down.
 Less dimensions lead to less computation/training time.
 Some algorithms do not perform well when we have a large dimensions. So
reducing these dimensions needs to happen for the algorithm to be useful.
 It takes care of multicollinearity by removing redundant features.
 It helps in visualizing data.
CSDS3202-Introduction to Data Science 19
DIMENSIONALITY REDUCTION…

Dimensionality reduction can be done in two different ways:

 Feature Selection
 Dimensionality Reduction
 Components/Factor Based
 Factor Analysis
 Principal Component Analysis(PCA)
 Singular Value Decomposition(SVD)
 Independent Component Analysis(ICA)
 Projections Based
 ISOMAP
 t-Distributed Stochastic Neighbor Embedding (t-SNE)
 Uniform Manifold Approximation and Projection (UMAP)

CSDS3202-Introduction to Data Science 20

5.7.1 FACTOR ANALYSIS

 Suppose we have two variables: Income and Education. These variables will
potentially have a high correlation as people with a higher education level tend to
have significantly higher income, and vice versa.
 In the Factor Analysis technique, variables are grouped by their correlations, i.e.,
all variables in a particular group will have a high correlation among themselves,
but a low correlation with variables of other group(s). Here, each group is known
as a factor. These factors are small in number as compared to the original
dimensions of the data. However, these factors are difficult to observe.

CSDS3202-Introduction to Data Science 21

FACTOR ANALYSIS…
Read in all the images contained in the train folder:
train = pd.read_csv("../input/fashionmnist/fashion-
mnist_train.csv",sep=',')
Convert these images into a numpy array format
train_data = np.array(train, dtype = 'float32')
image = []
for i in range(0,60000):
img = train_data[i].flatten()
image.append(img)
image = np.array(image)

CSDS3202-Introduction to Data Science 22

FACTOR ANALYSIS…

Create a dataframe containing the pixel values of every individual pixel present in each
image, and also their corresponding labels
train = pd.read_csv("../input/fashionmnist/fashion-
mnist_train.csv",sep=',') # Give the complete path of your
train.csv file
feat_cols = [ 'pixel'+str(i) for i in range(image.shape[1]) ]
df = pd.DataFrame(image,columns=feat_cols)
df['label'] = train['label']
Decompose the dataset using Factor Analysis:
from sklearn.decomposition import FactorAnalysis
fa = FactorAnalysis(n_components =
3).fit_transform(df[feat_cols].values)

CSDS3202-Introduction to Data Science 23

FACTOR ANALYSIS…

Visualize the results:

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(16,10))
plt.title('Factor Analysis Components')
plt.scatter(fa[:,0], fa[:,1],c='r',s=10)
plt.scatter(fa[:,1], fa[:,2],c='b',s=10)
plt.scatter(fa[:,2],fa[:,0],c='g',s=10)
plt.legend(("First Factor","Second Factor","Third
Factor"))

CSDS3202-Introduction to Data Science 24

5.7.2 UNIFORM MANIFOLD APPROXIMATION
AND PROJECTION (UMAP)
 Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction
technique that can preserve as much of the local, and more of the global data
structure
 Key advantages of UMAP are:
 It can handle large datasets and high dimensional data without too much difficulty
 It combines the power of visualization with the ability to reduce the dimensions of the
data
 Along with preserving the local structure, it also preserves the global structure of the
data. UMAP maps nearby points on the manifold to nearby points in the low
dimensional representation, and does the same for far away points.
 This method uses the concept of k-nearest neighbor and optimizes the results
using stochastic gradient descent.

CSDS3202-Introduction to Data Science 25

UMAP…

import umap
umap_data = umap.UMAP(n_neighbors=5, min_dist=0.3,
n_components=3).fit_transform(df[feat_cols][:6000].values
)
Here,
 n_neighbors determines the number of neighboring points
used
 min_dist controls how tightly embedding is allowed.
Larger values ensure embedded points are more evenly
distributed

CSDS3202-Introduction to Data Science 26

UMAP…

Visualize the transformation:

CSDS3202-Introduction to Data Science 27

5.8 FEATURE SELECTION

 The process of choosing a subset of input features that contribute the most to
the output feature for use in model construction.
 Important if we have datasets with high dimensionality (i.e., large number of
features).
 Helps to mitigate these problems by selecting features that have high
importance to the model, such that the data dimensionality can be reduced
without much loss of the total information.
 Benefits of feature selection are:
 Reduce training time
 Reduce the risk of overfitting
 Potentially increase model's performance
 Reduce model's complexity such that interpretation becomes easier

CSDS3202-Introduction to Data Science 28

5.8.1 METHODS OF FEATURE SELECTION

 Filter Methods
 ANOVA F-value
 Variance Threshold
 Mutual Information
 Wrapper Methods
 Exhaustive feature selection (EFS)
 Sequential forward selection (SFS)
 Sequential backward selection (SBS)
 Embedded Methods
 Random forest

CSDS3202-Introduction to Data Science 29

5.8.1.1 NECESSARY PYTHON LIBRARIES

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns
sns.set(style="whitegrid")

import warnings
warnings.filterwarnings('ignore')

CSDS3202-Introduction to Data Science 30

IRIS FLOWER DATASET FROM SCIKIT-
LEARN
# Load Iris dataset from Scikit-learn
from sklearn.datasets import load_iris

# Create input and output features

feature_names = load_iris().feature_names
X_data = pd.DataFrame(load_iris().data, columns=feature_names)
y_data = load_iris().target

# Show the first five rows of the dataset

X_data.head()

CSDS3202-Introduction to Data Science 31

5.8.1.2 ANOVA F-VALUE

 ANOVA F-value method estimates the degree of linearity between the input
feature (i.e., predictor) and the output feature.
 A high F-value indicates high degree of linearity and a low F-value indicates low
degree of linearity.
 The main disadvantage of using ANOVA F-value is it only captures linear
relationships between input and output feature.
 In other words, any non-linear relationships cannot be detected by F-value.
 Scikit-learn has two functions to calculate F-value:
 f_classif, which calculate F-value between input and output feature for classification
task
 f_regression, which calculate F-value between input and output feature for
classification task

CSDS3202-Introduction to Data Science 32

ANOVA F-VALUE…

 Use f_classif because the Iris dataset entails classification task

# Import f_classif from Scikit-learn
from sklearn.feature_selection import f_classif
# Create f_classif object to calculate F-value
f_value = f_classif(X_data, y_data)

# Print the name and F-value of each feature

for feature in zip(feature_names, f_value[0]):
print(feature)

CSDS3202-Introduction to Data Science 33

ANOVA F-VALUE…

Visualize the results by creating a bar chart:

# Create a bar chart for visualizing the F-values
plt.figure(figsize=(4,4))
plt.bar(x=feature_names, height=f_value[0], color='tomato')
plt.xticks(rotation='vertical')
plt.ylabel('F-value')
plt.title('F-value Comparison')
plt.show()

CSDS3202-Introduction to Data Science 34

5.8.1.3 EXHAUSTIVE FEATURE SELECTION (EFS)

 EFS finds the best subset of features by evaluating all feature combinations.
 Suppose we have a dataset with three features. EFS will evaluate the
following feature combinations:
 feature_1
 feature_2
 feature_3
 feature_1 and feature_2
 feature_1 and feature_3
 feature_2 and feature_3
 feature_1, feature_2, and feature_3
 EFS selects a subset that generates the best performance (e.g., accuracy,
precision, recall, etc.) of the model being considered.
 Mlxtend provides ExhaustiveFeatureSelector function to perform EFS.
CSDS3202-Introduction to Data Science 35
EXHAUSTIVE FEATURE SELECTION (EFS)…

 EFS has five important parameters:

 estimator: the classifier that we intend to train
 min_features: the minimum number of features to select
 max_features: the maximum number of features to select
 scoring: the metric to use to evaluate the classifier
 cv: the number of cross-validations to perform

CSDS3202-Introduction to Data Science 36

EXHAUSTIVE FEATURE SELECTION (EFS)…

# Import ExhaustiveFeatureSelector from Mlxtend

from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

# Import logistic regression from Scikit-learn

from sklearn.linear_model import LogisticRegression

# Create a logistic regression classifier

lr = LogisticRegression()

# Create an EFS object

efs = EFS(estimator=lr, # Use logistic regression as the

classifier/estimator

min_features=1, # The minimum number of features to consider is 1

max_features=4, # The maximum number of features to consider is 4

scoring='accuracy', # The metric to use to evaluate the classifier is

accuracy

cv=5) # The number of cross-validations to perform is 5

CSDS3202-Introduction to Data Science 37
EXHAUSTIVE FEATURE SELECTION (EFS)…

# Train EFS with our dataset

efs = efs.fit(X_data, y_data)
# Print the results
print('Best accuracy score: %.2f' % efs.best_score_) #
best_score_ shows the best score
print('Best subset (indices):', efs.best_idx_) # best_idx_
shows the index of features that yield the best score
print('Best subset (corresponding names):',
efs.best_feature_names_) # best_feature_names_ shows the feature
names
# that yield the best score

CSDS3202-Introduction to Data Science 38

EXHAUSTIVE FEATURE SELECTION (EFS)…

Transform the dataset into a new dataset containing only the subset of features that
generates the best score by using transform method.
# Transform the dataset
X_data_new = efs.transform(X_data)
# Print the results
print('Number of features before transformation:
{}'.format(X_data.shape[1]))
print('Number of features after transformation:
{}'.format(X_data_new.shape[1]))
# Show the performance of each subset of features
efs_results = pd.DataFrame.from_dict(efs.get_metric_dict()).T
efs_results.sort_values(by='avg_score', ascending=True, inplace=True)
efs_results

CSDS3202-Introduction to Data Science 39

EXHAUSTIVE FEATURE SELECTION (EFS)…

Visualize the performance of each subset of features by creating a horizontal bar chart:
# Create a horizontal bar chart for visualizing

# the performance of each subset of features

fig, ax = plt.subplots(figsize=(12,9))

y_pos = np.arange(len(efs_results))

ax.barh(y_pos, efs_results['avg_score'],

xerr=efs_results['std_dev'], color='tomato')

ax.set_yticks(y_pos)

ax.set_yticklabels(efs_results['feature_names'])

ax.set_xlabel('Accuracy')

plt.show()

CSDS3202-Introduction to Data Science 40

5.8.1.4 FEATURE SELECTION USING RANDOM FOREST

 Random forest is one of the most popular learning algorithms used for feature
selection in a data science workflow.
 Split dataset into train and test split because the feature selection is a part of the
training process.
 Use gini criterion to define feature importance

CSDS3202-Introduction to Data Science 41

FEATURE SELECTION USING RANDOM
FOREST…
# Import RandomForestClassifier from Scikit-learn
from sklearn.ensemble import RandomForestClassifier
# Import train_test_split from Scikit-learn
from sklearn.model_selection import train_test_split

# Split the dataset into 30% test and 70% training

X_train, X_test, y_train, y_test =
train_test_split(X_data, y_data, test_size=0.3,
random_state=0)

CSDS3202-Introduction to Data Science 42

FEATURE SELECTION USING RANDOM
FOREST…
# Create a random forest classifier
rfc = RandomForestClassifier(random_state=0, criterion='gini') # Use gini
criterion to define feature importance
# Train the classifier
rfc.fit(X_train, y_train)
# Print the name and gini importance of each feature
for feature in zip(feature_names, rfc.feature_importances_):
print(feature)

If we add up al the importance scores, the result is 100%. As we can see, petal length and petal
width correspond to 83% of the total importance score. They are clearly the most important features!

CSDS3202-Introduction to Data Science 43

REFERENCES
J. Han and M. Kamber (2011). Data Mining: Concepts and Techniques. Morgan
Kaufmann, 3rd ed.

Rahil Shaikh (2018). Feature Selection Techniques in Machine Learning with Python.
Available 2023-02-22 at https://towardsdatascience.com/feature-selection-
techniques-in-machine-learning-with-python-f24e7da3f36e

CSDS3202-Introduction to Data Science 44

Ethiopian Stats Exit Exam Guide
100% (2)
Ethiopian Stats Exit Exam Guide
9 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
123 pages
Forecasting Methods and Applications
No ratings yet
Forecasting Methods and Applications
7 pages
Statistics for Teachers
100% (4)
Statistics for Teachers
124 pages
Decision Science
No ratings yet
Decision Science
523 pages
Modern Penology - Esca
100% (1)
Modern Penology - Esca
31 pages
Introduction To Statistics: Prepared By: Joshua Erdy A. Tan
No ratings yet
Introduction To Statistics: Prepared By: Joshua Erdy A. Tan
29 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Introduction to Statistics Basics
No ratings yet
Introduction to Statistics Basics
40 pages
Statstical Method
No ratings yet
Statstical Method
60 pages
Introduction to Statistics Concepts
No ratings yet
Introduction to Statistics Concepts
98 pages
Introduction To Data Science: Course Code: CS-4883 Course Instructor: Muhammad Owais
No ratings yet
Introduction To Data Science: Course Code: CS-4883 Course Instructor: Muhammad Owais
38 pages
Data Presentation and Analysis Methods
No ratings yet
Data Presentation and Analysis Methods
7 pages
Complete Business Statistics: Introduction and Descriptive Statistics Introduction and Descriptive Statistics
No ratings yet
Complete Business Statistics: Introduction and Descriptive Statistics Introduction and Descriptive Statistics
79 pages
Understanding Averages & Statistics
No ratings yet
Understanding Averages & Statistics
11 pages
3 - Descriptive Stat
No ratings yet
3 - Descriptive Stat
70 pages
2-Statistika Deskriptif
No ratings yet
2-Statistika Deskriptif
34 pages
Lecture 02 - SP - Probability Review
No ratings yet
Lecture 02 - SP - Probability Review
47 pages
Statistical Data Description Guide
No ratings yet
Statistical Data Description Guide
13 pages
Analysing Quantitative Data - 13april2017
No ratings yet
Analysing Quantitative Data - 13april2017
41 pages
Statistical Analysis
No ratings yet
Statistical Analysis
15 pages
Descriptive Statistics - Handout
No ratings yet
Descriptive Statistics - Handout
10 pages
Statisticsforinterpretingtestscores 101220031831 Phpapp01
No ratings yet
Statisticsforinterpretingtestscores 101220031831 Phpapp01
60 pages
Measures of Relative Position
100% (1)
Measures of Relative Position
28 pages
Descriptive Statistics Using Microsoft Excel
No ratings yet
Descriptive Statistics Using Microsoft Excel
5 pages
Chapter2 Stats
No ratings yet
Chapter2 Stats
9 pages
Spring Semester, 2020-2021
No ratings yet
Spring Semester, 2020-2021
40 pages
Lesson 4: Statistics/Data Management Unit 1 - Measures of Central Tendency
No ratings yet
Lesson 4: Statistics/Data Management Unit 1 - Measures of Central Tendency
26 pages
Chapter 2
No ratings yet
Chapter 2
21 pages
SPC Charts for Quality Control
67% (3)
SPC Charts for Quality Control
3 pages
Nursing Research Methods: PH.D in Nursing
No ratings yet
Nursing Research Methods: PH.D in Nursing
66 pages
Module1 Introduction
No ratings yet
Module1 Introduction
95 pages
Data Science - Unit 2
No ratings yet
Data Science - Unit 2
57 pages
Maam Teslyn Report
No ratings yet
Maam Teslyn Report
38 pages
Health Statistics: Principles of Secondary Data Analysis
No ratings yet
Health Statistics: Principles of Secondary Data Analysis
61 pages
Data Management
No ratings yet
Data Management
36 pages
Chapter No#5
No ratings yet
Chapter No#5
18 pages
Norms and Basic Statistics For Testing
No ratings yet
Norms and Basic Statistics For Testing
26 pages
Chapter2-Statistical Analysis
No ratings yet
Chapter2-Statistical Analysis
86 pages
Lecture4 Slides
No ratings yet
Lecture4 Slides
22 pages
Figure 1: Representation of The Measures of Relative Standing On A Normal Distribution
No ratings yet
Figure 1: Representation of The Measures of Relative Standing On A Normal Distribution
7 pages
Unit 4-23
No ratings yet
Unit 4-23
42 pages
Statistics 84
No ratings yet
Statistics 84
4 pages
Statistics and Data Management Guide
No ratings yet
Statistics and Data Management Guide
14 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
Descriptive Statistics - Measures of Location
No ratings yet
Descriptive Statistics - Measures of Location
21 pages
Statistics Basics for Data Science
100% (1)
Statistics Basics for Data Science
27 pages
Week 1 Lecture Notes
No ratings yet
Week 1 Lecture Notes
46 pages
Measures of Central Tendency Guide
No ratings yet
Measures of Central Tendency Guide
32 pages
The Machine Learning Landscape
No ratings yet
The Machine Learning Landscape
25 pages
Basic Maths23su
No ratings yet
Basic Maths23su
42 pages
Chapter 1 BFC34303 (Lyy)
No ratings yet
Chapter 1 BFC34303 (Lyy)
104 pages
History Reporting
No ratings yet
History Reporting
61 pages
Measures of Relative Position
100% (1)
Measures of Relative Position
18 pages
Business Statistics
No ratings yet
Business Statistics
106 pages
Introduction To Statistics 2024-2025
No ratings yet
Introduction To Statistics 2024-2025
40 pages
Tamil Nadu Open University: Regulations and Overview For
No ratings yet
Tamil Nadu Open University: Regulations and Overview For
107 pages
Complete Business Statistics: by Amir D. Aczel & Jayavel Sounderpandian 6 Edition
No ratings yet
Complete Business Statistics: by Amir D. Aczel & Jayavel Sounderpandian 6 Edition
74 pages
Mathematics As A Tool (Descriptive Statistics) (Midterm Period) Overview: This Module Tackles Mathematics As Applied To Different Areas Such As Data
No ratings yet
Mathematics As A Tool (Descriptive Statistics) (Midterm Period) Overview: This Module Tackles Mathematics As Applied To Different Areas Such As Data
33 pages
Q.P. Code: 383822
No ratings yet
Q.P. Code: 383822
20 pages
Staticus: Math 103 Lecture 9 Class Notes
No ratings yet
Staticus: Math 103 Lecture 9 Class Notes
4 pages
A Teacher S Guide To Multisensory Learning Lawrence Baines Instant Download
No ratings yet
A Teacher S Guide To Multisensory Learning Lawrence Baines Instant Download
90 pages
Drug Reaction Time Analysis
No ratings yet
Drug Reaction Time Analysis
2 pages
Research Article: Multivariate Streamflow Simulation Using Hybrid Deep Learning Models
No ratings yet
Research Article: Multivariate Streamflow Simulation Using Hybrid Deep Learning Models
16 pages
Statistical Analysis: Summary Statistics
No ratings yet
Statistical Analysis: Summary Statistics
32 pages
Statistical Data Processing Guide
No ratings yet
Statistical Data Processing Guide
23 pages
SimProject Report
No ratings yet
SimProject Report
16 pages
LAS PR1 Q3 Week 1
No ratings yet
LAS PR1 Q3 Week 1
17 pages
Chapter13 Slides
No ratings yet
Chapter13 Slides
24 pages
Scientific Reasoning Is Material Inference: Combining Confirmation, Discovery, and Explanation
No ratings yet
Scientific Reasoning Is Material Inference: Combining Confirmation, Discovery, and Explanation
23 pages
Chapter I 3 Walay Tulugay
No ratings yet
Chapter I 3 Walay Tulugay
26 pages
Network Cell Info User Guide
No ratings yet
Network Cell Info User Guide
30 pages
Tirunillai 2014
No ratings yet
Tirunillai 2014
17 pages
Bayesian Bivariate Meta-Analysis of Diagnostic Test Studies With Interpretable Priors
No ratings yet
Bayesian Bivariate Meta-Analysis of Diagnostic Test Studies With Interpretable Priors
20 pages
Reseach Report
No ratings yet
Reseach Report
13 pages
Canonical Correlation Analysis Guide
No ratings yet
Canonical Correlation Analysis Guide
8 pages
EXAMPLE # 1, Lesson # 33
No ratings yet
EXAMPLE # 1, Lesson # 33
5 pages
AI & Hybrid Deep Learning for Intrusion Detection
No ratings yet
AI & Hybrid Deep Learning for Intrusion Detection
4 pages
Q:-Explain The Probability and Nonprobability Sampling Techniques
No ratings yet
Q:-Explain The Probability and Nonprobability Sampling Techniques
3 pages
Independent T-Test Results Analysis
No ratings yet
Independent T-Test Results Analysis
3 pages
For KS Diagnosis Specs Limits Shell LubeAnalyst Condemnation Limits Mar 081
100% (4)
For KS Diagnosis Specs Limits Shell LubeAnalyst Condemnation Limits Mar 081
56 pages
Examples On Normal Distribution
No ratings yet
Examples On Normal Distribution
1 page
Evaluation of Strength Test Results of Concrete (ACI 214R-02)
No ratings yet
Evaluation of Strength Test Results of Concrete (ACI 214R-02)
44 pages
(Ebook PDF) Spring in Action 5th Edition PDF Download
No ratings yet
(Ebook PDF) Spring in Action 5th Edition PDF Download
129 pages

Exploratory Data Analysis Updated

Uploaded by

Exploratory Data Analysis Updated

Uploaded by

Exploratory Data Analysis

CSDS3202-Introduction to Data Science 2

CSDS3202-Introduction to Data Science 3

 Frequency is the number of times that a particular result occurs.

CSDS3202-Introduction to Data Science 4

CSDS3202-Introduction to Data Science 6

CSDS3202-Introduction to Data Science 7

CSDS3202-Introduction to Data Science 9

rounded to one decimal place.

CSDS3202-Introduction to Data Science 11

CSDS3202-Introduction to Data Science 13

CSDS3202-Introduction to Data Science 14

CSDS3202-Introduction to Data Science 15

CSDS3202-Introduction to Data Science 16

CSDS3202-Introduction to Data Science 17

Examples of scatter diagrams with different values of correlation coefficient (ρ)

CSDS3202-Introduction to Data Science 18

Dimensionality reduction techniques can reduce the number of features in the

Dimensionality reduction can be done in two different ways:

CSDS3202-Introduction to Data Science 20

CSDS3202-Introduction to Data Science 21

CSDS3202-Introduction to Data Science 22

CSDS3202-Introduction to Data Science 23

Visualize the results:

CSDS3202-Introduction to Data Science 24

CSDS3202-Introduction to Data Science 25

CSDS3202-Introduction to Data Science 26

Visualize the transformation:

CSDS3202-Introduction to Data Science 27

CSDS3202-Introduction to Data Science 28

CSDS3202-Introduction to Data Science 29

import matplotlib.pyplot as plt

CSDS3202-Introduction to Data Science 30

# Create input and output features

# Show the first five rows of the dataset

CSDS3202-Introduction to Data Science 31

CSDS3202-Introduction to Data Science 32

 Use f_classif because the Iris dataset entails classification task

# Print the name and F-value of each feature

CSDS3202-Introduction to Data Science 33

Visualize the results by creating a bar chart:

CSDS3202-Introduction to Data Science 34

 EFS has five important parameters:

CSDS3202-Introduction to Data Science 36

# Import ExhaustiveFeatureSelector from Mlxtend

from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

# Import logistic regression from Scikit-learn

from sklearn.linear_model import LogisticRegression

# Create a logistic regression classifier

# Create an EFS object

efs = EFS(estimator=lr, # Use logistic regression as the

min_features=1, # The minimum number of features to consider is 1

max_features=4, # The maximum number of features to consider is 4

scoring='accuracy', # The metric to use to evaluate the classifier is

cv=5) # The number of cross-validations to perform is 5

# Train EFS with our dataset

CSDS3202-Introduction to Data Science 38

CSDS3202-Introduction to Data Science 39

# the performance of each subset of features

CSDS3202-Introduction to Data Science 40

CSDS3202-Introduction to Data Science 41

# Split the dataset into 30% test and 70% training

CSDS3202-Introduction to Data Science 42

CSDS3202-Introduction to Data Science 43

CSDS3202-Introduction to Data Science 44

You might also like