Unit I 1
Unit I 1
MACHINE LEARNING
23AIE233M INTRODUCTION TO MACHINE LEARNING L-T-P-C: 2 1 3 4
Course Objectives
Course Outcomes
• Apply pre-processing techniques to prepare the data for machine learning applications
• Implement supervised machine learning algorithms for different datasets
• Implement unsupervised machine learning algorithms for different datasets
• Identify the appropriate machine learning algorithms for different applications
Syllabus
Unit 1
• Introduction to Machine Learning – Data and Features – Machine Learning Pipeline: Data Preprocessing:
Standardization, Normalization, Missing data problem, Data imbalance problem – Data visualization - Setting
up training, development and test sets – Cross validation – Problem of Overfitting, Bias vs Variance -
Evaluation measures – Different types of machine learning: Supervised learning, Unsupervised learning.
Unit 2
• Supervised learning - Regression: Linear regression, logistic regression – Classification: K-Nearest Neighbor,
Naïve Bayes, Decision Tree, Random Forest, Support Vector Machine, Perceptron.
Unit 3
• Unsupervised learning – Clustering: K-means, Hierarchical, Spectral, subspace clustering, Dimensionality
Reduction Techniques, Principal component analysis, Linear Discriminant Analysis.
Text Books:
• Andrew Ng, Machine learning yearning, URL: http://www. mlyearning. org/(96) 139 (2017).
• Kevin P. Murphey. Machine Learning, a probabilistic perspective. The MIT Press Cambridge, Massachusetts,
2012.
• Christopher M Bishop. Pattern Recognition and Machine Learning. Springer 2010
Reference Books:
• Richard O. Duda, Peter E. Hart, David G. Stork. Pattern Classification. Wiley, Second Edition;2007
• Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
Unit 1
• Introduction to Machine Learning – Data and Features – Machine Learning Pipeline:
Data Preprocessing: Standardization, Normalization, Missing data problem, Data
imbalance problem – Data visualization - Setting up training, development and test sets
– Cross validation – Problem of Overfitting, Bias vs Variance - Evaluation measures –
Different types of machine learning: Supervised learning, Unsupervised learning.
6
What is AI?
"The simulation of human intelligence in machines that are programmed
to think like humans and mimic their actions." (Techopedia)
Machine Learning algorithms enable the computers to learn from data, and even
improve themselves, without being explicitly programmed -Arthur Lee Samuel
What is Machine Learning?
8
AI Vs. ML
9
Machine Learning Evolution
10
Traditional vs ML
ML Model
Unseen Predictions
Input
12
Machine Learning Types
13
Machine Learning Types
14
Applications of Machine Learning
15
Components of Machine Learning
Why Machine Learning?
VOLUMINOUS
DATA
COMPUTATIONAL POWERFUL
POWER ALGORITHMS
17
Getting to Know Your Data
TID Items
1 Bread, Coke, Milk
• Transaction data 2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
• Document data:
• Term-frequency vector (matrix) of text documents Document 1 3 0 5 0 2 6 0 2 0
19
2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Types of Data Sets: (2) Graphs and Networks
• Transportation network
Molecular Structures
20
Social or information networks
Types of Data Sets: (3) Ordered Data
• Video data: sequence of images
• Temporal data:
time-series
• Sequential Data:
Transaction sequences
21
• Genetic sequence data
Types of Data Sets: (4) Spatial, image and multimedia Data
• Image data:
22
• Video data:
Important Characteristics of
Structured Data
• Dimensionality
• Curse of dimensionality
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale
• Distribution
• Centrality and dispersion
23
Data Types
Categorical Numerical
Temperature
• Equal Spaces between values (Celsius/Fahrenheit),
Interval • No meaningful zero value, mean makes sense IQ, Credit Score
Gender,
• Variables that are non-numeric or where the numbers Ethnicity, Eye
Nominal
have no value
• Median makes sense
Color, Blood
Type
27
Nominal/Ordinal Examples
Nominal
Ordinal
28
A More Detailed Taxonomy
Types of Data
Quantitative Qualitative
Interval Nominal
Ratio Ordinal
29
Quantitative Vs. Qualitative
● Quantitative data seem to be the easiest to explain and try to
find the answers to questions such as
○ “how many, “how much” and “how often”
● It can be expressed as a number, so it can be quantified
30
Quantitative Vs. Qualitative
● Qualitative data can’t be expressed as a number, so it can’t be
measured
○ It mainly consists of words, pictures, and symbols, but not
numbers
● These can answer the questions like:
○ “how this has happened”, or “why this has happened”
31
Categorical Data
● Categorical data represents characteristics.
○ Therefore it can represent things like a person’s gender,
language etc.
○ Categorical data can also take on numerical values (Example: 1
for female and 0 for male)
● Two types of categorical data
○ Nominal
○ Ordinal
32
Numerical - Discrete
● We speak of discrete data if its values are distinct and
separate
○ In other words: We speak of discrete data if the data can only
take on certain values
○ This type of data can’t be measured but it can be counted
○ It basically represents information that can be categorized into a
classification
○ Example:
■ The number of students in a class
■ The number of workers in a company
■ The number of test questions you answered correctly
35
Good data
preparation is key
to producing valid
and reliable models
Data Preprocessing
A lot a method have been developed but still an active area
of research
39
Traditional ML
Workflow
Why is Data Preprocessing
Important?
■ No quality data, less accurate results!
• Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even misleading
statistics.
■ Data preparation, cleaning, and transformation comprises the
majority of the work in a data mining and Machine Learning
application (90%).
Data Preprocessing
■ Data cleaning
■ Data reduction
■ Discretization
■ Summary
Why Data Preprocessing?
■ Data Integration
• Integration of multiple databases, or files
■ Data Transformation
• Normalization and aggregation
■ Data Reduction
• Obtains reduced representation in volume but produces the same or similar analytical
results
■ Data Discretization
• Automatic generation of concept hierarchies from numerical data
Data Cleaning
● Ignore the tuple: usually done when class label is missing (assuming the tasks
in classification—not effective when the percentage of missing values per
attribute varies considerably.
● Fill in the missing value manually: tedious + infeasible?
● Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!
How to Handle Missing Data?
BA
D
● Ignore the tuple: usually done when class label is missing (assuming the tasks
PR
in classification—not effective when the percentage of missing values per
attribute varies considerably.
AC
T
● Fill in the missing value manually: tedious + infeasible?
ICE
● Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!
How to Handle Missing Data?
● Use the attribute mean for all samples belonging to the same class to fill in the
● Use the most probable value to fill in the missing value: inference-based such as
Y1
Y1’ y=x+1
X1 x
Linear regression – find the best line to fit two variables and use regression function to smooth data
Data Integration
(combines data from multiple sources into a coherent store )
● Schema integration
○ integrate metadata from different sources
○ Entity identification problem: identify real world entities from multiple data sources, e.g.,
A.cust-id B.cust-#
○ possible reasons: different representations, different scales, e.g., metric vs. British units
Web Information Integration
■ Many integration tasks,
• Integrating Web query interfaces (search forms)
• Integrating ontologies (taxonomy)
• Integrating extracted data
• Integrating textual information
• E.g., entity linking, paraphrasing, etc.
• Attribute/feature construction
• New attributes constructed from the given ones to help in the data mining
process
Data Transformation:
Normalization
• min-max normalization
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
v meanA
v'
stand _ devA
■ Data cleaning
■ Data reduction
■ Discretization
■ Summary
Data Reduction Strategies
■ Data is too big to work with
■ Data reduction
■obtain a reduced representation of the data set that is much smaller in
volume, yet closely maintains the integrity of the original data
■ Data reduction strategies
• (Data Cube)Aggregation
• Attribute (Subset) Selection
• Dimensionality Reduction
• Numerosity Reduction
• Data Discretization
• Concept Hierarchy Generation
Data Cube Aggregation
■Summarize (aggregate) data based on dimensions
■The resulting data set is smaller in volume, without loss of
information necessary for analysis task
■Concept hierarchies may exist for each attribute, allowing
the analysis of data at multiple levels of abstraction
• Data Aggregation
• Data Cube
■ Provide fast access to pre‐computed,
summarized data, thereby benefiting on‐line
analytical processing as well as data mining
Attribute Subset Selection
■ Attribute selection can help in the phases of data mining (knowledge discovery) process
■ By attribute selection,
■ we can improve data mining performance (speed of learning, predictive accuracy, or
simplicity of rules)
■ we can visualize the data for model selected
■ we reduce dimensionality and remove noise.
■ Attribute (Feature) selection is a search problem
■ Search directions
■ (Sequential) Forward selection
■ (Sequential) Backward selection (elimination)
■ Bidirectional selection
■ Decision tree algorithm (induction)
Attribute Subset Selection
■ Search strategies
■ Exhaustive search
■ Heuristic search
■ Selection criteria
■ Statistic significance
■ Information gain
■ etc.
Data Compression
String compression
There are extensive theories and well-tuned
algorithms
Typically lossless
But only limited manipulation is possible without
expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time
Data
Compression
s y
los
Original Data
Approximated
Wavelet
Transforms Haar2 Daubechi
e4
Discrete wavelet transform (DWT): linear signal processing
Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
Method:
Length, L, must be an integer power of 2 (padding with 0s, when necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired length
Principal Component
Analysis
Given N data vectors from k-dimensions, find
c <= k
orthogonal vectors that can be best used to
represent data
The original data set is reduced to one consisting
of N data vectors on c principal components
(reduced dimensions)
Each data vector is a linear combination of
the c principal component vectors
Works for numeric data only
Used when the number of dimensions is
large
Principal Component Analysis
X2
Y1
Y2
X1
Numerosity Reduction
Parametric methods
Assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
Log-linear models: obtain value at a point in m-D space as the
product on appropriate marginal subspaces
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
Regression and Log-Linear
Models
Linear regression: Data are modeled to fit a straight line
Often uses the least-square method to fit the line
Partition data set into clusters, and one can store cluster
representation only
Can be very effective if data is clustered but not if data is
“smeared”
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms
Sampling
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Choose a representative subset of the data
Simple random sampling may have very poor performance in the
presence of skew
Develop adaptive sampling methods
Stratified sampling:
Approximate the percentage of each class (or subpopulation of
SWOR
SR sa mp le
om
rand
sim p l e ent
p l acem
re
ou t
with
SRSW
simp Rle ra
ndom
w i th samp
repla le
ceme
nt
Raw Data
Sampling
Raw Data Cluster/Stratified Sample
Hierarchical Reduction
Use multi-resolution structure with different
degrees of reduction
Hierarchical clustering is often performed but tends
to define partitions of data sets rather than
“clusters”
Parametric methods are usually not amenable to
hierarchical representation
Hierarchical aggregation
An index tree hierarchically divides a data set into
partitions by value range of some attributes
Each partition can be considered as a bucket
Thus an index tree with aggregates stored at each
node is a hierarchical histogram
Discretization
Three types of attributes:
Nominal — values from an unordered set
Ordinal — values from an ordered set
Continuous — real numbers
Discretization:
divide the range of a continuous attribute into
intervals
Some classification algorithms only accept
categorical attributes.
Reduce data size by discretization
Histogram analysis
Clustering analysis
Entropy-based discretization
Segmentation by natural
partitioning
Entropy-Based
Discretization
Segmentation by natural
partitioning
3-4-5 rule can be used to segment numeric data into
relatively uniform, “natural” intervals.
* If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi-width
intervals
* If it covers 2, 4, or 8 distinct values at the most significant
digit, partition the range into 4 intervals
* If it covers 1, 5, or 10 distinct values at the most significant
digit, partition the range into 5 intervals
Concept hierarchy generation for
categorical data
Specification of a partial ordering of attributes explicitly at
the schema level by users or experts
Specification of a portion of a hierarchy by explicit data
grouping
Specification of a set of attributes, but not of their partial
ordering
Specification of only a partial set of attributes
Specification of a set of
attributes
Concept hierarchy can be automatically generated based on the
number of distinct values per attribute in the given attribute
set. The attribute with the most distinct values is placed at the
lowest level of the hierarchy.
103
Basic Statistical Descriptions of Data
• What
• Measure of central tendency
• Mean, Median and mode
• Location of the centre of a data distribution
• Where do most of the attribute values fall?
• Dispersion Measure
• Range, quartiles, inter quartile range, five number summary and box plots ,
variance and standard deviation
• It describes how are the data spread out.
104
Descriptive Statistics
105
Basic Statistical Descriptions of Data
• Motivation
• To better understand the data: central tendency, variation and spread
Data dispersion characteristics
Median, max, min, quantiles, outliers, variance, ...
Numerical dimensions correspond to sorted intervals
Data dispersion:
Analyzed with multiple granularities of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
106
Measuring the Central Tendency: (1)
Mean
• Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size.
1 n
x xi x
n i 1 N
• Weighted arithmetic mean: n
w x i i
x i 1n
w i
• Trimmed mean: i 1
107
Measuring the Central Tendency:
(1) Mean
• Suppose we have the following values for salary(in thousands of
dollars), shown in increasing order:
• 30, 31, 47, 50, 52, 52, 56, 60, 63, 70, 70
108
Measuring the Central Tendency: (2)
Median
• Median
113
Measuring the Central Tendency: (2)
Median
• Median:
• Middle value if odd number of values, or average of the middle two values otherwise
• Estimated by interpolation (for grouped data):
115
Measuring the Central Tendency: (3) Mode
• Mode
• Value that occurs most frequently in the
data
• Unimodal, bimodal, multimodal
116
Measuring the Central Tendency: (3)
Mode
• Mode: Value that occurs most
frequently in the data
• Unimodal
• Empirical formula:
mean mode 3 (mean median)
• Multi-modal
• Bimodal
• Trimodal
117
Measuring the Central Tendency: (3)
Mode
• Mode for the grouped data:
118
Symmetric vs. Skewed Data
Symmetric/No Skew
• Data can be "skewed", meaning it tends to
have a long tail on one side or the other:
• Median, mean and mode of symmetric,
positively and negatively skewed data
negatively
positively skewed
skewed
120
When to use what measurement of central tendency ??
121
Practice Questions: Mean
• the grade 10 math class recently had a mathematics test and the
grades were as follows: 78 66 82 89 75 74
122
Practice Questions: Mean
• The following table shows the number of plants in 20 houses in a
group . Find the mean number of plans per house
Number of
Plants 0-2 2-4 4-6 6-8 8 - 10 10 - 12 12 - 14
Number of
Houses 1 2 2 4 6 2 3
123
Practice Questions: Mean
• The following table shows the number of plants in 20 houses in a
group . Find the mean number of plans per house
Number of
0-2 2-4 4-6 6-8 8 - 10 10 - 12 12 - 14
Plants
Number of
1 2 2 4 6 2 3
Houses
∑fi = 1 + 2 + 2 + 4 + 6 + 2 + 3 = 20
∑fi xi =1 + 6 + 10 + 28 + 54 + 22 + 39 = 160 125
Therefore, mean = ∑(fixi)/ ∑fi = 160/20 = 8 plants
Practice Questions: Median
• the grade 10 math class recently had a mathematics test and the grades
were as follows: 78 66 82 89 75 74
• Arrange in order
66 74 75 78 82 89
75 + 78 = 153
78 56 68 92 84 76 74 56 68 66 78 72 66
65 53 61 62 78 84 61 90 87 77 62 88 81
127
Mode
• Find the mode of the following data:
78 56 68 92 84 76 74 56 68 66 78 72 66
65 53 61 62 78 84 61 90 87 77 62 88 81
• Frequency:
78 56 68 92 84 76 74 56 68 66 78 72 66
65 53 61 62 78 84 61 90 87 77 62 88 81
• Mode=78 128
Measure of Dispersion
• Statistics are very important for observations, analysis and
mathematical prediction models. With the help of statistics we can
know what happened in the past and what may occur in the future.
• Central tendency measures do not reveal the variability present in the
data.
• Dispersion is the scattered ness of the data series around it average.
• Dispersion is the extent to which values in a distribution differ from
the average of the distribution.
• A measure of statistical dispersion is a nonnegative real number that is
zero if all the data are the same and increases as the data become
more diverse. 129
Range
130
Range for grouped data
• The range of a sample of data organized in a frequency distribution is
computed by the following formula:
• Range = upper limit of the last class - lower limit of the first class
131
Measures Data Dispersion: Variance and
Standard Deviation
133
Variance/Standard Deviation for Grouped Data
134
Properties of Normal Distribution Curve
← — ————Represent data dispersion, spread — ————→
136
Graphic Displays of Basic Statistical
Descriptions
• Boxplot: graphic display of five-number summary
• Histogram: x-axis are values, y-axis repres. frequencies
• Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of
data are xi
• Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution
against the corresponding quantiles of another
• Scatter plot: each pair of values is a pair of coordinates and plotted as points in the
plane
137
Measuring the Dispersion of Data: Quartiles &
Boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th
percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3,
max
• Boxplot: Data is represented with a box
• Q1, Q3, IQR: The ends of the box are at the first and
third quartiles, i.e., the height of the box is IQR
• Median (Q2) is marked by a line within the box
• Whiskers: two lines outside the box extended to
Minimum and Maximum
Outliers: points beyond a specified outlier threshold, plotted individually
Outlier: usually, a value higher/lower than 1.5 x IQR
138
The 5 Number Summary
• The five number summary is another name for the visual
representation of the box and whisker plot.
Lower Upper
Lowest Quartile Median Quartile Highest
Value Value
Whisker Box Whisker
4 5 6 7 8 9 10 11 12
Box Plots
140
Graphing The Data
• Notice, the Box includes the lower quartile, median, and upper quartile.
• The Whiskers extend from the Box to the max and min.
141
Measuring the Dispersion of Data: Quartiles &
Boxplots
A sample of 10 boxes of raisins has these weights (in grams):
25, 28, 29, 29, 30, 34, 35, 35, 37, 38
Make a box plot of the data.
Step 1: Order the data from smallest to largest.
Our data is already in order. 25, 28, 29, 29, 30, 34, 35, 35, 37, 38
Step 2
Find the median.
The median is the mean of the middle two numbers:
30+34/2 = 32 median=32
Step 3: Find the quartiles.
The first quartile is the median of the data points to the left of the median.
Q1=29
The third quartile is the median of the data points to the right of the median.
Q3=35
Step 4: Complete the five-number summary by finding the min and the max.
The min is the smallest data point, which is 25.
The max is the largest data point, which is 38.
The five-number summary is 25, 29, 32, 35, 38.
142
Measuring the Dispersion of Data: Quartiles &
Boxplots
A sample of 10 boxes of raisins has these weights (in grams):
25, 28, 29, 29, 30, 34, 35, 35, 37, 38
Make a box plot of the data.
143
Constructing a box and whisker plot : Example 2
• Step 1 - take the set of numbers given…34, 18, 100, 27, 54, 52, 93, 59, 61, 87, 68, 85, 78, 82, 91
Place the numbers in order from least to greatest:18, 27, 34, 52, 54, 59, 61, 68, 78, 82, 85, 87,
91, 93, 100
• Step 2 - Find the median. Remember, the median is the middle value in a data set. 18, 27, 34, 52,
54, 59, 61, 68, 78, 82, 85, 87, 91, 93, 100
68 is the median of this data set.
• Step 3 – Find the lower quartile. The lower quartile is the median of the data set to the left of 68.
(18, 27, 34, 52, 54, 59, 61,) 68, 78, 82, 85, 87, 91, 93, 100
52 is the lower quartile
• Step 4 – Find the upper quartile.The upper quartile is the median of the data set to the right of 68.
18, 27, 34, 52, 54, 59, 61, 68, (78, 82, 85, 87, 91, 93, 100)
87 is the upper quartile
• Step 5 – Find the maximum and minimum values in the set. The maximum is the greatest value in the
data set. The minimum is the least value in the data set. 18, 27, 34, 52, 54, 59, 61, 68, 78, 82, 85, 87,
91, 93, 100. 18 is the minimum and 100 is the maximum. 144
Constructing a box and whisker plot : Example 2
• Median = 7.5
• Lower Quartile = 5
• Upper Quartile = 11
• Upper Extreme = 20
• Lower Extreme = 2
Draw the Boxplot now !!!
Is the data skewed???
146
Interpreting the Box Plot:
147
Interpreting the Box Plot:
Symmetric: If a box and whisker plot is symmetric, the median is equidistant from
the minimum and the maximum.
Negatively Skewed: If a box and whisker plot is negatively skewed, the distance from
the median to the minimum is greater than the distance from the median to the
maximum.
Positively Skewed: If a box and whisker plot is positively skewed, the distance from
the median to the maximum is greater than the distance from the median to the
minimum.
148
You should include the following in your interpretation:
149
Analyzing The Graph
• The data values found inside the box represent the middle half ( 50%)
of the data.
• The line segment inside the box represents the median
150
Compute 5 Number Summary and outlier detection
Data: 3, 7, 11, 11, 15, 21, 23, 39, 41, 45, 50, 61, 87, 99, 220
• Median - 39
• Lower Quartile - 11
• Upper Quartile - 61
• Max - 220
• Min – 3
• Lower end of data possible = Q1- (1.5 * IQR) = 11 – ( 1.5 * 50) = -64
• Upper end of data possible = Q3 + (1.5 * IQR) = 61 + (1.5 * 50) = 136
• Outlier is the data value 220 Draw the Boxplot now !!!
Is the data skewed???
151
Practice Questions: Quartiles & Boxplots
The five-number summary for the number of accounts managed by
each sales manager at
ABC Inc. is shown in the following table.
The five-number summary suggests that about 50%, percent
of sales managers at ABC Inc. manage fewer than what
number of accounts?
Min Q1 Median Q3 Max
35 45 50 65 85
152
Practice Questions: Quartiles & Boxplots
Jason saves a portion of his salary from his part-time job in the hope of
buying a used car. He recorded the number of dollars he was able to save
over the past 15 weeks.
Dollars saved: 19, 12, 9, 7, 17, 10, 6, 18, 9, 14, 19, 8, 5, 17, 9
Draw box plot
153
Practice Questions: Quartiles & Boxplots
The distribution of daily average wind speed on an island over a period of 120 days is
displayed on this box-and-whisker diagram.
155
Histogram Analysis
Histogram
• Histogram: Graph display of 40
157
Quantile Plot
• Displays all of the data (allowing the user to assess both the overall behavior and
unusual occurrences)
• Plots quantile information
• For a data xi data sorted in increasing order, fi indicates that approximately 100
fi% of the data are below or equal to the value xi
158
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding quantiles of another
• View: Is there is a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit prices of
items sold at Branch 1 tend to be lower than those at Branch 2.
159
Exploring Bivariate Data
160
Scatter plot
161
Scatter plot
• Provides a first look at bivariate data to see clusters of points, outliers, etc.
• Each pair of values is treated as a pair of coordinates and plotted as points in the
plane
162
Uncorrelated Data
164
Need In Detail?
Statistical Descriptions of Data
● They help us measure some very special properties of
the data
● One such property is the central tendency
○ Measuring the central tendency helps us know, where most of
the data lies taking into account the whole set of data
166
Central Tendancy - Mean
● Mathematically, the mean of n values can be defined as:
167
Central Tendency - Median
● When our dataset has skewness, calculating the Median
could prove to be more beneficial than Mean
○ Median is defined as the centermost value of an ordered
numerical dataset
168
Central Tendency - Mode
● The mode for a set of data is the value that occurs most
frequently in the set
○ Hence, it can be calculated for both qualitative and quantitative
attributes
○ A dataset might have two modes and are known as bimodal
○ In general, a dataset with two or more modes is known as multimodal
169
Central Tendency – Mid Range
● This is defined as the average of the largest and smallest
values in the set of values
170
Dispersion of the Data
● The dispersion of data means the spread of data
● Measuring the dispersion of data
○ Let x1, x2, x3…xn be a set of observations for some numeric
attribute, X
○ The following terms for measuring the dispersion of data:
■ Range
■ Quantile
■ Interquartile Range (IQR)
■ Variance and Standard Deviation
171
Range
● It is defined as the difference between the largest and
smallest values in the set
5 8 9 4 3 2 7 12 15 6
Range = 15 – 2
= 13
172
Quantiles
● These are points taken at regular intervals of data distribution,
dividing it into essentially equal-size consecutive sets.
2 3 4 5 7 9 11 13 15 22 24 27 30 31 35
The kth q-quantile for given data distribution is the value x such at most k/q of data
values are less than x and at most (q-k)/q of data values are more than x, where k is an
integer such that 0 < k < q. There are total (q-1) q-quantiles.
173
Quartile – 4 Quantiles
Quartiles are the values that divide a list of numbers into quarters:
• Put the list of numbers in order
• Then cut the list into four equal parts
• The Quartiles are at the "cuts"
174
Quartiles
175
Interquartile Range (IQR)
● The distance between the first and third quartiles is a simple
measure of the spread that gives the range covered by the
middle half of the data
176
Variance & Standard Deviation
● The Standard Deviation is a measure of how spread out
numbers are
● The Variance is defined as the average of the squared
differences from the Mean
○ The variance of N observations, x1, x2, x3….xn, for a numeric
attribute X is –
177
Standard Deviation – a Look
182
Outliers
● An Outlier is a data object that deviates significantly from the
rest of the objects as if it were generated by a different
mechanism
183
Outlier Example
184
What if we remove outlier?
185
Outlier Detection using Box Plot
● A box and whisker plot — also called a box plot — displays five-
number summary of a set of data
● Five number summary
○ Minimum
○ First quartile (Q1)
○ Median
○ Third quartile (Q3)
○ Maximum
186
Outlier Detection using Box Plot
187
Handling missing values in the
dataset
● The data in the real world will obviously have a lot of missing
values
● Handling missing values:
○ Ignore the tuple with missing values
○ Use a measure of central tendency for the attribute to fill in the
missing value
○ Use prediction techniques to fill in the missing value
● Handling missing data is important as many machine learning
algorithms do not support data with missing values
188
Removing noise from the data using the
Binning Technique
● What is defined as a noise in data?
○ Suppose that we have a dataset in which we have some
measured attributes
○ Now, these attributes might carry some random error or variance
○ Such errors in attribute values are called as noise in the data
● If such errors persist in our data, it will return inaccurate results
189
Binning Vs. Encoding
● For a machine learning model, the dataset needs to be
processed in the form of numerical vectors to train it using an
ML algorithm
○ Feature Binning: Conversion of a continuous variable to
categorical
○ Feature Encoding: Conversion of a categorical variable to
numerical features
190
Binning Technique
● The set of data values are sorted in an order, grouped into
“buckets” or “bins” and then each value in a particular bin is
smoothed using its neighbor
○ It is also said that the binning method does local smoothing
because it consults its nearby values to smooth the values of the
attribute
191
Smoothing by bin means
● In this method, all the values of a particular bin are replaced by the
mean of the values of that particular bin
○ Mean of 4, 8, 15 = 9
○ Mean of 21, 21, 24 = 22
○ Mean of 25, 28, 34 = 29
192
Smoothing by bin medians
● In this method, all the values of a particular bin are replaced by the
median of the values of that particular bin
○ Median of 4, 8, 15 = 8
○ Median of 21, 21, 24 = 21
○ Median of 25, 28, 34 = 28
193
Smoothing by bin boundaries
● In this method, all the values of a particular bin are replaced by the
closest boundary of the values of that particular bin
194
Encoding
● Most of the ML algorithms cannot handle categorical variables
and hence it is important to do feature encoding
Label Encoding
Ordinal Encoding
Frequency Encoding
Binary Encoding
One-hot Encoding
For Target
Variable
196
Ordinal Encoding
● An ordinal encoding involves mapping each unique label
to an integer value
○ This type of encoding is really only appropriate if there is a
known relationship between the categories
For
Features
197
Frequency Encoding
● It transforms an original categorical variable to a numerical
variable by considering the frequency distribution of the data
○ It can be useful for nominal features
198
Binary encoding
● Binary Encoding just labels values to an integer then takes
binary of the integer and makes a binary table to encode data
199
One hot encoding
● One hot encoding technique splits the category each to a
column
○ It creates n different columns each for a category and
replaces one column with 1 rest of the columns is 0
200
Target Encoding
● Target encoding is the process of replacing a categorical value with the
mean of the target variable
○ Any non-categorical columns are automatically dropped by the target
encoder model
201
Feature Scaling
● Feature scaling means adjusting data that has different
scales so as to avoid biases from big outliers
○ It standardizes the independent features present in the data in
a fixed range
202
Why Feature Scaling?
● Machine learning algorithm works on numbers and has
no knowledge of what that number represents
○ Many ML algorithms perform better when numerical input
variables are scaled to a standard range
crucial part
of the data
preprocessin
g stage
203
Will Feature Scaling Work for all ML
Algorithms?
204
Why Feature Scaling?
● Tree-Based Algorithms
○ They are fairly insensitive to the scale of the features
○ Think about it, a decision tree is only splitting a node based on
a single feature
○ This split on a feature is not influenced by other features
207
Feature Scaling Categories
Feature Scaling
Normalization Standardization
208
Normalizaion
● A scaling technique in which values are shifted and
rescaled so that they end up ranging between 0 and 1
○ It is also known as Min-Max scaling
○ Here’s the formula for normalization:
′𝑋 − 𝑋 𝑚𝑖𝑛
𝑋 =
𝑋 𝑚𝑎𝑥 − 𝑋 𝑚𝑖𝑛
209
Standardization
● Standardization is another scaling technique where the
values are centered around the mean with a unit
standard deviation
○ This means that the mean of the attribute becomes zero and
the resultant distribution has a unit standard deviation
○ Here’s the formula for standardization:
𝑋 −𝜇
′
𝑋=
𝜎
211
Normalization or Standardization?
● Normalization is good to use when you know that the distribution
of your data does not follow a Gaussian distribution
● Standardization, on the other hand, can be helpful in cases where
the data follows a Gaussian distribution
212
Covariance
● Variables may change in relation to each other
● Covariance measures how much the movement in one
variable predicts the movement in a corresponding variable
213
Smoking v Lung Capacity Data
215
Calculating Covariance
216
Covariance Calculation
217
Calculating Correlation
218
Calculating Correlation
219
Calculating Correlation
221
Correlation is not Causation
222
Correlation Is Not Good at Curves
223