Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
81 views203 pages

Unit I 1

The document outlines a course on Machine Learning, covering its fundamentals, data preprocessing techniques, and both supervised and unsupervised learning algorithms. Students will gain hands-on experience and learn to implement various machine learning methods for different datasets. The syllabus includes topics like data visualization, evaluation measures, and types of data sets, alongside recommended textbooks and references.

Uploaded by

Calmer Music
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views203 pages

Unit I 1

The document outlines a course on Machine Learning, covering its fundamentals, data preprocessing techniques, and both supervised and unsupervised learning algorithms. Students will gain hands-on experience and learn to implement various machine learning methods for different datasets. The syllabus includes topics like data visualization, evaluation measures, and types of data sets, alongside recommended textbooks and references.

Uploaded by

Calmer Music
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 203

Introduction to

MACHINE LEARNING
23AIE233M INTRODUCTION TO MACHINE LEARNING L-T-P-C: 2 1 3 4

Course Objectives

• This course dives into the basics of Machine learning.


• This course will enable the students to work with various types of data and its pre-processing techniques.
• The students will learn about Supervised and Unsupervised Learning.
• The students will enrich themselves with hands-on experience to implement various machine learning algorithms.

Course Outcomes

• Apply pre-processing techniques to prepare the data for machine learning applications
• Implement supervised machine learning algorithms for different datasets
• Implement unsupervised machine learning algorithms for different datasets
• Identify the appropriate machine learning algorithms for different applications
Syllabus
Unit 1
• Introduction to Machine Learning – Data and Features – Machine Learning Pipeline: Data Preprocessing:
Standardization, Normalization, Missing data problem, Data imbalance problem – Data visualization - Setting
up training, development and test sets – Cross validation – Problem of Overfitting, Bias vs Variance -
Evaluation measures – Different types of machine learning: Supervised learning, Unsupervised learning.

Unit 2
• Supervised learning - Regression: Linear regression, logistic regression – Classification: K-Nearest Neighbor,
Naïve Bayes, Decision Tree, Random Forest, Support Vector Machine, Perceptron.

Unit 3
• Unsupervised learning – Clustering: K-means, Hierarchical, Spectral, subspace clustering, Dimensionality
Reduction Techniques, Principal component analysis, Linear Discriminant Analysis.
Text Books:
• Andrew Ng, Machine learning yearning, URL: http://www. mlyearning. org/(96) 139 (2017).
• Kevin P. Murphey. Machine Learning, a probabilistic perspective. The MIT Press Cambridge, Massachusetts,
2012.
• Christopher M Bishop. Pattern Recognition and Machine Learning. Springer 2010

Reference Books:
• Richard O. Duda, Peter E. Hart, David G. Stork. Pattern Classification. Wiley, Second Edition;2007
• Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
Unit 1
• Introduction to Machine Learning – Data and Features – Machine Learning Pipeline:
Data Preprocessing: Standardization, Normalization, Missing data problem, Data
imbalance problem – Data visualization - Setting up training, development and test sets
– Cross validation – Problem of Overfitting, Bias vs Variance - Evaluation measures –
Different types of machine learning: Supervised learning, Unsupervised learning.

Slides from Dr. M. Anbazhagan, CSE and others


Man Vs. Machine

6
What is AI?
"The simulation of human intelligence in machines that are programmed
to think like humans and mimic their actions." (Techopedia)

What is Machine Learning?


Machine learning is a branch of artificial intelligence (AI) and computer science
which focuses on the use of data and algorithms to imitate the way that humans
learn, gradually improving its accuracy. - IBM

Machine Learning algorithms enable the computers to learn from data, and even
improve themselves, without being explicitly programmed -Arthur Lee Samuel
What is Machine Learning?

8
AI Vs. ML

9
Machine Learning Evolution

10
Traditional vs ML

• Data –> Information


• Definitive cases
• Eg: Matrix
multiplication

• Data –> Knowledge


• Deals with
Uncertainty
• Eg: Next word
prediction
Machine Learning Terminlogies
● ML Model
○ The learned program that maps inputs to predictions
○ Alternate Name: Predictor/Classifier/Regression Model

ML Model

Unseen Predictions
Input

12
Machine Learning Types

13
Machine Learning Types

14
Applications of Machine Learning

15
Components of Machine Learning
Why Machine Learning?

VOLUMINOUS
DATA

COMPUTATIONAL POWERFUL
POWER ALGORITHMS

17
Getting to Know Your Data

• Types of Data Sets

• Important Characteristics of Structured Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data


18
Types of Data Sets: (1) Record Data
• Relational records
• Relational tables, highly structured
• Data matrix, e.g.numerical matrix, crosstabs

TID Items
1 Bread, Coke, Milk
• Transaction data 2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
• Document data:
• Term-frequency vector (matrix) of text documents Document 1 3 0 5 0 2 6 0 2 0
19
2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Types of Data Sets: (2) Graphs and Networks

• Transportation network

• World Wide Web

 Molecular Structures
20
 Social or information networks
Types of Data Sets: (3) Ordered Data
• Video data: sequence of images

• Temporal data:
time-series

• Sequential Data:
Transaction sequences

21
• Genetic sequence data
Types of Data Sets: (4) Spatial, image and multimedia Data

• Spatial data: maps

• Image data:
22
• Video data:
Important Characteristics of
Structured Data
• Dimensionality
• Curse of dimensionality
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale

• Distribution
• Centrality and dispersion
23
Data Types

Data types in the world of


Machine Learning

Categorical Numerical

Categorical data generally Numerical data is used to


means everything else and in mean anything represented
particular discrete labeled by numbers
groups
26
Stevens’ typology of measurement
scales Age, Height,
• Equal Spaces between values Weight
Ratio • A meaningful zero value, mean makes sense

Temperature
• Equal Spaces between values (Celsius/Fahrenheit),

Interval • No meaningful zero value, mean makes sense IQ, Credit Score
Gender,
• Variables that are non-numeric or where the numbers Ethnicity, Eye
Nominal
have no value
• Median makes sense
Color, Blood
Type

• No numerical relationship between the different


Income Level,
categories but allows for rank order Level of
Ordinal • Mean and median are meaningless
Agreement

27
Nominal/Ordinal Examples

Nominal

Ordinal

28
A More Detailed Taxonomy
Types of Data

Quantitative Qualitative

Discrete Continuous Nonnumerical

Interval Nominal

Ratio Ordinal
29
Quantitative Vs. Qualitative
● Quantitative data seem to be the easiest to explain and try to
find the answers to questions such as
○ “how many, “how much” and “how often”
● It can be expressed as a number, so it can be quantified

30
Quantitative Vs. Qualitative
● Qualitative data can’t be expressed as a number, so it can’t be
measured
○ It mainly consists of words, pictures, and symbols, but not
numbers
● These can answer the questions like:
○ “how this has happened”, or “why this has happened”

31
Categorical Data
● Categorical data represents characteristics.
○ Therefore it can represent things like a person’s gender,
language etc.
○ Categorical data can also take on numerical values (Example: 1
for female and 0 for male)
● Two types of categorical data
○ Nominal
○ Ordinal

32
Numerical - Discrete
● We speak of discrete data if its values are distinct and
separate
○ In other words: We speak of discrete data if the data can only
take on certain values
○ This type of data can’t be measured but it can be counted
○ It basically represents information that can be categorized into a
classification
○ Example:
■ The number of students in a class
■ The number of workers in a company
■ The number of test questions you answered correctly

35
Good data
preparation is key
to producing valid
and reliable models
Data Preprocessing
A lot a method have been developed but still an active area
of research

39
Traditional ML
Workflow
Why is Data Preprocessing
Important?
■ No quality data, less accurate results!
• Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even misleading
statistics.
■ Data preparation, cleaning, and transformation comprises the
majority of the work in a data mining and Machine Learning
application (90%).
Data Preprocessing

■ Why preprocess the data?

■ Data cleaning

■ Data integration and transformation

■ Data reduction

■ Discretization

■ Summary
Why Data Preprocessing?

■ Data in the real world is dirty


• Incomplete: missing attribute values, lack of certain
attributes of interest, or containing only aggregate data
• e.g., occupation=“”
• Noisy: containing errors or outliers-Deviates from actual data
• e.g., Salary=“-10”
• Inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
Major Tasks in Data Preprocessing
■ Data Cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers and noisy data,
and resolve inconsistencies

■ Data Integration
• Integration of multiple databases, or files

■ Data Transformation
• Normalization and aggregation

■ Data Reduction
• Obtains reduced representation in volume but produces the same or similar analytical
results

■ Data Discretization
• Automatic generation of concept hierarchies from numerical data
Data Cleaning

■ Data cleaning tasks


• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration
Missing Data (Data is not always
available )
■ Missing data may be due to
• Equipment malfunction
• Inconsistent with other recorded data and thus deleted
• Data not entered due to misunderstanding
• Certain data may not be considered important at the time of entry
• Not register history or changes of the data
Noisy Data (Random error or variance in a measured variable)
● Incorrect attribute values may be due to
○ faulty data collection instruments
○ data entry problems
○ data transmission problems
○ technology limitation
○ inconsistency in naming convention

● Other data problems which requires data cleaning


○ duplicate records
○ incomplete data
○ inconsistent data
Missing Data Example
Bank Acct Totals -
Historical
Name SSN Address Phone # Date Acct Total

John Doe 111-22-3333 1 Main St 111-222-3333 2/12/1999 2200.12


Bedford, Ma

John W. Doe Bedford, Ma 7/15/2000 12000.54

John Doe 111-22-3333 8/22/2001 2000.33

James Smith 222-33-4444 2 Oak St 222-333-4444 12/22/2002 15333.22


Boston, Ma

Jim Smith 222-33-4444 2 Oak St 222-333-4444 12333.66


Boston, Ma

Jim Smith 222-33-4444 2 Oak St 222-333-4444


Boston, Ma
How to Handle Missing Data?

● Ignore the tuple: usually done when class label is missing (assuming the tasks
in classification—not effective when the percentage of missing values per
attribute varies considerably.
● Fill in the missing value manually: tedious + infeasible?
● Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!
How to Handle Missing Data?

BA
D
● Ignore the tuple: usually done when class label is missing (assuming the tasks

PR
in classification—not effective when the percentage of missing values per
attribute varies considerably.
AC
T
● Fill in the missing value manually: tedious + infeasible?
ICE
● Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!
How to Handle Missing Data?

● Use the attribute mean/median to fill in the missing value

● Use the attribute mean for all samples belonging to the same class to fill in the

missing value: smarter

● Use the most probable value to fill in the missing value: inference-based such as

Bayesian formula or decision tree


Simple Discretization Methods: Binning
● Equal-width (distance) partitioning:
○ It divides the range into N intervals of equal size: uniform grid
○ if A and B are the lowest and highest values of the attribute, the width of intervals will be:
W = (B-A)/N.
○ The most straightforward
○ But outliers may dominate presentation
○ Skewed data is not handled well.

● Equal-depth (frequency) partitioning:


○ It divides the range into N intervals, each containing approximately same number of
samples
○ Good data scaling
○ Managing categorical attributes can be tricky.
Binning Methods for Data Smoothing
Equal-width
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-width) bins:
- Bin 1 (4-14): 4, 8, 9
- Bin 2(15-24): 15, 21, 21, 24
- Bin 3(25-34): 25, 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 7, 7, 7
- Bin 2: 20, 20, 20, 20
- Bin 3: 28, 28, 28, 28, 28
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4
- Bin 2: 15, 24, 24, 24
- Bin 3: 25, 25, 25, 25, 34
Binning Methods for Data Smoothing
Equal-depth
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Cluster Analysis
Allows detection and removal of outliers
Regression
y

Y1

Y1’ y=x+1

X1 x

Linear regression – find the best line to fit two variables and use regression function to smooth data
Data Integration
(combines data from multiple sources into a coherent store )

● Schema integration
○ integrate metadata from different sources

○ Entity identification problem: identify real world entities from multiple data sources, e.g.,
A.cust-id  B.cust-#

● Detecting and resolving data value conflicts


○ for the same real world entity, attribute values from different sources are different

○ possible reasons: different representations, different scales, e.g., metric vs. British units
Web Information Integration
■ Many integration tasks,
• Integrating Web query interfaces (search forms)
• Integrating ontologies (taxonomy)
• Integrating extracted data
• Integrating textual information
• E.g., entity linking, paraphrasing, etc.

■ E.g., integration of query interfaces.


• Many web sites provide forms to query deep web
• Applications: meta-search and meta-query
Redundant Data in Data Integration

● Redundant data occur often when integration of multiple databases


○ The same attribute may have different names in different databases
○ One attribute may be a “derived” attribute in another table, e.g., annual revenue
Handling Redundant Data
● Redundant data may be able to be detected by correlation analysis
○ Correlation coefficient for numeric data
○ Chi ‐square test for categorical data

● Careful integration of the data from multiple sources may help


reduce/avoid redundancies and inconsistencies and improve mining
speed and quality
Chi-square Test
Chi-square Test for Categorical Data
Chi-square Test
Correlation Coefficient for Numeric Data

• R A,B = Sum (A-A’) (B-B’)


(n-1) sdAsdB
Where A’ = mean value of A
sum (A)
n
sdA = standard deviation of A
SqRoot ( Sum (A-A’)2)
n-1
<0 negatively correlated, =0 no correlation, >0 correlated – consider
removal of A or B
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range
• min-max normalization
• z-score (zero mean) normalization
• normalization by decimal scaling

• Attribute/feature construction
• New attributes constructed from the given ones to help in the data mining
process
Data Transformation:
Normalization
• min-max normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA

• Example – income, min $55,000, max $150000 – map to 0.0 –


1.0
• $73,600 is transformed to :
• 73600-55000 (1.0 – 0) + 0 = 0.196
150000-55000
Data Transformation: Normalization
• z-score normalization

v  meanA
v' 
stand _ devA

• Example – income, mean $33000, sd $11000


• $73600 is transformed to :
• 73600-33000 = 3.69
11000
Data Transformation: Normalization
• normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(| v ' |)<1
10

• Example recorded values : -722 to 821


• Divide each value by 1000
• - 722 normalizes to -.772
• 821 normalizes to 0.821
Attribution Construction
• New attributes are constructed from given attributes
and added in order to help improve accuracy and
understanding of structure in high ‐dimension data
• Example – Add the attribute area based on the
attributes height and width
Data Preprocessing

■ Why preprocess the data?

■ Data cleaning

■ Data integration and transformation

■ Data reduction

■ Discretization

■ Summary
Data Reduction Strategies
■ Data is too big to work with
■ Data reduction
■obtain a reduced representation of the data set that is much smaller in
volume, yet closely maintains the integrity of the original data
■ Data reduction strategies
• (Data Cube)Aggregation
• Attribute (Subset) Selection
• Dimensionality Reduction
• Numerosity Reduction
• Data Discretization
• Concept Hierarchy Generation
Data Cube Aggregation
■Summarize (aggregate) data based on dimensions
■The resulting data set is smaller in volume, without loss of
information necessary for analysis task
■Concept hierarchies may exist for each attribute, allowing
the analysis of data at multiple levels of abstraction
• Data Aggregation

• Data Cube
■ Provide fast access to pre‐computed,
summarized data, thereby benefiting on‐line
analytical processing as well as data mining
Attribute Subset Selection
■ Attribute selection can help in the phases of data mining (knowledge discovery) process
■ By attribute selection,
■ we can improve data mining performance (speed of learning, predictive accuracy, or
simplicity of rules)
■ we can visualize the data for model selected
■ we reduce dimensionality and remove noise.
■ Attribute (Feature) selection is a search problem
■ Search directions
■ (Sequential) Forward selection
■ (Sequential) Backward selection (elimination)
■ Bidirectional selection
■ Decision tree algorithm (induction)
Attribute Subset Selection
■ Search strategies
■ Exhaustive search
■ Heuristic search
■ Selection criteria
■ Statistic significance
■ Information gain
■ etc.
Data Compression
 String compression
 There are extensive theories and well-tuned
algorithms
 Typically lossless
 But only limited manipulation is possible without
expansion
 Audio/video compression
 Typically lossy compression, with progressive
refinement
 Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
 Time sequence is not audio
 Typically short and vary slowly with time
Data
Compression

Original Data Compressed


Data
lossless

s y
los
Original Data

Approximated
Wavelet
Transforms Haar2 Daubechi
e4
 Discrete wavelet transform (DWT): linear signal processing
 Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients
 Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
 Method:
 Length, L, must be an integer power of 2 (padding with 0s, when necessary)
 Each transform has 2 functions: smoothing, difference
 Applies to pairs of data, resulting in two set of data of length L/2
 Applies two functions recursively, until reaches the desired length
Principal Component
Analysis
 Given N data vectors from k-dimensions, find
c <= k
orthogonal vectors that can be best used to
represent data
 The original data set is reduced to one consisting
of N data vectors on c principal components
(reduced dimensions)
 Each data vector is a linear combination of
the c principal component vectors
 Works for numeric data only
 Used when the number of dimensions is
large
Principal Component Analysis

X2

Y1
Y2

X1
Numerosity Reduction
 Parametric methods
 Assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
 Log-linear models: obtain value at a point in m-D space as the
product on appropriate marginal subspaces
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling
Regression and Log-Linear
Models
 Linear regression: Data are modeled to fit a straight line
 Often uses the least-square method to fit the line

 Multiple regression: allows a response variable Y to be


modeled as a linear function of multidimensional feature vector
 Log-linear model: approximates discrete multidimensional
probability distributions
Regress Analysis and Log-Linear Models
Histogram
s
 A popular data reduction
technique
 Divide data into buckets
and store average (sum)
for each bucket
 Can be constructed
optimally in one dimension
using dynamic
programming
 Related to quantization
problems.
Clustering

 Partition data set into clusters, and one can store cluster
representation only
 Can be very effective if data is clustered but not if data is
“smeared”
 Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms
Sampling
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Choose a representative subset of the data
 Simple random sampling may have very poor performance in the
presence of skew
 Develop adaptive sampling methods
 Stratified sampling:
 Approximate the percentage of each class (or subpopulation of

interest) in the overall database


 Used in conjunction with skewed data

 Sampling may not reduce database I/Os (page at a time).


Sampling

SWOR
SR sa mp le
om
rand
sim p l e ent
p l acem
re
ou t
with

SRSW
simp Rle ra
ndom
w i th samp
repla le
ceme
nt
Raw Data
Sampling
Raw Data Cluster/Stratified Sample
Hierarchical Reduction
 Use multi-resolution structure with different
degrees of reduction
 Hierarchical clustering is often performed but tends
to define partitions of data sets rather than
“clusters”
 Parametric methods are usually not amenable to
hierarchical representation
 Hierarchical aggregation
 An index tree hierarchically divides a data set into
partitions by value range of some attributes
 Each partition can be considered as a bucket
 Thus an index tree with aggregates stored at each
node is a hierarchical histogram
Discretization
 Three types of attributes:
 Nominal — values from an unordered set
 Ordinal — values from an ordered set
 Continuous — real numbers
 Discretization:
 divide the range of a continuous attribute into
intervals
 Some classification algorithms only accept

categorical attributes.
 Reduce data size by discretization

 Prepare for further analysis


Discretization and Concept
hierachy
 Discretization
 reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals. Interval labels
can then be used to replace actual data values.
 Concept hierarchies
 reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).
Discretization and concept hierarchy
generation for numeric data
 Binning

 Histogram analysis

 Clustering analysis

 Entropy-based discretization

 Segmentation by natural
partitioning
Entropy-Based
Discretization
Segmentation by natural
partitioning
3-4-5 rule can be used to segment numeric data into
relatively uniform, “natural” intervals.
* If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi-width
intervals
* If it covers 2, 4, or 8 distinct values at the most significant
digit, partition the range into 4 intervals
* If it covers 1, 5, or 10 distinct values at the most significant
digit, partition the range into 5 intervals
Concept hierarchy generation for
categorical data
 Specification of a partial ordering of attributes explicitly at
the schema level by users or experts
 Specification of a portion of a hierarchy by explicit data
grouping
 Specification of a set of attributes, but not of their partial
ordering
 Specification of only a partial set of attributes
Specification of a set of
attributes
Concept hierarchy can be automatically generated based on the
number of distinct values per attribute in the given attribute
set. The attribute with the most distinct values is placed at the
lowest level of the hierarchy.

country 15 distinct values

province_or_ state 65 distinct values

city 3567 distinct values

street 674,339 distinct values


Basic Statistical Descriptions of Data
• Why
• To get overall picture of the data, basic Statistical Descriptions are used in data
Analysis
• Statistical metrics can tell us if there are issues exist as extreme outliers and large
deviation in the values of attributes
• Outliers:
• Data values differs significantly from other values
• It affects the mean value of the data but little effect on median and
mode.

103
Basic Statistical Descriptions of Data
• What
• Measure of central tendency
• Mean, Median and mode
• Location of the centre of a data distribution
• Where do most of the attribute values fall?
• Dispersion Measure
• Range, quartiles, inter quartile range, five number summary and box plots ,
variance and standard deviation
• It describes how are the data spread out.

104
Descriptive Statistics

105
Basic Statistical Descriptions of Data
• Motivation
• To better understand the data: central tendency, variation and spread
 Data dispersion characteristics
 Median, max, min, quantiles, outliers, variance, ...
 Numerical dimensions correspond to sorted intervals
 Data dispersion:
 Analyzed with multiple granularities of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube

106
Measuring the Central Tendency: (1)
Mean
• Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size.

1 n
x   xi   x
n i 1 N
• Weighted arithmetic mean: n

w x i i
x  i 1n
w i
• Trimmed mean: i 1

• Chopping extreme values (e.g., Olympics gymnastics score computation)

107
Measuring the Central Tendency:
(1) Mean
• Suppose we have the following values for salary(in thousands of
dollars), shown in increasing order:
• 30, 31, 47, 50, 52, 52, 56, 60, 63, 70, 70

Estimated Mean = (30+31+47+50+52+52+56+60+63+70+70)/11


= 581/11
= 52.81

108
Measuring the Central Tendency: (2)
Median
• Median

• Median is nothing more than the


middle value of your observations
when they are order from the
smallest to the largest.
It involves two steps:

 Oder your cases from smallest to largest


 Find the middle Value

113
Measuring the Central Tendency: (2)
Median
• Median:
• Middle value if odd number of values, or average of the middle two values otherwise
• Estimated by interpolation (for grouped data):

Estimated Median = L + (n/2) − B × w


------------
G
where:
L is the lower class boundary of the group containing the median
n is the total number of values
B is the cumulative frequency of the groups before the median group
G is the frequency of the median group
w is the group width- Upper and lower class boundaries of median class
114
MEASURING THE CENTRAL TENDENCY: (2) MEDIAN

• For our example:


• The median is the middle value, which in our case is the 11th one, which is in
the 61 - 65 group:
• We can say "the median group is 61 - 65“
• L = 60.5
• n = 21
• B=2+7=9
• G=8
• w=5

115
Measuring the Central Tendency: (3) Mode

• Mode
• Value that occurs most frequently in the
data
• Unimodal, bimodal, multimodal

116
Measuring the Central Tendency: (3)
Mode
• Mode: Value that occurs most
frequently in the data

• Unimodal
• Empirical formula:
mean  mode 3 (mean  median)

• Multi-modal
• Bimodal
• Trimodal

117
Measuring the Central Tendency: (3)
Mode
• Mode for the grouped data:

118
Symmetric vs. Skewed Data
Symmetric/No Skew
• Data can be "skewed", meaning it tends to
have a long tail on one side or the other:
• Median, mean and mode of symmetric,
positively and negatively skewed data

negatively
positively skewed
skewed

120
When to use what measurement of central tendency ??

• If data is Categorical (Nominal or Ordinal) it is impossible to calculate


mean or median. So, go for mode.

• If your data is quantitative then go for mean or median.

• Basically, if your data is having some influential outliers or data is highly


skewed then median is the best measurement for finding central
tendency. Otherwise go for Mean

121
Practice Questions: Mean
• the grade 10 math class recently had a mathematics test and the
grades were as follows: 78 66 82 89 75 74

Mean= 464 / 6 = 77.3

122
Practice Questions: Mean
• The following table shows the number of plants in 20 houses in a
group . Find the mean number of plans per house
Number of
Plants 0-2 2-4 4-6 6-8 8 - 10 10 - 12 12 - 14
Number of
Houses 1 2 2 4 6 2 3

123
Practice Questions: Mean
• The following table shows the number of plants in 20 houses in a
group . Find the mean number of plans per house
Number of
0-2 2-4 4-6 6-8 8 - 10 10 - 12 12 - 14
Plants
Number of
1 2 2 4 6 2 3
Houses

Number of Houses Class Mark


Number of Plant fixi
(fi) (xi)
0-2 1 1 1×1=1
2-4 2 3 2×3=6
4-6 2 5 2 × 5 = 10
6-8 4 7 4 × 7 = 28
8 - 10 6 9 6 × 9 = 54
10 - 12 2 11 2 × 11 = 22 124
12 -14 3 13 3 × 13 = 39
Practice Questions: Mean
• The following table shows the number of plants in 20 houses in a
group. Find the mean number of plans per house

Number of Houses Class Mark


Number of Plant fi x i
(fi) (xi)
0-2 1 1 1×1=1
2-4 2 3 2×3=6
4-6 2 5 2 × 5 = 10
6-8 4 7 4 × 7 = 28
8 - 10 6 9 6 × 9 = 54
10 - 12 2 11 2 × 11 = 22
12 -14 3 13 3 × 13 = 39

∑fi = 1 + 2 + 2 + 4 + 6 + 2 + 3 = 20
∑fi xi =1 + 6 + 10 + 28 + 54 + 22 + 39 = 160 125
Therefore, mean = ∑(fixi)/ ∑fi = 160/20 = 8 plants
Practice Questions: Median
• the grade 10 math class recently had a mathematics test and the grades
were as follows: 78 66 82 89 75 74

• Arrange in order
66 74 75 78 82 89

75 + 78 = 153

• Median = 153 / 2 = 76.5


126
Practice Questions: Mode
• Find the mode of the following data:

78 56 68 92 84 76 74 56 68 66 78 72 66
65 53 61 62 78 84 61 90 87 77 62 88 81

127
Mode
• Find the mode of the following data:

78 56 68 92 84 76 74 56 68 66 78 72 66
65 53 61 62 78 84 61 90 87 77 62 88 81

• Frequency:
78 56 68 92 84 76 74 56 68 66 78 72 66
65 53 61 62 78 84 61 90 87 77 62 88 81

• Mode=78 128
Measure of Dispersion
• Statistics are very important for observations, analysis and
mathematical prediction models. With the help of statistics we can
know what happened in the past and what may occur in the future.
• Central tendency measures do not reveal the variability present in the
data.
• Dispersion is the scattered ness of the data series around it average.
• Dispersion is the extent to which values in a distribution differ from
the average of the distribution.
• A measure of statistical dispersion is a nonnegative real number that is
zero if all the data are the same and increases as the data become
more diverse. 129
Range

130
Range for grouped data
• The range of a sample of data organized in a frequency distribution is
computed by the following formula:
• Range = upper limit of the last class - lower limit of the first class

131
Measures Data Dispersion: Variance and
Standard Deviation

• Standard deviation s (or σ) is the square root of variance s2 (or


σ2)
132
Conceptual and Computational formula

133
Variance/Standard Deviation for Grouped Data

134
Properties of Normal Distribution Curve
← — ————Represent data dispersion, spread — ————→

Represent central tendency 135


Properties of Normal Distribution Curve

• The normal (distribution) curve


• From μ–σ to μ+σ: contains about 68% of the
measurements (μ: mean, σ: standard deviation)
• From μ–2σ to μ+2σ: contains about 95% of it
• From μ–3σ to μ+3σ: contains about 99.7% of it

136
Graphic Displays of Basic Statistical
Descriptions
• Boxplot: graphic display of five-number summary
• Histogram: x-axis are values, y-axis repres. frequencies
• Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of
data are  xi
• Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution
against the corresponding quantiles of another
• Scatter plot: each pair of values is a pair of coordinates and plotted as points in the
plane

137
Measuring the Dispersion of Data: Quartiles &
Boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th
percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3,
max
• Boxplot: Data is represented with a box
• Q1, Q3, IQR: The ends of the box are at the first and
third quartiles, i.e., the height of the box is IQR
• Median (Q2) is marked by a line within the box
• Whiskers: two lines outside the box extended to
Minimum and Maximum
 Outliers: points beyond a specified outlier threshold, plotted individually
 Outlier: usually, a value higher/lower than 1.5 x IQR

138
The 5 Number Summary
• The five number summary is another name for the visual
representation of the box and whisker plot.

• The five number summary consist of :


• The median ( 2nd quartile)
• The 1st quartile
• The 3rd quartile
• The maximum value in a data set
• The minimum value in a data set
139
Box and Whisker Diagrams.

Anatomy of a Box and Whisker Diagram.

Lower Upper
Lowest Quartile Median Quartile Highest
Value Value
Whisker Box Whisker

4 5 6 7 8 9 10 11 12

Box Plots
140
Graphing The Data
• Notice, the Box includes the lower quartile, median, and upper quartile.
• The Whiskers extend from the Box to the max and min.

141
Measuring the Dispersion of Data: Quartiles &
Boxplots
A sample of 10 boxes of raisins has these weights (in grams):
25, 28, 29, 29, 30, 34, 35, 35, 37, 38
Make a box plot of the data.
Step 1: Order the data from smallest to largest.
Our data is already in order. 25, 28, 29, 29, 30, 34, 35, 35, 37, 38
Step 2
Find the median.
The median is the mean of the middle two numbers:
30+34/2 = 32 median=32
Step 3: Find the quartiles.
The first quartile is the median of the data points to the left of the median.
Q1=29
The third quartile is the median of the data points to the right of the median.
Q3=35
Step 4: Complete the five-number summary by finding the min and the max.
The min is the smallest data point, which is 25.
The max is the largest data point, which is 38.
The five-number summary is 25, 29, 32, 35, 38.

142
Measuring the Dispersion of Data: Quartiles &
Boxplots
A sample of 10 boxes of raisins has these weights (in grams):
25, 28, 29, 29, 30, 34, 35, 35, 37, 38
Make a box plot of the data.

About 75%, percent of the boxes of


raisins weighed more than 29 grams.

143
Constructing a box and whisker plot : Example 2
• Step 1 - take the set of numbers given…34, 18, 100, 27, 54, 52, 93, 59, 61, 87, 68, 85, 78, 82, 91
Place the numbers in order from least to greatest:18, 27, 34, 52, 54, 59, 61, 68, 78, 82, 85, 87,
91, 93, 100
• Step 2 - Find the median. Remember, the median is the middle value in a data set. 18, 27, 34, 52,
54, 59, 61, 68, 78, 82, 85, 87, 91, 93, 100
68 is the median of this data set.
• Step 3 – Find the lower quartile. The lower quartile is the median of the data set to the left of 68.
(18, 27, 34, 52, 54, 59, 61,) 68, 78, 82, 85, 87, 91, 93, 100
52 is the lower quartile
• Step 4 – Find the upper quartile.The upper quartile is the median of the data set to the right of 68.
18, 27, 34, 52, 54, 59, 61, 68, (78, 82, 85, 87, 91, 93, 100)
87 is the upper quartile
• Step 5 – Find the maximum and minimum values in the set. The maximum is the greatest value in the
data set. The minimum is the least value in the data set. 18, 27, 34, 52, 54, 59, 61, 68, 78, 82, 85, 87,
91, 93, 100. 18 is the minimum and 100 is the maximum. 144
Constructing a box and whisker plot : Example 2

• Step 6 – Find the inter-quartile range (IQR).


The inter-quartile (IQR) range is the difference between the upper and lower quartiles.
 Upper Quartile = 87
 Lower Quartile = 52
 87 – 52 = 35
 35 = IQR
• Organize the 5 number summary
• Median – 68
• Lower Quartile – 52
• Upper Quartile – 87
• Max – 100 Draw the Boxplot now !!!
145
• Min – 18
Constructing a box and whisker plot : Example 3

2, 4, 5, 6, 7 ,8, 9, 11, 19, 20

• Median = 7.5
• Lower Quartile = 5
• Upper Quartile = 11
• Upper Extreme = 20
• Lower Extreme = 2
Draw the Boxplot now !!!
Is the data skewed???

146
Interpreting the Box Plot:

147
Interpreting the Box Plot:

Symmetric: If a box and whisker plot is symmetric, the median is equidistant from
the minimum and the maximum.

Negatively Skewed: If a box and whisker plot is negatively skewed, the distance from
the median to the minimum is greater than the distance from the median to the
maximum.

Positively Skewed: If a box and whisker plot is positively skewed, the distance from
the median to the maximum is greater than the distance from the median to the
minimum.
148
You should include the following in your interpretation:

• Range or spread of the data and what it means to your graph


• Quartiles—compare them. What are they telling you about the data?
• Median- this is an important part of the graph, and should be an
important part of the interpretation.
• Percentages should be used to interpret the data, where relevant.

149
Analyzing The Graph
• The data values found inside the box represent the middle half ( 50%)
of the data.
• The line segment inside the box represents the median

150
Compute 5 Number Summary and outlier detection

Data: 3, 7, 11, 11, 15, 21, 23, 39, 41, 45, 50, 61, 87, 99, 220
• Median - 39
• Lower Quartile - 11
• Upper Quartile - 61
• Max - 220
• Min – 3
• Lower end of data possible = Q1- (1.5 * IQR) = 11 – ( 1.5 * 50) = -64
• Upper end of data possible = Q3 + (1.5 * IQR) = 61 + (1.5 * 50) = 136
• Outlier is the data value 220 Draw the Boxplot now !!!
Is the data skewed???

151
Practice Questions: Quartiles & Boxplots
The five-number summary for the number of accounts managed by
each sales manager at
ABC Inc. is shown in the following table.
The five-number summary suggests that about 50%, percent
of sales managers at ABC Inc. manage fewer than what
number of accounts?
Min Q1 Median Q3 Max
35 45 50 65 85

Draw the Boxplot now !!!

152
Practice Questions: Quartiles & Boxplots

Jason saves a portion of his salary from his part-time job in the hope of
buying a used car. He recorded the number of dollars he was able to save
over the past 15 weeks.
Dollars saved: 19, 12, 9, 7, 17, 10, 6, 18, 9, 14, 19, 8, 5, 17, 9
Draw box plot

Draw the Boxplot now !!!


Is the data skewed???

153
Practice Questions: Quartiles & Boxplots

The distribution of daily average wind speed on an island over a period of 120 days is
displayed on this box-and-whisker diagram.

(a) Write down the median wind speed.


(b) Write down the minimum wind speed.
(c) Find the interquartile range.
(d) Write down the number of days the wind speed was between 20 kmh-1 and 51 kmh-1.
(e) Write down the number of days the wind speed was between 9 kmh-1 and 68 kmh-1.

Draw the Boxplot now !!!


Is the data skewed??? 154
Visualization of Data Dispersion: 3-D
Boxplots

155
Histogram Analysis
Histogram
• Histogram: Graph display of 40

tabulated frequencies, shown as bars 35


30

• Differences between histograms 25


20

and bar charts 15

• Histograms are used to show 10


5
distributions of variables while bar 0
charts are used to compare variables 10000 30000 50000 70000 90000

• Histograms plot binned quantitative


data while bar charts plot categorical
data
• Bars can be reordered in bar charts but
not in histograms
• Differs from a bar chart in that it is the
area of the bar that denotes the value,
not the height as in bar charts, a crucial
distinction when the categories are not of
uniform width Bar chart 156
Histograms Often Tell More than
Boxplots

 The two histograms shown in


the left may have the same
boxplot representation
 The same values for: min, Q1,
median, Q3, max
 But they have rather different
data distributions

157
Quantile Plot
• Displays all of the data (allowing the user to assess both the overall behavior and
unusual occurrences)
• Plots quantile information
• For a data xi data sorted in increasing order, fi indicates that approximately 100
fi% of the data are below or equal to the value xi

158
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding quantiles of another
• View: Is there is a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit prices of
items sold at Branch 1 tend to be lower than those at Branch 2.

159
Exploring Bivariate Data

160
Scatter plot

• A scatter plot - effective graphical methods for determining if there appears to be a


relationship, pattern, or trend between two numeric attributes.
• Provides a first look at bivariate data to see clusters of points, outliers, etc
• Each pair of values is treated as a pair of coordinates and plotted as points in the plane

161
Scatter plot
• Provides a first look at bivariate data to see clusters of points, outliers, etc.
• Each pair of values is treated as a pair of coordinates and plotted as points in the
plane

162
Uncorrelated Data

164
Need In Detail?
Statistical Descriptions of Data
● They help us measure some very special properties of
the data
● One such property is the central tendency
○ Measuring the central tendency helps us know, where most of
the data lies taking into account the whole set of data

166
Central Tendancy - Mean
● Mathematically, the mean of n values can be defined as:

○ Suppose that we have a dataset, in which, we have an attribute “age”


of supposing 100 people
○ The mean of ages is equivalent to answering, “what age do most of the
people belong to?”
○ Not a good choice when there are some extreme values in the data

167
Central Tendency - Median
● When our dataset has skewness, calculating the Median
could prove to be more beneficial than Mean
○ Median is defined as the centermost value of an ordered
numerical dataset

168
Central Tendency - Mode
● The mode for a set of data is the value that occurs most
frequently in the set
○ Hence, it can be calculated for both qualitative and quantitative
attributes
○ A dataset might have two modes and are known as bimodal
○ In general, a dataset with two or more modes is known as multimodal

169
Central Tendency – Mid Range
● This is defined as the average of the largest and smallest
values in the set of values

170
Dispersion of the Data
● The dispersion of data means the spread of data
● Measuring the dispersion of data
○ Let x1, x2, x3…xn be a set of observations for some numeric
attribute, X
○ The following terms for measuring the dispersion of data:
■ Range
■ Quantile
■ Interquartile Range (IQR)
■ Variance and Standard Deviation

171
Range
● It is defined as the difference between the largest and
smallest values in the set

5 8 9 4 3 2 7 12 15 6

Range = 15 – 2
= 13

172
Quantiles
● These are points taken at regular intervals of data distribution,
dividing it into essentially equal-size consecutive sets.

Quantile 1 Quantile 2 Quantile 3 Quantile 4

2 3 4 5 7 9 11 13 15 22 24 27 30 31 35

The kth q-quantile for given data distribution is the value x such at most k/q of data
values are less than x and at most (q-k)/q of data values are more than x, where k is an
integer such that 0 < k < q. There are total (q-1) q-quantiles.
173
Quartile – 4 Quantiles
Quartiles are the values that divide a list of numbers into quarters:
• Put the list of numbers in order
• Then cut the list into four equal parts
• The Quartiles are at the "cuts"

174
Quartiles

175
Interquartile Range (IQR)
● The distance between the first and third quartiles is a simple
measure of the spread that gives the range covered by the
middle half of the data

176
Variance & Standard Deviation
● The Standard Deviation is a measure of how spread out
numbers are
● The Variance is defined as the average of the squared
differences from the Mean
○ The variance of N observations, x1, x2, x3….xn, for a numeric
attribute X is –

○ Mathematically, the standard deviation is defined as the square


root of the variance

177
Standard Deviation – a Look

182
Outliers
● An Outlier is a data object that deviates significantly from the
rest of the objects as if it were generated by a different
mechanism

183
Outlier Example

184
What if we remove outlier?

185
Outlier Detection using Box Plot
● A box and whisker plot — also called a box plot — displays five-
number summary of a set of data
● Five number summary
○ Minimum
○ First quartile (Q1)
○ Median
○ Third quartile (Q3)
○ Maximum

186
Outlier Detection using Box Plot

187
Handling missing values in the
dataset
● The data in the real world will obviously have a lot of missing
values
● Handling missing values:
○ Ignore the tuple with missing values
○ Use a measure of central tendency for the attribute to fill in the
missing value
○ Use prediction techniques to fill in the missing value
● Handling missing data is important as many machine learning
algorithms do not support data with missing values

188
Removing noise from the data using the
Binning Technique
● What is defined as a noise in data?
○ Suppose that we have a dataset in which we have some
measured attributes
○ Now, these attributes might carry some random error or variance
○ Such errors in attribute values are called as noise in the data
● If such errors persist in our data, it will return inaccurate results

189
Binning Vs. Encoding
● For a machine learning model, the dataset needs to be
processed in the form of numerical vectors to train it using an
ML algorithm
○ Feature Binning: Conversion of a continuous variable to
categorical
○ Feature Encoding: Conversion of a categorical variable to
numerical features

Categorical Data Continuous Data

190
Binning Technique
● The set of data values are sorted in an order, grouped into
“buckets” or “bins” and then each value in a particular bin is
smoothed using its neighbor
○ It is also said that the binning method does local smoothing
because it consults its nearby values to smooth the values of the
attribute

[4, 8, 15, 21, 21, 24, 25, 28, 34]

4, 8, 15 21, 21, 24 25, 28, 34


(Bin 1) (Bin 2) (Bin 3)

191
Smoothing by bin means
● In this method, all the values of a particular bin are replaced by the
mean of the values of that particular bin
○ Mean of 4, 8, 15 = 9
○ Mean of 21, 21, 24 = 22
○ Mean of 25, 28, 34 = 29

9, 9, 9 22, 22, 22 29, 29, 29


(Bin 1) (Bin 2) (Bin 3)

192
Smoothing by bin medians
● In this method, all the values of a particular bin are replaced by the
median of the values of that particular bin
○ Median of 4, 8, 15 = 8
○ Median of 21, 21, 24 = 21
○ Median of 25, 28, 34 = 28

8, 8, 8 21, 21, 21 28, 28, 28


(Bin 1) (Bin 2) (Bin 3)

193
Smoothing by bin boundaries
● In this method, all the values of a particular bin are replaced by the
closest boundary of the values of that particular bin

4, 4, 15 21, 21, 24 25, 25, 34


(Bin 1) (Bin 2) (Bin 3)

194
Encoding
● Most of the ML algorithms cannot handle categorical variables
and hence it is important to do feature encoding

Label Encoding

Ordinal Encoding

Frequency Encoding

Binary Encoding

One-hot Encoding

Target mean Encoding


195
Label Encoding
● Label Encoding is a popular encoding technique for
handling categorical variables
○ In this technique, each label is assigned a unique integer
based on alphabetical ordering

For Target
Variable

196
Ordinal Encoding
● An ordinal encoding involves mapping each unique label
to an integer value
○ This type of encoding is really only appropriate if there is a
known relationship between the categories

For
Features

197
Frequency Encoding
● It transforms an original categorical variable to a numerical
variable by considering the frequency distribution of the data
○ It can be useful for nominal features

198
Binary encoding
● Binary Encoding just labels values to an integer then takes
binary of the integer and makes a binary table to encode data

199
One hot encoding
● One hot encoding technique splits the category each to a
column
○ It creates n different columns each for a category and
replaces one column with 1 rest of the columns is 0

200
Target Encoding
● Target encoding is the process of replacing a categorical value with the
mean of the target variable
○ Any non-categorical columns are automatically dropped by the target
encoder model

201
Feature Scaling
● Feature scaling means adjusting data that has different
scales so as to avoid biases from big outliers
○ It standardizes the independent features present in the data in
a fixed range

202
Why Feature Scaling?
● Machine learning algorithm works on numbers and has
no knowledge of what that number represents
○ Many ML algorithms perform better when numerical input
variables are scaled to a standard range
crucial part
of the data
preprocessin
g stage

203
Will Feature Scaling Work for all ML
Algorithms?

It improves the performance of some machine


learning algorithms and does not work at all for others

204
Why Feature Scaling?
● Tree-Based Algorithms
○ They are fairly insensitive to the scale of the features
○ Think about it, a decision tree is only splitting a node based on
a single feature
○ This split on a feature is not influenced by other features

207
Feature Scaling Categories

Feature Scaling

Normalization Standardization

208
Normalizaion
● A scaling technique in which values are shifted and
rescaled so that they end up ranging between 0 and 1
○ It is also known as Min-Max scaling
○ Here’s the formula for normalization:

′𝑋 − 𝑋 𝑚𝑖𝑛
𝑋 =
𝑋 𝑚𝑎𝑥 − 𝑋 𝑚𝑖𝑛

209
Standardization
● Standardization is another scaling technique where the
values are centered around the mean with a unit
standard deviation
○ This means that the mean of the attribute becomes zero and
the resultant distribution has a unit standard deviation
○ Here’s the formula for standardization:

𝑋 −𝜇

𝑋=
𝜎
211
Normalization or Standardization?
● Normalization is good to use when you know that the distribution
of your data does not follow a Gaussian distribution
● Standardization, on the other hand, can be helpful in cases where
the data follows a Gaussian distribution

○ However, this does not have to be necessarily true


● At the end of the day, the choice of using normalization or
standardization will depend on your problem and the machine
learning algorithm you are using

212
Covariance
● Variables may change in relation to each other
● Covariance measures how much the movement in one
variable predicts the movement in a corresponding variable

213
Smoking v Lung Capacity Data

Variables Cigarettes and Lung Capacity covary


inversely

When smoking is above its group mean, lung


capacity tends to be below its group mean. 214
Calculating Covariance

215
Calculating Covariance

216
Covariance Calculation

217
Calculating Correlation

218
Calculating Correlation

219
Calculating Correlation

Greater smoking exposure implies greater


likelihood of lung damage
220
Different Correlation Values

221
Correlation is not Causation

222
Correlation Is Not Good at Curves

223

You might also like