0% found this document useful (0 votes)

22 views27 pages

Data Science Lecture: Classification & Regression

gsfsgf

Uploaded by

sarahgohar0308

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views27 pages

Data Science Lecture: Classification & Regression

gsfsgf

Uploaded by

sarahgohar0308

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 27

BIG DATA ANALYTICS

Lecture 9 --- Week 10

Content

 Classification versus Regression

 Supervised vs. Unsupervised Learning

 Evaluating Predictive Models

 Supervised Learning Algorithms

 Model Evaluation
Classification vs Regression

 Classification:
 predicts categorical class labels
 classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying new
data.
 Regression:
 models continuous-valued functions, i.e., predicts unknown or missing
values.
 Typical Applications
 credit approval
 target marketing
 medical diagnosis
 treatment effectiveness analysis
Classification – A Motivating
Application
 Credit approval
 A bank wants to classify its customers based on whether they
are expected to pay back their approved loans
 The history of past customers is used to train the classifier
 The classifier provides rules, which identify potentially reliable
future customers
 Classification rule:
 If age = “31...40” and income = high then credit_rating = excellent
 Future customers
 Paul: age = 35, income = high  excellent credit rating
 John: age = 20, income = medium  fair credit rating
Classification—A Two-Step Process

 Model construction: describing a set of predetermined classes

 Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
 The set of tuples used for model construction: training set
 The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test samples is compared with the classified result
from the model
 Accuracy rate is the percentage of test set samples that are correctly
classified by the model
 Test set is independent of training set, otherwise over-fitting will occur
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
Classification Process (2): Use the
Model in Prediction
Accuracy=?
Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Mellisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Classification (Training Phase)

 In the first step, a classifier is built describing a predetermined set of data

classes or concepts.

 This is the learning step (or training phase), where a classification

algorithm builds the classifier by analyzing or “learning from” a training set
made up of database tuples and their associated class labels.

 A tuple, X, is represented by an n-dimensional attribute vector, X = (x1,

x2, …. , xn), depicting n measurements made on the tuple from n database
attributes, respectively, A1, A2, ….. , An.

 Each tuple, X, is assumed to belong to a predefined class as determined by

another database attribute called the class label attribute.
 The class label attribute is discrete-valued and unordered.
 It is categorical (or nominal) in that each value serves as a category or class.

 The individual tuples making up the training set are referred to as training
tuples and are randomly sampled from the database under analysis. In the
context of classification, data tuples can be referred to as samples, examples,
instances, data points, or objects.

 This first step of the classification process can also be viewed as the learning of a
mapping or function, y = f (X), that can predict the associated class label y of a
given tuple X.

 In this view, we wish to learn a mapping or function that separates the data
classes.

 This mapping is represented in the form of classification rules, decision trees, or

mathematical formulae.
 The mapping is represented as classification rules that identify loan
applications as being either safe or risky.

 The rules can be used to categorize future data tuples, as well as

provide deeper insight into the data contents.

 They also provide a compressed data representation.

Classification (Testing Phase)

 In the second step, the model issued for classification.

 First, the predictive accuracy of the classifier is estimated.

 If we were to use the training set to measure the classifier’s accuracy,

this estimate would likely be optimistic, because the classifier tends
to overfit the data (i.e., during learning it may incorporate some
particular anomalies of the training data that are not present in the
general data set overall).
Supervised vs. Unsupervised
Learning
 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
Evaluating Predictive Models

 Predictive Accuracy
 Speed
 time to construct the model
 time to use the model
 Robustness
 handling noise and missing values
 Scalability
 efficiency in disk-resident databases
 Interpretability:
 understanding and insight provided by the model
 Goodness of rules (quality)
 True Positives, True Negatives, False Negatives, False Positives
 compactness of classification rules
Supervised Learning Algorithms

 Artificial Neural Network

 Linear Regression
 Support Vector Machine
Artificial Neural Networks

 Perceptron
 Developed by Frank Rosenblatt by using McCulloch and Pitts model,
perceptron is the basic operational unit of artificial neural networks. It
employs supervised learning rule and is able to classify the data into
two classes.
 Operational characteristics of the perceptron: It consists of a single
neuron with an arbitrary number of inputs along with adjustable
weights, but the output of the neuron is 1 or 0 depending upon the
threshold. It also consists of a bias whose weight is always 1.
Following figure gives a schematic representation of the perceptron.
 Perceptron thus has the following three basic elements −
 Links − It would have a set of connection links, which carries a
weight including a bias always having weight 1.
 Adder − It adds the input after they are multiplied with their
respective weights.
 Activation function − It limits the output of neuron. The most basic
activation function is a Heaviside step function that has two possible
outputs. This function returns 1, if the input is positive, and 0 for any
negative input.
Linear Regression
 Linear regression may be defined as the statistical model that
analyzes the linear relationship between a dependent variable with
given set of independent variables. Linear relationship between
variables means that when the value of one or more independent
variables will change (increase or decrease), the value of dependent
variable will also change accordingly (increase or decrease).
 Mathematically the relationship can be represented with the help of
following equation −
Y = mX + b
 Here, Y is the dependent variable we are trying to predict
 X is the dependent variable we are using to make predictions.
 m is the slop of the regression line which represents the effect X
has on Y
 b is a constant, known as the Y-intercept. If X = 0,Y would be
equal to b.
 Positive Linear Relationship
 A linear relationship will be called positive if both independent and
dependent variable increases. It can be understood with the help of
following graph −
 Negative Linear relationship
 A linear relationship will be called negative if independent increases
and dependent variable decreases. It can be understood with the help
of following graph −
Support Vector Machines

 An SVM model is basically a representation of different classes in a

hyper-plane in multidimensional space. The hyper-plane will be
generated in an iterative manner by SVM so that the error can be
minimized. The goal of SVM is to divide the datasets into classes to
find a maximum marginal hyper-plane (MMH).
 The followings are important concepts in SVM −
 Support Vectors − Data-points that are closest to the hyper-plane is
called support vectors. Separating line will be defined with the help of
these data points.
 Hyper-plane − As we can see in the above diagram, it is a decision
plane or space which is divided between a set of objects having
different classes.
 Margin − It may be defined as the gap between two lines on the
closet data points of different classes. It can be calculated as the
perpendicular distance from the line to the support vectors. Large
margin is considered as a good margin and small margin is
considered as a bad margin.
 The main goal of SVM is to divide the datasets into classes to find a
maximum marginal hyper-plane (MMH) and it can be done in the
following two steps −
 First, SVM will generate hyper-planes iteratively that segregates
the classes in best way.
 Then, it will choose the hyper-plane that separates the classes
correctly.
Model Evaluation

 Metrics for Performance Evaluation

 How to evaluate the performance of a model?

 Methods for Performance Evaluation

 How to obtain reliable estimates?

 Methods for Model Comparison

 How to compare the relative performance among competing models?
Metrics for Performance Evaluation

 Focus on the predictive capability of a model

 Rather than how fast it takes to classify or build models, scalability, etc.
 Confusion Matrix:

PREDICTED CLASS

Class=Yes Class=No

a: TP (true positive)
ACTUAL Class=Yes a b
b: FN (false
CLASS negative)
Class=No c d
c: FP (false
positive)
d: TN (true
Metrics for Performance
Evaluation… PREDICTED CLASS

Class=Yes Class=No

ACTUAL Class=Yes a b
(TP) (FN)
CLASS
Class=No c d
(FP) (TN)

 Most widely-used metric:

ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN
Methods of Estimation
 Holdout
 Reserve 2/3 for training and 1/3 for testing
 Random subsampling
 One sample may be biased -- Repeated holdout
 Cross validation
 Partition data into k disjoint subsets
 k-fold: train on k-1 partitions, test on the
remaining one
 Leave-one-out: k=n
 Guarantees that each record is used the same number of times for training
and testing
 Bootstrap
 Sampling with replacement
 ~63% of records used for training, ~27% for testing
ROC (Receiver Operating Characteristic)

 Developed in 1950s for signal detection theory to

analyze noisy signals
 Characterize the trade-off between positive hits and false
alarms
 ROC curve plotsTP TPR (on the y-axis) against FPR
TPR 
(on the x-axis) PREDICTED CLASS
TP  FN
Fraction of positive Yes No
instances predicted as
positive Yes a b
FP Actual (TP) (FN)
FPR  No c d
FP  TN
(FP) (TN)
Fraction of negative
instances predicted as
positive

Statistical Prediction and Machine Learning
100% (4)
Statistical Prediction and Machine Learning
314 pages
Financial Machine Learning-Unit-1: Dr. J.Dhanalakshmi
No ratings yet
Financial Machine Learning-Unit-1: Dr. J.Dhanalakshmi
70 pages
Unit 2
No ratings yet
Unit 2
151 pages
MI - Unit 3
100% (1)
MI - Unit 3
107 pages
19-Introduction Classification Algorithm-18-09-2024
No ratings yet
19-Introduction Classification Algorithm-18-09-2024
102 pages
Machine Learning Unit-2
No ratings yet
Machine Learning Unit-2
89 pages
Data Warehousing and Data Mining MCQ'S: Unit - I
No ratings yet
Data Warehousing and Data Mining MCQ'S: Unit - I
29 pages
07 - Linear Models For Classification
No ratings yet
07 - Linear Models For Classification
76 pages
DM Unit - 3
No ratings yet
DM Unit - 3
21 pages
Unit 5
No ratings yet
Unit 5
73 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
Supervised Machine Learning Algorithm
100% (1)
Supervised Machine Learning Algorithm
111 pages
ML Introduction
No ratings yet
ML Introduction
76 pages
3 DM Classification
No ratings yet
3 DM Classification
62 pages
05 - Machine Learning
No ratings yet
05 - Machine Learning
31 pages
Machine Learning
No ratings yet
Machine Learning
37 pages
Classification (Part II)
No ratings yet
Classification (Part II)
162 pages
Data Mining Basics for Beginners
No ratings yet
Data Mining Basics for Beginners
20 pages
Machine Learning-Classification
No ratings yet
Machine Learning-Classification
52 pages
QSRI Lecture1
No ratings yet
QSRI Lecture1
45 pages
Machine Learning Overview Guide
No ratings yet
Machine Learning Overview Guide
68 pages
IT 802 ML Unit-2 Notes
No ratings yet
IT 802 ML Unit-2 Notes
19 pages
Data Mining and Warehousing Mod3
No ratings yet
Data Mining and Warehousing Mod3
69 pages
Predictive Analysis & Supervised Learning
No ratings yet
Predictive Analysis & Supervised Learning
22 pages
Lect 4 - Linear Classification
No ratings yet
Lect 4 - Linear Classification
14 pages
1
No ratings yet
1
61 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
55 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
ML Unit 2
No ratings yet
ML Unit 2
37 pages
Machine Learning
No ratings yet
Machine Learning
32 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
ML Unit-4
No ratings yet
ML Unit-4
20 pages
Week - 03 Week04
No ratings yet
Week - 03 Week04
32 pages
3 DM Classification
No ratings yet
3 DM Classification
55 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
24 pages
Classification
No ratings yet
Classification
22 pages
Aiya Session 4
No ratings yet
Aiya Session 4
42 pages
Machine Learning
No ratings yet
Machine Learning
42 pages
Btech Cs 7 Sem Deep Learning
No ratings yet
Btech Cs 7 Sem Deep Learning
3 pages
Unit6 - 1 Classification-and-Prediction-Basics
No ratings yet
Unit6 - 1 Classification-and-Prediction-Basics
12 pages
Lecture 10.self
No ratings yet
Lecture 10.self
9 pages
Unit 3 in Machine Intelligence
No ratings yet
Unit 3 in Machine Intelligence
62 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
5 pages
Introduction To Basics of Machine Learning Algorithms: Pankaj Oli
100% (1)
Introduction To Basics of Machine Learning Algorithms: Pankaj Oli
13 pages
Data Mining Classification and Prediction
No ratings yet
Data Mining Classification and Prediction
17 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
19 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages
Machine Learning For Quants
No ratings yet
Machine Learning For Quants
13 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
18mca52c U3
No ratings yet
18mca52c U3
8 pages
Practical # 11
No ratings yet
Practical # 11
10 pages
Unit-4 AML (1. Basics and K-NN)
No ratings yet
Unit-4 AML (1. Basics and K-NN)
25 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Classification: Unit-III
No ratings yet
Classification: Unit-III
90 pages
For More Visit WWW - Ktunotes.in
No ratings yet
For More Visit WWW - Ktunotes.in
21 pages
Unit Iii Classification
No ratings yet
Unit Iii Classification
57 pages
Module 1 ML Mumbai University
No ratings yet
Module 1 ML Mumbai University
47 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Classifiers (Support Vector Machines, Decision Trees, Nearest Neighbor Classification)
No ratings yet
Classifiers (Support Vector Machines, Decision Trees, Nearest Neighbor Classification)
16 pages
AMT305 INTRODUCTION TO MACHINE LEARNING, Pyq2
No ratings yet
AMT305 INTRODUCTION TO MACHINE LEARNING, Pyq2
3 pages
Classification and Prediction
No ratings yet
Classification and Prediction
14 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
Assignment 3-PDS Python-24S3
No ratings yet
Assignment 3-PDS Python-24S3
5 pages
Data Mining - Theories - Algorithms - and Examples PDF
No ratings yet
Data Mining - Theories - Algorithms - and Examples PDF
347 pages
Amt305 Introduction To Machine Learning, Pyq
No ratings yet
Amt305 Introduction To Machine Learning, Pyq
5 pages
Machine Learning & AI Quiz Answers
No ratings yet
Machine Learning & AI Quiz Answers
15 pages
Big Data Artificial Intelligence and Data Analytics in Climate Change Research Gaurav Tripathi PDF Download
100% (1)
Big Data Artificial Intelligence and Data Analytics in Climate Change Research Gaurav Tripathi PDF Download
84 pages
Project Report 2
No ratings yet
Project Report 2
11 pages
Feature Selection for Data Mining
No ratings yet
Feature Selection for Data Mining
12 pages
Deng Et Al. - 2019 - Feature Selection For Text Classification A Review
No ratings yet
Deng Et Al. - 2019 - Feature Selection For Text Classification A Review
20 pages
Lecture 2
No ratings yet
Lecture 2
25 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Data Science Training Bangalore
No ratings yet
Data Science Training Bangalore
8 pages
Data Mining Project
100% (1)
Data Mining Project
24 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Sharda Dss11e Ch03
No ratings yet
Sharda Dss11e Ch03
70 pages
Assignment 1 Face Recognition Updated
No ratings yet
Assignment 1 Face Recognition Updated
3 pages
Reading 3 Machine Learning - Answers
No ratings yet
Reading 3 Machine Learning - Answers
11 pages
Chapter 3: Cluster Analysis: 3.1 Basic Concepts of Clustering
No ratings yet
Chapter 3: Cluster Analysis: 3.1 Basic Concepts of Clustering
33 pages
Linear Models: CS771: Introduction To Machine Learning Piyush Rai
No ratings yet
Linear Models: CS771: Introduction To Machine Learning Piyush Rai
8 pages
Thesis Presentation
No ratings yet
Thesis Presentation
33 pages
TMC 403 (5) Data Mining and Warehousing
No ratings yet
TMC 403 (5) Data Mining and Warehousing
3 pages
Data Mining Techniques and Tools For Syn PDF
No ratings yet
Data Mining Techniques and Tools For Syn PDF
45 pages
Neural Networks Quiz Review
No ratings yet
Neural Networks Quiz Review
4 pages
Lecture 1
No ratings yet
Lecture 1
23 pages
ICS 2408 - Lecture 7 - Clustering
No ratings yet
ICS 2408 - Lecture 7 - Clustering
25 pages
COE101 Project Group 16
No ratings yet
COE101 Project Group 16
12 pages
ML QB
No ratings yet
ML QB
13 pages
Anne Hennessy
No ratings yet
Anne Hennessy
61 pages
MLT Unit-1
No ratings yet
MLT Unit-1
19 pages
PHME16 CM Bearing Final
No ratings yet
PHME16 CM Bearing Final
18 pages
A Simplifies Mechanics-Based Procedure For The Seismic Risk Assessment of Unreinforced Masonry Buildings
No ratings yet
A Simplifies Mechanics-Based Procedure For The Seismic Risk Assessment of Unreinforced Masonry Buildings
119 pages