0% found this document useful (0 votes)

114 views40 pages

ML Book

Machine learning exists at the intersection of computer science and statistics. The document provides examples of machine learning applications such as spam filters, recommendations, and fraud detection. It discusses common machine learning tasks like classification, clustering, and feature extraction. Classification involves learning a mapping from entities to discrete labels. The document outlines the typical classification workflow, which includes data preprocessing, feature extraction, model training, and evaluation. It also provides an example of applying classification to spam detection.

Uploaded by

Rishabh chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views40 pages

ML Book

Uploaded by

Rishabh chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Machine

Learning Crash Course:

Part I
Ariel Kleiner
August 21, 2012
Machine learning
exists at the intersec<on of
computer science and sta<s<cs.
Examples
•  Spam filters
•  Search ranking
•  Click (and clickthrough rate) predic<on
•  Recommenda<ons (e.g., NeJlix, Facebook friends)
•  Speech recogni<on
•  Machine transla<on
•  Fraud detec<on
•  Sen<ment analysis
•  Face detec<on, image classifica<on
•  Many more
A Variety of Capabili<es
•  Classifica<on •  Collabora<ve filtering
•  Regression •  Ac<ve learning and
•  Ranking experimental design
•  Clustering •  Reinforcement learning
•  Dimensionality •  Time series analysis
reduc<on •  Hypothesis tes<ng
•  Feature selec<on •  Structured predic<on
•  Structured probabilis<c
modeling
For Today

Classiﬁca<on
Clustering

(with emphasis on

implementability and scalability)
Typical Data Analysis Workﬂow
Obtain and load raw data

Data explora<on

Preprocessing and featuriza<on

Learning

Diagnos<cs and evalua<on

Classiﬁca<on
•  Goal: Learn a mapping from en<<es to
discrete labels.
–  Refer to en<<es as x and labels as y.

•  Example: spam classiﬁca<on

–  En<<es are emails.
–  Labels are {spam, not-‐spam}.
–  Given past labeled emails, want to predict
whether a new email is spam or not-‐spam.
Classifica<on
•  Examples
–  Spam filters
–  Click (and clickthrough rate) predic<on
–  Sen<ment analysis
–  Fraud detec<on
–  Face detec<on, image classifica<on
Classifica<on
Given a labeled dataset (x1, y1), ..., (xN, yN):
1.  Randomly split the full dataset into two disjoint
parts:
–  A larger training set (e.g., 75%)
–  A smaller test set (e.g., 25%)
2.  Preprocess and featurize the data.
3.  Use the training set to learn a classifier.
4.  Evaluate the classifier on the test set.
5.  Use classifier to predict in the wild.
Classifica<on

training
classifier
full set
dataset

test set new entity

accuracy prediction
Example: Spam Classiﬁca<on

From: [email protected]

"Eliminate your debt by spam

giving us your money..."

From: [email protected]

"Hi, it's been a while! not-spam

How are you? ..."
Featuriza<on
•  Most classifiers require numeric descrip<ons
of en<<es as input.
•  Featuriza1on: Transform each en<ty into a
vector of real numbers.
–  StraighJorward if data already numeric (e.g.,
pa<ent height, blood pressure, etc.)
–  Otherwise, some effort required. But, provides an
opportunity to incorporate domain knowledge.
Featuriza<on: Text
•  Ofen use “bag of words” features for text.
–  En<<es are documents (i.e., strings).
–  Build vocabulary: determine set of unique words
in training set. Let V be vocabulary size.
–  Featuriza1on of a document:
•  Generate V-‐dimensional feature vector.
•  Cell i in feature vector has value 1 if document contains
word i, and 0 otherwise.
Example: Spam Classifica<on

From: [email protected]

"Eliminate your debt by Vocabulary

giving us your money..."
been
debt
eliminate
giving
how
From: [email protected] it's
money
"Hi, it's been a while! while
How are you? ..."
Example: Spam Classiﬁca<on
0 been

1 debt

1 eliminate
From: [email protected]
1 giving
"Eliminate your debt by
0 how
giving us your money..."
0 it's

1 money

0 while
Example: Spam Classifica<on
•  How might we construct a classifier?
•  Using the training data, build a model that will
tell us the likelihood of observing any given (x, y)
pair.
–  x is an email’s feature vector
–  y is a label, one of {spam, not-‐spam}
•  Given such a model, to predict label for an email:
–  Compute likelihoods of (x, spam) and (x, not-‐spam).
–  Predict label which gives highest likelihood.
Example: Spam Classifica<on
•  What is a reasonable probabilis<c model for
(x, y) pairs?
•  A baseline:
–  Before we observe an email’s content, can we say
anything about its likelihood of being spam?
–  Yes: p(spam) can be es<mated as the frac<on of
training emails which are spam.
–  p(not-‐spam) = 1 – p(spam)
–  Call this the “class prior.” Wrinen as p(y).
Example: Spam Classifica<on
•  How do we incorporate an email’s content?
•  Suppose that the email were spam. Then,
what would be the probability of observing its
content?
Example: Spam Classifica<on
•  Example: “Eliminate your debt by giving us your
money” with feature vector (0, 1, 1, 1, 0, 0, 1, 0)
•  Ignoring word sequence, probability of email is
p(seeing “debt” AND seeing “eliminate” AND
seeing “giving” AND seeing “money” AND not
seeing any other vocabulary words | given that
email is spam)
•  In feature vector nota<on:
p(x1=0, x2=1, x3=1, x4=1, x5=0, x6=0, x7=1, x8=0 |
given that email is spam)
Example: Spam Classifica<on
•  Now, to simplify, model each word in the
vocabulary independently:
–  Assume that (given knowledge of the class label)
probability of seeing word i (e.g., eliminate) is
independent of probability of seeing word j (e.g.,
money).
–  As a result, probability of email content becomes
p(x1=0 | spam) p(x2=1 | spam) ... p(x8=0 | spam)
rather than
p(x1=0, x2=1, x3=1, x4=1, x5=0, x6=0, x7=1, x8=0 | spam)
Example: Spam Classifica<on
•  Now, we only need to model the probability of
seeing (or not seeing) a par<cular word i,
assuming that we knew the email’s class y
(spam or not-‐spam).
–  But, this is easy!
–  To es<mate p(xi = 1 | y), simply compute the
frac<on of emails in the set {emails in training set
with label y} which contain the word i.
Example: Spam Classifica<on
•  Pusng it all together:
–  Based on the training data, es<mate the class prior
p(y).
•  i.e., es<mate p(spam) and p(not-‐spam).
–  Also es<mate the (condi<onal) probability of seeing
any individual word i, given knowledge of the class
label y.
•  i.e., es<mate p(xi = 1 | y) for each i and y
–  The (condi<onal) probability p(x | y) of seeing an
en<re email, given knowledge of the class label y, is
then simply the product of the condi<onal word
probabili<es.
•  e.g., p(x=(0, 1, 1, 1, 0, 0, 1, 0) | y) =
p(x1=0 | y) p(x2=1 | y) ... p(x8=0 | y)
Example: Spam Classifica<on
•  Recall: we want a model that will tell us the
likelihood p(x, y) of observing any given (x, y) pair.
•  The probability of observing (x, y) is the
probability of observing y, and then observing x
given that value of y:
p(x, y) = p(y) p(x | y)
•  Example:
p(“Eliminate your debt...”, spam) =
p(spam) p(“Eliminate your debt...” | spam)
Example: Spam Classifica<on
•  To predict label for a new email:
–  Compute log[p(x, spam)] and log[p(x, not-‐spam)].
–  Choose the label which gives higher value.
–  We use logs above to avoid underflow which
otherwise arises in compu<ng the p(x | y), which
are products of individual p(xi | y) < 1:
log[p(x, y)] = log[p(y) p(x | y)]
= log[ p(y) p(x1 | y) p(x2 | y) ...]
= log[p(y)] + log[p(x1 | y)] + log[p(x2 | y)] + ...
Classifica<on: Beyond Text
•  You have just seen an instance of the Naive
Bayes classifier.
•  Applies as shown to any classifica<on problem
with binary feature vectors.
•  What if the features are real-‐valued?
–  S<ll model each element of the feature vector
independently.
–  But, change the form of the model for p(xi | y).
Classifica<on: Beyond Text
•  If xi is a real number, ofen model p(xi | y) as
2
Gaussian with mean µ iy
and variance σ
iy
:
� x −µ �2
1 − 12 i iy
p(xi |y) = √ e σiy

σiy 2π
•  Es<mate the mean and variance for a given i,y as
the mean and variance of the xi in the training set
which have corresponding class label y.
•  Other, non-‐Gaussian distribu<ons can be used if
know more about the xi.
Naive Bayes: Benefits
•  Can easily handle more than two classes and
different data types
•  Simple and easy to implement
•  Scalable
Naive Bayes: Shortcomings
•  Generally not as accurate as more sophis<cated
methods (but s<ll generally reasonable).
•  Independence assump<on on the feature vector
elements
–  Can instead directly model p(x | y) without this
independence assump<on.
•  Requires us to specify a full model for p(x, y)
–  In fact, this is not necessary!
–  To do classifica<on, we actually only require p(y | x),
the probability that the label is y, given that we have
observed en<ty features x.
Logis<c Regression
•  Recall: Naive Bayes models the full ( joint)
probability p(x, y).
•  But, Naive Bayes actually only uses the
condi<onal probability p(y | x) to predict.
•  Instead, why not just directly model p(y | x)?
–  Logis1c regression does exactly that.
–  No need to first model p(y) and then separately
p(x | y).
Logis<c Regression
•  Assume that class labels are {0, 1}.
•  Given an en<ty’s feature vector x, probability that
label is 1 is taken to be
1
p(y = 1|x) = −b Tx
1+e
where b is a parameter vector and bTx denotes a
dot product.
•  The probability that the label is 1, given features
x, is determined by a weighted sum of the
features.
Logis<c Regression
•  This is libera<ng:
–  Simply featurize the data and go.
–  No need to find a distribu<on for p(xi | y) which is
par<cularly well suited to your sesng.
–  Can just as easily use binary-‐valued (e.g., bag of
words) or real-‐valued features without any
changes to the classifica<on method.
–  Can ofen improve performance simply by adding
new features (which might be derived from old
features).
Logis<c Regression
•  Can be trained efficiently at large scale, but
not as easy to implement as Naive Bayes.
–  Trained via maximum likelihood.
–  Requires use of itera<ve numerical op<miza<on
(e.g., gradient descent, most basically).
–  However, implemen<ng this effec<vely, robustly,
and at large scale is non-‐trivial and would require
more <me than we have today.
•  Can be generalized to mul<class sesng.
Other Classifica<on Techniques
•  Support Vector Machines (SVMs)
•  Kernelized logis<c regression and SVMs
•  Boosted decision trees
•  Random Forests
•  Nearest neighbors
•  Neural networks
•  Ensembles

See The Elements of Sta4s4cal Learning by Has<e,

Tibshirani, and Friedman for more informa<on.
Featuriza<on: Final Comments
•  Featuriza<on affords the opportunity to
–  Incorporate domain knowledge
–  Overcome some classifier limita<ons
–  Improve performance
•  Incorpora<ng domain knowledge:
–  Example: in spam classifica<on, we might suspect that
sender is important, in addi<on to email body.
–  So, try adding features based on sender’s email
address.
Featuriza<on: Final Comments
•  Overcoming classifier limita<ons:
–  Naive Bayes and logis<c regression do not model
mul<plica<ve interac<ons between features.
–  For example, the presence of the pair of words
[eliminate, debt] might indicate spam, while the
presence of either one individually might not.
–  Can overcome this by adding features which explicitly
encode such interac<ons.
–  For example, can add features which are products of
all pairs of bag of words features.
–  Can also include nonlinear effects in this manner.
–  This is actually what kernel methods do.
Classifica<on
Given a labeled dataset (x1, y1), ..., (xN, yN):
1.  Randomly split the full dataset into two disjoint
parts:
–  A larger training set (e.g., 75%)
–  A smaller test set (e.g., 25%)
2.  Preprocess and featurize the data.
3.  Use the training set to learn a classifier.
4.  Evaluate the classifier on the test set.
5.  Use classifier to predict in the wild.
Classifica<on

training
classifier
full set
dataset

test set new entity

accuracy prediction
Classifier Evalua<on
•  How do we determine the quality of a trained
classifier?
•  Various metrics for quality, most common is
accuracy.
•  How do we determine the probability that a
trained classifier will correctly classify a new
en<ty?
Classifier Evalua<on
•  Cannot simply evaluate a classifier on the
same dataset used to train it.
–  This will be overly op<mis<c!
•  This is why we set aside a disjoint test set
before training.
Classifier Evalua<on
•  To evaluate accuracy:
–  Train on the training set without exposing the test set
to the classifier.
–  Ignoring the (known) labels of the data points in the
test set, use the trained classifier to generate label
predic<ons for the test points.
–  Compute the frac<on of predicted labels which are
iden<cal to the test set’s known labels.
•  Other, more sophis<cated evalua<on methods
are available which make more efficient use of
data (e.g., cross-‐valida<on).

AI ML Unit4
No ratings yet
AI ML Unit4
252 pages
Complete Unit-1 Merged
No ratings yet
Complete Unit-1 Merged
74 pages
SP14 CS188 Lecture 21 - Naive Bayes - Print
No ratings yet
SP14 CS188 Lecture 21 - Naive Bayes - Print
41 pages
lec20-ML I
No ratings yet
lec20-ML I
48 pages
Lec 09
No ratings yet
Lec 09
50 pages
Lec 09
No ratings yet
Lec 09
50 pages
23-Naive Bayes
No ratings yet
23-Naive Bayes
22 pages
Mla Unit-5'2
No ratings yet
Mla Unit-5'2
74 pages
Naive Bayes Lecture
No ratings yet
Naive Bayes Lecture
7 pages
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
No ratings yet
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
38 pages
Naive456 Bayes297Classification
No ratings yet
Naive456 Bayes297Classification
21 pages
Introduction to Classification in AI
No ratings yet
Introduction to Classification in AI
66 pages
Unit 3
No ratings yet
Unit 3
46 pages
Unit-3 AML (Bayesian Concept Learning)
No ratings yet
Unit-3 AML (Bayesian Concept Learning)
40 pages
In4080 2022 Lecture 03
No ratings yet
In4080 2022 Lecture 03
62 pages
Lecture 05
No ratings yet
Lecture 05
45 pages
4 22865 IS465 2019 1 2 1 08ClassBasic
No ratings yet
4 22865 IS465 2019 1 2 1 08ClassBasic
43 pages
ML Module4 Classification
No ratings yet
ML Module4 Classification
79 pages
Machine Learning for Students
No ratings yet
Machine Learning for Students
30 pages
CS585 Lecture October01st
No ratings yet
CS585 Lecture October01st
158 pages
Naive Bayes
No ratings yet
Naive Bayes
38 pages
AI & Machine Learning Course Guide
No ratings yet
AI & Machine Learning Course Guide
47 pages
L5 TextClassification Updated
No ratings yet
L5 TextClassification Updated
179 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
Comp Vis Week 3
No ratings yet
Comp Vis Week 3
44 pages
Lecture 6
No ratings yet
Lecture 6
54 pages
ML 2
No ratings yet
ML 2
13 pages
Naive Bayes Ons
No ratings yet
Naive Bayes Ons
29 pages
CH 5
No ratings yet
CH 5
21 pages
Text Classification for ML Experts
No ratings yet
Text Classification for ML Experts
19 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
6 pages
CS 188 Introduction To Artificial Intelligence Fall 2018 Note 9
No ratings yet
CS 188 Introduction To Artificial Intelligence Fall 2018 Note 9
14 pages
Text Classification in ML
No ratings yet
Text Classification in ML
47 pages
Lecture13 Nbayes
No ratings yet
Lecture13 Nbayes
56 pages
3 - Bayesian Classification
No ratings yet
3 - Bayesian Classification
15 pages
PhD Application & Text Classification
No ratings yet
PhD Application & Text Classification
41 pages
Irs Unit 4 CH 1
No ratings yet
Irs Unit 4 CH 1
58 pages
cs188 Fa22 Note19
No ratings yet
cs188 Fa22 Note19
8 pages
Classification
No ratings yet
Classification
21 pages
Text Classification Using TF-IDF and Machine Learning
No ratings yet
Text Classification Using TF-IDF and Machine Learning
30 pages
Textbook Evaluation: Success With BEC Vantage
No ratings yet
Textbook Evaluation: Success With BEC Vantage
31 pages
cs221 Lecture10
No ratings yet
cs221 Lecture10
43 pages
AI Notes
No ratings yet
AI Notes
19 pages
14 Supervised Machine Learning
No ratings yet
14 Supervised Machine Learning
94 pages
Spam Detection Model
No ratings yet
Spam Detection Model
4 pages
0457 Example Candidate Responses Paper 3 (For Examination From 2018)
78% (9)
0457 Example Candidate Responses Paper 3 (For Examination From 2018)
26 pages
CSE546: Naïve Bayes: Winter 2012
No ratings yet
CSE546: Naïve Bayes: Winter 2012
35 pages
Lecture3 Linear Classifiers
No ratings yet
Lecture3 Linear Classifiers
36 pages
AI Lec 04+05 - Naive Bayes
No ratings yet
AI Lec 04+05 - Naive Bayes
55 pages
Active Verbs: Present 1st Vocare - To Call 2nd Monere - To Warn 3rd Ponere - To Put 4th Audire - To Hear
No ratings yet
Active Verbs: Present 1st Vocare - To Call 2nd Monere - To Warn 3rd Ponere - To Put 4th Audire - To Hear
4 pages
Aiml Assignment-2
No ratings yet
Aiml Assignment-2
8 pages
NLP NB
No ratings yet
NLP NB
52 pages
MachineLearning Lecture06 PDF
No ratings yet
MachineLearning Lecture06 PDF
16 pages
Bayesian Concept Learning Guide
No ratings yet
Bayesian Concept Learning Guide
157 pages
Naive Bayes
No ratings yet
Naive Bayes
37 pages
Hacking Human Brain.
100% (3)
Hacking Human Brain.
6 pages
Artistic Intuition Within Cassirer S Sys PDF
No ratings yet
Artistic Intuition Within Cassirer S Sys PDF
45 pages
Naïve Bayes for CS Students
No ratings yet
Naïve Bayes for CS Students
55 pages
Naive Bayes Classifiers - Parta
No ratings yet
Naive Bayes Classifiers - Parta
17 pages
Active vs Passive Voice Guide
No ratings yet
Active vs Passive Voice Guide
5 pages
NaiveBayes N Text Analytics
No ratings yet
NaiveBayes N Text Analytics
20 pages
TTC Consciousness and Its Implications Bibliography
100% (1)
TTC Consciousness and Its Implications Bibliography
2 pages
Objects of Instructional Materials in Teaching English Language
No ratings yet
Objects of Instructional Materials in Teaching English Language
4 pages
DWM Exp5 C49
No ratings yet
DWM Exp5 C49
12 pages
Conflict Management
100% (1)
Conflict Management
28 pages
Linguistic Pioneer: Dell Hymes
No ratings yet
Linguistic Pioneer: Dell Hymes
13 pages
How To Submit Your Homework: EECS 349 Machine Learning Homework 5
No ratings yet
How To Submit Your Homework: EECS 349 Machine Learning Homework 5
4 pages
Principles of Management - Test Example: Question One
No ratings yet
Principles of Management - Test Example: Question One
2 pages
NYT - How To Answer Common Difficult Interview Questions - The New York Times
No ratings yet
NYT - How To Answer Common Difficult Interview Questions - The New York Times
4 pages
Language Teaching Evolution
No ratings yet
Language Teaching Evolution
41 pages
20ECE633T Machine Learning in VLSI
No ratings yet
20ECE633T Machine Learning in VLSI
81 pages
Subjectverbagreement
No ratings yet
Subjectverbagreement
19 pages
Renz Argao, PH.D., Shella Marie Z. Gonzales, BS Psychology, MSW (Ongoing), and Lois J. Engelbrecht, PH.D
No ratings yet
Renz Argao, PH.D., Shella Marie Z. Gonzales, BS Psychology, MSW (Ongoing), and Lois J. Engelbrecht, PH.D
39 pages
Relationship and Work Stress Survey
No ratings yet
Relationship and Work Stress Survey
5 pages
Logosynthesis State of The Art
100% (1)
Logosynthesis State of The Art
21 pages
Free Will Could All Be An Illusion, Scientists Suggest After Study Shows Choice May Just Be Brain Tricking Itself - Science - News - The Independent
No ratings yet
Free Will Could All Be An Illusion, Scientists Suggest After Study Shows Choice May Just Be Brain Tricking Itself - Science - News - The Independent
9 pages
Coppell ISD: Classroom Management Framework
No ratings yet
Coppell ISD: Classroom Management Framework
22 pages
Reading Challenges in Philippine Schools
No ratings yet
Reading Challenges in Philippine Schools
16 pages
Degrees of Comparison GR 8
No ratings yet
Degrees of Comparison GR 8
18 pages
The Implementation of Decision Tree Algorithm C4.5 Using Rapidminer in Analyzing Dropout Students
No ratings yet
The Implementation of Decision Tree Algorithm C4.5 Using Rapidminer in Analyzing Dropout Students
6 pages
Lesson 11 - Passage 3
No ratings yet
Lesson 11 - Passage 3
4 pages
Questionnaire
No ratings yet
Questionnaire
10 pages
Workplace Insights - Martin Tonikyan-Workplace-Insights
No ratings yet
Workplace Insights - Martin Tonikyan-Workplace-Insights
5 pages
Integrative Lesson Planning Exam
No ratings yet
Integrative Lesson Planning Exam
1 page
Concept Map
No ratings yet
Concept Map
3 pages
Athifa Aura Kenza Aydin - The Importance of Reading in The Digital Age
No ratings yet
Athifa Aura Kenza Aydin - The Importance of Reading in The Digital Age
4 pages
Lexical Bundles 20091
No ratings yet
Lexical Bundles 20091
12 pages
Read Clinical Lesson Plan
No ratings yet
Read Clinical Lesson Plan
4 pages

ML Book

Uploaded by

ML Book

Uploaded by

Machine

Learning Crash Course:

(with emphasis on

Preprocessing and featuriza<on

Diagnos<cs and evalua<on

• Example: spam classiﬁca<on

test set new entity

"Eliminate your debt by spam

"Hi, it's been a while! not-spam

"Eliminate your debt by Vocabulary

See The Elements of Sta4s4cal Learning by Has<e,

test set new entity

You might also like

•  Example: spam classiﬁca<on