Machine Learning
23AI4PCMLG
UNIT -1
Machine Learning
Sem IV
Course Title: Machine Learning
Course Code: 23AI4PCMLG Total Contact Hours: 40 hours
L-T-P: 3-0-1 Total Credits: 4
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
UNIT WISE DETAILS
Unit
Topics Hours
No.
1 Machine Learning Landscape: Introduction, Types of Machine Learning, 8
Challenges of Machine Learning, Testing and Validating.
Supervised Learning
Decision Tree Learning: Decision tree representation, Appropriate problems
for decision tree learning, Basic decision tree learning algorithm, Issues in
Decision tree learning, CART Training algorithm
2 Support Vector Machines: Linear SVM, Non Linear SVM, SVM Regression, 8
Under the Hood.
Instance Based Learning: Introduction, k-Nearest Neighbor learning
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
3 Probabilistic Learning 8
Bayesian Learning: Bayes Theorem and Concept Learning, Maximum
Likelihood, Minimum Description Length Principle, Bayes Optimal Classifier,
Gibbs Algorithm, Naïve Bayes Classifier, Bayesian Belief Network, EM
Algorithm.
4 Ensemble Learning and Random Forests: Voting Classifiers, Bagging and 8
5 Unsupervised
Pasting, Learning
Random Techniques
Patches and Random Subspaces, Random Forests, 8
Boosting, – Kmeans, DBSCAN, Other Clustering Algorithms, Gaussian
ClusteringStacking
Mixtures – Anomaly Detection, Selecting Clustering, Bayesian Gaussian
Mixture Models, Other algorithms for anomaly and novelty detection
Reinforcement Learning: Markov Decision Process, Introduction, Learning
Task, Q Learning
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
TEXT BOOK DETAILS
Prescribed Text Book
Sl. No. Book Title Authors Edition Publisher Year
1. Machine Learning Tom M. First McGraw Hill 2013
Mitchell Education
2 Hands-On Machine Aurelien Geron Second O’Reilly 2020
Learning with Scikit-
Learn, Keras &
TensorFlow
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
TEXT BOOK DETAILS
Reference Text Book
Sl. Book Title Authors Edition Publisher Year
No.
1. Introduction to Andreas C First Shroff 2019
Machine Muller & Publishers
Learning Sarah
with Python Guido
2. Thoughtful Mathew First Shroff 2019
Machine Kirk Publishers
learning
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Course Outcomes
CO1 Apply different learning algorithms for various complex problems
CO2 Analyze the learning techniques for given dataset
CO3 Design a model using machine learning to solve a problem.
Ability to conduct practical experiments to solve problems using
CO4 appropriate machine learning techniques.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
SEE Exam Question paper format
Unit-1 Mandatory One Question to be asked for 20Marks
Unit-2 Mandatory One Question to be asked for 20Marks
Unit-3 Internal Choice Two Questions to be asked for 20Marks each
Unit-4 Internal Choice Two Questions to be asked for 20Marks each
Unit-5 Mandatory One Question to be asked for 20Marks
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Lab Program
Lab Unit# Program Details
Program
Write a program to demonstrate the working of the decision tree based ID3
1 1 algorithm. Use an appropriate data set for building the decision tree and apply
this knowledge to classify a new sample.
Develop a program to construct Support Vector Machine considering a Sample
2 2
Dataset
Write a program to implement k-Nearest Neighbour algorithm to classify the
3 2
iris data set. Print both correct and wrong predictions
Write a program to implement the naïve Bayesian classifier for a sample
4 3 training data set stored as a .CSV file. Compute the accuracy of the classifier,
considering few test data sets
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Lab Program
Write a program to construct a Bayesian network considering training data.
5 3
Use this model to make predictions.
Apply EM algorithm to cluster a set of data stored in a .CSV file. Compare the
6 3
results of k-Means algorithm and EM algorithm.
7 4 Implement Boosting ensemble method on a given dataset.
Write a program to construct random forest for a sample training data.
8 4
Display model accuracy using various metrics
9 5 Implement tic tac toe using reinforcement learning
Consider a sample application. Deploy machine learning model as a web
10 5
service and make them available for the users to predict a given instance.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Key Motivations for Machine Learning
Systems that support humans by either improving upon existing human capabilities or providing
new capabilities
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Problems Solved by Machine Learning
Today
Spam Detection
Information Retrieval
Recognition
Robotics
Recommendation Systems
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Problems Solved by Machine Learning
Today
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Problems Solved by Machine Learning
Today
Recognition
Information Retrieval Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Problems Solved by Machine Learning Today
Robotics
Recommendation Systems
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Problems Solved by Machine Learning
Today
Computer Vision Systems Home Virtual Assistants
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Machine Learning - Definition
Machine Learning is the science (and art) of programming computers so they can learn from
data.
Machine Learning is the field of study that gives computers the ability to learn without being
explicitly programmed. —Arthur Samuel, 1959
A computer program is said to learn from experience E with respect to some task T and some
performance measure P, if its performance on T, as measured by P, improves with experience E.
—Tom Mitchell, 1997
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Machine Learning
▪ The examples that the system uses to learn are called the training set.
▪ Each training example is called a training instance (or sample).
▪ In this case, the task T is to flag spam for new emails, the experience E is the training data,
and the performance measure P needs to be defined; for example, you can use the ratio of
correctly classified emails.
▪ This particular performance measure is called accuracy and it is often used in classification
tasks.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Why Use Machine Learning?
Consider how you would write a spam filter using traditional programming technical questions
some words or phrases (such as “4U,” “credit card,” “free,” and “amazing”)
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Why Use Machine Learning?
a spam filter based on Machine Learning techniques automatically learns which words and phrases are good
predictors of spam by detecting unusually frequent patterns of words in the spam examples compared to the
ham examples
If spammers notice that all their emails
containing “4U” are blocked, they might
start writing “For U” instead.
A spam filter using traditional programming
techniques would need to be updated to flag
“For U” emails.
If spammers keep working around your spam
filter, you will need to keep writing new rules
forever
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Why Use Machine Learning?
In contrast, a spam filter based on Machine Learning techniques automatically notices that “For
U” has become unusually frequent in spam flagged by users, and it starts flagging them without
your intervention
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Why Use Machine Learning?
Machine Learning can help humans learn
Applying ML techniques to dig into large amounts of data can help discover patterns that were
not immediately apparent. This is called data mining.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Types of Machine Learning Systems
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Types of Machine Learning Systems
Machine Learning systems can be classified according to the amount and type of supervision
they get during training.
1. Supervised/Unsupervised Learning
2. Unsupervised learning
3. Semi-supervised learning
4. Reinforcement Learning
5. Batch and Online Learning
6. Instance-Based Versus Model-Based Learning
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Supervised
The training data feed to the algorithm includes the desired solutions called labels
Classification
▪ Spam filter
▪ Trained with many example emails along with their class (spam or ham)
▪ Model must learn how to classify new emails
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Supervised
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Supervised
Regression
▪ Predict a target numeric value, such as the price of a car
▪ A set of features (mileage, age, brand, etc.) called predictors.
▪ To train the system, you need to give it many examples of cars, including both their predictors
and their labels (i.e., their prices).
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Supervised algorithms - Types
Supervised learning algorithms
• k-Nearest Neighbors
• Linear Regression
• Logistic Regression
• Support Vector Machines (SVMs)
• Decision Trees and Random Forests
• Neural networks
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Supervised – An Example for
classification
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Unsupervised learning
▪ DATA THAT IS NOT ASSOCIATED WITH LABELS ARE CALLED UNLABELLED DATA
▪ The training data is unlabeled
▪ The system tries to learn without a teacher.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Unsupervised learning
• Clustering
• Visualization and dimensionality
— K-Means reduction
— Principal Component Analysis (PCA)
— DBSCAN
— Kernel PCA
— Hierarchical Cluster Analysis (HCA) — Locally-Linear Embedding (LLE)
— t-distributed Stochastic Neighbor
• Anomaly detection and novelty detection
Embedding (t-SNE)
— One-class SVM • Association rule learning
— Apriori
— Isolation Forest
— Eclat
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Unsupervised learning
▪ Clustering - detect groups of similar
▪ Example - Hierarchical clustering algorithm- it
may also subdivide each group into smaller
groups.
▪ This may help you target your posts for each
group.
▪ Visualization algorithms are also good examples of
unsupervised learning algorithms
▪ Algorithm is feed with a lot of complex and
unlabeled data output is a 2D or 3D
representation of your data that can easily be
plotted
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Unsupervised learning
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Unsupervised learning
▪ Dimensionality reduction
▪ Simplify the data without losing too much information.
▪ One way to do this is to merge several correlated features into one.
▪ For example,
▪ a car’s mileage may be very correlated with its age
▪ Humidity and temperature
▪ Dimensionality reduction algorithm will merge them into one feature that represents the car’s
wear and tear. This is called feature extraction.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Unsupervised learning
Anomaly detection
▪ For example, detecting unusual credit card transactions to prevent fraud
▪ Catching manufacturing defects
▪ Automatically removing outliers from a dataset before feeding it to another learning algorithm
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Supervised learning v/s Unsupervised
learning
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Semisupervised learning
Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data
and a little bit of labeled data.
▪ Photo-hosting services, such as Google Photos
▪ Once you upload all your family photos to the service, it automatically recognizes that the same
person A shows up in photos 1, 5, and 11, while another person B shows up in photos 2, 5, and
7.
▪ This is the unsupervised part of the algorithm
(clustering).
▪ Now all the system needs is for you to tell it
who these people are.
▪ Just one label per person,4 and it is able to
name everyone in every photo, which is useful
for searching photos.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Reinforcement Learning
The learning system 🡪 called an agent
Observe the environment, select and perform actions, and get rewards in return (or penalties in
the form of negative rewards
It must then learn by itself what is the best strategy, called a policy, to get the most reward over
time.
A policy defines what action the agent should choose when it is in a given situation.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Reinforcement Learning
DeepMind’s AlphaGo program
Made the headlines in May 2017 when it beat the world champion Ke Jie at the game of Go.
It learned its winning policy by analyzing millions of games, and then playing many games against itself
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Batch and Online Learning
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Batch Learning
▪ System can learn incrementally from a stream of incoming data
▪ It must be trained using all the available data.
▪ This will generally take a lot of time and computing resources, so it is typically done offline.
▪ Also called Offline learning
▪ System is trained and launched into production
▪ Requires no more learning it just applies what it has learned.
▪ What happens when there is New data ??? (such as a new type of spam)
▪ Train a new version of the system from scratch on the full dataset (not just the new data, but also the
old data)
▪ Stop the old system and replace it with the new one.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Batch Learning
▪ The process of automating is fairly easily
▪ Training on the full set of data requires a lot of computing resources (CPU, memory space, disk
space, disk I/O, network I/O, etc.).
▪ Cost for training is huge
▪ Bad option for system that is with limited resources
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Online learning
▪ Train the system incrementally by feeding it data instances
▪ Sequentially
▪ Individually
▪ By small groups called mini-batches.
▪ The algorithm loads part of the data, runs a training step on that data, and repeats the process
until it has run on all of the data
▪ Each learning step is fast and cheap, so the
system can learn about new data on the fly
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Online learning
▪ Receive data as a continuous flow (e.g., stock prices) and need to adapt to change rapidly or
autonomously.
▪ It is also a good option if you have limited computing resources Train systems on huge datasets
that cannot fit in one machine’s main memory (this is called out-of-core learning)
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Online learning
Learning rate
▪ How fast they should adapt to changing data ?
▪ If you set a high learning rate, then system will learn fast
▪ Rapidly adapt to new data, but it will also tend to quickly forget the old data
▪ If you set a low learning rate, then system will learn slow
▪ A big challenge with online learning is that if bad data is fed to the system, the system’s
performance will gradually decline.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Instance-based learning
▪ Flagging emails that are identical to known spam emails
▪ Alterative approach - Flag emails that are very similar to known spam emails.
▪ Measure of similarity between two emails.
▪ Count the number of words they have in common.
▪ Steps
▪ Learns
▪ Generalizes to new cases by comparing them to the learned examples
▪ The new instance would be classified as a triangle because the majority of the most similar
instances belong to that class.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Model-based learning
▪ Steps
▪ Study the data
▪ Select a model
▪ Train it on the training data (i.e., the learning algorithm searched for the model parameter values
that minimize a cost function).
▪ Finally, you applied the model to make predictions on new cases (this is called inference), hoping
that this model will generalize well
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Main Challenges of Machine Learning
Insufficient Quantity of Training Data
❑ very simple problems needs thousands of examples,
❑ complex problems such as image or speech recognition needs millions of examples
Nonrepresentative Training Data
❑ In order to generalize well, it is crucial that your training data be representative of the new
cases you want to generalize to
Poor-Quality Data
❑If training data is full of errors, outliers, and noise
❑System finds to difficult to detect the underlying patterns
❑Hence less likely to perform well. Solution - Cleanse
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Main Challenges of Machine Learning
Irrelevant Features
System will only be capable of learning if the training data contains enough relevant features
and not too many irrelevant ones
Feature engineering
Identify good set of features to train on
• Feature selection: selecting the most useful features to train on among existing features.
• Feature extraction: combining existing features to produce a more useful one (
• Creating new features by gathering new data.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Main Challenges of Machine Learning
Overfitting the Training Data
it means that the model performs well on the training data, but it does not generalize well.
Solutions
▪ To simplify the model by selecting one with fewer parameters
▪ Reduce the number of attributes in the training data or by constraining the model
▪ To gather more training data
▪ To reduce the noise in the training data (e.g., fix data errors and remove outliers)
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Main Challenges of Machine Learning
▪ Underfitting the Training Data
▪ Occurs when a model is too simple
▪ Model needs more training time, more input features
▪ The main options to fix this problem are:
• Selecting a more powerful model, with more parameters
• Feeding better features to the learning algorithm (feature engineering)
• Reducing the constraints on the model (e.g., reducing the regularization hyper‐ parameter)
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Testing and Validating
The only way to know how well a model will generalize to new cases ???
▪ Try it out on new cases.
▪ Put your model in production and monitor how well it performs.
▪ Not the best idea.
Solution
▪ Split your data into two sets: the training set and the test set.
▪ The error rate on new cases is called the generalization error
▪ Evaluate model on the test set, get an estimate of this error.
▪ This value tells you how well your model will perform on instances it has never seen before.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Hyperparameter Tuning and Model
Selection
▪ How can you decide between two models (say a linear model and a polynomial model)
▪ One option is to train both and compare how well they generalize using the test set.
▪ Measuring the generalization error multiple times on the test set
▪ Model and hyperparameters is adapted to produce the best model for that particular set.
▪ This means that the model is unlikely to perform as well on new data.
▪ A common solution to this problem is called holdout validation
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Hyperparameter Tuning and Model
Selection - holdout validation
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Hyperparameter Tuning and Model
Selection - holdout validation
However, if the validation set is too small, then model evaluations will be imprecise
We end up selecting a suboptimal model by mistake.
Conversely, if the validation set is too large, then the remaining training set will be much
smaller than the full training set.
Solution
Perform repeated cross-validation, using many small validation sets
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Hyperparameter Tuning and Model
Selection – Cross Validation
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Data Mismatch
▪ One solution is to hold out part
▪ Dataset is split into
o Training set is used for training.
o The Dev set (often called “validation set”) is used for adjusting the model
o The Test set is a final check of your completed model.
▪ After the model is trained evaluate it on the train-dev set: performance is good 🡪 model is not
overfitting the training set
▪ performance is poor🡪 on the validation set, the problem must come from the data mismatch
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Data Mismatch
Here is a way to split the data into three sets: 80% train, 10% dev and 10% test. OR 60% train,
20% dev and 20% test
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Decision Trees
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Decision Trees
The initial node is called the root node (colored in blue), the final nodes are called the leaf
nodes (colored in green) and the rest of the nodes are called intermediate or internal nodes.
Decision tree that is used to classify whether a person is Fit or Unfit.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Decision Trees
ID3 stands for iteratively (repeatedly) dichotomizes(divides) features into two or more
groups at each step.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Decision Trees
(Outlook = Sunny, Temperature = Hot,
Humidity = High, Wind = Strong)
(Outlook = Sunny A Humidity = Normal)
V (Outlook = Overcast) v (Outlook = Rain
A Wind = Weak)
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
APPROPRIATE PROBLEMS FOR
DECISION TREE LEARNING
Decision tree learning is generally best suited to problems with the following characteristics:
1. Instances are represented by attribute-value pairs.
▪ Instances are described by a fixed set of attributes (e.g., Temperature) and their values (e.g., Hot).
▪ Each attribute takes on a small number of disjoint possible values e.g., Hot, Mild, Cold
2. The target function has discrete output values.
▪ Assigns a boolean classification (e.g., yes or no) to each example
3. Disjunctive descriptions may be required - represent disjunctive expressions.
4. Decision tree learning methods are robust to errors
▪ Errors in classifications of the training examples
▪ Errors in the attribute values that describe these examples
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
THE BASIC DECISION TREE LEARNING
ALGORITHM
▪ Our basic algorithm, ID3, learns decision trees by constructing them top-down
▪ Which attribute should be tested at the root of the tree?
▪ Each instance attribute is evaluated using a statistical test to determine how well it alone classifies
the training examples.
▪ The best attribute is selected and used as the test at the root node of the tree.
▪ A descendant of the root node is then created for each possible value of this attribute, and the
training examples are sorted to the appropriate descendant node
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
THE BASIC DECISION TREE LEARNING
ALGORITHM
Which Attribute Is the Best Classifier?
What is a good quantitative measure of an attribute?
Statistical property - information gain and Gini Index
Measures how well a given attribute separates the training examples according to their target
classification.
ID3 uses this information gain measure to select among the candidate attributes at each step
while growing the tree.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
THE BASIC DECISION TREE LEARNING
ALGORITHM
Information gain and Entropy
Entropy: Entropy is a measure of any sort of uncertainty that is present in data.
Information gain: Suggests how much information a particular feature or a particular
variable gives us about final outcomes.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
THE BASIC DECISION TREE LEARNING
ALGORITHM
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
THE BASIC DECISION TREE LEARNING
ALGORITHM
Gain(S, Outlook) = 0.246
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048
Gain(S, Temperature) = 0.029
information gain measure, the Outlook attribute provides the best prediction of the target
attribute, PlayTennis, over the training examples.
Therefore, Outlook is selected as the decision attribute for the root node, and branches are
created below the root for each of its possible values (i.e.,Sunny, Overcast, and Rain).
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Decision Trees - Expression
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Decision Trees
Building a Decision Tree
1. First test all attributes and select the one that would function as the best root;
2. Break-up the training set into subsets based on the branches of the root node;
3. Test the remaining attributes to see which ones fit best underneath the branches of the root
node;
4. Continue this process for all other branches until
a. all examples of a subset are of one type
b. there are no examples left (return majority classification of the parent)
c. there are no more attributes left (default value should be majority classification)
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Decision Trees
A is discrete-valued
A is continuous-valued
A is discrete-valued and a binary tree
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Issues of Decision Trees - Overfitting
Overfitting occurs when a decision tree model tries to cover all the data points
But also, simultaneous captures noise and irrelevant patterns in the training
data, instead of the underlying true patterns.
The model becomes specialized in the training data and fails to generalize well to
unseen data.
◦ This phenomenon is called Variance
◦ Variance: If the machine learning model performs well with the training dataset, but does
not perform well with the test dataset, then variance occurs.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Issues of Decision Trees
Classification Model must have
◦ Low Training Errors
◦ Low Generalisation Errors
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Issues of Decision Trees- Underfitting
Model fails to capture the underlying patterns in the training data
Hence reduced accuracy and produces unreliable predictions.
How to avoid underfitting:
• By increasing the training time of the model.
• By increasing the number of features.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Issues of Decision Trees- Overfitting
What is pruning ?
In general pruning is a process of removal of selected part of plant such as bud, branches
and roots .
In Decision Tree pruning removes the branches of decision tree
This is done to overcome the overfitting condition of decision tree.
Types
Post Pruning
Pre-Pruning
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Decision Trees
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Decision Trees
Since there is an outlier the partition of the tree is no pure .So the algorithm shall
add more layers.
But in the test data the 3 blues dots gets misclassified to orange ( overfitting of the
training data since it is not capable of generalising orange and blue dots)
Solution – Prune the tree
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Prepruning approach
A tree is “pruned” by halting its construction early
◦ Decide not to further split or partition the subset of training tuples at a given node.
◦ Tree should not get deep → specify the leaf depth
Upon halting, the node becomes a leaf.
The leaf may hold the most frequent class among the subset tuples or the probability distribution
of those tuples.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Prepruning approach
Setting of the Max- depth = 3 , pre-pruning
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Post Pruning
Removes subtrees from a “fully grown” tree.
A subtree at a given node is pruned by removing its branches and replacing it with a leaf.
The leaf is labeled with the most frequent class among the subtree being replaced.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
The tree has too many layers. Start at the deepest layer. Since the training data
has 3 blue and 1 orange the leaf would be blue
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Post Pruning
Here the Most common class within this subtree is “class B.”
In the pruned version of the tree, the subtree in question is pruned by replacing it with the leaf “class B.”
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Decision Trees
▪ During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning,
developed a decision tree algorithm known as ID3 (Iterative Dichotomiser)
▪ Quinlan later presented C4.5 (a successor of ID3), which became a benchmark to which
newer supervised learning algorithms
▪ In 1984, a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone) published
the book Classification and Regression Trees (CART), which described the generation of
binary decision trees
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
CART (Classification And Regression Tree)
in Machine Learning
▪ CART uses the Classification And Regression Tree (CART) algorithm to
train Decision Trees (also called “growing” trees)
▪ It works by recursively partitioning the data into smaller and smaller subsets based on
certain criteria.
▪ The goal is to create a tree structure that can accurately predict the target variable for new
data points.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
CART
Gini Index
◦ The Gini index measures the impurity of D, a data partition or set of training tuples, as
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
CART
If A has v possible values, then there are possible subsets.
For example, if income has three possible values, namely
{low, medium, high}, then the possible subsets are
{low, medium, high}, {low, medium}, {low, high}, {medium, high}, {low}, {medium}, {high},
and {}
Therefore, there are − 2 possible ways to form two partitions of the data, D, based on a
binary split on A.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
CART
If a binary split on A partitions D into D1 and D2, the Gini index of D given that partitioning is
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
CART
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
CART
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
CART
Buys computer = YES/NO
To find the splitting criterion for the tuples in D, we need to compute the Gini index for each attribute.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
CART
Income
◦ {low, high} and {medium}) → 0.458
◦ {medium, high} and {low} → 0.450
◦ {low, medium} (or {high}) → 0.443
◦ The best binary split for attribute income is {low, medium} (or {high}) because it minimizes the Gini
index.
Evaluating age
◦ {youth, senior} (or {middle aged}) → Gini index of 0.375;
Student → Gini index 0.367
Credit rating → Gini index values 0.429
Age
{youth, senior} → 0.459 − 0.357 = 0.102
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
ISSUES IN DECISION TREE LEARNING
1. Avoiding Overfitting the Data
◦ REDUCED ERROR PRUNING
◦ RULE POST-PRUNING
2. Incorporating Continuous-Valued Attributes
3. Alternative Measures for Selecting Attributes
4.Handling Training Examples with Missing Attribute Values
5.Handling Attributes with Differing Costs
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
Avoiding Overfitting the Data
Definition: Given a hypothesis space H, a hypothesis h E H is said to overfit the training data
if there exists some alternative hypothesis h' E H, such that h has smaller error than h' over
the training examples, but h' has a smaller error than h over the entire distribution of
instances.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
1. Avoiding Overfitting the Data
Figure - Illustrates the impact of overfitting in a typical application of decision tree learning.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
1. Avoiding Overfitting the Data
▪ ID3 algorithm is applied to the task of learning which medical patients have a form of diabetes.
▪ The horizontal axis of this plot indicates the total number of nodes in the decision tree, as the
tree is being constructed.
▪ The vertical axis indicates the accuracy of predictions made by the tree.
▪ The solid line shows the accuracy of the decision tree over the training examples, whereas the
broken line shows accuracy measured over an independent set of test examples (not included in
the training set).
▪ Predictably, the accuracy of the tree over the training examples increases monotonically as the
tree is grown.
▪ However, the accuracy measured over the independent test examples first increases, then
decreases.
▪ Once the tree size exceeds approximately 25 nodes, further elaboration of the tree decreases its
accuracy over the test examples despite increasing its accuracy on the training examples.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
1. Avoiding Overfitting the Data
▪ There are several approaches to avoiding overfitting in decision tree learning.
▪ These can be grouped into two classes :
▪ Approaches that stop growing the tree earlier, before it reaches the point where it perfectly
classifies the training data
▪ Approaches that allow the tree to overfit the data, and then post-prune the tree.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
1. Avoiding Overfitting the Data
What is the correct tree size is found by stopping early or by post-pruning ?
What is the criterion used to determine the correct final tree size ?
1. Use a separate set of examples- distinct from the training examples, to evaluate the utility
of post-pruning nodes from the tree.
2. Apply a statistical test to estimate whether expanding (or pruning) a particular node
example - chi-square test to estimate
3. Minimum Description Length principle - explicit measure of the complexity for encoding the
training examples and the decision tree, halting growth of the tree when this encoding size
is minimized.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
1.1 REDUCED ERROR PRUNING
How exactly can we use a validation set to prevent overfitting?
reduced-error pruning
◦ Consider each of the decision nodes in the tree to be candidates for pruning.
◦ Pruning a decision node consists of removing the subtree rooted at that node, making it a leaf
node,
◦ Assign it the most common classification of the training examples affiliated with that node.
◦ Nodes are removed should not cause any performance reduction than the original over the
validation set.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
1.1 REDUCED ERROR PRUNING
Figure - Shows the impact of reduced error pruning of the tree produced by ID3.
There is an increase in accuracy over the test set as nodes are pruned from the tree.
Here, the validation set used for pruning is distinct from both the training and test sets.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
1.2 RULE POST-PRUNING
Rule post-pruning involves the following steps:
1. Infer the decision tree from the training set, growing the tree until the training
data is fit as well as possible and allowing overfitting to occur.
2. Convert the learned tree into an equivalent set of rules by creating one rule
for each path from the root node to a leaf node.
3. Prune (generalize) each rule by removing any preconditions that result in
improving its estimated accuracy.
4. Sort the pruned rules by their estimated accuracy, and consider them in this
sequence when classifying subsequent instances.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
1.2 RULE POST-PRUNING
Each attribute test along the path from the
root to the leaf becomes a rule antecedent (precondition)
classification at the leaf node becomes the rule consequent (postcondition).
IF (Outlook = Sunny) A (Humidity = High)
THEN PlayTennis = No
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
1.2 RULE POST-PRUNING
Rule post-pruning would consider removing the preconditions
(Outlook = Sunny) and (Humidity = High).
Select pruning steps which produced the greatest improvement in estimated rule accuracy
Then consider pruning the second precondition as a further pruning step.
No pruning step is performed if it reduces the estimated rule accuracy.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
1.2 RULE POST-PRUNING
b. Estimate rule accuracy is to use a validation set of examples disjoint from the training set.
c. Another method, used by C4.5, is to evaluate performance based on the training set itself,
using a pessimistic estimate
◦ calculating the rule accuracy
◦ calculating the standard deviation
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
1.2 RULE POST-PRUNING
Why convert the decision tree to rules before pruning? There are three main advantages.
◦ Converting to rules allows distinguishing among the different contexts in which a decision node is
used.
◦ Each distinct path through the decision tree node produces a distinct rule,
◦ The pruning decision about that attribute test can be made differently for each path.
◦ In contrast, if the tree itself were pruned - remove the decision node completely, or to retain it in
its original form.
Converting to rules removes the distinction between attribute tests that occur near the root
of the tree and those that occur near the leaves.
Converting to rules improves readability - Rules are often easier for to understand.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
2.Incorporating Continuous-Valued
Attributes
▪ ID3 is restricted to attributes that take on a discrete set of values.
▪ First, the target attribute whose value is predicted by the learned tree must be discrete
valued.
▪ Second, the attributes tested in the decision nodes of the tree must also be discrete valued.
▪ Second restriction can be removed by incorporating continuous-valued decision attributes
▪ Define new discrete valued attributes that partition the continuous attribute value into a
discrete set of intervals.
▪ In particular, for an attribute A that is continuous-valued, the algorithm can dynamically
create a new boolean attribute A, that is true if A < c and false otherwise.
▪ How to select the best value for the threshold c.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
2. Incorporating Continuous-Valued
Attributes
▪ Generate a set of candidate thresholds midway between the corresponding values of A.
▪ computing the information gain for the Candidate thresholds
▪ In the example, there are two candidate thresholds, corresponding to the values of
Temperature at which the value of PlayTennis changes:
▪ (48 + 60)/2, and (80 + 90)/2.
▪ The information gain can then be computed for each of the candidate attributes,
▪ temperature > 54 temperature > 85
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
3.Alternative Measures for Selecting
Attributes
If we were to add this attribute to the data in Table 3.2
e.g., March 4, 1979
Identify very poor predictor of the target function over unseen instances.
select decision attributes based on some measure other than information gain
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
4. Handling Training Examples with
Missing Attribute Values
▪ Assign it the value that is most common among training examples at node n.
▪ Assign a probability to each of the possible values of A rather than simply assigning the most
common value to A(x)
If node n contains six known examples with A = 1
four known examples with A = 0,
▪ probability that A(x) = 1 is 0.6, and
▪ probability that A(x) = 0 is 0.4.
▪ Fractional 0.6 of instance x is now distributed down the branch for A = 1 and
▪ Fractional 0.4 of x down the other tree branch.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
5 Handling Attributes with Differing
Costs
Patients in terms of attributes such as Temperature, BiopsyResult, Pulse, BloodTestResults,
etc.
These attributes vary significantly in their costs
A cost term into the attribute selection measure
Tan and Schlimmer (1990) and Tan (1993) describe - robot perception task
Attribute cost is measured by the number of seconds required to obtain the attribute value
by positioning and operating the sonar.
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1
5 Handling Attributes with Differing
Costs
Nunez (1988) describes application to learning medical diagnosis rules
where w E [0, 11 is a constant that determines the relative importance of cost versus
information gain
Dr. Lakshmi Shree K Dept of AI & DS ML UNIT -1