Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
19 views53 pages

Pks Machine Learning Module 2 3

The document outlines the course structure and key concepts in Machine Learning, specifically focusing on the design of learning systems and the concept of learning. It discusses the steps involved in creating a learning system, including choosing training experiences, determining target functions, and selecting approximation algorithms. Additionally, it covers concept learning, hypothesis representation, and various algorithms such as Find-S and Candidate Elimination for hypothesis generation and refinement.

Uploaded by

c.shopper.05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views53 pages

Pks Machine Learning Module 2 3

The document outlines the course structure and key concepts in Machine Learning, specifically focusing on the design of learning systems and the concept of learning. It discusses the steps involved in creating a learning system, including choosing training experiences, determining target functions, and selecting approximation algorithms. Additionally, it covers concept learning, hypothesis representation, and various algorithms such as Find-S and Candidate Elimination for hypothesis generation and refinement.

Uploaded by

c.shopper.05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

MACHINE LEARNING

Course Code- BCS602


CIE Marks 50, SEE Marks 50
Teaching Hours/Week (L: T:P: S) 4:0:0:0
Total Hours of Pedagogy 50
Total Marks 100
Module-2 continued…

•Basic Learning Theory: Design of Learning System, Introduction to


Concept of Learning, Modelling in Machine Learning.

• Chapter-3 (3.3, 3.4, 3.6)


3.3 DESIGN OF A LEARNING SYSTEM
• A system that is built around a learning algorithm is called a learning system.
• The design of systems focuses on these steps:
1. Choosing a training experience
2. Choosing a target function
3. Representation of a target function
4. Function approximation
Training Experience
• Let us consider designing of a chess game. In direct experience, individual board states and
correct moves of the chess game are given directly. In indirect system, the move sequences and
results are only given. The training experience also depends on the presence of a supervisor who
can label all valid moves for a board state. In the absence of a supervisor, the game agent plays
against itself and learns the good moves, if the training samples cover all scenarios, or in other
words, distributed enough for performance computation. If the training samples and testing
samples have the same distribution, the results would be good.
Determine the Target Function
• The next step is the determination of a target function. In this step, the type of knowledge that
needs to be learnt is determined. In direct experience, a board move is selected and is determined
whether it is a good move or not against all other moves. If it is the best move, then it is chosen
as: B -> M, where, B and M are legal moves. In indirect experience, all legal moves are accepted
and a score is generated for each. The move with largest score is then chosen and executed.
Determine the Target Function Representation
• The representation of knowledge may be a table, collection of rules or a neural network. The
linear combination of these factors can be coined as:
• where, x1, x2 and x3 represent different board features and w0, w1, w2, and w3, represent
weights.
Choosing an Approximation Algorithm for the Target Function
• The focus is to choose weights and fit the given training samples effectively. The aim is to
reduce the error given as:
• Here, b is the sample and is the predicted hypothesis. The approximation is carried out as:
• Computing the error as the difference between trained and expected hypothesis. Let error be
error(b).
• Then, for every board feature x1, the weights are updated as:

• Here, µ is the constant that moderates the size of the weight update.
• Thus, the learning system has the following components:
1. A Performance system to allow the game to play against itself.
2. A Critic system to generate the samples.
3. A Generalizer system to generate a hypothesis based on samples.
4. An Experimenter system to generate a new system based on the currently learnt function.
• This is sent as input to the performance system.
3.4 INTRODUCTION TO CONCEPT LEARNING
• Concept learning is a learning strategy of acquiring abstract knowledge or inferring a general concept or
deriving a category from the given training samples. It is a process of abstraction and generalization from
the data.
• Concept learning helps to classify an object that has a set of common, relevant features. Thus, it helps a
learner compare and contrast categories based on the similarity and association of positive and negative
instances in the training data to classify an object.
• The learner tries to simplify by observing the common features from the training samples and then apply
this simplified model to the future samples. This task is also known as learning from experience.
• Each concept or category obtained by learning is a Boolean valued function which takes a true or false
value. For example, humans can identify different kinds of animals based on common relevant features
and categorize all animals based on specific sets of features.
• The special features that distinguish one animal from another can be called as a concept. This way of
learning categories for object and to recognize new instances of those categories is called as concept
learning.
• It is formally defined as inferring a Boolean valued function by processing training instances.
Concept learning requires three things:
1. Input- Training dataset which is a set of training instances, each labeled with the
name of a concept or category to which it belongs. Use this past experience to
train and build the model.
2. Output-Target concept or Target function f. It is a mapping function f(x) from
input x to output y. It is to determine the specific features or common features to
identify an object. In other words, it is to find the hypothesis to determine the
target concept. For e.g. the specific set of features to identify an elephant from
all animals.
3. Test- New instances to test the learned model.
• Formally, Concept learning is defined as "Given a set of hypotheses, the learner searches through
the hypothesis space to identify the best hypothesis that matches the target concept".
• Consider the following set of training instances shown in Table 3.1.
• Table 3.1: Sample Training Instances

• Here, in this set of training instances, the independent attributes considered are ‘Horns’, ‘Tail',
'Tusks’, ‘Paws’, ‘Fur’, ‘Color’, ‘Hooves’ and ‘Size’. The dependent attribute is ‘Elephant’.
• The target concept is to identify the animal to be an Elephant.
• Let us now take this example and understand further the concept of hypothesis.
• Target Concept: Predict the type of animal - For example-Elephant’.
3.4.1 Representation of a Hypothesis
• A hypothesis ‘h’ approximates a target function ‘f’ to represent the relationship
between the independent attributes and the dependent attribute of the training
instances. The hypothesis is the predicted approximate model that best maps the
inputs to outputs. Each hypothesis is represented as a conjunction of attribute
conditions in the antecedent part.
• For example, (Tail = Short) ^ (Color = Black)....
• The set of hypothesis in the search space is called as hypotheses. Hypotheses are
the plural form of hypothesis. Generally ‘H’ is used to represent the hypotheses
and ‘h’ is used to represent a candidate hypothesis.
• Each attribute condition is the constraint on the attribute which is represented as
attribute-value pair.
• In the antecedent of an attribute condition of a hypothesis, each attribute can take
value as either ‘?’ or ‘φ’ or can hold a single value.
• “?” denotes that the attribute can take any value [e.g., Color = ?]
• “φ” denotes that the attribute cannot take any value, ie., it represents a null
value[e.g., Horns = φ]
• Single value denotes a specific single value from acceptable values of the
attribute, i.e., the attribute ‘Tail’ can take a value as ‘short’ [e.g., Tail = Short]
• For example, a hypothesis ‘h’ will look like,

• Given a test instance x, we say h(x)=1, if the test instance x satisfies this
hypothesis h.
• The training dataset given above has 5 training instances with 8 independent
attributes and one dependent attribute.
• Here, the different hypotheses that can be predicted for the target concept are,

• The task is to predict the best hypothesis for the target concept (an elephant).
• The most general hypothesis can allow any value for each of the attribute.
• It is represented as:
• This hypothesis indicates that any animal can be an elephant.
• The most specific hypothesis will not allow any value for each of the attribute
<?,?,?,?,?,?,?,?>. This hypothesis indicates that no animal can be an elephant.
• The target concept mentioned in this example is to identify the conjunction of
specific features from the training instances to correctly identify an elephant.
• Thus, concept learning can also be called as Inductive Learning that tries to induce
a general function from specific training instances.
• This way of learning a hypothesis that can produce an approximate target function
with a sufficiently large set of training instances can also approximately classify
other unobserved instances and is called as inductive learning hypothesis.
• We can only determine an approximate target function because it is very difficult
to find an exact target function with the observed training instances.
• That is why a hypothesis is an approximate target function that best maps the
inputs to outputs.
3.4.2 Hypothesis Space
• Hypothesis space is the set of all possible hypotheses that approximates the target function f.
• In other words, the set of all possible approximations of the target function can be defined as
hypothesis space. From this set of hypotheses in the hypothesis space, a machine learning
algorithm would determine the best possible hypothesis that would best describe the target
function or best fit the outputs.
• Generally, a hypothesis representation language represents a larger hypothesis space. Every
machine learning algorithm would represent the hypothesis space in a different manner about the
function that maps the input variables to output variables.
• For example, a regression algorithm represents the hypothesis space as a linear function whereas
a decision tree algorithm represents the hypothesis space as a tree.
• The set of hypotheses that can be generated by a learning algorithm can be further reduced by
specifying a language bias.
• The subset of hypothesis space that is consistent with all-observed training instances is called as
Version Space. Version space represents the only hypotheses that are used for the classification.
• For example, each of the attribute given in the Table 3.1 has the following possible set of values.
•Horns - Yes, No
•Tail - Long, Short
•Tusks - Yes, No
•Paws - Yes, No
•Fur - Yes, No
•Color - Brown, Black, White
•Hooves - Yes, No
•Size - Medium, Big
• Considering these values for each of the attribute, there are (2x2x2x2x2x3x2x2) =384 distinct instances
covering all the 5 instances in the training dataset.
• So, we can generate (4x4x4x4x4x5x4 x4) =81,920 distinct hypotheses when including two more values
[?, φ] for each of the attribute. However, any hypothesis containing one or more φ symbols represents
the empty set of instances; that is, it classifies every instance as negative instance. Therefore, there will
be (3x3 x3x3x3x4x3x3 +1) =8,749 distinct hypotheses by including only ‘?’ for each of the attribute
and one hypothesis representing the empty set of instances. Thus, the hypothesis space is much larger
and hence we need efficient learning algorithms to search for the best hypothesis from the set of
hypotheses.
• Hypothesis ordering is also important wherein the hypotheses are ordered from the most specific one to
3.4.3 Heuristic Space Search
• Heuristic search is a search strategy that finds an optimized hypothesis/solution to a
problem by iteratively improving the hypothesis/solution based on a given heuristic
function or a cost measure.
• Heuristic search methods will generate a possible hypothesis that can be a solution in the
hypothesis space or a path from the initial state. This hypothesis will be tested with the
target function or the goal state to see if it is a real solution.
• If the tested hypothesis is a real solution, then it will be selected. This method generally
increases the efficiency because it is guaranteed to find a better hypothesis but may not
be the best hypothesis. It is useful for solving tough problems which could not solved by
any other method. The typical example problem solved by heuristic search is the
travelling salesman problem.
• Several commonly used heuristic search methods are hill climbing methods, constraint
satisfaction problems, best-first search, simulated-annealing, A* algorithm, and genetic
algorithms.
3.4.4 Generalization and Specialization
• In order to understand about how we construct this concept hierarchy, let us apply
this general principle of generalization/specialization relation. By generalization
of the most specific hypothesis and by specialization of the most general
hypothesis, the hypothesis space can be searched for an approximate hypothesis
that matches all positive instances but does not match any negative instance.
Searching the Hypothesis Space
• There are two ways of learning the hypothesis, consistent with all training
instances from the large hypothesis space.
1. Specialization - General to Specific learning
2. Generalization -Specific to General learning
Generalization - Specific to General Learning

• This learning methodology will search through the hypothesis space for an
approximate hypothesis by generalizing the most specific hypothesis.
Specialization - General to Specific Learning
This learning methodology will search through the hypothesis space for an
approximate hypothesis by specializing the most general hypothesis.
3.4.5 Hypothesis Space Search by Find-S Algorithm
• Find-S algorithm is guaranteed to converge to the most specific hypothesis in H
that is consistent with the positive instances in the training dataset.
• Obviously, it will also be consistent with the negative instances. Thus, this
algorithm considers only the positive instances and eliminates negative instances
while generating the hypothesis. It initially starts with the most specific
hypothesis.
• Example 3.4: Consider the training dataset of 4 instances shown in Table 3.2. It
contains the details of the performance of students and their likelihood of getting a
job offer or not in their final semester. Apply the Find-S algorithm.
Limitations of Find-S Algorithm
1. Find-S algorithm tries to find a hypothesis that is consistent with positive
instances, ignoring all negative instances. As long as the training dataset is
consistent, the hypothesis found by this algorithm may be consistent.
2. The algorithm finds only one unique hypothesis, wherein there may be many
other hypotheses that are consistent with the training dataset.
3. Many times, the training dataset may contain some errors; hence such
inconsistent data instances can mislead this algorithm in determining the consistent
hypothesis since it ignores negative instances.
• Hence, it is necessary to find the set of hypotheses that are consistent with the
training data including the negative examples.
• To overcome the limitations of Find-S algorithm, Candidate Elimination algorithm
was proposed to output the set of all hypotheses consistent with the training
dataset.
3.4.6 Version Spaces
• The version space contains the subset of hypotheses from the hypothesis space that is consistent
with all training instances in the training dataset.
List-Then-Eliminate Algorithm
• The principle idea of this learning algorithm is to initialize the version space to contain all
hypotheses and then eliminate any hypothesis that is found inconsistent with any training
instances. Initially, the algorithm starts with a version space to contain all hypotheses scanning
each training instance. The hypotheses that are inconsistent with the training instance are
eliminated. Finally, the algorithm outputs the list of remaining hypotheses that are all consistent.

• This algorithm works fine if the hypothesis space is finite but practically it is difficult to deploy
this algorithm. Hence, a variation of this idea is introduced in the Candidate Elimination
algorithm.
Version Spaces and the Candidate Elimination Algorithm
• Version space learning is to generate all consistent hypotheses around.
• This algorithm computes the version space by the combination of the two cases
namely,
• Specific to General learning -Generalize S to include the positive example
• General to Specific learning -Specialize G to exclude the negative example
• Using the Candidate Elimination algorithm, we can compute the version space
containing all (and only those) hypotheses from H that are consistent with the
given observed sequence of training instances.
• The algorithm defines two boundaries called ‘general boundary’ which is a set of
all hypotheses that are the most general and ‘specific boundary’ which is a set of
all hypotheses that are the most specific.
• Thus, the algorithm limits the version space to contain only those hypotheses that
are most general and most specific. Thus, it provides a compact representation of
List-then algorithm.
Generating Positive Hypothesis ‘S’
• If it is a positive example, refine S to include the positive instance. We need to
generalize S to include the positive instance. The hypothesis is the conjunction of
‘S’ and positive instance. When generalizing, for the first positive instance, add to
S all minimal generalizations such that S is filled with attribute values of the
positive instance. For the subsequent positive instances scanned, check the
attribute value of the positive instance and S obtained in the previous iteration.
• If the attribute values of positive instance and S are different, fill that field value
with a ‘?’.
• If the attribute values of positive instance and S are same, no change is required.
• If it is a negative instance, it skips.
Generating Negative Hypothesis ‘G’
• If it is a negative instance, refine G to exclude the negative instance. Then, prune G to
exclude all inconsistent hypotheses in G with the positive instance. The idea is to add to
G all minimal specializations to exclude the negative instance and be consistent with the
positive instance. Negative hypothesis indicates general hypothesis.
• If the attribute values of positive and negative instances are different, then fill that field
with positive instance value so that the hypothesis does not classify that negative
instance as true.
• If the attribute values of positive and negative instances are same, then no need to update
‘G’ and fill that attribute value with a ‘?’.
• Generating Version Space - [Consistent Hypothesis] We need to take the combination
of sets in ‘G’ and check that with ‘S’. When the combined set fields are matched with
fields in ’S’, then only that is included in the version space as consistent hypothesis.
3.6 MODELLING IN MACHINE LEARNING
• A machine learning model is an abstraction of the training dataset that can perform
a prediction on new data.
• Training the model means feeding instances to the machine learning algorithm.
Training datasets are used to fit and tune the model. After training a machine
learning algorithm with the training data, a predictive model is generated as output
to which a new data is fed to make predictions.
• The process of modelling means training a machine learning algorithm with the
training dataset, tuning it to increase performance, validating it and making
predictions for a new unseen data.
• The major concern in machine learning is what model to select, how to train the
model, time required to train, the dataset to be used, what performance to expect,
and so on.
• Learning the parameters is the main goal in machine learning algorithms.
• There are two types of parameters - model parameters and hyperparameters.
1. Certain parameters can be learnt directly from training data and are called model
parameters. For example, the coefficients used in regression model, split attributes
in decision tree model, weights and biases in neural networks and so on.
2. Hyperparameters are higher-level parameters which cannot be learnt directly. For
example, regularization lambda λ used in regularized regression, number of decision
trees to include in a random forest, and so on.
• Evaluating the selected machine learning model is also equally important as
training the model. Hence, the dataset is split into two subsets called training
dataset and test dataset, where in the training dataset is used to train the model and
the test dataset is used to evaluate the model.
• During training, the test dataset is unseen to the model so that the model can be
tested properly on its ability to predict a new data. If the training and test datasets
are the same, then the model can overfit but it would perform poorly when given a
new unseen data.
• During prediction, an error occurs when the estimated output does not match with
the true output.
• Training error, also called as in-sample error, results when applying the predicted
model on the training data, while Test error also called as out-of-sample error is
the average error when predicting on unseen observations.
• The error function or the loss function is the aggregation of the differences
between the true values and the predicted values. This loss function is defined as
the Mean Squared Error (MSE), which is the average of the squared differences
between the true values Yi and the predicted values f(Xi) for an input value ‘Xi‘.
• A smaller value of MSE denotes that the error is less and, therefore, the prediction
is more accurate.
Machine Learning Process
The four basic steps in the machine learning process are:
1. Choose a machine learning algorithm to suit the training data and the problem
domain
2. Input the training dataset and train the machine learning algorithm to learn from
the data and capture the patterns in the data
3. Tune the parameters of the model to improve the accuracy of learning of the
algorithm
4. Evaluate the learned model once the model is built
3.6.1 Model Selection and Model Evaluation
• The biggest challenge in machine learning is choosing an algorithm that suits the
problem. Hence, model selection and assessment are very important and deal with
two types of complexities.
1. Model Performance -How well the model performs on the training dataset?
2. Model Complexity- How much complexity the model possesses after the training
phase is over?
• Model Selection is a process of selecting one good enough model among different
machine learning models for the dataset or selecting different sets of features or
hyperparameters for the same machine learning model.
• It is difficult to find the best model because all models exhibit some predictive
error for the problem, so at least a good enough model should be selected that
performs fairly well with the dataset.
Some of the approaches used for selecting a machine learning model are listed
below:
1. Use resample methods and split the dataset as training, testing and validation data
sets and observe the performance of the model over all the phases. This approach is
suitable for smaller datasets.
2. The simplest approach is to fit a model on the training dataset and to compute
measures like error or accuracy.
3. The use of probabilistic framework and quantification of the performance of the
model as a score is the third approach.
• These methods are discussed in the following sections.
3.6.2 Re-sampling Methods
• Re-sampling is a technique to select a model by reconstructing the training dataset and
test dataset by randomly choosing instances by some method from the given dataset.
• This method involves selecting different instances repeatedly from a training dataset to
tune a model. It is done to improve the accuracy of a model.
• The common re-sampling model selection methods are Random train/test splits,
Cross-Validation (K-fold, LOOCY, etc.) and Bootstrap.
Cross-Validation
• Cross-Validation is a method by which we can tune the model with only training dataset.
It is a model evaluation approach by which we can set aside some data of the training
dataset for validation and fit the rest of the data to train the model.
• The best model is found by estimating the average of errors on different test data. The
popular cross-validation family of methods includes Holdout method, K-fold
cross-validation, Stratified cross-validation and Leave-One-Out Cross-Validation
(LOOCV).
Holdout Method
• This is the simplest method of cross-validation. The dataset is split into two subsets called
training dataset and test dataset. The model is trained using the training dataset and then
evaluated using the test dataset.
• This holdout method can be applied for a single time which is called as single holdout method or
it can be repeated for more than once which is called as repeated hold out method.
• The average performance on the test dataset is estimated to evaluate the model. Even though this
model is very simple, it can exhibit high variance and the performance largely depends on how
the dataset is split.
K-fold Cross-Validation
• Another way of cross-validating is using a k-fold cross-validation, which will split the training
dataset into k equal folds/parts creating k-1 subsets of training set and one test subset.
• Out of the k folds, k-1 folds are used for training and one fold is used for testing the model.
• This has to be performed for k iterations and during each iteration a different fold is selected for
testing. The average performance of the model on k iterations is the final estimate of the model
performance.
The illustration of this re-sampling is shown in Figure 3.4.
Stratified K-fold Cross-Validation
• This method is similar to k-fold cross-validation but with a slight difference. Here, it is
ensured that while splitting the dataset into k folds, each fold should contain the same
proportion of instances with a given categorical value. This is called stratified
cross-validation.
Leave-One-Out Cross-Validation (LOOCV)
• This method repeatedly splits the n data instances of the dataset into training dataset
containing n-1 data instances and leaving one data instance for evaluating the model.
This process is repeated n times and average test error is then estimated for the model.
• Even though this model is expensive and time consuming because it has to run for n
times (i.e., 1 data instances in the dataset), it has less bias.
• For example, if the training dataset contains 100 data instances, then 99 instances are
used for training and one instance to test or evaluate the model. This process is repeated
100 times selecting a different instance as holdout instance for testing in each iteration.
The illustration of this re-sampling is shown in Figure 3.5.
Model Performance
• The focus of this section is the evaluation of classifier models.
• Classifiers are unstable as a small change in the input can change the output.
• A solid framework is needed for proper evaluation. There are several metrics that can be used to
describe the quality and usefulness of a classifier.
• One way to compute the metrics is to form a table called contingency table.
• For example, consider a test for detecting a disease, say cancer. Table 3.4 shows a contingency
table for this scenario.

• In this table, True Positive (TP) = Number of cancer patients who are classified by the test
correctly, True Negative (TN) = Number of normal patients who do not have cancer are correctly
detected.
• The two errors that are involved in this process is False Positive (FP) that is an alarm that
indicates that the tests show positive when the patient has no disease and False Negative (FN) is
another error that says a patient has cancer when tests says negative or normal.
• FP and FN are costly errors in this classification process.
• The metrics that can be derived from this contingency table are listed below:
1. Sensitivity-The sensitivity of a test is the probability that it will produce a true
positive result when used on a test dataset. It is also known as true positive rate. The
sensitivity of a test can be determined by calculating:
• A combination of harmonic mean of precision and recall is called F-measure or F1 score. This is
useful in identifying the model skill for a specific threshold.
Classifier Performance as Distance Measures
• The classifier performance can be computed as a distance measure also. The classifier accuracy
can be plotted as a point. A point in the north-west is a better classifier. Euclid distance of two
points of the two classifiers can give a performance measure. The value ranges from 0 to 1.
Visual Classifier Performance
• Receiver Operating Characteristic (ROC) curve and Precision-Recall curves indicate the
performance of classifiers visually. ROC curves are visual means of checking the accuracy and
comparison of classifiers. ROC is a plot of sensitivity (True Positive Rate) and the 1-specificity
(False Positive Rate) for a given model.
• A sample ROC curve is shown in Figure 3.6, where results of five classifiers are given.
• A is the ROC of an average classifier. The ideal classifier is E where the area under curve is 1.0.
Theoretically, it can range from 0.9 to 1. The rest of the classifiers B, C, D are categorized based
on are a under curve as good, better and still better based on the area under curve values.
The classifier prediction is based on the threshold value. A threshold value of 0.5 maps the
probability to class 0 or class 1. The threshold value can be tuned to control the higher or lower
FP and EN. This is useful when one focuses on error — FP or FN for an experiment.

We start from the bottom left-hand corner initially. If we have any true positive case, we move up
and plot a point. If it is a false positive case, we move right and plot. This process is repeated until
the complete curve is drawn.
In ROC, the diagonal of the plot indicates the model has no skill or random classifier, and skillful
models show the curve above the diagonal. In short, if the ROC curve is closer to the diagonal
line, then it shows the classifier to be less accurate.
• Instead of predicting the label of a classifier, one can predict the probabilities of the
model. Probabilities allow some better evaluation by functions that are called scoring
functions or scoring rules. The area under curve (AUC) is one such score that can be
used for classifier model evaluation. The integrated AUC is a measure of the model
across threshold values.
• AUC indicates the accuracy of the model. A model is perfect if it has area under ROC
curve as one. The AUC score 0 of a model indicates the wrong model. The approximate
area under precision-recall curve also indicates the power of the model across thresholds.
• A precision-recall curve is a plot of precision and recall for different threshold values.
This curve is useful if there is an imbalance in the classes where one class has more
labels and other classes have less samples.
• ROC is used when there is no class imbalance and precision-recall curves are used when
there is a moderate-to-large class imbalance.
Scoring Methods
• Another alternative for model selection is to combine the complexity of the model and performance of the model
as a score. Then, model selection is done by selecting the model that maximizes or minimizes the score.
• Minimum Description Length (MDL) is one such method. The aim is to describe target variable and model] in
terms of bits. MDL is the principle of using minimum number of bits to represent the data and model. It is a
variant of Occom Razor's principle that states that the model with the simplest explanation is the best model.
• MDL too recommends the selection of the hypothesis that minimizes the sum of two descriptions of data and
model.
• Let h be a learning model. Let L(h) is the number of bits used to represent the model and D is the number of
predictions, then the MDL is given as:
• L(h) + L(DIh)
• where, L(D|h) is the number of bits used to represent the predictions D based on the training set.
• MDL can be expressed in terms of negative log-likelihood also as:
• MDL = -log(p(®)) -log(p(y | x,©))
• where, y is the target variable, x is the input and © is the model parameters.
End Of Module 2
Thank You

You might also like