Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views35 pages

Chapter Three

Chapter 3 discusses common issues in machine learning, focusing on the importance of properly splitting datasets into training, validation, and test sets to prevent overfitting and ensure generalization. It also covers feature engineering, data preprocessing techniques, and the bias-variance trade-off, emphasizing the need to balance model complexity and performance. Finally, it introduces evaluation metrics such as confusion matrix, precision, recall, and F1-score for assessing model performance.

Uploaded by

hiluf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views35 pages

Chapter Three

Chapter 3 discusses common issues in machine learning, focusing on the importance of properly splitting datasets into training, validation, and test sets to prevent overfitting and ensure generalization. It also covers feature engineering, data preprocessing techniques, and the bias-variance trade-off, emphasizing the need to balance model complexity and performance. Finally, it introduces evaluation metrics such as confusion matrix, precision, recall, and F1-score for assessing model performance.

Uploaded by

hiluf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Chapter -3 ML Common Issues

Data Set ,Training and Test Datas set

Training Set: This subset is used to train the machine learning model.

Validation Set: After training the model on the training set, the validation
set is used to fine-tune the model's hyper parameters or to evaluate its
performance during training. This set helps to prevent over fitting by
providing an independent dataset for model evaluation during training.

Test Set: Sometimes, the holdout strategy includes a test set, which is

used to evaluate the final performance of the trained model.

2
Data Set ,Training and Test Datas set

It is a good practice to use a separate dataset to test the performance of


your algorithm, as testing the algorithm on the training set may not give
you the true generalization power of the algorithm

To avoid the problem of an information leak and improve generalization,


it is often a common practice to split the dataset into three different parts,
namely a training, validation, and test dataset.

3
Training ,validation and Test split

The best approach for using the holdout dataset is to:

1. Train the algorithm on the training dataset

2. Perform hyper parameter tuning based on the validation dataset

3. Perform the first two steps iteratively until the expected performance is
achieved

Avoid splitting the data into two parts, as it may lead to an information
leak. Training and testing it on the same dataset is a clear no-no as it
does not guarantee algorithm generalization.

4
Polular Holdout strategies for splitting
● There are three popular holdout strategies that can be used to split the
data into training and validation sets. They are as follows:
 Simple holdout Validation
 K-fold Validation
 Iterated k-fold Validation
● Simple holdout validation set apart a fraction of the data as your test
dataset. What fraction to keep may be very problem-specific and could
largely depend on the amount of data available.
● This is one of the simplest holdout strategies and is commonly used to
start with
5
K-fold Cross validation
● K-fold Keep a fraction of the dataset for the test split, then divide the
entire dataset into k-folds where k can be any number, generally
varying from two/three to ten.
● At any given iteration, we hold one block for validation and train the
algorithm on the rest of the blocks. The final score is generally the
average of all the scores obtained across the k-folds.
● Expensive b/c algorithm needs to iterate a lot of times

6
K-fold Validation with Shuffling

● Is k-fold cross-validation method where the dataset is randomly


shuffled before splitting it into k folds
● This approach is particularly useful when the dataset has some inherent
order or structure, such as when instances are grouped or sorted in a
specific way.
● When splitting up the data,consider the following
 Data representativeness
 Time sensitivity
 Data redundancy
7
II. Feature Engineering and Data Preprocessing

➔ML -learning process depends on feature engineering, which mainly contains


two processes; which are Feature Selection and Feature Extraction.


Feature selection is a way of selecting the subset of the most relevant features
from the original features set by removing the redundant, irrelevant, or noisy
features.

➔Irrelevant features may negatively impact and reduce the overall performance
and accuracy of the model

➔ML model depends on Garbage In Garbage Out principle

8
Benefits of Feature Selection

➔The following are some benefits of using feature selection in machine learning:

 It helps in avoiding the curse of dimensionality.

 It helps in the simplification of the model so that it can be easily


interpreted by the researchers.

 It reduces the training time.

 It reduces overfitting hence enhance the generalization.

9
Feature Selection Techniques

10
How to choose a Feature Selection Method?


Based on the data type of the input and output variable

11
Feature selection method
1 Numerical Input, Numerical Output


used for predictive regression modelling. The common method to be used


Pearson's correlation coefficient (For linear Correlation).


Spearman's rank coefficient (for non-linear correlation).

2. Numerical Input, Categorical Output:

Numerical Input with categorical output is the case for classification predictive modelling problems.


ANOVA correlation coefficient (linear).

 Kendall's rank coefficient (nonlinear).

12
Feature selection method cont.

3. Categorical Input, Numerical Output:

This is the case of regression predictive modelling


Use Anova and Kendalls


4. Categorical Input, Categorical Output:

 This is a case of classification predictive modelling


Most commonly used technique are Chi-Squared Test and Information gain

13
Data Preprossing Fundamentals


In most cases, the data that we receive may not be in a format that can be readily used


most of the feature engineering techniques are domain-specific

commonly used fundamental processes in data preprocessing



Vectorization


Normalization


Missing values


Feature extraction

14
Vectorization


Involves converting non-numeric data into a numerical format

suitable for machine learning algorithms.

For text data, this often involves techniques represent words or


documents as numerical vectors.

For categorical variables, techniques like label encoding are used


to convert categories into numerical representations.

15
Normalization / Value normalization
It is a common practice to normalize features before passing the data to any algorithm.


Normalization is the process in which you represent data belonging to a particular feature in

such a way that its mean is zero and standard deviation is one.

Consider House prediction problem, where features are in different ranges


Check you data have


Take small values: Typically in a range between zero and one.

Same range: Ensure all the features are in the same range. Ex: MinMaxScaler ,StandardScalar

16
Normalization / Value normalization
 Handling missing values: Missing values are quite common in real-world machine

learning problems

Example : domain specific technique , deletion, Imputation (mean,median,mode) – Ex:


SimpleImputer

Feature engineering: involves creating new features from existing ones or transforming

existing features to improve the performance of machine learning models

Assume house prediction given their size (in square feet and the number of bedrooms and

bathrooms. But we can extract price_per_sqft ,to get new feature

17
Feature enginering Example
 Assume sales prediction problem. Say we have promotion dates, holidays,

competitor's start date, distance from competitor, and sales for a particular

day,but the are many features like Days until the next promotion, Days left

before the next holiday, Number of days the competitor's business has been

open , and so on , this can be extracted

18
Over-fitting and Underefitting
 Noise: Noise is meaningless or irrelevant data, affects the performance of the model


Bias: Bias is a prediction error that is introduced in the model due to oversimplifying the

machine learning algorithms. Or it is the difference between the predicted and

actual values

Variance: If the machine learning model performs well with the training dataset, but

does not perform well with the test dataset, then variance occurs.

Generalization: It shows how well a model is trained to predict unseen data.


19
Over-fitting and Underfitting
 Overfitting and underfitting are the two biggest causes for poor performance of

machine learning algorithms

An overfitted model is said to have low bias(b/c work well on training data) and

high variance.(doesn’t work well on unseen data)

20
How to Detect Overfitting

Overfitting in the model can only be detected once you test the data,

● just make train test split and check accuracy.


example, if the model shows 85% accuracy with training data and 50%
accuracy with the test dataset,

21
Mthod to solve overfitting

Early Stopping

 Train with more data


Feature Selection


Cross-Validation

 Data Augmentation

 Regularization

22
Technique to solve overfitting
1) Early Stopping : the training is paused before the model starts earning the noise
within the model.

In this process, while training the model iteratively, measure the performance of the
model after each iteration.

 May lead to the underfitting problem

If training is paused too early

-Stop at sweet spot

2) Train with More Data : provides more chances to discover the relationship

between input and output variables

23
Techniques cont..

3) Feature Selection: Identify the most important features within training data,
and other features are removed.

4) Cross-Validation: -one of the powerful techniques to prevent overfitting.


we divided the dataset into k-equal-sized subsets of data; these subsets are
known as folds- help to improved learning

5) Data Augmentation: instead of adding more training data, slightly


modified copies of already existing data are added to the dataset. makes it
possible to appear data sample slightly different every time it is processed by
the model.
24
Technique ...
6. Regularization : - It is a group of methods that forces the learning algorithms
to make a model simpler
apply L1 Regularization and L2 Regularization (discussed so far )

Underfitting Problem

Occur when model trained with fewer amounts of data

Underfitting occurs when our model is too simple to understand the base
structure of the data
Methods to Reduce Underfitting:


Increase model complexity

Remove noise from the data

Trained on increased and better features

Reduce the constraints

Increase the number of epochs to get better results.

25
Errors in Machine Learning

 In ML, an error is a measure of how accurately an algorithm can make


predictions for the previously unknown dataset

Reducible errors: These errors can be reduced to improve the model
accuracy. Such errors can further be classified into bias and Variance.
 Irreducible errors: These errors will always be present in the model
regardless of which algorithm has been used.
 The cause of these errors is unknown variables whose value can't be
reduced.

26
Bias and variance in ML
 these errors are prediction errors will always be present as there is always
a slight difference between the model predictions and actual predictions

The main goal is to reduce these errors in order to get more accurate
results.

27
High and Low bias Error
 Low Bias : A low bias model will make fewer assumptions about about the
underlying data distribution

Some algorithms exposed : Decision Trees, k-Nearest Neighbors and Support Vector
Machines.

 High Bias : A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias model
also cannot perform well on new data.

Algorithms some times exposed: Linear Regression and Logistic Regression.

To solve uses the previous methods of reducing overfitting problems

28
High and Low Variance-Error
 Variance in the context of machine learning refers to the sensitivity of a model's
predictions to the variations in the training dataset.

Low variance means there is a small variation in the prediction of the target function

with changes in the training data set

High variance shows a large variation in the prediction of the target function with
changes in the training dataset - lead to overfitting and increased model complexity

Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high
variance. Linear Regression and Logistic Regression has low variance

29
Combinations of Bias-Variance

Low-Bias, Low-Variance- Ideal ML algorithm


Low-Bias, High-Variance: This case occurs when the
model learns with a large number of parameters and
hence leads to an overfitting.
High-Bias, Low-Variance: This case occurs when a model
does not learn well with the training dataset or
uses few numbers of the parameter. It leads to
underfitting problems
High-Bias, High-Variance: With high bias and high
variance, predictions are inconsistent and also
inaccurate

30
Bias-Variance Trade-Off
it is required to make a balance between bias and variance errors, and this balance
between the bias error and variance error is known as the Bias-Variance trade-off.

31
Confusion Matrix in Machine Learning
The confusion matrix provides us a matrix/table as output and describes the

performance of the model. It is also known as the error matrix.

The matrix consists of predictions result in a summarized form, which has a

total number of correct predictions and incorrect predictions.

The matrix looks like as below table:


If the predicted and truth labels
match, then the prediction is said to
be correct, but when the
predicted and truth labels are
mismatched, then the prediction is
said to be incorrect.

32
Confusion matrox cont..

 True Positive: the number of correctly classified values as Positive

 False Negative : number of incorrectly classified a positive sample as Negative

 False Positive: number of incorrectly classified a negative sample as Positive

 True Negative: number of correctly classified a negative sample as Negativ

 Some of the models in machine learning require more precision and some model requires more

recall.

 So, it is important to know the balance between Precision and recall or, simply, precision-recall

trade-off.

Precision is defined as the ratio of correctly classified positive samples (True Positive) to a total

number of classified positive samples (either correctly or incorrectly).

33
Precision, Recall, and F1-Score

These metrics are particularly useful in classification.


Precision: The ratio of correctly predicted positive observations to the total


predicted positives.

Recall: The recall measures the model's ability to detect positive samples. The

higher the recall, the more positive samples detected.

The ratio of correctly predicted positive observations


to all actual positives.

F1-Score: The harmonic mean of precision and recall.


34
35

You might also like