0% found this document useful (0 votes)

8 views35 pages

Chapter Three

Chapter 3 discusses common issues in machine learning, focusing on the importance of properly splitting datasets into training, validation, and test sets to prevent overfitting and ensure generalization. It also covers feature engineering, data preprocessing techniques, and the bias-variance trade-off, emphasizing the need to balance model complexity and performance. Finally, it introduces evaluation metrics such as confusion matrix, precision, recall, and F1-score for assessing model performance.

Uploaded by

hiluf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views35 pages

Chapter Three

Uploaded by

hiluf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Chapter -3 ML Common Issues

Data Set ,Training and Test Datas set

Training Set: This subset is used to train the machine learning model.

Validation Set: After training the model on the training set, the validation
set is used to fine-tune the model's hyper parameters or to evaluate its
performance during training. This set helps to prevent over fitting by
providing an independent dataset for model evaluation during training.

Test Set: Sometimes, the holdout strategy includes a test set, which is
●

used to evaluate the final performance of the trained model.

2
Data Set ,Training and Test Datas set

It is a good practice to use a separate dataset to test the performance of

your algorithm, as testing the algorithm on the training set may not give
you the true generalization power of the algorithm

To avoid the problem of an information leak and improve generalization,

●

it is often a common practice to split the dataset into three different parts,
namely a training, validation, and test dataset.

3
Training ,validation and Test split

The best approach for using the holdout dataset is to:

1. Train the algorithm on the training dataset

2. Perform hyper parameter tuning based on the validation dataset

3. Perform the first two steps iteratively until the expected performance is
achieved

Avoid splitting the data into two parts, as it may lead to an information
leak. Training and testing it on the same dataset is a clear no-no as it
does not guarantee algorithm generalization.

4
Polular Holdout strategies for splitting
● There are three popular holdout strategies that can be used to split the
data into training and validation sets. They are as follows:
 Simple holdout Validation
 K-fold Validation
 Iterated k-fold Validation
● Simple holdout validation set apart a fraction of the data as your test
dataset. What fraction to keep may be very problem-specific and could
largely depend on the amount of data available.
● This is one of the simplest holdout strategies and is commonly used to
start with
5
K-fold Cross validation
● K-fold Keep a fraction of the dataset for the test split, then divide the
entire dataset into k-folds where k can be any number, generally
varying from two/three to ten.
● At any given iteration, we hold one block for validation and train the
algorithm on the rest of the blocks. The final score is generally the
average of all the scores obtained across the k-folds.
● Expensive b/c algorithm needs to iterate a lot of times

6
K-fold Validation with Shuffling

● Is k-fold cross-validation method where the dataset is randomly

shuffled before splitting it into k folds
● This approach is particularly useful when the dataset has some inherent
order or structure, such as when instances are grouped or sorted in a
specific way.
● When splitting up the data,consider the following
 Data representativeness
 Time sensitivity
 Data redundancy
7
II. Feature Engineering and Data Preprocessing

➔ML -learning process depends on feature engineering, which mainly contains

two processes; which are Feature Selection and Feature Extraction.

➔
Feature selection is a way of selecting the subset of the most relevant features
from the original features set by removing the redundant, irrelevant, or noisy
features.

➔Irrelevant features may negatively impact and reduce the overall performance
and accuracy of the model

➔ML model depends on Garbage In Garbage Out principle

8
Benefits of Feature Selection

➔The following are some benefits of using feature selection in machine learning:

 It helps in avoiding the curse of dimensionality.

 It helps in the simplification of the model so that it can be easily

interpreted by the researchers.

 It reduces the training time.

 It reduces overfitting hence enhance the generalization.

9
Feature Selection Techniques

10
How to choose a Feature Selection Method?


Based on the data type of the input and output variable

11
Feature selection method
1 Numerical Input, Numerical Output


used for predictive regression modelling. The common method to be used


Pearson's correlation coefficient (For linear Correlation).


Spearman's rank coefficient (for non-linear correlation).

2. Numerical Input, Categorical Output:

Numerical Input with categorical output is the case for classification predictive modelling problems.



ANOVA correlation coefficient (linear).

 Kendall's rank coefficient (nonlinear).

12
Feature selection method cont.

3. Categorical Input, Numerical Output:

This is the case of regression predictive modelling



Use Anova and Kendalls



4. Categorical Input, Categorical Output:

 This is a case of classification predictive modelling


Most commonly used technique are Chi-Squared Test and Information gain

13
Data Preprossing Fundamentals


In most cases, the data that we receive may not be in a format that can be readily used


most of the feature engineering techniques are domain-specific

commonly used fundamental processes in data preprocessing




Vectorization


Normalization


Missing values


Feature extraction

14
Vectorization


Involves converting non-numeric data into a numerical format

suitable for machine learning algorithms.

For text data, this often involves techniques represent words or



documents as numerical vectors.

For categorical variables, techniques like label encoding are used



to convert categories into numerical representations.

15
Normalization / Value normalization
It is a common practice to normalize features before passing the data to any algorithm.



Normalization is the process in which you represent data belonging to a particular feature in

such a way that its mean is zero and standard deviation is one.

Consider House prediction problem, where features are in different ranges



Check you data have



Take small values: Typically in a range between zero and one.

Same range: Ensure all the features are in the same range. Ex: MinMaxScaler ,StandardScalar

16
Normalization / Value normalization
 Handling missing values: Missing values are quite common in real-world machine

learning problems

Example : domain specific technique , deletion, Imputation (mean,median,mode) – Ex:



SimpleImputer

Feature engineering: involves creating new features from existing ones or transforming


existing features to improve the performance of machine learning models

Assume house prediction given their size (in square feet and the number of bedrooms and


bathrooms. But we can extract price_per_sqft ,to get new feature

17
Feature enginering Example
 Assume sales prediction problem. Say we have promotion dates, holidays,

competitor's start date, distance from competitor, and sales for a particular

day,but the are many features like Days until the next promotion, Days left

before the next holiday, Number of days the competitor's business has been

open , and so on , this can be extracted

18
Over-fitting and Underefitting
 Noise: Noise is meaningless or irrelevant data, affects the performance of the model


Bias: Bias is a prediction error that is introduced in the model due to oversimplifying the

machine learning algorithms. Or it is the difference between the predicted and

actual values

Variance: If the machine learning model performs well with the training dataset, but


does not perform well with the test dataset, then variance occurs.

Generalization: It shows how well a model is trained to predict unseen data.



19
Over-fitting and Underfitting
 Overfitting and underfitting are the two biggest causes for poor performance of

machine learning algorithms

An overfitted model is said to have low bias(b/c work well on training data) and


high variance.(doesn’t work well on unseen data)

20
How to Detect Overfitting

Overfitting in the model can only be detected once you test the data,

● just make train test split and check accuracy.

●
example, if the model shows 85% accuracy with training data and 50%
accuracy with the test dataset,

21
Mthod to solve overfitting

Early Stopping

 Train with more data


Feature Selection


Cross-Validation

 Data Augmentation

 Regularization

22
Technique to solve overfitting
1) Early Stopping : the training is paused before the model starts earning the noise
within the model.

In this process, while training the model iteratively, measure the performance of the
model after each iteration.

 May lead to the underfitting problem

If training is paused too early

-Stop at sweet spot

2) Train with More Data : provides more chances to discover the relationship

between input and output variables

23
Techniques cont..

3) Feature Selection: Identify the most important features within training data,
and other features are removed.

4) Cross-Validation: -one of the powerful techniques to prevent overfitting.

we divided the dataset into k-equal-sized subsets of data; these subsets are
known as folds- help to improved learning

5) Data Augmentation: instead of adding more training data, slightly

modified copies of already existing data are added to the dataset. makes it
possible to appear data sample slightly different every time it is processed by
the model.
24
Technique ...
6. Regularization : - It is a group of methods that forces the learning algorithms
to make a model simpler
apply L1 Regularization and L2 Regularization (discussed so far )

Underfitting Problem

Occur when model trained with fewer amounts of data

Underfitting occurs when our model is too simple to understand the base
structure of the data
Methods to Reduce Underfitting:


Increase model complexity

Remove noise from the data

Trained on increased and better features

Reduce the constraints

Increase the number of epochs to get better results.

25
Errors in Machine Learning

 In ML, an error is a measure of how accurately an algorithm can make

predictions for the previously unknown dataset

Reducible errors: These errors can be reduced to improve the model
accuracy. Such errors can further be classified into bias and Variance.
 Irreducible errors: These errors will always be present in the model
regardless of which algorithm has been used.
 The cause of these errors is unknown variables whose value can't be
reduced.

26
Bias and variance in ML
 these errors are prediction errors will always be present as there is always
a slight difference between the model predictions and actual predictions

The main goal is to reduce these errors in order to get more accurate
results.

27
High and Low bias Error
 Low Bias : A low bias model will make fewer assumptions about about the
underlying data distribution

Some algorithms exposed : Decision Trees, k-Nearest Neighbors and Support Vector
Machines.

 High Bias : A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias model
also cannot perform well on new data.

Algorithms some times exposed: Linear Regression and Logistic Regression.

To solve uses the previous methods of reducing overfitting problems

28
High and Low Variance-Error
 Variance in the context of machine learning refers to the sensitivity of a model's
predictions to the variations in the training dataset.

Low variance means there is a small variation in the prediction of the target function

with changes in the training data set

High variance shows a large variation in the prediction of the target function with
changes in the training dataset - lead to overfitting and increased model complexity

Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high
variance. Linear Regression and Logistic Regression has low variance

29
Combinations of Bias-Variance

Low-Bias, Low-Variance- Ideal ML algorithm

Low-Bias, High-Variance: This case occurs when the
model learns with a large number of parameters and
hence leads to an overfitting.
High-Bias, Low-Variance: This case occurs when a model
does not learn well with the training dataset or
uses few numbers of the parameter. It leads to
underfitting problems
High-Bias, High-Variance: With high bias and high
variance, predictions are inconsistent and also
inaccurate

30
Bias-Variance Trade-Off
it is required to make a balance between bias and variance errors, and this balance
between the bias error and variance error is known as the Bias-Variance trade-off.

31
Confusion Matrix in Machine Learning
The confusion matrix provides us a matrix/table as output and describes the

performance of the model. It is also known as the error matrix.

The matrix consists of predictions result in a summarized form, which has a

total number of correct predictions and incorrect predictions.

The matrix looks like as below table:

If the predicted and truth labels
match, then the prediction is said to
be correct, but when the
predicted and truth labels are
mismatched, then the prediction is
said to be incorrect.

32
Confusion matrox cont..

 True Positive: the number of correctly classified values as Positive

 False Negative : number of incorrectly classified a positive sample as Negative

 False Positive: number of incorrectly classified a negative sample as Positive

 True Negative: number of correctly classified a negative sample as Negativ

 Some of the models in machine learning require more precision and some model requires more


recall.

 So, it is important to know the balance between Precision and recall or, simply, precision-recall


trade-off.

Precision is defined as the ratio of correctly classified positive samples (True Positive) to a total


number of classified positive samples (either correctly or incorrectly).

33
Precision, Recall, and F1-Score

These metrics are particularly useful in classification.



Precision: The ratio of correctly predicted positive observations to the total



predicted positives.

Recall: The recall measures the model's ability to detect positive samples. The


higher the recall, the more positive samples detected.

The ratio of correctly predicted positive observations



to all actual positives.

F1-Score: The harmonic mean of precision and recall.



34
35

AI & ML Interview Preparation
No ratings yet
AI & ML Interview Preparation
15 pages
Flip Flops - Registers and Counters
No ratings yet
Flip Flops - Registers and Counters
42 pages
Presentation-2 Data Pre-Processing in Machine Learning
No ratings yet
Presentation-2 Data Pre-Processing in Machine Learning
11 pages
RAC MCQs-180-set-01 V2
No ratings yet
RAC MCQs-180-set-01 V2
24 pages
Machine Learning Interview Questions
No ratings yet
Machine Learning Interview Questions
38 pages
Machine Learning 2
No ratings yet
Machine Learning 2
7 pages
Chapter 3 NeeLXU
No ratings yet
Chapter 3 NeeLXU
68 pages
Final ML
No ratings yet
Final ML
2 pages
نسخة من prep
No ratings yet
نسخة من prep
17 pages
Data Analyst Interview Questionaries
No ratings yet
Data Analyst Interview Questionaries
16 pages
CSC413 Lecture Note
No ratings yet
CSC413 Lecture Note
32 pages
7 Data Preprocessing Steps in Machine Learning
No ratings yet
7 Data Preprocessing Steps in Machine Learning
5 pages
Scheufler Abstract Openfoam 2019
No ratings yet
Scheufler Abstract Openfoam 2019
2 pages
ML Basics for MIT Students
No ratings yet
ML Basics for MIT Students
5 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
5 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
BCSL 63 Solved Assignment
No ratings yet
BCSL 63 Solved Assignment
10 pages
Understanding Datasets Features Selection Train Test Validation Sets L12
No ratings yet
Understanding Datasets Features Selection Train Test Validation Sets L12
25 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
5 pages
Data Science
No ratings yet
Data Science
64 pages
Introduction To ML
No ratings yet
Introduction To ML
55 pages
Woofer Tester Pro
No ratings yet
Woofer Tester Pro
16 pages
ML Notes All
No ratings yet
ML Notes All
32 pages
Machine Learning Engineer Interview Preparation Guide
No ratings yet
Machine Learning Engineer Interview Preparation Guide
14 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
61 pages
Code in Voices
No ratings yet
Code in Voices
10 pages
DATA 2024 - Dist
No ratings yet
DATA 2024 - Dist
72 pages
Lecture 5 - Feature Extraction, Model Building & Evaluation
No ratings yet
Lecture 5 - Feature Extraction, Model Building & Evaluation
35 pages
Chapter 4
No ratings yet
Chapter 4
34 pages
Unit 2 Part 2 Data Science Final 23june
No ratings yet
Unit 2 Part 2 Data Science Final 23june
39 pages
Unit 4 ML
No ratings yet
Unit 4 ML
25 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
4 pages
Mastering The Basics of Machine Learning
No ratings yet
Mastering The Basics of Machine Learning
65 pages
Duplichecker Plagiarism Report
No ratings yet
Duplichecker Plagiarism Report
2 pages
Lecture 12 - Machine Learning
No ratings yet
Lecture 12 - Machine Learning
18 pages
Lecture 5
No ratings yet
Lecture 5
26 pages
MMC102 - Module 4 - Notes
No ratings yet
MMC102 - Module 4 - Notes
39 pages
Lec2 Intro To ML
No ratings yet
Lec2 Intro To ML
35 pages
ML 5
No ratings yet
ML 5
26 pages
DELTA IA-TC DTM B EN-DIN 20181004 Web
No ratings yet
DELTA IA-TC DTM B EN-DIN 20181004 Web
4 pages
Powers and Exponents Unit (Fall 2015)
No ratings yet
Powers and Exponents Unit (Fall 2015)
48 pages
Unit 3 ML
No ratings yet
Unit 3 ML
40 pages
OUC DC 911 Follow Up
No ratings yet
OUC DC 911 Follow Up
2 pages
Unit 1b - Fundamentals of Machine Learning
No ratings yet
Unit 1b - Fundamentals of Machine Learning
31 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Unit 4
No ratings yet
Unit 4
34 pages
Machine Learning Dataset Handling Guide
No ratings yet
Machine Learning Dataset Handling Guide
15 pages
Machine Learning Notes ?
No ratings yet
Machine Learning Notes ?
64 pages
NN 7
No ratings yet
NN 7
26 pages
Heskay Report
No ratings yet
Heskay Report
43 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
TTL 1 UNIT 1 Intro and Lesson 1 T
No ratings yet
TTL 1 UNIT 1 Intro and Lesson 1 T
32 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
Ba 1176 en (Delta) Ab增量式 (Stca900110) )
No ratings yet
Ba 1176 en (Delta) Ab增量式 (Stca900110) )
91 pages
LG Inverter SCAC Catalog
100% (1)
LG Inverter SCAC Catalog
20 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Types of Machine Learning
No ratings yet
Types of Machine Learning
63 pages
Machine Learning
No ratings yet
Machine Learning
57 pages
Module 4
No ratings yet
Module 4
96 pages
ML.1Lecture.2 (Old)
No ratings yet
ML.1Lecture.2 (Old)
23 pages
Assignment 42
No ratings yet
Assignment 42
5 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
Heydaraliyevculturalcentre 180131094714 PDF
No ratings yet
Heydaraliyevculturalcentre 180131094714 PDF
23 pages
SML Updated UNIT 4
No ratings yet
SML Updated UNIT 4
44 pages
FULL PreSonus Studio One 4 Professional 411 MULTILANG x64 PDF
No ratings yet
FULL PreSonus Studio One 4 Professional 411 MULTILANG x64 PDF
4 pages
Practical File Questions
No ratings yet
Practical File Questions
2 pages
15-The Bias - Variance - Trade-Off-08-04-2024
No ratings yet
15-The Bias - Variance - Trade-Off-08-04-2024
23 pages
Intake and Exhaust: Group 15
No ratings yet
Intake and Exhaust: Group 15
20 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
Overfitting vs Underfitting in ML
No ratings yet
Overfitting vs Underfitting in ML
20 pages
Module1 - ARM Microcontroller MIT Portrait
100% (2)
Module1 - ARM Microcontroller MIT Portrait
21 pages
Edb Efm User
No ratings yet
Edb Efm User
115 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
Unit IV
No ratings yet
Unit IV
51 pages
ML Unit1
No ratings yet
ML Unit1
25 pages
LC-3 System Calls & TRAP Guide
No ratings yet
LC-3 System Calls & TRAP Guide
32 pages
Rbi Script - Plaza, Marvin - G11 Math - Q1-W2
100% (2)
Rbi Script - Plaza, Marvin - G11 Math - Q1-W2
6 pages
GPL Statement
No ratings yet
GPL Statement
1 page
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Appleton Conduit Hub
No ratings yet
Appleton Conduit Hub
1 page
7MWTW1500AQ0
No ratings yet
7MWTW1500AQ0
8 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Grade 2 Homeschool Pacing Guide Unit 1: Work Like A Scientist
No ratings yet
Grade 2 Homeschool Pacing Guide Unit 1: Work Like A Scientist
30 pages
Number System Conversion
No ratings yet
Number System Conversion
30 pages
The Poisson Distribution
No ratings yet
The Poisson Distribution
13 pages

Chapter Three

Uploaded by

Chapter Three

Uploaded by

Chapter -3 ML Common Issues

Data Set ,Training and Test Datas set

used to evaluate the final performance of the trained model.

It is a good practice to use a separate dataset to test the performance of

To avoid the problem of an information leak and improve generalization,

The best approach for using the holdout dataset is to:

1. Train the algorithm on the training dataset

2. Perform hyper parameter tuning based on the validation dataset

● Is k-fold cross-validation method where the dataset is randomly

➔ML -learning process depends on feature engineering, which mainly contains

➔ML model depends on Garbage In Garbage Out principle

 It helps in avoiding the curse of dimensionality.

 It helps in the simplification of the model so that it can be easily

 It reduces the training time.

 It reduces overfitting hence enhance the generalization.

2. Numerical Input, Categorical Output:

 Kendall's rank coefficient (nonlinear).

3. Categorical Input, Numerical Output:

This is the case of regression predictive modelling

Use Anova and Kendalls

4. Categorical Input, Categorical Output:

 This is a case of classification predictive modelling

commonly used fundamental processes in data preprocessing

suitable for machine learning algorithms.

For text data, this often involves techniques represent words or

documents as numerical vectors.

For categorical variables, techniques like label encoding are used

to convert categories into numerical representations.

Consider House prediction problem, where features are in different ranges

Check you data have

Take small values: Typically in a range between zero and one.

Example : domain specific technique , deletion, Imputation (mean,median,mode) – Ex:

existing features to improve the performance of machine learning models

bathrooms. But we can extract price_per_sqft ,to get new feature

open , and so on , this can be extracted

machine learning algorithms. Or it is the difference between the predicted and

Generalization: It shows how well a model is trained to predict unseen data.

machine learning algorithms

high variance.(doesn’t work well on unseen data)

● just make train test split and check accuracy.

 Train with more data

 May lead to the underfitting problem

If training is paused too early

-Stop at sweet spot

between input and output variables

4) Cross-Validation: -one of the powerful techniques to prevent overfitting.

5) Data Augmentation: instead of adding more training data, slightly

 In ML, an error is a measure of how accurately an algorithm can make

Algorithms some times exposed: Linear Regression and Logistic Regression.

To solve uses the previous methods of reducing overfitting problems

with changes in the training data set

Low-Bias, Low-Variance- Ideal ML algorithm

performance of the model. It is also known as the error matrix.

The matrix consists of predictions result in a summarized form, which has a

total number of correct predictions and incorrect predictions.

The matrix looks like as below table:

 True Positive: the number of correctly classified values as Positive

 False Negative : number of incorrectly classified a positive sample as Negative

 False Positive: number of incorrectly classified a negative sample as Positive

 True Negative: number of correctly classified a negative sample as Negativ

number of classified positive samples (either correctly or incorrectly).

These metrics are particularly useful in classification.

Precision: The ratio of correctly predicted positive observations to the total

higher the recall, the more positive samples detected.

The ratio of correctly predicted positive observations

to all actual positives.

F1-Score: The harmonic mean of precision and recall.

You might also like