Chapter -3 ML Common Issues
Data Set ,Training and Test Datas set
Training Set: This subset is used to train the machine learning model.
Validation Set: After training the model on the training set, the validation
set is used to fine-tune the model's hyper parameters or to evaluate its
performance during training. This set helps to prevent over fitting by
providing an independent dataset for model evaluation during training.
Test Set: Sometimes, the holdout strategy includes a test set, which is
●
used to evaluate the final performance of the trained model.
2
Data Set ,Training and Test Datas set
It is a good practice to use a separate dataset to test the performance of
your algorithm, as testing the algorithm on the training set may not give
you the true generalization power of the algorithm
To avoid the problem of an information leak and improve generalization,
●
it is often a common practice to split the dataset into three different parts,
namely a training, validation, and test dataset.
3
Training ,validation and Test split
The best approach for using the holdout dataset is to:
1. Train the algorithm on the training dataset
2. Perform hyper parameter tuning based on the validation dataset
3. Perform the first two steps iteratively until the expected performance is
achieved
Avoid splitting the data into two parts, as it may lead to an information
leak. Training and testing it on the same dataset is a clear no-no as it
does not guarantee algorithm generalization.
4
Polular Holdout strategies for splitting
● There are three popular holdout strategies that can be used to split the
data into training and validation sets. They are as follows:
Simple holdout Validation
K-fold Validation
Iterated k-fold Validation
● Simple holdout validation set apart a fraction of the data as your test
dataset. What fraction to keep may be very problem-specific and could
largely depend on the amount of data available.
● This is one of the simplest holdout strategies and is commonly used to
start with
5
K-fold Cross validation
● K-fold Keep a fraction of the dataset for the test split, then divide the
entire dataset into k-folds where k can be any number, generally
varying from two/three to ten.
● At any given iteration, we hold one block for validation and train the
algorithm on the rest of the blocks. The final score is generally the
average of all the scores obtained across the k-folds.
● Expensive b/c algorithm needs to iterate a lot of times
6
K-fold Validation with Shuffling
● Is k-fold cross-validation method where the dataset is randomly
shuffled before splitting it into k folds
● This approach is particularly useful when the dataset has some inherent
order or structure, such as when instances are grouped or sorted in a
specific way.
● When splitting up the data,consider the following
Data representativeness
Time sensitivity
Data redundancy
7
II. Feature Engineering and Data Preprocessing
➔ML -learning process depends on feature engineering, which mainly contains
two processes; which are Feature Selection and Feature Extraction.
➔
Feature selection is a way of selecting the subset of the most relevant features
from the original features set by removing the redundant, irrelevant, or noisy
features.
➔Irrelevant features may negatively impact and reduce the overall performance
and accuracy of the model
➔ML model depends on Garbage In Garbage Out principle
8
Benefits of Feature Selection
➔The following are some benefits of using feature selection in machine learning:
It helps in avoiding the curse of dimensionality.
It helps in the simplification of the model so that it can be easily
interpreted by the researchers.
It reduces the training time.
It reduces overfitting hence enhance the generalization.
9
Feature Selection Techniques
10
How to choose a Feature Selection Method?
Based on the data type of the input and output variable
11
Feature selection method
1 Numerical Input, Numerical Output
used for predictive regression modelling. The common method to be used
Pearson's correlation coefficient (For linear Correlation).
Spearman's rank coefficient (for non-linear correlation).
2. Numerical Input, Categorical Output:
Numerical Input with categorical output is the case for classification predictive modelling problems.
ANOVA correlation coefficient (linear).
Kendall's rank coefficient (nonlinear).
12
Feature selection method cont.
3. Categorical Input, Numerical Output:
This is the case of regression predictive modelling
Use Anova and Kendalls
4. Categorical Input, Categorical Output:
This is a case of classification predictive modelling
Most commonly used technique are Chi-Squared Test and Information gain
13
Data Preprossing Fundamentals
In most cases, the data that we receive may not be in a format that can be readily used
most of the feature engineering techniques are domain-specific
commonly used fundamental processes in data preprocessing
Vectorization
Normalization
Missing values
Feature extraction
14
Vectorization
Involves converting non-numeric data into a numerical format
suitable for machine learning algorithms.
For text data, this often involves techniques represent words or
documents as numerical vectors.
For categorical variables, techniques like label encoding are used
to convert categories into numerical representations.
15
Normalization / Value normalization
It is a common practice to normalize features before passing the data to any algorithm.
Normalization is the process in which you represent data belonging to a particular feature in
such a way that its mean is zero and standard deviation is one.
Consider House prediction problem, where features are in different ranges
Check you data have
Take small values: Typically in a range between zero and one.
Same range: Ensure all the features are in the same range. Ex: MinMaxScaler ,StandardScalar
16
Normalization / Value normalization
Handling missing values: Missing values are quite common in real-world machine
learning problems
Example : domain specific technique , deletion, Imputation (mean,median,mode) – Ex:
SimpleImputer
Feature engineering: involves creating new features from existing ones or transforming
existing features to improve the performance of machine learning models
Assume house prediction given their size (in square feet and the number of bedrooms and
bathrooms. But we can extract price_per_sqft ,to get new feature
17
Feature enginering Example
Assume sales prediction problem. Say we have promotion dates, holidays,
competitor's start date, distance from competitor, and sales for a particular
day,but the are many features like Days until the next promotion, Days left
before the next holiday, Number of days the competitor's business has been
open , and so on , this can be extracted
18
Over-fitting and Underefitting
Noise: Noise is meaningless or irrelevant data, affects the performance of the model
Bias: Bias is a prediction error that is introduced in the model due to oversimplifying the
machine learning algorithms. Or it is the difference between the predicted and
actual values
Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.
Generalization: It shows how well a model is trained to predict unseen data.
19
Over-fitting and Underfitting
Overfitting and underfitting are the two biggest causes for poor performance of
machine learning algorithms
An overfitted model is said to have low bias(b/c work well on training data) and
high variance.(doesn’t work well on unseen data)
20
How to Detect Overfitting
Overfitting in the model can only be detected once you test the data,
● just make train test split and check accuracy.
●
example, if the model shows 85% accuracy with training data and 50%
accuracy with the test dataset,
21
Mthod to solve overfitting
Early Stopping
Train with more data
Feature Selection
Cross-Validation
Data Augmentation
Regularization
22
Technique to solve overfitting
1) Early Stopping : the training is paused before the model starts earning the noise
within the model.
In this process, while training the model iteratively, measure the performance of the
model after each iteration.
May lead to the underfitting problem
If training is paused too early
-Stop at sweet spot
2) Train with More Data : provides more chances to discover the relationship
between input and output variables
23
Techniques cont..
3) Feature Selection: Identify the most important features within training data,
and other features are removed.
4) Cross-Validation: -one of the powerful techniques to prevent overfitting.
we divided the dataset into k-equal-sized subsets of data; these subsets are
known as folds- help to improved learning
5) Data Augmentation: instead of adding more training data, slightly
modified copies of already existing data are added to the dataset. makes it
possible to appear data sample slightly different every time it is processed by
the model.
24
Technique ...
6. Regularization : - It is a group of methods that forces the learning algorithms
to make a model simpler
apply L1 Regularization and L2 Regularization (discussed so far )
Underfitting Problem
Occur when model trained with fewer amounts of data
Underfitting occurs when our model is too simple to understand the base
structure of the data
Methods to Reduce Underfitting:
Increase model complexity
Remove noise from the data
Trained on increased and better features
Reduce the constraints
Increase the number of epochs to get better results.
25
Errors in Machine Learning
In ML, an error is a measure of how accurately an algorithm can make
predictions for the previously unknown dataset
Reducible errors: These errors can be reduced to improve the model
accuracy. Such errors can further be classified into bias and Variance.
Irreducible errors: These errors will always be present in the model
regardless of which algorithm has been used.
The cause of these errors is unknown variables whose value can't be
reduced.
26
Bias and variance in ML
these errors are prediction errors will always be present as there is always
a slight difference between the model predictions and actual predictions
The main goal is to reduce these errors in order to get more accurate
results.
27
High and Low bias Error
Low Bias : A low bias model will make fewer assumptions about about the
underlying data distribution
Some algorithms exposed : Decision Trees, k-Nearest Neighbors and Support Vector
Machines.
High Bias : A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias model
also cannot perform well on new data.
Algorithms some times exposed: Linear Regression and Logistic Regression.
To solve uses the previous methods of reducing overfitting problems
28
High and Low Variance-Error
Variance in the context of machine learning refers to the sensitivity of a model's
predictions to the variations in the training dataset.
Low variance means there is a small variation in the prediction of the target function
with changes in the training data set
High variance shows a large variation in the prediction of the target function with
changes in the training dataset - lead to overfitting and increased model complexity
Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high
variance. Linear Regression and Logistic Regression has low variance
29
Combinations of Bias-Variance
Low-Bias, Low-Variance- Ideal ML algorithm
Low-Bias, High-Variance: This case occurs when the
model learns with a large number of parameters and
hence leads to an overfitting.
High-Bias, Low-Variance: This case occurs when a model
does not learn well with the training dataset or
uses few numbers of the parameter. It leads to
underfitting problems
High-Bias, High-Variance: With high bias and high
variance, predictions are inconsistent and also
inaccurate
30
Bias-Variance Trade-Off
it is required to make a balance between bias and variance errors, and this balance
between the bias error and variance error is known as the Bias-Variance trade-off.
31
Confusion Matrix in Machine Learning
The confusion matrix provides us a matrix/table as output and describes the
performance of the model. It is also known as the error matrix.
The matrix consists of predictions result in a summarized form, which has a
total number of correct predictions and incorrect predictions.
The matrix looks like as below table:
If the predicted and truth labels
match, then the prediction is said to
be correct, but when the
predicted and truth labels are
mismatched, then the prediction is
said to be incorrect.
32
Confusion matrox cont..
True Positive: the number of correctly classified values as Positive
False Negative : number of incorrectly classified a positive sample as Negative
False Positive: number of incorrectly classified a negative sample as Positive
True Negative: number of correctly classified a negative sample as Negativ
Some of the models in machine learning require more precision and some model requires more
recall.
So, it is important to know the balance between Precision and recall or, simply, precision-recall
trade-off.
Precision is defined as the ratio of correctly classified positive samples (True Positive) to a total
number of classified positive samples (either correctly or incorrectly).
33
Precision, Recall, and F1-Score
These metrics are particularly useful in classification.
Precision: The ratio of correctly predicted positive observations to the total
predicted positives.
Recall: The recall measures the model's ability to detect positive samples. The
higher the recall, the more positive samples detected.
The ratio of correctly predicted positive observations
to all actual positives.
F1-Score: The harmonic mean of precision and recall.
34
35