Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views3 pages

Program 6

The document outlines preprocessing steps and their importance for various machine learning algorithms, highlighting the necessity of handling missing values, encoding categorical variables, and normalizing data. It emphasizes that normalization is crucial for distance and gradient-based algorithms, while tree-based models are more robust to outliers and do not require scaling. Additionally, it notes the significance of correlation checks for linear models to prevent multicollinearity issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views3 pages

Program 6

The document outlines preprocessing steps and their importance for various machine learning algorithms, highlighting the necessity of handling missing values, encoding categorical variables, and normalizing data. It emphasizes that normalization is crucial for distance and gradient-based algorithms, while tree-based models are more robust to outliers and do not require scaling. Additionally, it notes the significance of correlation checks for linear models to prevent multicollinearity issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Preprocessing Steps vs ML Algorithms

Missing Correlation
Algorithm / Outlier Label/One-Hot Normalization /
Values Check / Feature
Model Removal Encoding Standardization
Handling Selection

✅ Important (scale ✅ Very useful


Linear ✅ One-Hot for
✅ Required ✅ Important improves (multicollinearity
Regression categorical
convergence) issue)

✅ Important (scale
Logistic ✅ One-Hot for
✅ Required ✅ Important improves ✅ Useful
Regression categorical
convergence)

KNN (K-Nearest ✅ Very ✅ Required


✅ Required ✅ Needed ✅ Optional
Neighbors) important (distance-based)

SVM (Support ✅ Required


✅ Required ✅ Important ✅ Needed ✅ Useful
Vector Machine) (distance-based)

✅ Label
Decision Tree ✅ Required ❌ Not critical Encoding ❌ Not required ✅ Optional
enough

✅ Label
Random Forest ✅ Required ❌ Not critical Encoding ❌ Not required ✅ Optional
enough

Gradient ✅ Required ❌ Less critical ✅ CatBoost ❌ Not required ✅ Optional


Boosting handles directly,
Missing Correlation
Algorithm / Outlier Label/One-Hot Normalization /
Values Check / Feature
Model Removal Encoding Standardization
Handling Selection

(XGBoost,
else Label/One-
LightGBM,
Hot
CatBoost)

✅ One-Hot
Naive Bayes ✅ Required ❌ Not critical Encoding ❌ Not required ✅ Optional
preferred

K-Means ✅ Required
✅ Required ✅ Important ✅ Needed ✅ Useful
Clustering (distance-based)

PCA
✅ Required (scale
(Dimensionality ✅ Required ✅ Important ✅ Needed ✅ Core step
before PCA)
Reduction)

✅ Important ✅ One-Hot
Neural Networks ✅ Required (speeds
✅ Required (helps Encoding for ✅ Optional
(Deep Learning) up training)
stability) categorical

Key Takeaways:

 Always handle missing values, encode categorical variables, and separate features/target.

 Normalization/Standardization is crucial for distance/gradient-based algorithms (KNN, SVM, K-


Means, PCA, Neural Nets).

 Outliers hurt linear/logistic regression, SVM, and KNN, but tree-based models are more robust.
 Tree models don’t need scaling, and Label Encoding is usually enough.

 Correlation check is especially important for linear models to avoid multicollinearity.

You might also like