Preprocessing Steps vs ML Algorithms
Missing Correlation
Algorithm / Outlier Label/One-Hot Normalization /
Values Check / Feature
Model Removal Encoding Standardization
Handling Selection
✅ Important (scale ✅ Very useful
Linear ✅ One-Hot for
✅ Required ✅ Important improves (multicollinearity
Regression categorical
convergence) issue)
✅ Important (scale
Logistic ✅ One-Hot for
✅ Required ✅ Important improves ✅ Useful
Regression categorical
convergence)
KNN (K-Nearest ✅ Very ✅ Required
✅ Required ✅ Needed ✅ Optional
Neighbors) important (distance-based)
SVM (Support ✅ Required
✅ Required ✅ Important ✅ Needed ✅ Useful
Vector Machine) (distance-based)
✅ Label
Decision Tree ✅ Required ❌ Not critical Encoding ❌ Not required ✅ Optional
enough
✅ Label
Random Forest ✅ Required ❌ Not critical Encoding ❌ Not required ✅ Optional
enough
Gradient ✅ Required ❌ Less critical ✅ CatBoost ❌ Not required ✅ Optional
Boosting handles directly,
Missing Correlation
Algorithm / Outlier Label/One-Hot Normalization /
Values Check / Feature
Model Removal Encoding Standardization
Handling Selection
(XGBoost,
else Label/One-
LightGBM,
Hot
CatBoost)
✅ One-Hot
Naive Bayes ✅ Required ❌ Not critical Encoding ❌ Not required ✅ Optional
preferred
K-Means ✅ Required
✅ Required ✅ Important ✅ Needed ✅ Useful
Clustering (distance-based)
PCA
✅ Required (scale
(Dimensionality ✅ Required ✅ Important ✅ Needed ✅ Core step
before PCA)
Reduction)
✅ Important ✅ One-Hot
Neural Networks ✅ Required (speeds
✅ Required (helps Encoding for ✅ Optional
(Deep Learning) up training)
stability) categorical
Key Takeaways:
Always handle missing values, encode categorical variables, and separate features/target.
Normalization/Standardization is crucial for distance/gradient-based algorithms (KNN, SVM, K-
Means, PCA, Neural Nets).
Outliers hurt linear/logistic regression, SVM, and KNN, but tree-based models are more robust.
Tree models don’t need scaling, and Label Encoding is usually enough.
Correlation check is especially important for linear models to avoid multicollinearity.