Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views52 pages

50 ML Interview Questions

Uploaded by

Sunil Padole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views52 pages

50 ML Interview Questions

Uploaded by

Sunil Padole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Tajamul Khan

@Tajamulkhann
1. What is the difference between
supervised and unsupervised
learning?
Supervised Learning Uses labeled data to train
models for predictive tasks.
Regression Predicts continuous values e.g.,
house price, temperature.
Classification Predicts categories
n n
h a
e.g., spam/not spam, disease diagnosis.
l k
Unsupervised Learningu
j a m Uses unlabeled data to

T a
discover patterns
Clustering
or groupings.
Groups similar data e.g.,
@customer segmentation.
Dimensionality Reduction Reduces features
e.g., PCA for visualization.

Supervised learning predicts outcomes;


unsupervised learning uncovers hidden
structures.

@Tajamulkhann
2. What is the difference between
classification and regression?
Classification predicts a discrete label or
category. The model learns from labeled data
and assigns new instances to predefined
classes. Examples include identifying whether
an email is spam or not. Classification is

n n
evaluated using metrics such as accuracy,

h a
precision, recall, F1-score, and ROC-AUC.

u l k
a
model learnsjthe
m
Regression predicts a continuous value. The

T
features a
and
relationship between input
a real-valued target variable,
@ to estimate quantities such as house
aiming
prices, temperature, or sales figures. Common
evaluation metrics for regression include mean
squared error (MSE), mean absolute error
(MAE), and R-squared.

@Tajamulkhann
3. What is the bias-variance
tradeoff?
The bias-variance tradeoff is a fundamental
concept in ML:
Bias Error due to overly simplistic
assumptions in the learning algorithm. High
bias can cause underfitting.

n n
Variance Error due to sensitivity to small

h a
fluctuations in the training set. High variance
can cause overfitting.
u l k
a
A model withjhigh
m
T a
the training
bias pays little attention to
data and oversimplifies the
@ A model with high variance pays too
problem.
much attention to the training data and
captures noise. The goal is to find a balance
where the total error (bias + variance) is
minimized.

@Tajamulkhann
4. How to deal with overfitting and
underfitting?
Overfitting The model learns noise and details
from the training data, performing well on
training but poorly on new data. This is caused
by excessive model complexity.
Underfitting The model is too simple to capture
n
underlying patterns, performing poorly on both
n
a
training and test data. This is caused by
h
l k
insufficient model complexity.
u
ja m
Combat Overfitting Use regularization, cross-
T a
validation, reduce model complexity, early
@
stopping, and collect more data.
Combat Underfitting Increase model
complexity, add more features, reduce
regularization.

@Tajamulkhann
5. What is cross-validation and why
is it important?
Cross-validation is a resampling technique to
assess model performance:

k-Fold Cross-Validation The data is split into k


subsets. The model is trained on k-1 folds and

n n
validated on the remaining fold, repeated k
times.
h a
Importance Provides au
l k
j a m more reliable estimate
of model generalization, reduces overfitting risk,
T
and helpsain hyperparameter tuning.
@
Cross-validation is essential for robust model
evaluation, especially when data is limited

@Tajamulkhann
6. What is precision, recall, and
F1-score, and what is the trade-off
between them?
Precision The proportion of predicted positives
that are actual positives (true positives / (true
positives + false positives)).

n
Recall The proportion of actual positives that
n
h a
are predicted as positives (true positives / (true

l k
positives + false negatives)).
u
ja m
F1-score The harmonic mean of precision and
T a
recall, balancing both metrics.
@
There is a trade-off: increasing precision often
decreases recall and vice versa. F1-score is
used when you want to balance both metrics,
especially in imbalanced datasets

@Tajamulkhann
7. How do you choose the right
evaluation metric for your problem?
To choose the right evaluation metric, consider:
Problem Type
Classification: Accuracy, Precision, Recall, F1,
AUC-ROC
Regression: MAE, MSE, RMSE, R²
Data Imbalance
n n
a
Use Precision, Recall, or F1 instead of
h
l k
Accuracy if classes are imbalanced.
u
Business Goal

ja m
High cost of false negatives → use Recall
T a
High cost of false positives → use Precision
@
Interpretability
Pick a metric that's easy to explain to
stakeholders.

Example: For fraud detection, use Recall or


AUC-ROC, not Accuracy.

@Tajamulkhann
8. What is the difference between
accuracy, precision, and recall?
Accuracy Out of all predictions, how many
were correct?
Formula = (TP + TN) / (TP + TN + FP + FN)
Precision Out of all predicted positives, how
many were actually positive?
Formula: Precision = TP / (TP + FP)
n n
h a
Recall (Sensitivity) Out of all actual positives,
l k
how many did the model correctly identify?
u
m
Formula: Recall = TP / (TP + FN)
ja
T a
When to use which
@Accuracy: Best when classes are balanced.
Precision: Use when false positives are costly
(e.g., spam detection).
Recall: Use when false negatives are costly
(e.g., disease detection).

@Tajamulkhann
9. What is a Confusion Matrix and
how is it used?
It is used to evaluate the performance of a
classification model. It compares the model’s
predicted values with the actual values.

Predicted Predicted
Positive
n
Negative
n
h a
Actual
Positive
u l k
True Positive
False Negative
Type II Error (β)

ja m
T a
Actual False Positive
True Negative
@ Negative Type I Error (α)

Accuracy = (TP + TN) / Total


Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 Score = 2×(Precision×Recall)/(Precision+Recall)

@Tajamulkhann
10. How do you handle missing or
corrupted data in a dataset?
Handling missing/corrupted data is crucial for
robust modeling:
Deletion Remove rows or columns with
missing values if the loss is acceptable and
does not introduce bias.
n
Imputation Replace missing values with
n
a
mean, median, mode, or more sophisticated
h
l k
methods like k-NN imputation or predictive
u
modeling.
ja m
Flagging Add a binary indicator for missing
T a
values to signal the model about
@ missingness.
Advanced Methods Use algorithms that
natively handle missing data (e.g., XGBoost,
LightGBM).
The choice depends on the amount and nature
of missing data and the impact on downstream
analysis

@Tajamulkhann
11. How do you handle categorical
variables in machine learning?
Categorical variables can be handled by:

One-hot encoding Creates binary columns


for each category.
Ordinal encoding Assigns integers to
categories if order is meaningful.
n n
h a
Target encoding Replaces categories with
l k
the mean target value for that category,
u
m
useful for high-cardinality features6.
ja
T a
The method depends on the nature of the data
@
and the algorithm being used.

@Tajamulkhann
12. What is feature engineering and
why is it crucial?
Feature engineering is the process of creating,
transforming, or selecting features to improve
model performance:

Examples Creating interaction terms, encoding

n n
categorical variables, scaling, and generating
polynomial features.
h a
u l k
j a m
Importance Well-engineered features can

T a
significantly boost
interpretability,
model accuracy,
and robustness. Poor features
@lead to poor model performance
can
regardless of the algorithm used

@Tajamulkhann
13. What is the difference between
parametric and non-parametric
models?
Parametric Models Assume a fixed number of
parameters.
Example: Linear Regression, Logistic Regression
Simple, fast, works well with less data

n n
Less flexible, may underfit complex patterns

Non-Parametric Models No k h a
u l fixed number of

j a
Example: Decision
m
parameters; model grows
Trees,
with data
k-NN
T
More a
flexible, can capture complex
@relationships
Slower, needs more data, risk of overfitting

Use parametric for speed and simplicity, non-


parametric for flexibility and rich patterns.

@Tajamulkhann
14. What is the curse of
dimensionality and how does it
affect machine learning models?
The curse of dimensionality refers to problems
that arise when data has too many features.

Effect on ML Models
→ n
Data becomes sparse
n
harder to learn
a
patterns.
Distances lose meaning
lk
→ h impacts models
like k-NN.
m u
j a
Increases overfitting risk and computation
a
@ Ttime.
Solution
Use dimensionality reduction (e.g., PCA) or
feature selection.
Collect more data to match feature
complexity.

@Tajamulkhann
15. What is the purpose of
regularization in machine learning?
Regularization prevents overfitting by adding a
penalty to the loss function, discouraging overly
complex models.

Effect Shrinks large coefficients, improving


generalization.
n n
h a
Types
u l k
m
L1 (Lasso): Can reduce some coefficients to
ja
T a
zero (feature selection).
L2 (Ridge): Distributes shrinkage across
@ coefficients.

Use Case Helpful when dealing with limited


data or high model complexity.

@Tajamulkhann
16. What are the core assumptions of
linear regression?
Linearity The relationship between the
independent and dependent variables is
linear.
Independence Observations are
independent of each other (no
autocorrelation).
n n
a
Homoscedasticity The variance of residuals
h
l k
is constant across all levels of the
u
independent variables.

ja m
Normality of Residuals The residuals are

T a
normally distributed, especially important
@ for small samples.
No or Little Multicollinearity Independent
variables are not highly correlated with
each other.
Violations can lead to biased, inefficient, or
invalid results. For example, if residuals are not
normally distributed, confidence intervals and
hypothesis tests may be unreliable.

@Tajamulkhann
17. What is the role of activation
functions in logistic regression?
Logistic regression uses the sigmoid activation
function to convert the linear output
(z = w·x + b) into a probability between 0 and 1.
This probability helps in making binary
classification decisions (e.g., class 0 or 1).

n n
h a
The sigmoid function introduces a non-linear
transformation at the outputk
u l layer, allowing the

j a
probability space. m
model to map any real-valued input into a
Without this activation,
T
logistic a
regression would behave like linear
@ and couldn't handle classification
regression
tasks.

@Tajamulkhann
18. How do you interpret the
coefficients in logistic regression?
TP (True Positive) Model correctly predicted
positive class
FP (False Positive) Model incorrectly predicted
positive class
TN (True Negative) Model correctly predicted
negative class
n n
h a
FN (False Negative) Model incorrectly predicted
negative class
u l k
Use: ja m
T a
Helps calculate key metrics:
@ Accuracy = (TP + TN) / Total
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1-score = 2 * (Precision * Recall) /
(Precision + Recall).

@Tajamulkhann
19. How do decision trees work in
machine learning?
A decision tree is a supervised learning model
used for classification and regression.
Nodes Represent features or conditions
Branches Represent decision outcomes
(yes/no or value ranges)

n n
Leaves Represent final predictions (class or
value)
h a
l k
The tree is built by recursively splitting the data
u
ja m
using the feature that gives the best separation
(using metrics like Gini impurity or information
gain).T a
@
Features
Simple, visual, and easy to interpret
Can overfit the training data — handled by
pruning or using ensemble methods like
Random Forest

@Tajamulkhann
20. What is the motivation behind
random forests?
Random forests are an ensemble of decision
trees:
How it works Each tree is trained on a random
subset of data and features. Predictions are
averaged or voted.

n n
Motivation
h a
l k
Single decision trees are prone to overfitting
u
ja m
and sensitive to noise.
Random Forest builds many trees on
T a
random subsets of data and features, then
@combines their predictions.

This reduces variance, improves generalization,


and gives better performance on unseen data.

@Tajamulkhann
21. Explain the difference between
bagging and boosting in ensemble
methods.
Bagging Multiple models are trained in parallel
on different subsets of the data, and predictions
are averaged or voted. Reduces variance.
Example: Random Forest.

n n
h a
Boosting Models are trained sequentially, each

l k
correcting the errors of its predecessor.
Reduces bias. Example:u
Boosting.
j a m AdaBoost, Gradient

T a
@ is robust to overfitting, while boosting
Bagging
can achieve higher accuracy but is more
sensitive to noisy data

@Tajamulkhann
22. What is the difference between
hard and soft voting in ensemble
methods?
Voting is used in ensemble models like Random
Forests or Voting Classifiers to combine
predictions from multiple models.

n
Hard Voting Each model makes a final class n
h a
prediction, and the class with the majority votes
wins. (No Probability)
u l k
j a m
T a
Soft Voting Each
probabilities.
model outputs class
The final class is the one with the
@ average probability.
highest

Soft voting is generally more accurate as it


accounts for probability.

@Tajamulkhann
23. What is k-nearest neighbors
(k-NN) and how does it work?
k-NN is a simple, instance-based machine
learning algorithm used for classification and
regression.
How it works
To predict a new point, it finds the k closest

n
points in the training data using a distance
n
a
metric (like Euclidean distance).
h
l k
For classification, it picks the majority class
among neighbors. u
ja m
For regression, it averages the neighbors’

T a
values.
@
Key points
No training phase; just stores data.
Choice of k affects accuracy (small k =
sensitive to noise, large k = smoother
results).
Requires feature scaling for accurate
distance calculations.

@Tajamulkhann
24. What is k-Means and How Does It
Work?
k-Means is an unsupervised clustering
algorithm used to group data into k clusters
based on similarity.
How it works
Choose k initial cluster centroids randomly.

n
Assign each data point to the nearest
n
a
centroid (based on Euclidean distance).
h
l k
Recalculate the centroids by taking the
u
mean of all points assigned to each cluster.

ja m
Repeat steps 2, 3 until centroids no longer

T a
move or set number of iterations is reached.
@
Key points
It tries to minimize the distance between
points and their cluster centroids.
Sensitive to initial centroid placement.
Works best on spherical, evenly sized
clusters.
Requires specifying k beforehand.

@Tajamulkhann
25. How do you select the best value
of k in k-means clustering?
Elbow Method Plot the sum of squared
distances (inertia) for a range of k values.
The "elbow" point where the rate of decrease
sharply slows indicates a good value for k.
Silhouette Score Measures how similar an
object is to its own cluster compared to
n n
a
other clusters. The silhouette score ranges
h
l k
from -1 to 1; higher values indicate better
u
clustering.
ja m
Gap Statistic Compares the total intra-
T a
cluster variation for different k with their
@ expected values under null reference
distributions. The k with the largest gap is
optimal.

Each method has trade-offs, but silhouette


analysis is often preferred for its interpretability
and robustness

@Tajamulkhann
26. What is DBSCAN clustering and
how is it better than K-Means?
DBSCAN DBSCAN (Density-Based Spatial
Clustering of Applications with Noise) is a
clustering algorithm that groups data points
based on density. It works by connecting points
that have a minimum number of neighbors

n n
within a given radius (eps). Dense regions form

h a
clusters, while isolated points are labeled as
noise (outliers).
u l k
a
AdvantagesjUnlike
m
T
require a
specifying
K-Means, DBSCAN doesn't
the number of clusters and
@detect clusters of arbitrary shapes, making
can
it robust to noise and well-suited for real-world,
non-linear data.

@Tajamulkhann
27. What is the difference between
feature selection and feature
extraction?
Feature Selection involves choosing a subset of
the most relevant features from the original
dataset, often based on statistical tests or
model-based importance.

n n
h a
Feature Extraction transforms or combines

u l k
existing features to create new ones (e.g., PCA,
autoencoders).
j a m
T a
Both techniques aim to reduce dimensionality,
@ noise, and enhance model
remove
performance and interpretability.

@Tajamulkhann
28. How do you interpret feature
importance in tree-based models?
Feature importance in tree-based models (like
decision trees and random forests) is typically
measured by:

Reduction in Impurity Total decrease in

n n
metrics like Gini or entropy when a feature is
used for splits.
h a
l k
Split Frequency How often a feature is used
u
m
to split nodes across all trees.
ja
T a
Features with higher importance scores
@
contribute more to predictions, helping with
model interpretability and feature selection.

@Tajamulkhann
29. What is principal component
analysis (PCA) and when should it
be used?
PCA is a dimensionality reduction technique
that transforms the original features into a new
set of uncorrelated variables called principal
components, ordered by the amount of
variance they capture.
n n
h a
Use
u l k
ja m
It’s useful when dealing with high-dimensional
data, correlated features, or the curse of
T a
dimensionality.
@
PCA helps reduce noise, eliminate redundancy,
improve visualization, and speed up model
training while preserving as much information
as possible.

@Tajamulkhann
30. What is Linear Discriminant
Analysis (LDA) and when should it be
used?
Linear Discriminant Analysis (LDA) is a
supervised dimensionality reduction technique
used primarily for classification tasks. It projects
data onto a lower-dimensional space by
maximizing class separability.
n n
Unlike PCA, which focuses onk h a
u l capturing the

leverages classa
j m
most variance regardless of class labels, LDA
information to maximize
T a
between-class variance and minimize within-
@variance.
class

Use LDA when your goal is to improve


classification performance and your data
meets assumptions like normally distributed
features and equal class covariances.

@Tajamulkhann
31. How do you handle
multicollinearity in regression
models?
To address multicollinearity, you can:
Remove correlated predictors to reduce
redundancy.
Apply regularization (L1 or L2) to penalize
large or unstable coefficients.
n n
h a
Use dimensionality reduction techniques like
l k
PCA to transform features into uncorrelated
u
components.
ja m
T a
Multicollinearity can inflate the variance of
@
coefficient estimates, making them unreliable
and sensitive to small changes in the data.

@Tajamulkhann
32. How do you make your model
robust to outliers?
Robust Algorithms Use algorithms less
sensitive to outliers (e.g., tree-based
methods).

Outlier Removal Detect and remove outliers


n
using statistical methods (e.g., IQR, Z-score).
n
h a
Robust Loss Functions Use
u l k loss functions

outliers.
j a m
like Huber loss that are less sensitive to

T a
@Data Transformation Normalize or
standardize data to reduce the impact of
outliers

@Tajamulkhann
33. What is the difference between
generative and discriminative
models?
Generative Models learn the joint probability
distribution — they model how the data is
generated by learning both the input features
and output labels.
Examples: Naive Bayes, Gaussian Mixture
n n
Models, GANs.
h a
Discriminative Modelsu
l k
probability P(Ya
j m learn the conditional
| X) — they focus directly on
findingT a
boundaries between classes.
@ Logistic Regression, SVM, Decision
Examples:
Trees.

Key Difference: Generative models can


generate new data; discriminative models are
better for classification

@Tajamulkhann
34. How do you choose which
algorithm to use for a dataset?
Algorithm selection depends on:

Problem Type Classification, regression,


clustering, etc.
Data Size and Quality Some algorithms
scale better with large data.
n n
h a
Interpretability Needs Some models are
l k
more interpretable than others.
u
ja m
Computational Resources Some algorithms
are more resource-intensive.
T a
Domain Knowledge Prior knowledge can
@ guide algorithm choice.

Exploratory data analysis (EDA) and


experimentation are key to selecting the best
algorithm

@Tajamulkhann
35. Explain L1 and L2 regularization
and their differences.
Regularization adds a penalty to the loss
function to prevent overfitting. Discourages
large coefficients, reducing model complexity.
L1 (Lasso) Adds the absolute value of
coefficients as a penalty. This can shrink
n
some coefficients to zero, effectively
n
a
performing feature selection and producing
h
sparse models.
u l k
ja m
L2 (Ridge) Adds the squared value of
coefficients as a penalty. This shrinks
T a
coefficients but does not set them to zero,
@ reducing model complexity without
eliminating features.

L1 is useful when you suspect many features are


irrelevant, while L2 is better when all features
have some relevance but you want to control
overfitting.

@Tajamulkhann
36. What is the kernel trick in SVM
and why is it useful?
The kernel trick enables Support Vector
Machines (SVMs) to handle non-linear
classification by implicitly mapping data to a
higher-dimensional space without explicitly
computing the transformation.

How it works Uses kernel functions a n n


polynomial, RBF) to computek
l h
inner
(e.g.,
products in
the transformed space.u
j
Why it’s usefula m
Allows SVMs to find a linear
T
decision a
boundary in a space where the
@ data becomes separable.
original

This makes SVMs highly effective for complex,


non-linear classification tasks.

@Tajamulkhann
37. What are the differences between
batch, mini-batch, and stochastic
gradient descent?
Batch Gradient Descent
Uses the entire dataset to compute gradients
at each step.
Stable updates,
slow for large datasets.
n n
h a
Mini-Batch Gradient Descent Uses small
l k
batches (e.g., 32 or 64 samples) to compute
u
gradients.
ja m
T a
Balances speed and stability.
Most commonly used in practice, especially
@ in deep learning.
Stochastic Gradient Descent (SGD)
Uses one random data point per step.
Fast,
updates are noisy and may overshoot.

@Tajamulkhann
38. Explain how gradient descent
works in machine learning
It is an optimization algorithm that minimizes a
model's loss function by iteratively updating
parameters (e.g., weights) in the direction of
the negative gradient.

Key Component – Learning Rate (α):


n n
a
Determines the step size for each update.
h
l k
Too high: May overshoot or diverge.
u
Too low: Leads to slow or stalled

ja m
convergence.

T a
It’s fundamental to training most machine
@
learning and deep learning models.

@Tajamulkhann
39. What is the Learning Rate and
How Does It Affect Convergence?
The learning rate is a hyperparameter that
controls the size of the steps taken during
optimization (e.g., gradient descent) to
minimize the loss function.

If learning rate is too high:


n n
h a
The model may overshoot the minimum,

u l k
causing divergence or oscillations.

m
If learning rate is too low:
ja
a
The model converges very slowly, increasing
T
@ training time and possibly getting stuck in
local minima.

In short A proper learning rate balances fast


convergence without missing the optimal
solution.

@Tajamulkhann
40. Explain hyperparameters and
tuning methods.
Hyperparameters Settings configured before
training (e.g., learning rate, number of trees in a
random forest). They control the learning
process.
Random Forest: n_estimators, max_depth
SVM: C, kernel type, gamma.
n n
a
XGBoost: learning_rate, max_depth etc.
h
Tuning Methods
u l k
m
Grid Search: Exhaustively tests all
ja
a
combinations in a predefined
T
@ hyperparameter space.
Random Search: Samples random
combinations, more efficient for high-
dimensional spaces.
Bayesian Optimization: Uses probabilistic
models to guide the search (e.g., TPE,
Optuna).
Automated Methods: Tools like Hyperopt etc

@Tajamulkhann
41. How do you prevent overfitting
during hyperparameter tuning?
Use cross-validation (like k-fold) to reliably
evaluate model performance.

Avoid over-tuning on the validation set to


prevent overfitting.

Apply early stopping in iterativea


n n
l k h models

performance stops u
(e.g., neural networks) to halt training when

j a m improving.

UseT a
@model complexity and enhance
regularization methods (L1/L2) to reduce

generalization.

@Tajamulkhann
42.When would you use grid search
vs. random search?
Grid Search is ideal for small hyperparameter
spaces where you can exhaustively try all
combinations (e.g., 2–3 parameters with limited
values).

n n
Random Search is better for large or high-

h a
dimensional spaces, especially in models like

u l k
neural networks, as it samples combinations
randomly.
ja m
T a
@
Trade-off
Grid Search: More thorough but time-
consuming.
Random Search: Faster and often more
efficient, though it may miss the absolute
best combination.

@Tajamulkhann
43. How do you evaluate a model’s
performance using an ROC curve?
The ROC curve plots the true positive rate
(recall) against the false positive rate at various
threshold settings:

AUC (Area Under Curve) Measures the

n
model’s ability to discriminate betweenn
h a
classes. AUC = 0.5 is random; AUC = 1 is
perfect.
u l k
m
Interpretation A higher AUC indicates better
ja
T a
model performance. The curve helps
choose the optimal threshold based on the
@ desired balance between sensitivity and
specificity

@Tajamulkhann
44. What is the silhouette score and
how is it used in clustering?
The silhouette score measures how well a data
point fits within its cluster compared to other
clusters.
For each point, it calculates:
a: average distance to points in the same
cluster
n n
a
b: average distance to points in the nearest
h
different cluster
u l k
Silhouette Score = b − a / max(a, b)​

ja m
Scores range from -1 to 1:
a
T →→
Close to 1 well clustered
@ Around 0
Negative →
on the cluster boundary
possibly assigned to the wrong
cluster
How it’s used
To evaluate clustering quality
To choose the optimal number of clusters
(k) by selecting k with the highest average
silhouette score

@Tajamulkhann
45. How to Select Features in High-
Dimensional Data?
Feature selection helps improve accuracy and
reduce overfitting by choosing the most
important features:
Filter methods Use statistical tests (like
correlation or chi-square) to select features
independently of any model.
n n
a
Wrapper methods Use models to evaluate
h
l k
different feature subsets and pick the best
u
(e.g., Recursive Feature Elimination).

ja m
Embedded methods Feature selection

T a
happens during model training, like with
@ Lasso regression or tree-based models that
provide feature importance.
Dimensionality reduction Techniques like
PCA transform features into fewer
components while retaining most
information.
Always confirm your choice by checking how
the model performs with the selected features.

@Tajamulkhann
46. What is R² and Adjusted R² in
Regression, and How Do They Differ?
R² (Coefficient of Determination) Measures the
proportion of variance in the dependent
variable explained by the model. Values range
from 0 to 1 — higher means better fit.

n
Adjusted R² Modified version of R² that penalizes
n
h a
adding irrelevant predictors. It adjusts for the

l k
number of features, preventing overestimation
u
are added.
ja m
of model performance when more variables

T a
@
Key difference
R² always increases or stays the same when
more features are added.
Adjusted R² can decrease if added features
don’t improve the model.

@Tajamulkhann
47. What is the difference between
feature selection and feature
extraction?
Feature Selection Selects a subset of the
original features based on their importance or
relevance to the target variable, helping
simplify the model and reduce noise.

Feature Extraction Generates new a n byn


l k h
transforming or combining existing
features
ones (e.g.,

m u
using techniques like PCA or autoencoders) to

a j a
capture underlying patterns and structure in
T
the data.
@
Both methods aim to reduce dimensionality
and improve model performance, but feature
extraction can capture more complex, non-
obvious relationships compared to simple
selection.

@Tajamulkhann
48. What is the difference between
A/B Testing and Machine Learning
Model Deployment?
A/B Testing Comparing two or more versions of
a feature or design by analyzing user behavior
and metrics to determine which performs
better.

n n
h a
ML Deployment Putting a trained machine

u l k
learning model into a production environment
m
where it can make automated, real-time
ja
a
predictions on new data.
T
@
A/B testing helps validate and choose the best
option through experimentation; ML
deployment operationalizes the model to
deliver continuous value in live settings.

@Tajamulkhann
49. What is the Purpose of a Test Set,
and How Does It Differ from a
Validation Set?
Training Set Used to train the model by learning
patterns from data.

Validation Set Used to tune hyperparameters

n n
and select the best model during training. Helps
prevent overfitting.
h a
u l k
Test Set Used onlym once after training to

a a
evaluate the jfinal model’s performance on

@
unseenT data.

Key difference
Validation set guides model building; test
set assesses true generalization.
Test data must be kept separate until the
very end to avoid biased evaluation.

@Tajamulkhann
50. What are the stages in a Machine
Learning Project?
Problem Definition Clarify the business
problem and goals.
Data Collection Gather relevant data.
Data Cleaning & Preprocessing Fix missing
values, outliers, and prepare data.

n
Exploratory Data Analysis Explore patterns
n
and visualize insights.
h a
l k
Feature Engineering & Selection Create and
u
ja m
choose key features.
Model Training Select algorithms and train

T a
models.
@Hyperparameter Tuning & Validation
Optimize model settings.
Model Evaluation Test model performance
on unseen data.
Deployment Launch the model in
production.
Monitoring & Maintenance Track and
update the model as needed.

@Tajamulkhann
Follow for more!

You might also like