1.
Supervised vs Unsupervised Learning
Aspect Supervised Learning Unsupervised Learning
Learns from labeled data (input-output
Definition Learns from unlabeled data
pairs)
Discover hidden patterns or
Objective Predict outcomes or classify data
groupings
Example K-Means, PCA, Hierarchical
Linear Regression, SVM, Decision Trees
Algorithms Clustering
Customer segmentation in
Example Use Case Email spam detection (spam/not spam)
marketing
2. Overfitting and Prevention Techniques
Overfitting occurs when a model learns the noise in the training data instead of the actual patterns,
resulting in poor generalization to new data.
Prevention Techniques:
1. Regularization: Adds a penalty to the loss function to constrain model complexity (e.g.,
L1/Lasso, L2/Ridge regularization).
2. Cross-validation: Helps ensure the model performs well on unseen data by splitting the data
into training and validation sets.
3. Curse of Dimensionality
Definition: As the number of features (dimensions) increases, the volume of the feature space grows
exponentially, making the data sparse and distance measures less meaningful.
Effect on Models:
• Increases computational cost
• Reduces model performance due to overfitting and sparse data
• Makes training harder and generalization poorer
Solution: Dimensionality reduction (e.g., PCA), feature selection.
4. Cost Function for Linear Regression and Gradient Descent
Cost Function (Mean Squared Error - MSE):
5. Decision Trees vs Random Forests
Aspect Decision Tree Random Forest
Overfitting Prone to overfitting Less prone due to averaging over multiple trees
Accuracy Moderate Generally higher due to ensemble approach
Interpretability High (easy to visualize) Low (multiple trees are hard to interpret)
Speed Fast training and prediction Slower training due to many trees
6. SVM Classifier (with Diagram)
Concept: SVM finds the hyperplane that best separates the classes with the maximum margin.
Key Points:
• Works well in high-dimensional space
• Supports kernels (e.g., linear, RBF) for non-linear classification
Diagram:
Class A (o), Class B (x)
o | x
| x
--------|-------- <- Optimal Hyperplane
| x
o | x
Support vectors are closest points to the hyperplane on either side.
7. Cross-Validation
Definition: Technique to assess how the model will generalize to an independent dataset.
K-Fold Cross-Validation:
• Data is divided into K subsets.
• Model is trained on K-1 subsets and tested on the remaining.
• Repeated K times; average performance is reported.
Importance:
• Reduces risk of overfitting
• Provides a better estimate of model performance
8. Gradient Descent Variants
Type Description Pros Cons
Batch Gradient Uses the entire dataset to Slow for large
Stable convergence
Descent compute gradients datasets
Type Description Pros Cons
Stochastic GD Fast updates, handles Noisy updates, may
Uses one sample at a time
(SGD) big data overshoot
Uses a small batch of data (e.g., Combines speed and Requires batch size
Mini-Batch GD
32, 64 samples) stability tuning
9. Explain the working of K-means clustering algorithm with an example
K-means is an unsupervised learning algorithm used for clustering data into K distinct
groups based on feature similarity.
Steps:
1. Choose the number of clusters K.
2. Initialize K centroids randomly.
3. Assign each point to the nearest centroid (forming clusters).
4. Update the centroids as the mean of the points in each cluster.
5. Repeat steps 3–4 until centroids stop changing (convergence).
Example:
Given points: A(1,2), B(1,4), C(5,7), D(6,8)
If K=2:
• Initial centroids: A, C
• Cluster 1: A, B → new centroid = (1,3)
• Cluster 2: C, D → new centroid = (5.5, 7.5)
Repeat until centroids stabilize.
10. What is the role of activation functions in neural networks? Compare
ReLU, Sigmoid, and Tanh functions
Role:
Activation functions introduce non-linearity into the network, allowing it to learn complex
patterns.
Comparison:
Function Equation Range Pros Cons
𝜎(𝑥) = 11 + 𝑒 − 𝑥\𝑠𝑖𝑔𝑚𝑎(𝑥) Smooth,
Vanishing
𝑺𝒊𝒈𝒎𝒐𝒊𝒅 = \𝑓𝑟𝑎𝑐{1}{1 (0, 1) probabilistic
gradient, slow
+ 𝑒^{−𝑥}} output
𝑡𝑎𝑛ℎ (𝑥) = 𝑒𝑥 − 𝑒 − 𝑥𝑒𝑥 + 𝑒
− 𝑥\𝑡𝑎𝑛ℎ(𝑥)
(−1, Zero-centered Still vanishes
𝑻𝒂𝒏𝒉 = \𝑓𝑟𝑎𝑐{𝑒^𝑥
1) output gradient
− 𝑒^{−𝑥}}{𝑒^𝑥
+ 𝑒^{−𝑥}}
Fast Can die
𝑹𝒆𝑳𝑼 𝑓(𝑥) = 𝑚𝑎𝑥 (0, 𝑥)𝑓(𝑥) = \𝑚𝑎𝑥(0, 𝑥) [0, ∞) convergence, (neurons stuck
sparsity at 0)
11. How does a Naïve Bayes classifier work? What are its advantages and
limitations?
Working:
• Based on Bayes' Theorem:
𝑃(𝐶 ∣ 𝑋) = 𝑃(𝑋 ∣ 𝐶)𝑃(𝐶)𝑃(𝑋)𝑃(𝐶|𝑋) = \𝑓𝑟𝑎𝑐{𝑃(𝑋|𝐶)𝑃(𝐶)}{𝑃(𝑋)}
• Assumes feature independence given the class.
• Selects the class with the highest posterior probability.
Advantages:
• Fast and simple
• Works well with high-dimensional data (e.g., text classification)
Limitations:
• Assumes independence among features (often unrealistic)
• Not ideal for correlated features
12. Discuss the significance of reinforcement learning and its real-world
applications
Reinforcement Learning (RL) is a type of learning where an agent learns by interacting
with an environment to maximize a reward signal.
Significance:
• Models sequential decision-making
• Learns optimal policies without labeled data
Real-world applications:
• Game playing (e.g., AlphaGo)
• Robotics (e.g., robotic arm control)
• Recommendation systems (dynamic content personalization)
• Autonomous vehicles (decision-making)
13. Explain the concept of overfitting and underfitting in machine learning.
How can they be prevented?
Concept Description Prevention Techniques
Model learns noise; high training Regularization, cross-validation,
Overfitting
accuracy, poor generalization pruning, dropout
Model is too simple; fails to learn Use complex models, more
Underfitting
underlying patterns features, reduce bias
14. Describe the k-Nearest Neighbors (k-NN) algorithm and its working with
an example
k-NN is a non-parametric algorithm used for classification and regression.
Working:
1. Choose k.
2. Compute the distance (e.g., Euclidean) from the query point to all data points.
3. Select the k nearest neighbors.
4. Majority vote (classification) or average (regression) to make prediction.
Example:
To classify a new point, if its 3 nearest neighbors are {dog, dog, cat}, k-NN predicts dog.
15. What is the role of a confusion matrix in model evaluation? Explain its
components
A confusion matrix is a performance metric for classification tasks showing actual vs
predicted values.
Structure (for binary classification):
Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)
Key Metrics:
• F1-score = Harmonic mean of precision and recall
16. Compare and contrast Reinforcement Learning and Supervised Learning
with examples
Aspect Supervised Learning Reinforcement Learning
No labeled data, reward signal guides
Data Labeled data (input-output pairs)
learning
Minimize error between prediction and
Objective Maximize cumulative reward
truth
Feedback Instant and direct Delayed and indirect
Example Email classification Robot navigating a maze
17. Describe how Genetic Algorithms work in machine learning. Provide an
example use case
Genetic Algorithms (GAs) are optimization techniques inspired by natural selection.
Working Steps:
1. Initialization: Generate random population (solutions).
2. Selection: Choose best-performing individuals.
3. Crossover: Combine parts of two individuals.
4. Mutation: Randomly alter some genes.
5. Repeat until convergence.
Example Use Case:
• Feature selection: GA can optimize which subset of features yields best model
accuracy.
Machine Learning Questions and Answers (Continued)
18. What is the difference between Hard Margin and Soft Margin in Support
Vector Machines (SVMs)?
• Hard Margin SVM:
o Assumes data is linearly separable.
o Finds a hyperplane that perfectly separates the classes with maximum
margin.
o No tolerance for misclassification.
o May fail if data is not perfectly separable.
• Soft Margin SVM:
o Allows some misclassifications using a penalty parameter CC.
o Balances between maximizing margin and minimizing classification error.
o More robust in the presence of noise or overlapping classes.
19. Explain the structure of an Artificial Neural Network (ANN) with a
suitable diagram.
• Structure:
o Input Layer: Accepts the feature set.
o Hidden Layers: Perform transformations using weights, biases, and activation
functions.
o Output Layer: Produces the prediction (classification or regression).
Diagram:
Input Layer → Hidden Layer(s) → Output Layer
o o
o → o → o
o o
20. Describe the concept of backpropagation and how it is used to train neural
networks.
Backpropagation is the algorithm used to train neural networks by updating weights to
minimize the error.
• Steps:
1. Forward pass: Compute output using current weights.
2. Loss computation: Calculate error using a loss function.
3. Backward pass: Compute gradients of the loss with respect to each weight.
4. Update weights: Use gradient descent to adjust weights.
21. Discuss the role of regularization in machine learning. Explain L1 (Lasso)
and L2 (Ridge) regularization with their impact on model performance.
Regularization prevents overfitting by adding a penalty to the loss θ:=θ−α⋅∂Loss∂θ\theta :=
\theta - \alpha \cdot \frac{\partial \text{Loss}}{\partial \theta}function.
• L1 (Lasso) Regularization:
o Can shrink some weights to zero (feature selection).
o Leads to sparse models.
• L2 (Ridge) Regularization:
o Penalizes large weights without eliminating them.
o Promotes weight smoothing.
22. Explain the K-means clustering algorithm. How do you determine the
optimal number of clusters?
• K-means Clustering:
o Clusters data into K groups based on feature similarity.
o Uses distance measures to assign points and update centroids.
• Determining K:
o Use the Elbow Method: Plot the within-cluster sum of squares (WCSS) for
various values of K and choose the point where the curve elbows.
o Silhouette Score: Measures how similar a point is to its cluster compared to
other clusters.
23. What are the advantages of Hierarchical Clustering over K-means
clustering? Explain Agglomerative Hierarchical Clustering with an example.
Advantages of Hierarchical Clustering:
• No need to predefine number of clusters.
• Dendrogram gives a clear visual representation.
• Can capture nested clusters.
Agglomerative Clustering:
• Bottom-up approach.
• Each point starts as its own cluster.
• Iteratively merge the closest pair of clusters until one cluster remains.
Example:
• Given points: A, B, C, D
• Initially: {A}, {B}, {C}, {D}
• Merge closest: {A, B}, {C, D}
• Continue until one cluster remains.
24. What is Backpropagation? Derive the weight update equation in a simple
neural network using gradient descent. Compare CNN and RNN.
• Backpropagation:
o Computes the gradient of the loss function w.r.t weights.
o Uses chain rule to propagate errors from output to input layers.
• Weight Update Equation:
Comparison: CNN vs RNN
Feature CNN RNN
Used For Image, spatial data Sequential data, time series
Memory No memory of past input Remembers previous input
Architecture Convolutional + pooling layers Loops through time with hidden state
Example Image classification Text generation, language modeling
25. Explain the structure and working of an Artificial Neural Network (ANN)
with a suitable diagram. How is ML used in Disaster Management System?
Discuss with case study.
• ANN Structure:
o Same as explained in Q19.
ML in Disaster Management:
• Use cases:
o Predict floods, earthquakes, forest fires using satellite data and historical
patterns.
o Optimizing resource allocation during disasters.
Case Study:
• Google AI for Flood Forecasting:
o Uses ML + satellite imagery to predict river water levels and warn populations
in flood-prone areas (India, Bangladesh).
o Achieved better accuracy and earlier alerts than traditional systems.
26. What are the key areas where we engage Machine Learning as a tool to get
advantages in our daily life? How is ML used in healthcare for disease
prediction? Discuss with case studies.
Key Areas:
• Voice assistants (e.g., Alexa, Siri)
• Recommendation systems (Netflix, YouTube)
• Smart home devices
• Spam filtering
• Language translation
ML in Healthcare:
• Disease prediction using medical history, lab results, imaging.
Case Study 1:
• IBM Watson: Analyzes medical records and literature to assist in cancer treatment
recommendations.
Case Study 2:
• Diabetes Prediction using logistic regression, decision trees, and neural networks
trained on patient data to forecast onset.
Here is the content formatted for Microsoft Word, continuing from the previous questions. You can
copy and paste this directly, or I can generate a .docx file for download if needed.
Machine Learning Questions and Answers (Continued)
27. Explain the working of the Linear Regression model and derive its cost function using Mean
Squared Error (MSE).
28. Explain the advantages and disadvantages of Naïve Bayes Classifier.
Advantages:
• Simple, fast, and efficient.
• Performs well on text classification problems (e.g., spam detection).
• Works well with high-dimensional data.
Disadvantages:
• Assumes independence among features, which is rarely true.
• Poor estimates for correlated features.
• Zero-frequency problem: if a category is missing in training data, it leads to zero probability.
29. Consider a dataset with the following points and labels:
Point (X, Y) Class
(2, 4) A
Point (X, Y) Class
(5, 8) B
(1, 3) A
(6, 9) B
(3, 6) A
A new data point (4, 7) needs classification using K = 3.
30. State the difference between Random Forest and Decision Tree method.
Feature Decision Tree Random Forest
Definition A single tree-based model An ensemble of multiple decision trees
Overfitting Prone to overfitting Reduced overfitting due to averaging
Accuracy Lower on unseen data Higher due to ensemble voting
Interpretability Easy to interpret Harder to interpret (many trees)
Performance Fast but unstable Slower but more stable and accurate
31. K-Means Clustering – One Iteration
Dataset:
X Y
1 2
2 3
3 4
8 9
9 10
10 11
Initial centroids:
• Centroid 1 (C1) = (2, 3)
• Centroid 2 (C2) = (9, 10)
32. A bank wants to classify whether a customer should get a loan based on income level and
credit score.
Dataset:
Income Credit Score Loan Approved?
High Good Yes
Low Good No
High Bad No
Medium Good Yes
Medium Bad No
33. Explain Linear and Nonlinear Regression Method.
• Linear Regression assumes a linear relationship: y=wx+by = wx + b
• Nonlinear Regression models more complex curves, e.g., exponential, polynomial, etc.
Example:
• Linear: House price vs square footage
• Nonlinear: Population growth over time
34. Difference between Clustering and Classification in ML
Feature Classification Clustering
Labels Labeled data (supervised) Unlabeled data (unsupervised)
Objective Predict categories Discover structure
Example Email spam detection Customer segmentation
35. Spam Classification Dataset
"Free" "Win" Spam?
Yes No Yes
No Yes No
Yes Yes Yes
No No No
Yes No Yes
36. Key features of Naïve Bayes Classifier
• Assumes feature independence
• Based on Bayes theorem
• Efficient and simple
• Works well on high-dimensional data
• Common in text classification
37. Fraud Detection using Naïve Bayes
Amount Suspicious Fraud (Yes) Fraud (No)
High Yes 8 2
High No 5 4
Medium Yes 6 5
Medium No 3 7
Low Yes 3 8
Low No 2 9
Total Yes = 27, No = 35
1. Prior Probabilities:
P(Yes)=27/62P(No)=35/62P(Yes) = 27 / 62 \quad P(No) = 35 / 62
2. Likelihoods:
• P(Medium | Yes) = 6/27, P(Yes | Suspicious = Yes) = (6+3)/27 = 9/27
• P(Medium | No) = 5/35, P(No | Suspicious = Yes) = (5+8)/35 = 13/35
3. Posterior (proportional):
• Yes ∝ P(Yes) * P(Medium|Yes) * P(Yes|Suspicious)
• No ∝ P(No) * P(Medium|No) * P(No|Suspicious)
Compare both and classify accordingly.
38. Define in context of DBSCAN
1. Core points: Have minimum number of neighbors (minPts) within radius (eps)
2. Border points: Within eps of a core point but not a core themselves
3. Noise points: Not within eps of any core point
Density reachability helps form clusters by connecting points that are density-reachable via a chain
of core points.
39. Apriori Algorithm & Association Rule Mining
• Apriori Algorithm: Identifies frequent itemsets in transactional data to form rules.
• Support: Proportion of transactions that contain an itemset.
• Confidence: Likelihood that item Y is bought when item X is bought.
• Lift: How much more likely Y is bought with X compared to random chance.
Pruning: Eliminate candidate sets that do not meet minimum support.
Apriori Property: A superset of an infrequent itemset cannot be frequent.
This reduces computational complexity by limiting the number of itemsets considered.
Optimization: Use hash trees, vertical data format, or FP-Growth to reduce candidate generation.