Ixs8h l8mgc
Ixs8h l8mgc
Answer)
Ans:
Formula:IQR=Q3−Q1
Outlier Determination:
Lower Bound: Q1−1.5×IQR
Upper Bound: Q3+1.5×IQR
Data points outside these bounds are considered outliers.
16. Find the probability of throwing two fair dice when the sum is 5 and when the
sum is 8. Show your calculation. (Long Answer)
Ans:
17. Compare quantitative data and qualitative data? (Long Answer)
Ans:
Aspect Quantitative Data Qualitative Data
Data that can be measured and Data that describes qualities or
Definition
expressed numerically. characteristics.
Numeric (e.g., height, weight, Categorical (e.g., colors, names,
Type
temperature). labels).
Examples Age, salary, distance, time. Gender, nationality, type of car.
Interval or ratio scale (e.g., Nominal or ordinal scale (e.g.,
Measurement Scale
distance, temperature). yes/no, ratings).
Mathematical operations (e.g., Non-numeric analysis (e.g.,
Analysis
mean, median, variance). frequency, mode, grouping).
Graphical Histograms, bar graphs, line Pie charts, bar graphs, word
Representation charts. clouds.
18. Define covariance. Analyze how it helps in understanding the direction and
strength of the relationship between two variables. (Long Answer)
Ans:Covariance is a statistical measure that quantifies the relationship between two
variables, showing how they change together. It indicates whether an increase in one
variable leads to an increase or decrease in another.
n
1
Formula: Cov(X,Y) =1 ∑ ( Xi − X )(Yi −Y )
n i=1
Where:
Xand Yare two variables,
X bar and Y bar are the means of XXX and YYY,
nis the number of data points.
Significance of Covariance:Covariance helps in understanding both the direction
and strength of the relationship between two variables.
Direction of Relationship:
Positive Covariance: If the covariance is positive, it indicates that as one variable
increases, the other tends to increase as well (positive relationship).
Negative Covariance: If the covariance is negative, it suggests that as one variable
increases, the other tends to decrease (negative relationship).
Zero Covariance: A covariance close to zero suggests no linear relationship between
the variables.
Strength of Relationship:
Magnitude: The larger the magnitude of the covariance, the stronger the
relationship between the variables. However, covariance alone does not provide a
standardized measure of strength. The value depends on the scale of the variables, so
comparing covariances across datasets with different units can be misleading.
19. A distribution is skewed to the right and has a median of 20. Will the mean be
greater than or less than 20? Explain briefly. (Long Answer)
Ans:In a right-skewed distribution (positively skewed), the mean is pulled in the direction
of the longer tail on the right side. Since the median (which represents the central value) is
given as 20, the mean will be greater than 20 because the higher values in the tail raise the
average.
This happens because extreme values on the right increase the mean more than they affect the
median.
20. Define one-sample t-test. Explain when it is used in statistical analysis. (Long
Answer)
Ans:A one-sample t-test is a statistical test used to determine whether the mean of a
sample is significantly different from a known or hypothesized population mean. It
compares the sample mean to the population mean to assess if any observed difference
is statistically significant or if it could have occurred due to random chance.
When to Use a One-Sample T-Test:
The one-sample t-test is used in the following scenarios:
When you have a sample and you want to compare its mean to a known
population mean (or hypothesized value).
When the population standard deviation is unknown and the sample size is
small (typically n<30n < 30n<30).
When the data is approximately normally distributed, as the t-test assumes
normality for the sample data, especially when the sample size is small.
21. Apply your understanding of outliers to identify situations where retaining
them is appropriate. Support your response with relevant examples. (Long
Answer)
Ans:While outliers are often removed to ensure accurate analysis, there are situations
where retaining them is necessary and informative. Here are some scenarios:
True Extreme Values:
Example: In financial analysis, extreme stock prices or large transactions may be
legitimate and critical to understanding market behavior.
Reason: Removing these outliers could distort trends and risk analysis, as they
represent rare but significant events.
Scientific Research:
Example: In genetics research, some outliers may represent important variations or
mutations that offer insight into rare diseases.
Reason: Outliers could reveal important patterns or rare phenomena crucial to the
study.
Fraud Detection:
Example: In banking, abnormally large withdrawals could be outliers, but they
might indicate fraudulent activity.
Reason: Retaining these outliers can help detect suspicious behaviors or fraud.
Process Monitoring:
Example: In manufacturing, a machine failure that causes a spike in data could be an
outlier but indicates a problem needing attention.
Reason: The outlier could point to a process anomaly that requires corrective
measures.
22. List the key properties of a normal distribution. Analyze any two of these
properties by explaining how they affect the shape and behavior of the distribution.
(Long Answer)
Ans:Key Properties of a Normal Distribution:
Symmetry: The normal distribution is symmetrical around its mean.
Bell-Shaped Curve: It has a bell-shaped curve, with the highest point at the mean,
and tails that extend infinitely in both directions.
Mean, Median, and Mode Equality: In a normal distribution, the mean, median, and
mode are all the same.
68-95-99.7 Rule:
68% of the data lies within one standard deviation of the mean,
95% lies within two standard deviations,
99.7% lies within three standard deviations.
Asymptotic: The tails of the distribution approach the horizontal axis but never touch
it.
Defined by Mean and Standard Deviation: The shape of the normal distribution is
fully defined by its mean (μ) and standard deviation (σ).
Analysis of Two Properties:
1. Symmetry:
Impact on Shape: The symmetry of a normal distribution means that the curve is
identical on both sides of the mean. This property leads to a balanced distribution of
data, where half of the values are less than the mean and the other half are greater.
Behavior: Since the distribution is symmetrical, it implies that the probability of
observing a value above the mean is equal to the probability of observing a value
below the mean. This is crucial in hypothesis testing and confidence interval
estimation.
2. 68-95-99.7 Rule:
Impact on Shape: This rule indicates that the majority of the data is concentrated
around the mean. For a standard normal distribution (mean = 0, standard deviation =
1), 68% of the data falls within one standard deviation, making the curve steepest at
the mean and flatter as you move away from it.
Behavior: This property shows that most values lie close to the mean, and as you
move further from the mean, the probability of observing values decreases rapidly.
This is important in predicting probabilities and understanding the spread of data
within a normal distribution.
23. Explain different stages of data Science? (Long Answer)
Ans:
24. What is machine learning? Justify its role and importance in data science with
a brief explanation. (Long Answer)
Ans:Machine Learning (ML) is a subset of artificial intelligence that enables
systems to learn from data and make predictions or decisions without being explicitly
programmed.
Role and Importance in Data Science:
Core Analytical Tool: ML provides powerful algorithms (like regression,
classification, clustering) to extract patterns from data.
Automates Decision-Making: It helps build models that can automatically predict
outcomes (e.g., customer churn, sales forecasts).
Handles Large Data: ML efficiently processes and learns from massive, complex
datasets that are difficult to analyze manually.
Supports Real-Time Applications: Powers applications like recommendation
systems, fraud detection, and image recognition.
25. What is data preprocessing? Explain its role in data analysis and describe
common methods used for preprocessing data. (Long Answer)
Ans:Data preprocessing is the process of cleaning and transforming raw data into a
suitable format for analysis or modeling.
Role in Data Analysis:
Ensures data quality and consistency.
Removes errors, missing values, and noise.
Enhances the accuracy and performance of models.
Makes data ready for statistical or machine learning tasks.
Common Methods of Data Preprocessing:
Missing Value Treatment – Filling or removing missing data.
Data Cleaning – Removing duplicates, correcting errors.
Normalization/Scaling – Standardizing feature ranges.
Encoding Categorical Data – Converting categories into numerical values (e.g.,
one-hot encoding).
Outlier Detection and Handling – Identifying and treating extreme values.
Data Transformation – Log, square root, or other transformations to reduce
skewness.
MODULE 2
1. Define a population and a sample in the context of statistical analysis. (Short
Answer)
Ans: Population:A population is the complete set of individuals or items of interest
(e.g., all students in a university).
Sample:A sample is a subset drawn from the population used for analysis (e.g., 200
selected students).
Sampling allows for efficient estimation of population characteristics.
2. List any two types of probability distributions and provide a brief example of each.
(Short Answer)
Ans:· Binomial Distribution: This distribution is used when there are two possible
outcomes (success or failure) in a fixed number of independent trials.
Example: Tossing a coin 10 times and counting the number of heads.
Normal Distribution: This is a continuous distribution that is symmetric about the
mean, often called a bell curve.
Example: Heights of adult humans typically follow a normal distribution.
3. Explain the difference between a parameter and a statistic with an example.
(Short Answer)
Ans:·
Aspect Parameter Statistic
A value that describes a characteristic A value that describes a
Definition
of a population characteristic of a sample
Based on Entire population Subset (sample) of the population
Average height of all students Average height of 50 randomly
Example
in a school selected students
16. Evaluate the roles of descriptive and inferential statistics in data analysis.
Justify the significance of each through practical examples. (Long Answer)
Ans:1. Descriptive Statistics
Role:
Descriptive statistics are used to summarize, organize, and present data in a
meaningful way. They help to describe the basic features of a dataset, offering a clear
picture of what the data shows without making any generalizations or predictions
beyond the data itself.
Example:
Suppose a teacher collects test scores from her class of 30 students. She calculates the
average (mean) score as 75, the highest score as 95, and the standard deviation as 10.
This summary gives her a snapshot of class performance—this is descriptive
statistics in action.
Significance:
Descriptive statistics help in:
Understanding patterns in data
Identifying outliers or trends
Presenting data clearly for reporting or communication
2. Inferential Statistics
Role:
Inferential statistics go a step further by using sample data to make predictions or
inferences about a larger population. This is essential when it's impractical or
impossible to collect data from every member of a population.
Example:
A political analyst surveys 1,000 people out of a population of 1 million to estimate
voting preferences. Based on the sample, the analyst infers that 60% of the population
supports a particular candidate. This is a classic case of inferential statistics.
Significance:
Inferential statistics are vital for:
Making decisions under uncertainty
Predicting future trends or behaviors
Testing theories or assumptions based on limited data
17. Imagine that Jeremy took part in an examination. The test is having a mean score
of 160, and it has a standard deviation of 15. If Jeremy’s z-score is 1.20, Evaluate his
score on test. (Long Answer)
Ans:
18. Analyze a left-skewed distribution with a median of 60 to draw conclusions
about the relative positions of the mean and the mode. Explain the relationship
among these measures of central tendency based on the skewness of the
distribution. (Long Answer)
Ans:In a left-skewed distribution (also called negatively skewed), the tail of the
distribution extends more toward the lower values (left side). Given that the median is
60, here's how the mean and mode relate to it:
Mean: The mean will be less than the median in a left-skewed distribution.
This is because the long left tail pulls the mean down. The mean is sensitive to
extreme values, so outliers on the lower end will reduce the mean value.
Mode: The mode, which is the most frequent value in the dataset, is usually
greater than the median in a left-skewed distribution. This is because the
majority of the data points are clustered toward the higher values on the right
side of the distribution.
Summary of the relationship:
Mode > Median > Mean in a left-skewed distribution.
The mean is pulled to the left by the skewness, making it the smallest measure
of central tendency, while the mode is the largest, as it is typically positioned
near the peak of the distribution.
19. Evaluate the effect of outliers on statistical measures such as mean and
standard deviation. Justify when and why outliers may distort data interpretation
using examples. (Long Answer)
Ans:Outliers can significantly impact statistical measures like the mean and standard
deviation:
Mean: The mean is sensitive to outliers because it's calculated by summing all values.
Outliers can skew the mean toward extreme values, making it unrepresentative of the
majority of the data.
Example: In the dataset (10, 12, 14, 16, 1000), the mean is disproportionately high
due to the outlier, making it misleading about the typical value.
Standard Deviation: Outliers increase the spread of data, leading to a higher standard
deviation. This makes it seem like the data is more variable than it actually is.
Example: In datasets with and without outliers, the dataset with an outlier will show a
much higher standard deviation, which could distort interpretations of data variability.
When and Why Outliers Distort Data Interpretation:
Skewed Conclusions: Outliers can lead to skewed interpretations, especially when
making decisions based on the mean or standard deviation. For instance, if a business
is using the mean income to set salary levels, outliers could lead to unreasonably high
salaries.
Misleading Comparisons: When comparing datasets or groups, outliers can give the
false impression that one group has more variability or a higher average than it
actually does.
Data Cleaning: In many cases, it's important to identify and decide whether outliers
should be removed or treated separately. They may represent important but rare
occurrences (e.g., a $100,000,000 income is a legitimate but rare outlier in an income
dataset), but they should not unduly influence overall statistical analyses.
20. Analyze the structure and components of a confusion matrix in classification
tasks. Interpret its elements (TP, FP, TN, FN) with an example to assess model
performance. (Long Answer)
Ans:A confusion matrix is a table used to evaluate the performance of a classification
model. It compares the predicted labels with the actual labels, showing how many
correct and incorrect predictions were made. The confusion matrix is typically
structured as follows:
Components of a Confusion Matrix:
True Positive (TP): The number of correct predictions where the model correctly
predicts the positive class.
False Positive (FP): The number of incorrect predictions where the model incorrectly
predicts the positive class (type I error).
True Negative (TN): The number of correct predictions where the model correctly
predicts the negative class.
False Negative (FN): The number of incorrect predictions where the model
incorrectly predicts the negative class (type II error).
Example:
Consider a binary classification model used to predict whether a person has a disease
(Positive) or not (Negative), with actual values (True labels) and predicted values
(Predicted labels):
Actual \ Predicted Positive (1) Negative (0)
Positive (1) 80 (TP) 10 (FN)
Negative (0) 15 (FP) 95 (TN)
Interpretation:
True Positive (TP = 80): The model correctly predicted 80 cases of the disease.
False Negative (FN = 10): The model incorrectly predicted that 10 people did not
have the disease when they actually did (missed diagnoses).
False Positive (FP = 15): The model incorrectly predicted that 15 people had the
disease when they did not (false alarms).
True Negative (TN = 95): The model correctly predicted that 95 people did not have
the disease.
Model Performance Metrics from the Confusion Matrix:
From the confusion matrix, we can derive various performance metrics to evaluate the
model:
Accuracy: The proportion of correctly predicted instances.
Accuracy=TP+TN\TP+TN+FP+FN
Precision (Positive Predictive Value): The proportion of positive predictions that are
actually correct.
Precision=TP\TP+FP
Recall (Sensitivity, True Positive Rate): The proportion of actual positives that are
correctly identified.
Recall=TP\TP+FN
F1 Score: The harmonic mean of precision and recall, balancing both metrics.
F1=2×(Precision×Recall\Precision+Recall)
21. Analyze the role of sampling in statistical investigations by comparing different
sampling techniques. Illustrate each method with examples and assess their
applicability in various scenarios. (Long Answer)
Ans:1. Simple Random Sampling (SRS)
Description: Every individual in the population has an equal chance of being selected.
Example: A researcher randomly selects 100 students from a school’s list of 500
students to study their academic performance.
Applicability: Ideal for situations where the population is homogeneous and every
member has an equal chance of being selected. Works well when there is no clear
subgroup structure in the population.
Pros:
Easy to understand and implement.
Results in an unbiased sample.
Cons:
Can be impractical for large populations.
May not represent smaller subgroups well.
2. Systematic Sampling
Description: A sample is selected by choosing every kth individual from the
population after selecting a random starting point.
Example: A researcher decides to survey every 10th customer in a store’s queue,
starting from a random point.
Applicability: Useful when a list of the population is available, and the researcher
wants a quick and simple sampling method. It works well when there’s no hidden
pattern in the population.
Pros:
Simple and faster than simple random sampling.
Ensures even coverage of the population.
Cons:Can be biased if there's an underlying pattern in the population (e.g., a queue
system where every 10th customer has the same characteristics).
3. Stratified Sampling
Description: The population is divided into subgroups (strata) based on specific
characteristics (e.g., age, gender, income level), and a sample is randomly selected
from each stratum.
Example: A researcher wants to survey employees about job satisfaction in a
company. The company is divided into strata based on departments (HR, marketing,
finance), and a random sample is taken from each department.
Applicability: Ideal when the population has distinct subgroups that may vary
significantly, and the researcher wants to ensure representation from all subgroups.
Pros:Ensures proportional representation of all key subgroups.
Cons:More complex to implement and requires knowledge about the population
structure.
4. Cluster Sampling
Description: The population is divided into clusters (often geographically), and then a
random sample of clusters is selected. All members of the chosen clusters are
surveyed.
Example: A researcher wants to study household energy usage across a country. They
randomly select 10 cities (clusters) and survey all households in those cities.
Applicability: Useful when the population is geographically spread out, or when it’s
difficult to compile a complete list of the population. Often used in large-scale surveys
or government studies.
Pros:
Cost-effective and easier to manage when the population is geographically
dispersed.
Requires fewer resources than surveying a random sample from the entire
population.
Cons:
May introduce bias if the clusters are not homogeneous.
Less precise than stratified sampling.
5. Convenience Sampling
Description: The sample is selected based on what is easiest for the researcher (e.g.,
surveying people nearby or accessible).
Example: A researcher surveys students in a class to understand their opinions about
online education.
Applicability: Often used in exploratory research or when time, cost, or access to the
population is limited.
Pros:
Quick, easy, and inexpensive.
Can be used for preliminary research.
Cons:
Highly prone to bias, as it does not represent the broader population well.
Results may not be generalizable to the entire population.
6. Quota Sampling
Description: The researcher selects participants to fulfill certain quotas, ensuring
representation of key demographic factors. It’s similar to stratified sampling, but the
selection is non-random.
Example: A researcher decides to interview 50 men and 50 women in a city to
understand their shopping habits, ensuring gender balance.
Applicability: Useful when the researcher wants to ensure certain demographic
groups are represented in the sample but doesn't have the resources for random
sampling.
Pros:
Ensures diversity in the sample.
Easier and cheaper than random sampling.
Cons:
Can be biased if the selection process is not random within each quota group.
The sample may not represent the overall population accurately.
7. Snowball Sampling
Description: A non-probability sampling method where existing participants recruit
new participants, often used in hard-to-reach or niche populations.
Example: A researcher studying the experiences of rare disease patients may start by
interviewing a few known patients, who then refer other patients.
Applicability: Ideal for studying hidden populations or when there’s no clear list of
the population (e.g., drug users, homeless individuals).
Pros:
Useful for accessing populations that are difficult to identify or reach.
Can be helpful in exploratory research.
Cons:
Prone to bias, as the sample may not be representative of the larger population.
Limited to social networks or groups.
22. Explain the normal distribution with examples. Describe its properties and
applications in data analysis. (Long Answer)
Ans:The normal distribution is a continuous probability distribution that is
symmetric and bell-shaped, commonly used in statistics. It is defined by its mean (μ)
and standard deviation (σ), and many real-world variables, like heights or exam
scores, follow this distribution.
Properties of the Normal Distribution:
Symmetry: The distribution is symmetrical around the mean.
Bell-shaped: The curve is highest at the mean and tapers off towards the tails.
68-95-99.7 Rule:
68% of data lies within one standard deviation of the mean.
95% lies within two standard deviations.
99.7% lies within three standard deviations.
Mean, Median, Mode: All are equal and located at the center.
Asymptotic: The tails of the distribution approach but never reach the
horizontal axis.
Example:
For a set of student exam scores with a mean of 75 and a standard deviation of 10,
most students would score near 75, with fewer students scoring very high or low.
Applications in Data Analysis:
Hypothesis Testing: Many statistical tests assume normality (e.g., z-tests).
Confidence Intervals: Used to estimate population parameters.
Z-Scores: Standardize data to compare different datasets.
Quality Control: Applied to monitor consistency in manufacturing.
Finance: Models returns on investment and risk assessment.
23. Draw and explain the bias-variance tradeoff in machine learning. Interpret its
impact on model performance. (Long Answer)
Ans:The bias-variance tradeoff is a fundamental concept in machine learning,
representing the relationship between the model's complexity and its performance.
Bias: The error introduced by approximating a real-world problem with a simplified
model. A model with high bias makes strong assumptions and underfits the data,
leading to systematic errors. As model complexity decreases, bias increases.
Variance: The error introduced by the model's sensitivity to small fluctuations in the
training data. A model with high variance can overfit, capturing noise rather than the
true patterns. As model complexity increases, variance increases.
Impact on Model Performance:
High bias (underfitting) occurs when the model is too simple, resulting in poor
performance on both training and test data.
High variance (overfitting) happens when the model is too complex, performing well
on training data but poorly on new, unseen data.
The goal is to find a balance where both bias and variance are minimized, leading to
the best model performance. The total error is the sum of bias and variance, and the
optimal model complexity lies where this total error is lowest.
24. Identify underfitting and overfitting in machine learning through model behavior
and apply your understanding to interpret their impact on prediction accuracy, using
diagrams where appropriate. (Long Answer)
Ans: Overfitting in Machine Learning
Overfitting happens when a model learns too much from the training data, including
details that don’t matter (like noise or outliers).
For example, imagine fitting a very complicated curve to a set of points. The
curve will go through every point, but it won’t represent the actual pattern.
As a result, the model works great on training data but fails when tested on new
data.
2. Underfitting in Machine Learning
Underfitting is the opposite of overfitting. It happens when a model is too simple to
capture what’s going on in the data.
For example, imagine drawing a straight line to fit points that actually follow a
curve. The line misses most of the pattern.
In this case, the model doesn’t work well on either the training or testing data.
Underfitting : Straight line trying to fit a curved dataset but cannot capture the
data’s patterns, leading to poor performance on both training and test sets.
Overfitting: A squiggly curve passing through all training points, failing to
generalize performing well on training data but poorly on test data.
Appropriate Fitting: Curve that follows the data trend without overcomplicating
to capture the true patterns in the data
25. Apply the concept of p-values in a hypothesis testing scenario and interpret the
results using a suitable example. (Long Answer)
Ans:In hypothesis testing, the p-value helps determine the strength of evidence
against the null hypothesis (H₀). It represents the probability of obtaining results as
extreme as the observed, assuming H₀ is true.
Example Scenario:
A company claims that their battery lasts at least 10 hours on average. A researcher
believes the true average is less than 10 hours, and tests this by taking a sample of 30
batteries.
H₀ (null hypothesis): μ = 10 hours
H₁ (alternative hypothesis): μ < 10 hours
After testing, the researcher finds the sample mean is 9.6 hours, and the p-value is
0.03.
Interpretation:
If the significance level (α) is 0.05:
Since p = 0.03 < 0.05, we reject H₀.
There is significant evidence that the battery lasts less than 10 hours
on average.
If p > 0.05, we would fail to reject H₀, meaning the data doesn't provide strong
enough evidence to dispute the company's claim.
26. Use Bayes’s theorem to solve a probability-based problem and explain the
steps involved with a real-life example. (Long Answer)
Ans:
27. Analyze the application of Bayes’ Theorem in the Naive Bayes algorithm and
illustrate how the algorithm functions in a classification task using an example.
(Long Answer)
Ans:
28. Apply one-hot encoding and dummy variable techniques to a given
dataset and demonstrate their use with examples. (Long Answer)
Ans:One-Hot Encoding
Definition: One-hot encoding transforms each categorical value into a new binary
column, representing the presence (1) or absence (0) of each category.
Example daaset:
ID Color
1 Red
2 Blue
3 Green
4 Red
After one hot en-coding:
ID Color_Red Color_Blue Color_Green
1 1 0 0
2 0 1 0
3 0 0 1
4 1 0 0
Dummy Variable Encoding
Definition: Dummy encoding is a form of one-hot encoding that drops one column
to avoid the dummy variable trap — a situation where predictors are perfectly
✅
multicollinear (i.e., one column can be predicted from others).
Dummy Variable Result (Red dropped):
ID Color_Blue Color_Green
1 0 0
2 1 0
3 0 1
4 0 0
Encoding
Use Case
Type
Ideal for tree-based models (e.g., Decision Trees,
One-Hot
Random Forest, XGBoost) that are not affected by
Encoding
multicollinearity.
Dummy Preferred for linear models (e.g., Linear or Logistic
Encoding Regression) where multicollinearity can affect the model.
Real-World Application
In a customer churn prediction model, a feature like Contract Type = [Monthly,
One Year, Two Year] is categorical.
Encoding it using one-hot or dummy variables allows the model to understand
different contract types numerically and associate them with churn risk.
MODULE 3
1. A company is using two different probability distributions to model customer
purchasing behavior. One follows a normal distribution, while the other follows a
Poisson distribution. Critique which distribution is more appropriate for modeling
daily purchase counts and justify your reasoning with statistical properties. (Short
Answer)
Ans:The Poisson distribution is more appropriate because it models count data—
specifically, the number of events (purchases) in a fixed time interval.
Poisson handles discrete, non-negative values, while Normal assumes continuous,
symmetric data.
Daily purchase counts typically follow Poisson, especially when the counts are low or
vary over time.
2. Define unconstrained multivariate optimization and provide an example. (Short
Answer)
Ans: Unconstrained multivariate optimization involves finding the minimum or
maximum of a function with multiple variables, without any constraints.
Example: Minimize f(x,y)=x^2+y^2
Solution: The minimum occurs at (0, 0), where the gradient is zero.
3. List two key differences between equality and inequality constraints in
optimization. (Short Answer)
Ans:Two key differences between equality and inequality constraints in
optimization:
Equality constraints (e.g., g(x)=0g(x) = 0g(x)=0) require exact satisfaction;
inequality constraints (e.g., h(x)≤0h(x) \leq 0h(x)≤0) require a limit not to be
exceeded.
Lagrange multipliers are used for equality constraints; Karush-Kuhn-Tucker
(KKT) conditions are used for inequality constraints.
4. Explain the role of the gradient in gradient descent optimization. (Short Answer)
Ans:The gradient points in the direction of steepest ascent of the function.
Gradient descent uses the negative gradient to iteratively update variables to
minimize the function.
It guides the model towards local minima step-by-step.
5. Describe how Lagrange multipliers are used to handle equality
constraints in optimization. (Short Answer)
Ans:Lagrange multipliers help solve constrained optimization problems by converting
them into an unconstrained form. Given a function f(x, y, ...) to maximize or minimize
under an equality constraint g(x, y, ...) = 0, the method introduces a multiplier λ
(Lagrange multiplier) to construct the Lagrangian function:
L(x,y,...,λ)=f(x,y,...)+λg(x,y,...)
By solving the gradient equations ∇L=0, we ensure that the constraint g(x, y, ...) = 0
is satisfied while optimizing f(x, y, ...).
6. Interpret what happens when the learning rate in gradient descent is set too high
or too low. (Short Answer)
Ans:· Too high: The model may overshoot minima, fail to converge, or diverge.
· · Too low: Convergence becomes very slow, increasing computation time.
Optimal learning rate balances speed and stability.
7. A function has the form f(x, y) = x^2 + 2xy + y^2. Use the gradient descent
method to determine the direction of the steepest descent at the point (1,1).
Ans:
8. A company wants to minimize the cost function C(x,y)=x^2+y^2, subject to the
constraint that the total resource allocation satisfies x^2 + y^2 = 4. Use the Lagrange
multiplier method to find the optimal values of x and y.
Ans:
9. A machine learning model is trained using gradient descent. Illustrate how the
model updates its weights iteratively using the learning rule. (Short Answer)
Ans:Weights www are updated iteratively using:
Wnew=Wold−η⋅∇L(w)
where η is the learning rate, and ∇L(w)(nabla) is the gradient of the loss function.
This process continues until the loss converges to a minimum.
10. A function f(x, y) has multiple local minima. Analyze how different
initialization points in gradient descent impact convergence to the global minimum.
Ans:If f(x,y)f(x, y)f(x,y) has multiple local minima, gradient descent may converge to
different minima depending on the starting point.
Poor initialization may lead to local, not global, minima.
Multiple runs with different initial points or using stochastic approaches can help
find better minima.
11. Two optimization algorithms, Gradient Descent and Newton’s Method, are used
for unconstrained multivariate optimization. Compare their efficiency in terms of
convergence speed and computational cost, and justify which method would be more
suitable for large scale machine learning problems.
Ans:
Aspect Gradient Descent Newton’s Method
Slower, linear (depends on Faster (quadratic, if near
Convergence Speed
learning rate) minimum)
High (requires Hessian and
Computational Cost Low per iteration
matrix inverse)
Suitability for Large Better (less resource-
Not ideal for large datasets
Scale intensive)
Outlier
High (outliers skew the scale) Low (robust to outliers)
Sensitivity
19. Using a Python library such as Seaborn or Matplotlib, generate a histogram for a
numerical feature in a dataset. Analyze the underlying distribution by examining its
shape, central tendency, and spread, and interpret what these characteristics reveal
about the data. (Long Answer)
Ans:import seaborn as sns
import matplotlib.pyplot as plt
# Load dataset (example: Titanic)
df = sns.load_dataset('titanic')
# Plot histogram of 'age' feature
plt.figure(figsize=(8,4))
sns.histplot(data=df, x='age', bins=30, kde=True)
plt.title('Age Distribution of Titanic Passengers')
plt.show()
Distribution Analysis:
Shape:
Right-skewed (long tail to older ages)
Peaks around 20-30 years (young adults)
Central Tendency:
Median (~28) < Mean (~30) due to skew
Mode at 24 (most frequent age)
Spread:
Range: 0.4 to 80 years
IQR ~20-38 years (middle 50%)
Interpretation:
Majority were young adults (20s-30s)
Few children (<10) and elderly (>60)
Positive skew suggests most passengers were younger than the mean age
20. Analyze the concepts of overfitting and underfitting in machine learning
models. Compare their characteristics and examine their impact on model
performance using relevant examples. (Long Answer)
Ans:
Aspect Overfitting Underfitting
Training
Very low (near zero) High (poor fit to training data)
Error
- Too complex model (e.g., high- - Too simple model (e.g., linear
degree polynomial) for non-linear data)
Cause
- Too many features - Too few features
- Small dataset - Insufficient training
Fits training data perfectly but Fails to fit even training data
Visual Sign
erratic elsewhere (high bias)
21. Analyze how precision and recall contribute to evaluating model performance,
particularly in the context of imbalanced datasets. Examine scenarios where these
metrics provide more meaningful insights than accuracy. (Long Answer)
Ans:Precision and recall are critical metrics for evaluating model performance,
especially with imbalanced datasets.
Precision measures the proportion of true positive predictions out of all positive
predictions made by the model. It tells us how many of the predicted positive cases are
actually positive. This is important when false positives are costly (e.g., incorrectly
predicting a rare disease).
Recall measures the proportion of actual positive cases that the model correctly
identifies. It’s crucial when missing a positive case is costly (e.g., failing to detect a
fraud transaction).
Scenarios:
Medical Testing: Imagine a model that predicts whether a patient has a rare disease.
If the disease occurs in 1% of the population, the model could predict "no disease" for
everyone and still achieve 99% accuracy, missing every actual case of the disease.
Here, recall is more important—it's crucial that the model identifies as many true
cases of the disease as possible, even if it means accepting a few false positives (lower
precision).
Fraud Detection: In fraud detection, where fraudulent transactions make up a small
percentage of the total transactions, a model that predicts "no fraud" most of the time
could still have high accuracy. However, such a model wouldn't help in catching
fraudulent transactions. Here, both precision (to avoid false alarms) and recall (to
catch as many fraudulent transactions as possible) are more meaningful.
MODULE 4
1. Define logistic regression and mention one real-world application.
Ans:Logistic regression is a classification algorithm that models the probability
of a binary outcome using the sigmoid function.
Real-world application: Predicting whether a customer will buy a product
(yes/no) based on features like age and browsing behavior.
2. List the key assumptions of the k-nearest neighbor (k-NN) algorithm.
Ans:Similarity matters: Similar inputs have similar outputs (based on distance).
Feature scaling is important: Distance metrics are sensitive to scale.
No assumptions about data distribution: It's a non-parametric method.
3. Explain how the sigmoid function is used in logistic regression.
Ans:The sigmoid function transforms linear output into a probability between 0 and
1:
σ(z)=1/1+e−z
where z=wTx+b. If probability > 0.5, it predicts class 1; else, class 0.
13. Analyze how Support Vector Machines (SVM) separate data points into distinct
classes by identifying optimal hyperplanes. Examine the role of kernel functions in
transforming non-linearly separable data into a higher-dimensional space for
effective classification. (Long Answer)
Ans:Support Vector Machines (SVM) classify data by finding the optimal
hyperplane that best separates two classes. This hyperplane maximizes the margin
—the distance between the hyperplane and the nearest data points from each class,
called support vectors. A larger margin leads to better generalization and
classification accuracy. In cases where data is not linearly separable, kernel
functions can be used to transform the data into a higher-dimensional space where a
linear hyperplane can separate the classes effectively.
Role of Kernel Functions in SVM:
When data is not linearly separable in its original space, kernel functions enable
Support Vector Machines (SVM) to classify it effectively by transforming the data
into a higher-dimensional space where a linear separation becomes possible.
Key Points:
1. Transformation Without Explicit Computation (Kernel Trick):
Kernel functions compute the dot product of two data points in the transformed
space without explicitly mapping them.
This is known as the kernel trick, which saves computational effort and avoids
working directly in high-dimensional space.
2. How It Helps:
Transforms complex, non-linear boundaries into linearly separable ones in a
higher dimension.
Allows SVM to draw a linear hyperplane in this new space, which corresponds
to a non-linear boundary in the original space.
14. Analyze how Decision Trees determine splits at each node based on feature
selection criteria such as Information Gain or Gini Impurity. Examine how the
choice of splitting feature at each node impacts the tree’s accuracy, complexity, and
potential for overfitting. (Long Answer)
Ans:Decision Trees Splitting Criteria
Feature Selection with Information Gain or Gini Impurity:
Decision Trees choose the best feature to split a node based on criteria like
Information Gain (IG) or Gini Impurity.
Information Gain measures the reduction in entropy after a split.
Higher IG indicates a more informative split.
Gini Impurity measures how often a randomly chosen element would
be incorrectly classified. A lower Gini score is preferred.
Impact on Accuracy and Complexity :
The selected splitting feature directly affects the accuracy of the tree. Good splits lead
to purer nodes, improving classification or prediction performance. However, too
many splits can increase complexity, making the tree deeper and harder to interpret.
Potential for Overfitting :
If the model keeps splitting based on minor differences (high complexity), it may
overfit the training data and perform poorly on unseen data. Using pruning techniques
or setting depth limits helps reduce this risk.
15. Critically evaluate the problem of overfitting in Decision Trees and its impact on
model generalization. Justify how pruning techniques, such as pre-pruning and post-
pruning, can effectively reduce overfitting and improve the model’s performance on
unseen data. (Long Answer)
Ans:Overfitting in Decision Trees and Pruning Techniques :
Problem of Overfitting :
Overfitting occurs when a Decision Tree becomes too complex and captures
noise in the training data rather than the actual patterns. This results in high
training accuracy but poor generalization to new, unseen data.
Impact on Model Generalization (1 Mark):
An overfitted tree fails to perform well on test data, reducing its real-world
effectiveness. It may make overly specific rules that don't apply beyond the
training set.
Pruning Techniques to Reduce Overfitting :
Pre-Pruning (Early Stopping): Stops the tree from growing beyond a
certain depth or requires a minimum number of samples to split a
node. This prevents over-complex trees.
Post-Pruning (Reduced Error or Cost-Complexity Pruning): Allows
the full tree to grow and then removes branches that don't significantly
improve accuracy on validation data.
Both methods simplify the model, reducing overfitting and improving
performance on unseen data.
16. Analyze the role of distance metrics in the K-Nearest Neighbors (KNN)
algorithm and how they influence the classification or regression outcomes.
Examine why feature scaling through normalization or standardization is crucial
when using KNN, particularly in datasets with varying feature ranges. (Long
Answer).
Ans:Distance Metrics and Feature Scaling in KNN:
Role of Distance Metrics in KNN:
KNN relies on distance metrics to identify the 'K' nearest neighbors of a query
point. Common metrics include:
Euclidean Distance (default for continuous features)
Manhattan Distance (suitable for high-dimensional data)
Minkowski or Cosine Distance (for customized similarity)
These metrics determine how "close" other points are, directly influencing
classification or regression results.
Influence on Outcomes :
The choice of distance metric affects which neighbors are selected. Incorrect
or unbalanced metrics can lead to poor predictions by favoring irrelevant
features or distant points.
Importance of Feature Scaling:
When features have different ranges (e.g., age: 0–100 vs. income: 0–100,000),
larger-range features dominate the distance calculation.
Normalization (scales values to [0, 1])
Standardization (scales to mean = 0, std = 1)
These methods ensure all features contribute equally, making KNN
more accurate and fair.
17. Analyze how Logistic Regression models probabilities to predict binary
outcomes. Examine how adjusting the decision threshold influences model
performance, particularly in handling imbalanced datasets. (Long Answer)
Ans:Logistic Regression and Decision Thresholds (5 Marks)
Modeling Probabilities (2 Marks):
Logistic Regression predicts binary outcomes by modeling the probability that an
instance belongs to a class using the sigmoid function:
This maps any real-valued input to a value between 0 and 1, interpreted as the
probability of the positive class.
Decision Threshold :
A default threshold of 0.5 is typically used to classify outcomes:
If P≥0.5P \geq 0.5P≥0.5, predict class 1
If P<0.5P < 0.5P<0.5, predict class 0
This threshold can be adjusted depending on the problem.
Impact on Imbalanced Datasets :
In imbalanced datasets (e.g., 95% negative, 5% positive), a 0.5 threshold may miss
minority cases (false negatives).
Lowering the threshold (e.g., to 0.3) increases sensitivity (recall) to detect more
positives.
Adjusting the threshold helps balance precision and recall, improving
performance on real-world, skewed data.
MODULE 5
1. A company is using k-means clustering for customer segmentation but is unsure
about the optimal number of clusters. Critique whether the Elbow Method or
Silhouette Score is a better approach to determine the optimum number of clusters
and justify your reasoning.
Ans:The Elbow Method and Silhouette Score are both valid techniques for finding
the optimal number of clusters in K-means.
Elbow Method plots the within-cluster sum of squares (WCSS) vs. number of
clusters. The "elbow" point suggests the best balance between model complexity
and compactness. However, it can be subjective and unclear if no obvious elbow
exists.
Silhouette Score measures how well each point fits within its cluster compared to
others. It ranges from -1 to 1, with higher scores indicating well-separated, dense
clusters.
4. Explain the role of the R-squared (R^2) value in evaluating a regression model.
Ans:The R-squared (R²) value measures how well a regression model explains the
variance in the dependent variable.
It ranges from 0 to 1:
R² = 1 means perfect prediction.
R² = 0 means the model explains none of the variance.
Role:R² indicates the goodness of fit — the higher the R², the better the model
explains the data. However, it does not indicate causation or model correctness and
can be misleading if used alone, especially in overfitted models.
5. Describe how residual plots can help diagnose model fit issues in linear regression.
Ans:Residual plots show the difference between predicted and actual values
(residuals) vs. predicted values or input features.
How they help:
A random scatter of points suggests a good model fit.
Patterns or curves indicate non-linearity — the model may be missing
relationships.
Increasing or decreasing spread signals heteroscedasticity (non-constant
variance).
Outliers or clustering can reveal data issues or the need for a better model.
6. Interpret the purpose of cross-validation in regression model selection.
Ans:Cross-validation is used to assess how well a regression model generalizes to
unseen data.
Purpose:
It splits the dataset into training and validation sets multiple times (e.g., in k-fold
cross-validation).
Helps avoid overfitting by testing the model on different data subsets.
Provides a more reliable estimate of model performance compared to a single
train-test split.
Aids in model selection by comparing performance metrics (like RMSE or R²)
across different models.
7. A simple linear regression model is given by Y=3X+5. Calculate the predicted
value of Y when X=4.
Ans:
8. Illustrate how to implement multiple linear regression in Python using the scikit-
learn library with a sample dataset.
Ans:from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=2, noise=0.1)
model = LinearRegression()
model.fit(X, y)
print(model.coef_, model.intercept_)
9.Given a dataset with multicollinearity, apply an appropriate regression
technique to reduce its impact and describe your approach.
Ans:To reduce multicollinearity in a dataset, we can apply Ridge Regression or
Principal Component Regression (PCR):
Ridge Regression: Adds an L2 regularization penalty to the regression model,
shrinking correlated variable coefficients to stabilize estimates.
Principal Component Regression (PCR): Uses Principal Component Analysis
(PCA) to transform correlated predictors into uncorrelated components, reducing
redundancy.
10.A researcher builds two multiple linear regression models with different feature
sets. Analyze how cross-validation can help determine which model is more reliable.
Ans:Cross-validation helps compare the reliability and performance of two
multiple linear regression models with different feature sets.
How it helps:
Both models are evaluated on multiple data splits (e.g., k-fold CV), reducing
bias from a single train-test split.
Metrics like Mean Squared Error (MSE) or R² are averaged across folds.
The model with lower average error and less variance in scores is considered
more reliable.
Helps identify if extra features improve performance or just cause overfitting
11.A company is using Akaike Information Criterion (AIC) and Adjusted R-squared
to select the best multiple regression model. Critique which metric is more effective
for model selection and justify your reasoning.
Ans:Both Akaike Information Criterion (AIC) and Adjusted R-squared are useful
for model selection, but they serve slightly different purposes.
AIC balances model fit and complexity. It penalizes the number of predictors to
avoid overfitting. Lower AIC indicates a better model. It is more effective when
comparing non-nested models and focuses on predictive accuracy.
Adjusted R-squared adjusts R² for the number of predictors. It increases only if a
new feature improves the model more than by chance. However, it is less
sensitive to overfitting compared to AIC.
12. Compare and contrast the advantages and disadvantages of ensemble
methods like Bagging, Boosting, and Stacking. (Long Answer)
Ans:
Method Advantages Disadvantages
- Reduces variance and - Less improvement if base learners
Bagging (e.g.,
overfitting- Easy to parallelize- are strong- Can be computationally
Random Forest)
Improves stability expensive with many trees
- Reduces bias and variance-
Boosting (e.g., - Sensitive to noisy data and
Focuses on hard-to-predict
AdaBoost, outliers- Sequential training is
examples- Often achieves high
XGBoost) slower and harder to parallelize
accuracy
- Combines diverse models for - Complex to implement and tune-
Stacking better performance- Flexible, can Risk of overfitting if meta-model is
use any base learners not carefully chosen
13. Examine the trade-offs between bias and variance in machine learning models
and their effect on overfitting and underfitting. (Long Answer)
Ans:Bias: Bias refers to errors from overly simple models that underfit data, missing
important patterns. High bias leads to poor training and test accuracy.
Variance refers to errors from models that are too complex and overfit training
data, capturing noise. High variance leads to good training but poor test accuracy.
Trade-off:
Reducing bias usually increases variance and vice versa. The goal is to find a balance
that minimizes total error.
Effect on Overfitting/Underfitting:
High bias → Underfitting (model too simple)
High variance → Overfitting (model too complex)
14. Evaluate the performance of different classification models (e.g., Logistic
Regression, Random Forest, and SVM) on a given dataset using precision, recall,
and F1-score. (Long Answer)
Ans:Overview of Metrics:
Precision: Measures the proportion of true positive predictions among all positive
predictions.
Precision=TP\TP+FP
High precision means fewer false positives.
Recall (Sensitivity): Measures the proportion of true positives identified out of actual
positives.
Recall=TP\TP+FN
High recall means fewer false negatives.
F1-score: The harmonic mean of precision and recall, balancing the two.
F1=2×Precision×Recall \ Precision+Recall
Useful when you want a balance between precision and recall, especially in
imbalanced datasets.
15. Analyze the impact of hyperparameter tuning in Random Search in deep
learning models. (Long Answer)
Ans:· Efficiency: Random Search explores hyperparameter space by sampling
combinations randomly, often finding good configurations faster than exhaustive grid
search, especially when only a few hyperparameters matter.
· · Improved Model Performance: Proper tuning of parameters like learning rate,
batch size, and dropout via Random Search can significantly enhance model accuracy
and generalization by optimizing training dynamics.
· · Better Exploration: Random Search can discover non-intuitive hyperparameter
values by sampling broadly, increasing chances of escaping local optima in complex
deep learning models.
· · Scalability: It handles high-dimensional hyperparameter spaces efficiently,
making it suitable for deep learning models with many parameters.
· · Limitations: Despite efficiency, Random Search requires multiple training runs,
which can be computationally expensive, and results depend on chosen parameter
ranges.
16. A regression analysis between apples (y) and oranges (x) resulted in the
following least squares line: y = 100 + 2x. Predict the implication if oranges are
increased by 1 (Long Answer)