0% found this document useful (0 votes)

29 views40 pages

Ixs8h l8mgc

The document covers various aspects of Data Science, including definitions, methods for handling missing values, tools, and the differences between related fields like statistics and data analytics. It outlines the Data Science project lifecycle, goals, and the significance of exploratory data analysis (EDA). Additionally, it discusses outliers, their identification, and the properties of normal distribution.

Uploaded by

xapow49268

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views40 pages

Ixs8h l8mgc

Uploaded by

xapow49268

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

MODULE 1

2. Short and Long Answer Questions

1. Explain Data Science (Short Answer)
Ans:Data Science is a multidisciplinary field that uses scientific methods, algorithms,
and systems to extract knowledge and insights from structured and unstructured data.
It combines statistics, computer science, and domain expertise to analyze data and
make informed decisions or predictions.
2. During analysis, how do you treat the missing values? (Short Answer)
Ans:· Deletion:
Listwise Deletion: Remove rows with missing values if they are few and random.
Column Removal: Drop columns with too many missing values.
· Imputation:
Mean/Median/Mode Imputation: Replace missing values with the column's mean,
median, or mode.
Predictive Imputation: Use models like regression or KNN to predict and fill
missing values.
Forward/Backward Fill: Useful in time-series data.
Flagging:Create a new binary variable to indicate if a value was missing, preserving
missingness information.
Use of Algorithms That Handle Missing Data:Some models (like decision trees,
XGBoost) can handle missing values natively.
3. Summarize the reasons for using Python for data cleaning in Data Science.
(Short Answer)
Ans:·Rich Libraries: Python provides powerful libraries such as Pandas, NumPy,
and OpenPyXL that simplify tasks like handling missing values, formatting data, and
data transformation.
·Ease of Use: Python’s clear and concise syntax makes it easy to write and understand
data cleaning code, even for beginners.
·Automation: Python enables automation of repetitive cleaning tasks, reducing time
and human error.
·Integration: It integrates seamlessly with data analysis, visualization (e.g.,
Matplotlib, Seaborn), and machine learning tools (e.g., Scikit-learn), making the entire
workflow efficient.
4. What are the different Data Science tools? (Short Answer)
Ans:1.Programming Languages
 Python – Most popular, with rich libraries (Pandas, NumPy, Scikit-learn)
 R – Strong in statistical analysis and visualization
 SQL – For querying and managing databases
2. Data Manipulation & Analysis
 Pandas (Python)
 dplyr (R)
 Excel
3. Data Visualization
 Matplotlib, Seaborn, Plotly (Python)
 ggplot2 (R)
 Tableau, Power BI
4. Machine Learning & Modeling
 Scikit-learn
 TensorFlow, Keras, PyTorch
 XGBoost, LightGBM
5. Big Data Tools
 Apache Hadoop
 Apache Spark
 Databricks
6. Data Storage & Retrieval
 MySQL, PostgreSQL
 MongoDB
 Google BigQuery, Amazon Redshift
7. Integrated Development Environments (IDEs):
 Jupyter Notebook
 RStudio
 VS Code, PyCharm
5. Distinguish Data science and statistics. (Short Answer)
Ans:
Aspect Data Science Statistics
Broader, includes data analysis, ML, Focused on data collection and
Scope
and AI inference
Python, R, SQL, ML libraries, big
Tools Mostly R, Python, SAS, Excel
data tools
Data
Structured, unstructured, big data Mostly structured data
Types
Prediction, automation, actionable Understanding relationships,
Goal
insights inference
Hypothesis testing, regression,
Methods ML, deep learning, data wrangling
sampling

6. Explain life cycle of Data Science projects. (Short Answer)

Ans:· Problem Definition: Understand the business problem and define objectives.
· · Data Collection: Gather data from various sources like databases, APIs, or files.
· · Data Cleaning: Handle missing values, remove duplicates, and fix data
inconsistencies.
· · Data Exploration & Analysis: Use statistics and visualizations to understand
patterns and insights.
· · Feature Engineering: Create or select important variables to improve model
performance.
· · Model Building: Apply machine learning algorithms to train predictive models.
· · Model Evaluation: Test the model using metrics like accuracy, precision, recall,
etc.
· · Deployment: Integrate the model into a real-world application or system.
· · Monitoring & Maintenance: Track model performance over time and update
when necessary
7. Compare between data analytics and data science. (Short

Answer)

Ans:

Aspect Data Analytics Data Science

Focus Descriptive & diagnostic analysis Predictive & prescriptive analysis
Build models to predict or automate
Goal Understand past trends
tasks
Python, R, ML libraries, big data
Tools Excel, SQL, Tableau, Power BI
tools
Skills Statistics, visualization, business Programming, statistics, ML, data
Needed knowledge engineering
Output Reports, dashboards Predictive models, algorithms

8. Compare box plot and histogram. (Short Answer)

Aspect Box Plot Histogram

Shows summary statistics and
Purpose Shows distribution of data frequencies
outliers
Data Type Continuous or ordinal Continuous or discrete
Median, quartiles, min, max,
Displays Frequency of data in intervals (bins)
outliers
Shape Good for visualizing shape (e.g.,
Limited shape information
Info skewness)
Comparing distributions between
Best Use Understanding overall distribution
groups

9. Define Data mining. (Short Answer)

Ans:Data mining is the process of discovering patterns, correlations, and useful
information from large datasets using statistical, mathematical, and computational
techniques. It involves techniques like clustering, classification, regression, and
association rule learning to extract hidden insights.
10. List popular libraries used in Data Science. (Short Answer)
Ans:· Pandas – Data manipulation and analysis
· · NumPy – Numerical computing and arrays
· · Matplotlib – Data visualization
· · Seaborn – Statistical data visualization (built on Matplotlib)
· · Scikit-learn – Machine learning algorithms
· · TensorFlow – Deep learning and neural networks
11. Illustrate the use of Data science with example (Short Answer)
Ans:Problem: A retail company wants to predict which products will be popular in
the upcoming season to optimize inventory and marketing efforts.
Steps:
Data Collection: Gather historical sales data, customer reviews, seasonal trends,
weather data, etc.
Data Cleaning: Handle missing values, remove duplicates, and standardize formats.
Exploratory Data Analysis (EDA): Visualize trends in sales, customer preferences,
and correlations with seasonality.
Feature Engineering: Create features like "seasonality score," "customer sentiment
score" from reviews, or "weather impact" based on historical data.
Model Building: Use machine learning algorithms like Random Forest or XGBoost
to predict product demand for the next season.
Model Evaluation: Test the model using metrics like accuracy and RMSE (Root
Mean Squared Error).
Deployment: Integrate the model into the company's inventory system to
automatically adjust stock levels based on predictions.
Outcome:
The company optimizes inventory, reduces stockouts, and increases sales by stocking
the right products at the right time.
12. Summarize the goals of data science? (Short Answer)
Ans:· Data Exploration and Insight Discovery: Understand patterns, relationships,
and trends within data to derive actionable insights.
· · Prediction and Forecasting: Use historical data to build models that predict
future events or trends, such as customer behavior or sales forecasting.
· · Automation of Processes: Build models and algorithms that automate decision-
making, optimizing tasks like inventory management or fraud detection.
· · Optimization: Improve processes or systems, such as resource allocation, by
using data-driven approaches.
· · Data-Driven Decision Making: Support business strategies and decisions by
analyzing data and providing recommendations for growth and improvement.
13. Define an outlier. Explain its significance in statistics with an example. (Long
Answer)
Ans:An outlier is a data point that significantly differs from the rest of the dataset. It
is an extreme value that lies far outside the general pattern of the data.
Significance in Statistics:
Distortion of Mean: Outliers can skew the mean, making it unrepresentative of the
data.
Model Impact: Outliers can affect the accuracy of models like regression, leading to
inaccurate predictions.
Error Detection: They help identify errors or unusual events in the data.
Statistical Tests: They can affect hypothesis tests and lead to incorrect conclusions.
Example:
In the dataset of ages: [20, 22, 23, 24, 21, 100], the value 100 is an outlier. It skews
the mean age higher, but most ages are around 21-24. The outlier might be due to an
error or a rare case, and should be handled carefully.
14. Break down the concept of Exploratory Data Analysis (EDA) and illustrate
its significance in statistics. Highlight how EDA facilitates the identification of
patterns, detection of outliers, and validation of assumptions using both
graphical and statistical methods. (Long Answer)
Ans:Exploratory Data Analysis (EDA)
EDA is the process of analyzing and visualizing data to summarize its main
characteristics, often with the help of graphical methods. It is used to understand the
structure, distribution, and relationships in the data before applying formal statistical
methods or models.
Significance in Statistics:
Identification of Patterns:EDA helps uncover patterns, trends, and relationships in
data. By visualizing data through methods like histograms and scatter plots, one can
identify key characteristics and form hypotheses.
Detection of Outliers:Outliers can be detected using box plots and scatter plots,
which highlight values that fall far outside the normal range. Outliers can distort
analysis and models, so identifying them early is crucial.
Validation of Assumptions:Many statistical methods assume normality or linearity.
EDA allows validation of these assumptions through histograms, QQ plots, and
correlation matrices, ensuring the appropriateness of subsequent models.
Graphical Methods:
Histograms: Show data distribution, helping identify skewness.
Box Plots: Reveal data spread and outliers.
Scatter Plots: Show relationships between two variables.
Statistical Methods:
Summary Statistics: Mean, median, and standard deviation provide insight
into the central tendency and variability.
Correlation: Quantifies relationships between variables.
15. Analyze how outliers can be identified in a dataset by examining their impact on
data distribution. Compare at least two different detection methods, such as the Z-
score method and the Interquartile Range (IQR) method, and explain how each
method determines outliers. (Long Answer)
Ans:Outliers are data points that deviate significantly from the majority of the data in
a dataset. Their presence can distort key statistical measures such as the mean,
standard deviation, and correlations, leading to incorrect conclusions. Identifying and
handling outliers is essential for accurate analysis.
Detection Methods:
1. Z-Score Method: The Z-score indicates how many standard deviations a data
point is from the mean.
Formula:Z=X−μ/σ
Where:
 X is the data point,
 μ is the mean,
 σ is the standard deviation.
Outlier Determination:Data points with a Z-score greater than 3 or less than -3 are
typically considered outliers (though this threshold can be adjusted depending on the
dataset).
2. Interquartile Range (IQR) Method:The IQR measures the range between the
25th percentile (Q1) and the 75th percentile (Q3), which captures the middle
50% of the data.

Formula:IQR=Q3−Q1
Outlier Determination:
 Lower Bound: Q1−1.5×IQR
 Upper Bound: Q3+1.5×IQR
 Data points outside these bounds are considered outliers.

16. Find the probability of throwing two fair dice when the sum is 5 and when the
sum is 8. Show your calculation. (Long Answer)
Ans:
17. Compare quantitative data and qualitative data? (Long Answer)
Ans:
Aspect Quantitative Data Qualitative Data
Data that can be measured and Data that describes qualities or
Definition
expressed numerically. characteristics.
Numeric (e.g., height, weight, Categorical (e.g., colors, names,
Type
temperature). labels).
Examples Age, salary, distance, time. Gender, nationality, type of car.
Interval or ratio scale (e.g., Nominal or ordinal scale (e.g.,
Measurement Scale
distance, temperature). yes/no, ratings).
Mathematical operations (e.g., Non-numeric analysis (e.g.,
Analysis
mean, median, variance). frequency, mode, grouping).
Graphical Histograms, bar graphs, line Pie charts, bar graphs, word
Representation charts. clouds.

18. Define covariance. Analyze how it helps in understanding the direction and
strength of the relationship between two variables. (Long Answer)
Ans:Covariance is a statistical measure that quantifies the relationship between two
variables, showing how they change together. It indicates whether an increase in one
variable leads to an increase or decrease in another.
n
1
Formula: Cov(X,Y) =1 ∑ ( Xi − X )(Yi −Y )
n i=1
Where:
Xand Yare two variables,
X bar and Y bar are the means of XXX and YYY,
nis the number of data points.
Significance of Covariance:Covariance helps in understanding both the direction
and strength of the relationship between two variables.
Direction of Relationship:
Positive Covariance: If the covariance is positive, it indicates that as one variable
increases, the other tends to increase as well (positive relationship).
Negative Covariance: If the covariance is negative, it suggests that as one variable
increases, the other tends to decrease (negative relationship).
Zero Covariance: A covariance close to zero suggests no linear relationship between
the variables.
Strength of Relationship:
Magnitude: The larger the magnitude of the covariance, the stronger the
relationship between the variables. However, covariance alone does not provide a
standardized measure of strength. The value depends on the scale of the variables, so
comparing covariances across datasets with different units can be misleading.

19. A distribution is skewed to the right and has a median of 20. Will the mean be
greater than or less than 20? Explain briefly. (Long Answer)
Ans:In a right-skewed distribution (positively skewed), the mean is pulled in the direction
of the longer tail on the right side. Since the median (which represents the central value) is
given as 20, the mean will be greater than 20 because the higher values in the tail raise the
average.
This happens because extreme values on the right increase the mean more than they affect the
median.
20. Define one-sample t-test. Explain when it is used in statistical analysis. (Long
Answer)
Ans:A one-sample t-test is a statistical test used to determine whether the mean of a
sample is significantly different from a known or hypothesized population mean. It
compares the sample mean to the population mean to assess if any observed difference
is statistically significant or if it could have occurred due to random chance.
When to Use a One-Sample T-Test:
The one-sample t-test is used in the following scenarios:
 When you have a sample and you want to compare its mean to a known
population mean (or hypothesized value).
 When the population standard deviation is unknown and the sample size is
small (typically n<30n < 30n<30).
 When the data is approximately normally distributed, as the t-test assumes
normality for the sample data, especially when the sample size is small.
21. Apply your understanding of outliers to identify situations where retaining
them is appropriate. Support your response with relevant examples. (Long
Answer)
Ans:While outliers are often removed to ensure accurate analysis, there are situations
where retaining them is necessary and informative. Here are some scenarios:
True Extreme Values:
Example: In financial analysis, extreme stock prices or large transactions may be
legitimate and critical to understanding market behavior.
Reason: Removing these outliers could distort trends and risk analysis, as they
represent rare but significant events.
Scientific Research:
Example: In genetics research, some outliers may represent important variations or
mutations that offer insight into rare diseases.
Reason: Outliers could reveal important patterns or rare phenomena crucial to the
study.
Fraud Detection:
Example: In banking, abnormally large withdrawals could be outliers, but they
might indicate fraudulent activity.
Reason: Retaining these outliers can help detect suspicious behaviors or fraud.
Process Monitoring:
Example: In manufacturing, a machine failure that causes a spike in data could be an
outlier but indicates a problem needing attention.
Reason: The outlier could point to a process anomaly that requires corrective
measures.
22. List the key properties of a normal distribution. Analyze any two of these
properties by explaining how they affect the shape and behavior of the distribution.
(Long Answer)
Ans:Key Properties of a Normal Distribution:
Symmetry: The normal distribution is symmetrical around its mean.
Bell-Shaped Curve: It has a bell-shaped curve, with the highest point at the mean,
and tails that extend infinitely in both directions.
Mean, Median, and Mode Equality: In a normal distribution, the mean, median, and
mode are all the same.
68-95-99.7 Rule:
68% of the data lies within one standard deviation of the mean,
95% lies within two standard deviations,
99.7% lies within three standard deviations.
Asymptotic: The tails of the distribution approach the horizontal axis but never touch
it.
Defined by Mean and Standard Deviation: The shape of the normal distribution is
fully defined by its mean (μ) and standard deviation (σ).
Analysis of Two Properties:
1. Symmetry:
Impact on Shape: The symmetry of a normal distribution means that the curve is
identical on both sides of the mean. This property leads to a balanced distribution of
data, where half of the values are less than the mean and the other half are greater.
Behavior: Since the distribution is symmetrical, it implies that the probability of
observing a value above the mean is equal to the probability of observing a value
below the mean. This is crucial in hypothesis testing and confidence interval
estimation.
2. 68-95-99.7 Rule:
Impact on Shape: This rule indicates that the majority of the data is concentrated
around the mean. For a standard normal distribution (mean = 0, standard deviation =
1), 68% of the data falls within one standard deviation, making the curve steepest at
the mean and flatter as you move away from it.
Behavior: This property shows that most values lie close to the mean, and as you
move further from the mean, the probability of observing values decreases rapidly.
This is important in predicting probabilities and understanding the spread of data
within a normal distribution.
23. Explain different stages of data Science? (Long Answer)
Ans:
24. What is machine learning? Justify its role and importance in data science with
a brief explanation. (Long Answer)
Ans:Machine Learning (ML) is a subset of artificial intelligence that enables
systems to learn from data and make predictions or decisions without being explicitly
programmed.
Role and Importance in Data Science:
Core Analytical Tool: ML provides powerful algorithms (like regression,
classification, clustering) to extract patterns from data.
Automates Decision-Making: It helps build models that can automatically predict
outcomes (e.g., customer churn, sales forecasts).
Handles Large Data: ML efficiently processes and learns from massive, complex
datasets that are difficult to analyze manually.
Supports Real-Time Applications: Powers applications like recommendation
systems, fraud detection, and image recognition.
25. What is data preprocessing? Explain its role in data analysis and describe
common methods used for preprocessing data. (Long Answer)
Ans:Data preprocessing is the process of cleaning and transforming raw data into a
suitable format for analysis or modeling.
Role in Data Analysis:
 Ensures data quality and consistency.
 Removes errors, missing values, and noise.
 Enhances the accuracy and performance of models.
 Makes data ready for statistical or machine learning tasks.
Common Methods of Data Preprocessing:
 Missing Value Treatment – Filling or removing missing data.
 Data Cleaning – Removing duplicates, correcting errors.
 Normalization/Scaling – Standardizing feature ranges.
 Encoding Categorical Data – Converting categories into numerical values (e.g.,
one-hot encoding).
 Outlier Detection and Handling – Identifying and treating extreme values.
 Data Transformation – Log, square root, or other transformations to reduce
skewness.
MODULE 2
1. Define a population and a sample in the context of statistical analysis. (Short
Answer)
Ans: Population:A population is the complete set of individuals or items of interest
(e.g., all students in a university).
Sample:A sample is a subset drawn from the population used for analysis (e.g., 200
selected students).
Sampling allows for efficient estimation of population characteristics.
2. List any two types of probability distributions and provide a brief example of each.
(Short Answer)
Ans:· Binomial Distribution: This distribution is used when there are two possible
outcomes (success or failure) in a fixed number of independent trials.
Example: Tossing a coin 10 times and counting the number of heads.
Normal Distribution: This is a continuous distribution that is symmetric about the
mean, often called a bell curve.
Example: Heights of adult humans typically follow a normal distribution.
3. Explain the difference between a parameter and a statistic with an example.
(Short Answer)
Ans:·
Aspect Parameter Statistic
A value that describes a characteristic A value that describes a
Definition
of a population characteristic of a sample
Based on Entire population Subset (sample) of the population
Average height of all students Average height of 50 randomly
Example
in a school selected students

4. Describe how a probability distribution helps in statistical modeling. (Short

Answer)
Ans:A probability distribution models how likely different outcomes are, helping to:
 Quantify uncertainty,
 Make predictions,
 Support statistical inference (e.g., hypothesis testing and confidence intervals).
5. Interpret what it means when a model is said to have a \"good fit\" in
statistical modeling. (Short Answer)
Ans:A model has a good fit when it accurately represents the relationship between
variables in the data:
 Residuals are small and random,
 Predictions closely match observed values,
 High R² value (in regression), indicating high explanatory power.
6. A researcher collects a sample of students\' heights from a university. Classify
whether the mean height of the sample is a parameter or a statistic, and justify your
answer. (Short Answer)
Ans:It is a statistic because it is calculated from a sample (not the entire
population).
Parameters are based on entire populations, which are often impractical to measure.
7. A dataset follows a normal distribution with a mean of 50 and a standard deviation
of 5. Calculate the probability of getting a value greater than 60 using the empirical
rule. (Short Answer)
Ans:
8. Given a dataset, a researcher fits a linear model to predict sales based on
advertising expenditure. Illustrate how they can check whether the model provides
a good fit using residual analysis.
Ans:· Plot residuals vs. predicted values.
· · If residuals are randomly scattered (no pattern), the model is a good fit.
· · Patterns or non-random spread suggest model misspecification or non-linearity.
9. A researcher fits a linear regression model to predict monthly sales based on
advertisement spending. Differentiate between systematic and random errors in the
model, and analyze how these errors affect prediction accuracy. (Short Answer)
Ans:·Systematic errors: Persistent bias due to flawed assumptions or data (e.g.,
missing a key variable). These skew predictions consistently.
·Random errors: Unpredictable variations caused by unknown factors. These cause
fluctuations around the true value.
Systematic errors reduce accuracy; random errors reduce precision.
10.Evaluate the effectiveness of the median over the mean in specific data scenarios.
Justify your reasoning with appropriate examples. (Long Answer)
Ans:The median is often more effective than the mean in scenarios with skewed data
or outliers, as it better represents the central tendency by not being affected by
extreme values.
Example 1 (Income data):
Consider incomes: $30k, $35k, $40k, $45k, $1 million.
Mean = $230k (skewed by outlier)
Median = $40k (more representative of most people's income)
Example 2 (Test scores):
Scores: 10, 20, 30, 40, 100
Mean = 40
Median = 30
If 100 is an outlier, median better reflects typical performance.
Conclusion: Use median when data is non-symmetric or contains outliers; use
mean when data is symmetrical and clean.
11. Analyze the relationship between a population and a sample in inferential
statistics, and examine their respective roles through examples. (Long Answer)
Ans:In inferential statistics, a sample is a subset of a population, and it is used to
draw conclusions or make predictions about the entire population.
Relationship:
 The population includes all members of a group (e.g., all college students in a
country).
 A sample is a smaller group selected from the population (e.g., 500 students
from different colleges).
Roles:
 Population is the target of the inference.
 Sample provides the data used to estimate population characteristics (e.g., mean,
proportion).
Example:
If you want to know the average study time of college students in India:
 You can't ask every student (population is too large).
 Instead, survey a sample (e.g., 500 students).
 Use the sample mean to estimate the population mean.
Conclusion:The sample acts as a tool to study the population. Inferential statistics use
sample data to make educated guesses about population parameters.
12. Analyze the statistical concept of the mean by breaking down its calculation
process and interpreting its significance through an example. (Long Answer)
Ans:The mean (or average) is a measure of central tendency that represents the
typical value in a data set.
Calculation Process:
Add all data values.
Divide the total by the number of values.
Formula:
Mean=Sum of all values \Number of values
Example:
Test scores: 70, 80, 90, 85, 75
Mean=70+80+90+85+75/5=400/5=80
Interpretation:The mean score is 80, meaning that on average, students scored 80 on
the test.
Significance:The mean provides a summary of the data's overall level, useful for
comparison and analysis. However, it can be influenced by extreme values (outliers),
so context matters.
13. Analyze the different types of mean by classifying them and interpreting their
relevance with appropriate examples. (Long Answer)
Ans:
14. Analyze the characteristics and statistical significance of a bell-curve
distribution by deconstructing its shape, features, and implications. (Long
Answer)
Ans:A bell-curve distribution, also known as a normal distribution, is a
fundamental concept in statistics with several key characteristics and implications.
Shape & Features:
Symmetrical: The curve is mirrored around the mean.
Unimodal: One peak at the center (mean = median = mode).
Tails extend infinitely: But never touch the x-axis.
68-95-99.7 Rule (Empirical Rule):
 68% of data within 1 standard deviation (σ) of the mean
 95% within 2σ
 99.7% within 3σ
Statistical Significance:
Predictability: Allows for probability estimation (e.g., likelihood of a test
score).
Benchmarking: Often used in grading, IQ tests, and standardized exams.
Foundation for inference: Many statistical tests (like t-tests and z-tests)
assume normality.
Example:
If the mean SAT score is 1000 with σ = 100:
A score of 1100 is within 1σ → above average.
A score of 1300 is 3σ → exceptional (top ~0.15%)
15. Analyze and contrast the characteristics of left-skewed and right-skewed
distributions. Interpret their shapes using diagrams and explain their implications in
data analysis. (Long Answer)
Ans:
Feature Left-Skewed (Negative) Right-Skewed (Positive)
Right (toward higher
Tail Direction Left (toward lower values)
values)
Order of
Mean < Median < Mode Mode < Median < Mean
Mean/Median/Mode
Shape Peak on right, tail on left Peak on left, tail on right
Retirement age, exam
Example Income, housing prices
failures
Best Measure of Center Median Median
Mean underestimates typical Mean overestimates typical
Implication
value value

16. Evaluate the roles of descriptive and inferential statistics in data analysis.
Justify the significance of each through practical examples. (Long Answer)
Ans:1. Descriptive Statistics
Role:
Descriptive statistics are used to summarize, organize, and present data in a
meaningful way. They help to describe the basic features of a dataset, offering a clear
picture of what the data shows without making any generalizations or predictions
beyond the data itself.
Example:
Suppose a teacher collects test scores from her class of 30 students. She calculates the
average (mean) score as 75, the highest score as 95, and the standard deviation as 10.
This summary gives her a snapshot of class performance—this is descriptive
statistics in action.
Significance:
Descriptive statistics help in:
 Understanding patterns in data
 Identifying outliers or trends
 Presenting data clearly for reporting or communication
2. Inferential Statistics
Role:
Inferential statistics go a step further by using sample data to make predictions or
inferences about a larger population. This is essential when it's impractical or
impossible to collect data from every member of a population.
Example:
A political analyst surveys 1,000 people out of a population of 1 million to estimate
voting preferences. Based on the sample, the analyst infers that 60% of the population
supports a particular candidate. This is a classic case of inferential statistics.
Significance:
Inferential statistics are vital for:
 Making decisions under uncertainty
 Predicting future trends or behaviors
 Testing theories or assumptions based on limited data
17. Imagine that Jeremy took part in an examination. The test is having a mean score
of 160, and it has a standard deviation of 15. If Jeremy’s z-score is 1.20, Evaluate his
score on test. (Long Answer)
Ans:
18. Analyze a left-skewed distribution with a median of 60 to draw conclusions
about the relative positions of the mean and the mode. Explain the relationship
among these measures of central tendency based on the skewness of the
distribution. (Long Answer)
Ans:In a left-skewed distribution (also called negatively skewed), the tail of the
distribution extends more toward the lower values (left side). Given that the median is
60, here's how the mean and mode relate to it:
Mean: The mean will be less than the median in a left-skewed distribution.
This is because the long left tail pulls the mean down. The mean is sensitive to
extreme values, so outliers on the lower end will reduce the mean value.
Mode: The mode, which is the most frequent value in the dataset, is usually
greater than the median in a left-skewed distribution. This is because the
majority of the data points are clustered toward the higher values on the right
side of the distribution.
Summary of the relationship:
Mode > Median > Mean in a left-skewed distribution.
The mean is pulled to the left by the skewness, making it the smallest measure
of central tendency, while the mode is the largest, as it is typically positioned
near the peak of the distribution.
19. Evaluate the effect of outliers on statistical measures such as mean and
standard deviation. Justify when and why outliers may distort data interpretation
using examples. (Long Answer)
Ans:Outliers can significantly impact statistical measures like the mean and standard
deviation:
Mean: The mean is sensitive to outliers because it's calculated by summing all values.
Outliers can skew the mean toward extreme values, making it unrepresentative of the
majority of the data.
Example: In the dataset (10, 12, 14, 16, 1000), the mean is disproportionately high
due to the outlier, making it misleading about the typical value.
Standard Deviation: Outliers increase the spread of data, leading to a higher standard
deviation. This makes it seem like the data is more variable than it actually is.
Example: In datasets with and without outliers, the dataset with an outlier will show a
much higher standard deviation, which could distort interpretations of data variability.
When and Why Outliers Distort Data Interpretation:
Skewed Conclusions: Outliers can lead to skewed interpretations, especially when
making decisions based on the mean or standard deviation. For instance, if a business
is using the mean income to set salary levels, outliers could lead to unreasonably high
salaries.
Misleading Comparisons: When comparing datasets or groups, outliers can give the
false impression that one group has more variability or a higher average than it
actually does.
Data Cleaning: In many cases, it's important to identify and decide whether outliers
should be removed or treated separately. They may represent important but rare
occurrences (e.g., a $100,000,000 income is a legitimate but rare outlier in an income
dataset), but they should not unduly influence overall statistical analyses.
20. Analyze the structure and components of a confusion matrix in classification
tasks. Interpret its elements (TP, FP, TN, FN) with an example to assess model
performance. (Long Answer)
Ans:A confusion matrix is a table used to evaluate the performance of a classification
model. It compares the predicted labels with the actual labels, showing how many
correct and incorrect predictions were made. The confusion matrix is typically
structured as follows:
Components of a Confusion Matrix:
True Positive (TP): The number of correct predictions where the model correctly
predicts the positive class.
False Positive (FP): The number of incorrect predictions where the model incorrectly
predicts the positive class (type I error).
True Negative (TN): The number of correct predictions where the model correctly
predicts the negative class.
False Negative (FN): The number of incorrect predictions where the model
incorrectly predicts the negative class (type II error).
Example:
Consider a binary classification model used to predict whether a person has a disease
(Positive) or not (Negative), with actual values (True labels) and predicted values
(Predicted labels):
Actual \ Predicted Positive (1) Negative (0)
Positive (1) 80 (TP) 10 (FN)
Negative (0) 15 (FP) 95 (TN)
Interpretation:
True Positive (TP = 80): The model correctly predicted 80 cases of the disease.
False Negative (FN = 10): The model incorrectly predicted that 10 people did not
have the disease when they actually did (missed diagnoses).
False Positive (FP = 15): The model incorrectly predicted that 15 people had the
disease when they did not (false alarms).
True Negative (TN = 95): The model correctly predicted that 95 people did not have
the disease.
Model Performance Metrics from the Confusion Matrix:
From the confusion matrix, we can derive various performance metrics to evaluate the
model:
Accuracy: The proportion of correctly predicted instances.
Accuracy=TP+TN\TP+TN+FP+FN
Precision (Positive Predictive Value): The proportion of positive predictions that are
actually correct.
Precision=TP\TP+FP
Recall (Sensitivity, True Positive Rate): The proportion of actual positives that are
correctly identified.
Recall=TP\TP+FN
F1 Score: The harmonic mean of precision and recall, balancing both metrics.
F1=2×(Precision×Recall\Precision+Recall)
21. Analyze the role of sampling in statistical investigations by comparing different
sampling techniques. Illustrate each method with examples and assess their
applicability in various scenarios. (Long Answer)
Ans:1. Simple Random Sampling (SRS)
Description: Every individual in the population has an equal chance of being selected.
Example: A researcher randomly selects 100 students from a school’s list of 500
students to study their academic performance.
Applicability: Ideal for situations where the population is homogeneous and every
member has an equal chance of being selected. Works well when there is no clear
subgroup structure in the population.
Pros:
 Easy to understand and implement.
 Results in an unbiased sample.
Cons:
 Can be impractical for large populations.
 May not represent smaller subgroups well.
2. Systematic Sampling
Description: A sample is selected by choosing every kth individual from the
population after selecting a random starting point.
Example: A researcher decides to survey every 10th customer in a store’s queue,
starting from a random point.
Applicability: Useful when a list of the population is available, and the researcher
wants a quick and simple sampling method. It works well when there’s no hidden
pattern in the population.
Pros:
 Simple and faster than simple random sampling.
 Ensures even coverage of the population.
Cons:Can be biased if there's an underlying pattern in the population (e.g., a queue
system where every 10th customer has the same characteristics).
3. Stratified Sampling
Description: The population is divided into subgroups (strata) based on specific
characteristics (e.g., age, gender, income level), and a sample is randomly selected
from each stratum.
Example: A researcher wants to survey employees about job satisfaction in a
company. The company is divided into strata based on departments (HR, marketing,
finance), and a random sample is taken from each department.
Applicability: Ideal when the population has distinct subgroups that may vary
significantly, and the researcher wants to ensure representation from all subgroups.
Pros:Ensures proportional representation of all key subgroups.
Cons:More complex to implement and requires knowledge about the population
structure.
4. Cluster Sampling
Description: The population is divided into clusters (often geographically), and then a
random sample of clusters is selected. All members of the chosen clusters are
surveyed.
Example: A researcher wants to study household energy usage across a country. They
randomly select 10 cities (clusters) and survey all households in those cities.
Applicability: Useful when the population is geographically spread out, or when it’s
difficult to compile a complete list of the population. Often used in large-scale surveys
or government studies.
Pros:
 Cost-effective and easier to manage when the population is geographically
dispersed.
 Requires fewer resources than surveying a random sample from the entire
population.
Cons:
 May introduce bias if the clusters are not homogeneous.
 Less precise than stratified sampling.
5. Convenience Sampling
Description: The sample is selected based on what is easiest for the researcher (e.g.,
surveying people nearby or accessible).
Example: A researcher surveys students in a class to understand their opinions about
online education.
Applicability: Often used in exploratory research or when time, cost, or access to the
population is limited.
Pros:
 Quick, easy, and inexpensive.
 Can be used for preliminary research.
Cons:
 Highly prone to bias, as it does not represent the broader population well.
 Results may not be generalizable to the entire population.
6. Quota Sampling
Description: The researcher selects participants to fulfill certain quotas, ensuring
representation of key demographic factors. It’s similar to stratified sampling, but the
selection is non-random.
Example: A researcher decides to interview 50 men and 50 women in a city to
understand their shopping habits, ensuring gender balance.
Applicability: Useful when the researcher wants to ensure certain demographic
groups are represented in the sample but doesn't have the resources for random
sampling.
Pros:
 Ensures diversity in the sample.
 Easier and cheaper than random sampling.
Cons:
 Can be biased if the selection process is not random within each quota group.
 The sample may not represent the overall population accurately.
7. Snowball Sampling
Description: A non-probability sampling method where existing participants recruit
new participants, often used in hard-to-reach or niche populations.
Example: A researcher studying the experiences of rare disease patients may start by
interviewing a few known patients, who then refer other patients.
Applicability: Ideal for studying hidden populations or when there’s no clear list of
the population (e.g., drug users, homeless individuals).
Pros:
 Useful for accessing populations that are difficult to identify or reach.
 Can be helpful in exploratory research.
Cons:
 Prone to bias, as the sample may not be representative of the larger population.
 Limited to social networks or groups.
22. Explain the normal distribution with examples. Describe its properties and
applications in data analysis. (Long Answer)
Ans:The normal distribution is a continuous probability distribution that is
symmetric and bell-shaped, commonly used in statistics. It is defined by its mean (μ)
and standard deviation (σ), and many real-world variables, like heights or exam
scores, follow this distribution.
Properties of the Normal Distribution:
Symmetry: The distribution is symmetrical around the mean.
Bell-shaped: The curve is highest at the mean and tapers off towards the tails.
68-95-99.7 Rule:
68% of data lies within one standard deviation of the mean.
95% lies within two standard deviations.
99.7% lies within three standard deviations.
Mean, Median, Mode: All are equal and located at the center.
Asymptotic: The tails of the distribution approach but never reach the
horizontal axis.
Example:
For a set of student exam scores with a mean of 75 and a standard deviation of 10,
most students would score near 75, with fewer students scoring very high or low.
Applications in Data Analysis:
Hypothesis Testing: Many statistical tests assume normality (e.g., z-tests).
Confidence Intervals: Used to estimate population parameters.
Z-Scores: Standardize data to compare different datasets.
Quality Control: Applied to monitor consistency in manufacturing.
Finance: Models returns on investment and risk assessment.
23. Draw and explain the bias-variance tradeoff in machine learning. Interpret its
impact on model performance. (Long Answer)
Ans:The bias-variance tradeoff is a fundamental concept in machine learning,
representing the relationship between the model's complexity and its performance.
Bias: The error introduced by approximating a real-world problem with a simplified
model. A model with high bias makes strong assumptions and underfits the data,
leading to systematic errors. As model complexity decreases, bias increases.
Variance: The error introduced by the model's sensitivity to small fluctuations in the
training data. A model with high variance can overfit, capturing noise rather than the
true patterns. As model complexity increases, variance increases.
Impact on Model Performance:
High bias (underfitting) occurs when the model is too simple, resulting in poor
performance on both training and test data.
High variance (overfitting) happens when the model is too complex, performing well
on training data but poorly on new, unseen data.
The goal is to find a balance where both bias and variance are minimized, leading to
the best model performance. The total error is the sum of bias and variance, and the
optimal model complexity lies where this total error is lowest.

24. Identify underfitting and overfitting in machine learning through model behavior
and apply your understanding to interpret their impact on prediction accuracy, using
diagrams where appropriate. (Long Answer)
Ans: Overfitting in Machine Learning
Overfitting happens when a model learns too much from the training data, including
details that don’t matter (like noise or outliers).
 For example, imagine fitting a very complicated curve to a set of points. The
curve will go through every point, but it won’t represent the actual pattern.
 As a result, the model works great on training data but fails when tested on new
data.
2. Underfitting in Machine Learning
Underfitting is the opposite of overfitting. It happens when a model is too simple to
capture what’s going on in the data.
 For example, imagine drawing a straight line to fit points that actually follow a
curve. The line misses most of the pattern.
 In this case, the model doesn’t work well on either the training or testing data.

 Underfitting : Straight line trying to fit a curved dataset but cannot capture the
data’s patterns, leading to poor performance on both training and test sets.
 Overfitting: A squiggly curve passing through all training points, failing to
generalize performing well on training data but poorly on test data.
 Appropriate Fitting: Curve that follows the data trend without overcomplicating
to capture the true patterns in the data

25. Apply the concept of p-values in a hypothesis testing scenario and interpret the
results using a suitable example. (Long Answer)
Ans:In hypothesis testing, the p-value helps determine the strength of evidence
against the null hypothesis (H₀). It represents the probability of obtaining results as
extreme as the observed, assuming H₀ is true.
Example Scenario:
A company claims that their battery lasts at least 10 hours on average. A researcher
believes the true average is less than 10 hours, and tests this by taking a sample of 30
batteries.
H₀ (null hypothesis): μ = 10 hours
H₁ (alternative hypothesis): μ < 10 hours
After testing, the researcher finds the sample mean is 9.6 hours, and the p-value is
0.03.
Interpretation:
If the significance level (α) is 0.05:
Since p = 0.03 < 0.05, we reject H₀.
There is significant evidence that the battery lasts less than 10 hours
on average.
If p > 0.05, we would fail to reject H₀, meaning the data doesn't provide strong
enough evidence to dispute the company's claim.
26. Use Bayes’s theorem to solve a probability-based problem and explain the
steps involved with a real-life example. (Long Answer)
Ans:
27. Analyze the application of Bayes’ Theorem in the Naive Bayes algorithm and
illustrate how the algorithm functions in a classification task using an example.
(Long Answer)
Ans:
28. Apply one-hot encoding and dummy variable techniques to a given
dataset and demonstrate their use with examples. (Long Answer)
Ans:One-Hot Encoding
Definition: One-hot encoding transforms each categorical value into a new binary
column, representing the presence (1) or absence (0) of each category.
Example daaset:
ID Color
1 Red
2 Blue
3 Green
4 Red
After one hot en-coding:
ID Color_Red Color_Blue Color_Green
1 1 0 0
2 0 1 0
3 0 0 1
4 1 0 0
Dummy Variable Encoding
Definition: Dummy encoding is a form of one-hot encoding that drops one column
to avoid the dummy variable trap — a situation where predictors are perfectly

✅
multicollinear (i.e., one column can be predicted from others).
Dummy Variable Result (Red dropped):
ID Color_Blue Color_Green
1 0 0
2 1 0
3 0 1
4 0 0
Encoding
Use Case
Type
Ideal for tree-based models (e.g., Decision Trees,
One-Hot
Random Forest, XGBoost) that are not affected by
Encoding
multicollinearity.
Dummy Preferred for linear models (e.g., Linear or Logistic
Encoding Regression) where multicollinearity can affect the model.
Real-World Application
In a customer churn prediction model, a feature like Contract Type = [Monthly,
One Year, Two Year] is categorical.
Encoding it using one-hot or dummy variables allows the model to understand
different contract types numerically and associate them with churn risk.

MODULE 3
1. A company is using two different probability distributions to model customer
purchasing behavior. One follows a normal distribution, while the other follows a
Poisson distribution. Critique which distribution is more appropriate for modeling
daily purchase counts and justify your reasoning with statistical properties. (Short
Answer)
Ans:The Poisson distribution is more appropriate because it models count data—
specifically, the number of events (purchases) in a fixed time interval.
Poisson handles discrete, non-negative values, while Normal assumes continuous,
symmetric data.
Daily purchase counts typically follow Poisson, especially when the counts are low or
vary over time.
2. Define unconstrained multivariate optimization and provide an example. (Short
Answer)
Ans: Unconstrained multivariate optimization involves finding the minimum or
maximum of a function with multiple variables, without any constraints.
Example: Minimize f(x,y)=x^2+y^2
Solution: The minimum occurs at (0, 0), where the gradient is zero.
3. List two key differences between equality and inequality constraints in
optimization. (Short Answer)
Ans:Two key differences between equality and inequality constraints in
optimization:
Equality constraints (e.g., g(x)=0g(x) = 0g(x)=0) require exact satisfaction;
inequality constraints (e.g., h(x)≤0h(x) \leq 0h(x)≤0) require a limit not to be
exceeded.
Lagrange multipliers are used for equality constraints; Karush-Kuhn-Tucker
(KKT) conditions are used for inequality constraints.

4. Explain the role of the gradient in gradient descent optimization. (Short Answer)
Ans:The gradient points in the direction of steepest ascent of the function.
Gradient descent uses the negative gradient to iteratively update variables to
minimize the function.
It guides the model towards local minima step-by-step.
5. Describe how Lagrange multipliers are used to handle equality
constraints in optimization. (Short Answer)
Ans:Lagrange multipliers help solve constrained optimization problems by converting
them into an unconstrained form. Given a function f(x, y, ...) to maximize or minimize
under an equality constraint g(x, y, ...) = 0, the method introduces a multiplier λ
(Lagrange multiplier) to construct the Lagrangian function:
L(x,y,...,λ)=f(x,y,...)+λg(x,y,...)
By solving the gradient equations ∇L=0, we ensure that the constraint g(x, y, ...) = 0
is satisfied while optimizing f(x, y, ...).
6. Interpret what happens when the learning rate in gradient descent is set too high
or too low. (Short Answer)
Ans:· Too high: The model may overshoot minima, fail to converge, or diverge.
· · Too low: Convergence becomes very slow, increasing computation time.
Optimal learning rate balances speed and stability.
7. A function has the form f(x, y) = x^2 + 2xy + y^2. Use the gradient descent
method to determine the direction of the steepest descent at the point (1,1).
Ans:
8. A company wants to minimize the cost function C(x,y)=x^2+y^2, subject to the
constraint that the total resource allocation satisfies x^2 + y^2 = 4. Use the Lagrange
multiplier method to find the optimal values of x and y.
Ans:
9. A machine learning model is trained using gradient descent. Illustrate how the
model updates its weights iteratively using the learning rule. (Short Answer)
Ans:Weights www are updated iteratively using:
Wnew=Wold−η⋅∇L(w)
where η is the learning rate, and ∇L(w)(nabla) is the gradient of the loss function.
This process continues until the loss converges to a minimum.
10. A function f(x, y) has multiple local minima. Analyze how different
initialization points in gradient descent impact convergence to the global minimum.
Ans:If f(x,y)f(x, y)f(x,y) has multiple local minima, gradient descent may converge to
different minima depending on the starting point.
 Poor initialization may lead to local, not global, minima.
 Multiple runs with different initial points or using stochastic approaches can help
find better minima.
11. Two optimization algorithms, Gradient Descent and Newton’s Method, are used
for unconstrained multivariate optimization. Compare their efficiency in terms of
convergence speed and computational cost, and justify which method would be more
suitable for large scale machine learning problems.
Ans:
Aspect Gradient Descent Newton’s Method
Slower, linear (depends on Faster (quadratic, if near
Convergence Speed
learning rate) minimum)
High (requires Hessian and
Computational Cost Low per iteration
matrix inverse)
Suitability for Large Better (less resource-
Not ideal for large datasets
Scale intensive)

12. Analyze the results of a decision tree classifier implemented on a real-world

dataset by examining its performance through suitable evaluation metrics. (Long
Answer)
Ans:A Decision Tree Classifier is a supervised machine learning algorithm used for
classification tasks. After training it on a real-world dataset (like Iris, Titanic, or loan
prediction), we must evaluate its performance to assess its effectiveness and
reliability.
Common Evaluation Metrics for Decision Tree Classifiers
Accuracy:Measures the proportion of correctly predicted instances.
Formula:Accuracy=TP+TN \ TP+TN+FP+FN
Use: Works well when classes are balanced.
Confusion Matrix:A summary of prediction results:
TP (True Positive): Correctly predicted positive.
FP (False Positive): Incorrectly predicted positive.
TN (True Negative): Correctly predicted negative.
FN (False Negative): Incorrectly predicted negative.
Precision:Indicates how many predicted positives were correct.
Precision=TP \ TP+FP
Recall (Sensitivity):Shows how many actual positives were identified.
Recall=TP \ TP+FN
F1 Score:Harmonic mean of precision and recall.
F1 Score=2⋅Precision*Recall \ Precision+Recall
Use: Especially helpful for imbalanced datasets.
ROC Curve and AUC Score
ROC Curve: Plots True Positive Rate vs. False Positive Rate.
AUC (Area Under Curve): Measures model’s ability to distinguish between classes.
Closer to 1 = better.

Example: Titanic Dataset

Using the Titanic dataset:
A decision tree classifier predicts if a passenger survived.
Evaluation results:
Accuracy: 82%
Precision: 78%
Recall: 71%
F1 Score: 74%
AUC Score: 0.85
Interpretation:
 The model performs well overall.
 Good at identifying survivors (high recall).
 Balanced performance (F1 score is strong).
 Some tuning (e.g., pruning or feature selection) may improve performance.
13. Analyze the impact of different imputation techniques on a given dataset with
missing values by comparing their outcomes based on relevant criteria. (Long
Answer)
Ans:Imputation techniques handle missing data differently, affecting statistical
properties, model performance, and bias. Below is a comparison based on key criteria:
1. Mean/Median/Mode Imputation:
Pros: Simple, fast, preserves data size.
Cons: Reduces variance, distorts correlations, ignores relationships.
Best for: Small missingness (<5%), numerical data.
2. K-Nearest Neighbors (KNN) Imputation:
Pros: Uses similarity between samples, better for relationships.
Cons: Computationally heavy, sensitive to outliers.
Best for: Small to medium datasets with meaningful patterns.
3. Multiple Imputation (MICE - Multivariate Imputation by Chained
Equations):
Pros: Accounts for uncertainty, preserves variance, robust.
Cons: Complex, slower, requires more tuning.
Best for: Datasets with multivariate dependencies.
4. Regression-Based Imputation:
Pros: Uses feature relationships, more accurate than mean.
Cons: Overfits if relationships are weak.
Best for: Linear relationships in data.
5. Forward/Backward Fill (Time-Series Data)
Pros: Maintains temporal order.
Cons: Introduces bias if data isn’t sequential.
Best for: Time-series with ordered missingness.
6. MissForest (Random Forest-Based Imputation):
Pros: Handles non-linearities, works for mixed data types.
Cons: Slow for large datasets.
Best for: Complex, non-linear datasets.
14. Analyze the effect of different imputation techniques (such as mean, median,
and KNN imputation) on model performance when applied to a dataset containing
missing values. Compare their outcomes using appropriate performance metrics
(Long Answer)
Ans:Impact of Imputation Techniques on Model Performance:
Mean Imputation:
Effect: Introduces bias, reduces variance
Performance: Decreases accuracy (especially for linear models)
Best for: Quick solutions with minimal missing data (<5%)
Median Imputation:
Effect: More robust to outliers than mean
Performance: Slightly better than mean, but still suboptimal
Best for: Skewed data distributions
KNN Imputation:
Effect: Preserves data relationships
Performance: Highest accuracy (2-5% improvement over mean/median)
Best for: Critical applications where accuracy matters
Median
Metric Mean Imp. KNN Imp.
Imp.

Accuracy (%) 78.2 79.1 82.5

RMSE 4.25 4.12 3.68

F1-Score 0.76 0.77 0.82

R² Score 0.71 0.73 0.79

Speed Fastest Fast Slow

15. Develop a real-world scenario that illustrates an alternative hypothesis in

hypothesis testing. Critically evaluate and justify the formulation of the hypothesis
based on the scenario's context and underlying data. (Long Answer)
Ans:Scenario:A pharmaceutical company claims its new drug (Drug X) lowers
cholesterol more effectively than the current standard (Drug Y).
Hypothesis Testing:
Null (H₀): Drug X = Drug Y (no difference in cholesterol reduction)
Alternative (H₁): Drug X > Drug Y (Drug X is more effective)
Justification:
Context: Prior studies suggest Drug X’s mechanism should outperform Drug Y.
Data: Randomized trial with pre/post cholesterol measurements (quantitative,
normally distributed).
Directionality: One-tailed test (H₁ uses ">") because only superior efficacy is
clinically meaningful.
Risk: Type I error (false claim of superiority) is costlier than Type II (missing a true
effect), so α=0.01 is set.
Critical Evaluation:
Strengths: Aligns with biological plausibility and uses rigorous experimental data.
Limitations: Assumes normal distribution; non-inferiority testing might be safer if
small margins matter.
16. Critically evaluate the different types of biases encountered during the sampling
process by analyzing their impact on research validity. Support your evaluation with
real-life examples for each type of bias. (Long Answer)
Ans: Selection Bias
Definition: When the sample is not representative of the target population.
Impact: Invalid generalizations, skewed results.
Example:1936 Literary Digest Poll predicted Landon would beat Roosevelt in the
U.S. election. They sampled only car owners and telephone users (wealthier groups),
missing broader voter sentiment. Result: Wrong prediction.
2. Non-Response Bias
Definition: When participants who refuse or fail to respond differ from those who do.
Impact: Underrepresentation of certain groups.
Example:Health Surveys often underrepresent heavy smokers or individuals with
poor health, as they are less likely to respond. This leads to underestimated disease
risks.
3. Survivorship Bias
Definition: Focusing only on "survivors" while ignoring those that dropped out.
Impact: Overly optimistic conclusions.
Example:WWII Aircraft Armor Study: Analysts initially reinforced areas where
returning planes had bullet holes, ignoring that planes shot in critical areas (e.g.,
engines) didn’t return. Correct insight: Armor the untouched areas.
4. Sampling Frame Bias
Definition: When the sampling source excludes parts of the population.
Impact: Missing key subgroups.
Example:Online Surveys on Political Opinions exclude older, less tech-savvy
voters, leading to misleading election forecasts.
5. Convenience Sampling Bias
Definition: Using easily accessible subjects (e.g., students, volunteers).
Impact: Low external validity.
Example:Psychology Studies often rely on university students, who are younger and
more educated than the general population, limiting applicability of findings.
6. Time Interval Bias
Definition: Sampling during a non-representative time period.
Impact: Temporal distortions.
Example:Retail Sales Surveys conducted only during holidays may overestimate
annual spending habits.
7. Voluntary Response Bias
Definition: When participants self-select (e.g., call-in polls).
Impact: Overrepresentation of extreme opinions.
Example:Product Reviews are often left by very satisfied or very dissatisfied
customers, skewing perceived quality.
17. Critically evaluate the concept of degrees of freedom (DF) in statistics by
explaining its significance and impact on the validity of statistical tests. Support
your evaluation with a practical example that illustrates its role in hypothesis
testing. (Long Answer)
Ans:Definition & Significance
Degrees of Freedom (DF) represent the number of independent values in a statistical
calculation that can vary without violating constraints. They:
Quantify Sample Information: DF = sample size (n) minus the number of estimated
parameters (e.g., mean, regression coefficients).
Impact Test Validity:
Higher DF → More reliable estimates (closer to population).
Lower DF → Increased uncertainty (wider confidence intervals, less power).
Determine Critical Values: DF shape distributions (e.g., t-distribution approaches
normal as DF increases).
Impact on Statistical Tests
t-Tests:
DF = *n* − 1 for one-sample tests.
Lower DF → Heavier tails in t-distribution → More conservative p-values.
Chi-Square Tests:
DF = (rows − 1) × (cols − 1) in contingency tables.
Incorrect DF inflates Type I error (false positives).
Regression Models:
DF = *n* − *k* − 1 (where *k* = predictors).
Overfitting occurs if DF is too low (e.g., too many predictors for sample size).
Practical Example: One-Sample t-Test
Scenario: A bakery claims its croissants weigh 50g on average. You sample 10
croissants (*n* = 10) and test if the mean differs.
Calculate DF:
DF=n−1=10−1=9
Critical t-Value: For α = 0.05 (two-tailed), t-critical = ±2.262 (from t-table at DF =
9).
Interpretation:
If calculated t > 2.262, reject the null (mean ≠ 50g).
Lower DF (e.g., 4) would require larger t for significance (t-critical = ±2.776),
reducing test power.
Key Evaluation Points
DF as "Information Budget":Each parameter estimated (e.g., mean, variance)
consumes DF, reducing residual information.
Trade-offs:
High DF: More precise estimates (e.g., *n* = 100 → DF = 99 → t ≈ normal).
Low DF: Tests become overly conservative (risk Type II errors).
Misuse Consequences:Ignoring DF in ANOVA or regression leads to incorrect F-
statistics and p-values.
18. Analyze the concepts of normalization and standardization in data preprocessing.
Compare their processes and effects on data using examples, and examine situations
where each technique is most suitable. (Long Answer)
Ans:
Normalization (Min-Max Standardization (Z-Score
Aspect
Scaling) Scaling)

Formula X−Xmin\Xmax−Xmin X−μ \ σ

Output No fixed range (Mean = 0, Std

[0, 1] or [-1, 1]
Range Dev = 1)

Outlier
High (outliers skew the scale) Low (robust to outliers)
Sensitivity

- Neural Networks (e.g., CNNs)

- Distance-based algorithms - PCA, Clustering (K-Means)
Best For (KNN, SVM) - Linear Models (Regression)
- Bounded data (e.g., images [0- - Gaussian-distributed data
255])

Scaling pixel values (0 to 1) for Scaling features (age, income) for

Example
a CNN logistic regression

Data with non-Gaussian

When to
Data with extreme outliers distributions (e.g.,
Avoid
binary/categorical)

19. Using a Python library such as Seaborn or Matplotlib, generate a histogram for a
numerical feature in a dataset. Analyze the underlying distribution by examining its
shape, central tendency, and spread, and interpret what these characteristics reveal
about the data. (Long Answer)
Ans:import seaborn as sns
import matplotlib.pyplot as plt
# Load dataset (example: Titanic)
df = sns.load_dataset('titanic')
# Plot histogram of 'age' feature
plt.figure(figsize=(8,4))
sns.histplot(data=df, x='age', bins=30, kde=True)
plt.title('Age Distribution of Titanic Passengers')
plt.show()
Distribution Analysis:
Shape:
 Right-skewed (long tail to older ages)
 Peaks around 20-30 years (young adults)
Central Tendency:
 Median (~28) < Mean (~30) due to skew
 Mode at 24 (most frequent age)
Spread:
 Range: 0.4 to 80 years
 IQR ~20-38 years (middle 50%)
Interpretation:
 Majority were young adults (20s-30s)
 Few children (<10) and elderly (>60)
 Positive skew suggests most passengers were younger than the mean age
20. Analyze the concepts of overfitting and underfitting in machine learning
models. Compare their characteristics and examine their impact on model
performance using relevant examples. (Long Answer)
Ans:
Aspect Overfitting Underfitting

Model learns noise/training- Model is too simple to capture

Definition
specific patterns data trends

Training
Very low (near zero) High (poor fit to training data)
Error

Test Error High (poor generalization) High (poor generalization)

- Too complex model (e.g., high- - Too simple model (e.g., linear
degree polynomial) for non-linear data)
Cause
- Too many features - Too few features
- Small dataset - Insufficient training

A decision tree memorizing Linear regression fitting a sine

Example
training data (depth=50) wave poorly

Fits training data perfectly but Fails to fit even training data
Visual Sign
erratic elsewhere (high bias)

- Regularization (L1/L2) - Increase model complexity

Solution - Cross-validation - Add features
- Prune trees/Reduce NN layers - Train longer (for NNs)

21. Analyze how precision and recall contribute to evaluating model performance,
particularly in the context of imbalanced datasets. Examine scenarios where these
metrics provide more meaningful insights than accuracy. (Long Answer)
Ans:Precision and recall are critical metrics for evaluating model performance,
especially with imbalanced datasets.
Precision measures the proportion of true positive predictions out of all positive
predictions made by the model. It tells us how many of the predicted positive cases are
actually positive. This is important when false positives are costly (e.g., incorrectly
predicting a rare disease).
Recall measures the proportion of actual positive cases that the model correctly
identifies. It’s crucial when missing a positive case is costly (e.g., failing to detect a
fraud transaction).
Scenarios:
Medical Testing: Imagine a model that predicts whether a patient has a rare disease.
If the disease occurs in 1% of the population, the model could predict "no disease" for
everyone and still achieve 99% accuracy, missing every actual case of the disease.
Here, recall is more important—it's crucial that the model identifies as many true
cases of the disease as possible, even if it means accepting a few false positives (lower
precision).
Fraud Detection: In fraud detection, where fraudulent transactions make up a small
percentage of the total transactions, a model that predicts "no fraud" most of the time
could still have high accuracy. However, such a model wouldn't help in catching
fraudulent transactions. Here, both precision (to avoid false alarms) and recall (to
catch as many fraudulent transactions as possible) are more meaningful.
MODULE 4
1. Define logistic regression and mention one real-world application.
Ans:Logistic regression is a classification algorithm that models the probability
of a binary outcome using the sigmoid function.
Real-world application: Predicting whether a customer will buy a product
(yes/no) based on features like age and browsing behavior.
2. List the key assumptions of the k-nearest neighbor (k-NN) algorithm.
Ans:Similarity matters: Similar inputs have similar outputs (based on distance).
Feature scaling is important: Distance metrics are sensitive to scale.
No assumptions about data distribution: It's a non-parametric method.
3. Explain how the sigmoid function is used in logistic regression.
Ans:The sigmoid function transforms linear output into a probability between 0 and
1:
σ(z)=1/1+e−z
where z=wTx+b. If probability > 0.5, it predicts class 1; else, class 0.

4. Describe the role of the distance metric in k-NN classification.

Ans:The distance metric (e.g., Euclidean, Manhattan) determines how similarity
is measured between data points.
In k-NN, classification is based on the 'k' closest neighbors. The choice of
metric directly impacts model accuracy.
5. Interpret how the number of clusters (k) affects the output of the k-means
clustering algorithm.
Ans:Too few clusters: Underfitting; dissimilar points grouped together.
Too many clusters: Overfitting; noise may be treated as structure.
Choosing optimal k (e.g., using the elbow method) balances compactness and
separation.
6. Calculate the Euclidean distance between the two points (2,3) and (5,7) in a k-NN
model.
Ans:
7. Illustrate how logistic regression can be implemented in Python using scikit-learn
with a simple dataset.
Ans:from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y == 0, test_size=0.2)
model = LogisticRegression()
model.fit(X_train, y_train)
print("Accuracy:", model.score(X_test, y_test))

8. Given a dataset of customer transactions, apply the k-means clustering

algorithm to segment customers into three groups and briefly describe the steps.
Ans:·Preprocess data (e.g., scale features).
· Use KMeans(n_clusters=3) from sklearn.cluster.
· Fit the model and obtain cluster labels.
· Analyze cluster profiles to understand customer segments.
9. Given an imbalanced dataset, analyze whether logistic regression or k-NN would
be more effective and justify your answer.
Ans:Logistic Regression is generally more effective for imbalanced data:
· Logistic Regression works well when the dataset has a class imbalance, as it
assigns probabilities and can be adjusted using class weights to handle the imbalance
better.
· k-NN is sensitive to class imbalances—since it relies on neighbors, the majority
class can dominate predictions, making it less reliable in skewed datasets
10.Analyze how Logistic Regression is applied to classify binary outcomes. Examine
how its approach, underlying assumptions, and output differ from those of Linear
Regression. (Long Answer)
Ans:Logistic Regression for Binary Classification:
1. Application:
Logistic Regression is used when the dependent variable is binary (e.g., 0 or 1, Yes or
No). It estimates the probability that a given input belongs to a particular class.
It uses a sigmoid function to map predicted values to a range between 0 and 1.
Based on a threshold (commonly 0.5), the output is classified into one of two
categories.
2. Approach:
The model calculates the log-odds (logit) of the probability:
P
l og( )=β 0+ β 1 X
P −1
This is transformed into a probability using the sigmoid function:
1
p= −( β 0+ β 1 X )
1+e
3. Underlying Assumptions:
 Binary dependent variable.
 Linearity in the logit (not in the raw output like in linear regression).
 No multicollinearity among predictors.
 Independence of observations.
 Errors do not need to be normally distributed.
4. Output:
 Produces probabilities.
 Final classification depends on a decision threshold.
 Can be evaluated using metrics like accuracy, precision, recall, ROC-AUC, etc.
11. Analyze how varying the value of 'k' influences the accuracy and decision
boundaries of the K-Nearest Neighbors (KNN) algorithm. Examine the consequences
of choosing a 'k' that is too small or too large, and how it impacts model bias, variance,
and overall performance. (Long Answer)
Ans:Effect of Varying ‘k’ in K-Nearest Neighbors (KNN):
The value of ‘k’ (number of nearest neighbors) is a key hyperparameter in the KNN
algorithm, and it greatly influences model accuracy, decision boundaries, bias, and
variance.
1. Small ‘k’ (e.g., k = 1):
Decision Boundary: Very complex and irregular, tightly follows the training
data.
Accuracy: Can be high on training data but poor on test data.
Bias: Low – model fits training data closely.
Variance: High – sensitive to noise and outliers.
Overfitting: Likely.
2. Large ‘k’ (e.g., k = 20 or more):
Decision Boundary: Smoother and more generalized.
Accuracy: May improve on test data up to a point, then degrade if too large.
Bias: Higher – model becomes too simple.
Variance: Lower – less sensitive to noise.
Underfitting: Risk increases if ‘k’ is too large, as it averages over too many
neighbors, possibly from different classes.
Value of k Bias Variance Risk
Small (e.g., 1–3) Low High Overfitting
Optimal Balanced Balanced Best performance
Large (e.g., >20) High Low Underfitting
Model Performance:
 The optimal value of ‘k’ is usually found using cross-validation.
 Even values of ‘k’ can lead to ties; odd k is often preferred for binary
classification.
12. Critically evaluate how the K-Means clustering algorithm partitions data points
into clusters. Justify the methods used to determine the optimal number of clusters,
such as the Elbow Method or Silhouette Score, and assess their effectiveness in
different scenarios. (Long Answer)
Ans:How K-Means Works:
K-Means is an unsupervised learning algorithm that partitions data into k clusters
by:
 Randomly initializing k centroids.
 Assigning each data point to the nearest centroid (using Euclidean distance).
 Updating centroids by computing the mean of assigned points.
 Repeating steps 2–3 until centroids stabilize (convergence).
2. Strengths:
 Simple and computationally efficient.
 Works well when clusters are spherical, equal-sized, and well-separated.
3. Limitations:
 Requires predefining k
 Sensitive to initial centroid placement.
 Struggles with non-spherical or overlapping clusters.
 Not suitable for categorical data (unless preprocessed).8
4. Determining the Optimal Number of Clusters (k):
a. Elbow Method:
 Plots Within-Cluster Sum of Squares (WCSS) against different values of k.
 Look for the "elbow" point where adding more clusters gives diminishing
returns.
 Limitation: The elbow is not always clear, especially for overlapping or noisy
data.
b. Silhouette Score:
 Measures how similar a point is to its own cluster compared to other clusters.
 Score ranges from -1 to 1; higher is better.
 More reliable than the elbow method, especially for non-uniform cluster sizes or
shapes.
Scenario Elbow Method Silhouette Score
Well-separated spherical clusters Effective Effective
Overlapping or irregular clusters Less reliable More informative
High-dimensional data Difficult to interpret Useful, but may vary
Noisy data or outliers Misleading May detect poor clustering

13. Analyze how Support Vector Machines (SVM) separate data points into distinct
classes by identifying optimal hyperplanes. Examine the role of kernel functions in
transforming non-linearly separable data into a higher-dimensional space for
effective classification. (Long Answer)
Ans:Support Vector Machines (SVM) classify data by finding the optimal
hyperplane that best separates two classes. This hyperplane maximizes the margin
—the distance between the hyperplane and the nearest data points from each class,
called support vectors. A larger margin leads to better generalization and
classification accuracy. In cases where data is not linearly separable, kernel
functions can be used to transform the data into a higher-dimensional space where a
linear hyperplane can separate the classes effectively.
Role of Kernel Functions in SVM:
When data is not linearly separable in its original space, kernel functions enable
Support Vector Machines (SVM) to classify it effectively by transforming the data
into a higher-dimensional space where a linear separation becomes possible.
Key Points:
1. Transformation Without Explicit Computation (Kernel Trick):
 Kernel functions compute the dot product of two data points in the transformed
space without explicitly mapping them.
 This is known as the kernel trick, which saves computational effort and avoids
working directly in high-dimensional space.
2. How It Helps:
 Transforms complex, non-linear boundaries into linearly separable ones in a
higher dimension.
 Allows SVM to draw a linear hyperplane in this new space, which corresponds
to a non-linear boundary in the original space.
14. Analyze how Decision Trees determine splits at each node based on feature
selection criteria such as Information Gain or Gini Impurity. Examine how the
choice of splitting feature at each node impacts the tree’s accuracy, complexity, and
potential for overfitting. (Long Answer)
Ans:Decision Trees Splitting Criteria
Feature Selection with Information Gain or Gini Impurity:
Decision Trees choose the best feature to split a node based on criteria like
Information Gain (IG) or Gini Impurity.
Information Gain measures the reduction in entropy after a split.
Higher IG indicates a more informative split.
Gini Impurity measures how often a randomly chosen element would
be incorrectly classified. A lower Gini score is preferred.
Impact on Accuracy and Complexity :
The selected splitting feature directly affects the accuracy of the tree. Good splits lead
to purer nodes, improving classification or prediction performance. However, too
many splits can increase complexity, making the tree deeper and harder to interpret.
Potential for Overfitting :
If the model keeps splitting based on minor differences (high complexity), it may
overfit the training data and perform poorly on unseen data. Using pruning techniques
or setting depth limits helps reduce this risk.
15. Critically evaluate the problem of overfitting in Decision Trees and its impact on
model generalization. Justify how pruning techniques, such as pre-pruning and post-
pruning, can effectively reduce overfitting and improve the model’s performance on
unseen data. (Long Answer)
Ans:Overfitting in Decision Trees and Pruning Techniques :
Problem of Overfitting :
Overfitting occurs when a Decision Tree becomes too complex and captures
noise in the training data rather than the actual patterns. This results in high
training accuracy but poor generalization to new, unseen data.
Impact on Model Generalization (1 Mark):
An overfitted tree fails to perform well on test data, reducing its real-world
effectiveness. It may make overly specific rules that don't apply beyond the
training set.
Pruning Techniques to Reduce Overfitting :
Pre-Pruning (Early Stopping): Stops the tree from growing beyond a
certain depth or requires a minimum number of samples to split a
node. This prevents over-complex trees.
Post-Pruning (Reduced Error or Cost-Complexity Pruning): Allows
the full tree to grow and then removes branches that don't significantly
improve accuracy on validation data.
Both methods simplify the model, reducing overfitting and improving
performance on unseen data.
16. Analyze the role of distance metrics in the K-Nearest Neighbors (KNN)
algorithm and how they influence the classification or regression outcomes.
Examine why feature scaling through normalization or standardization is crucial
when using KNN, particularly in datasets with varying feature ranges. (Long
Answer).
Ans:Distance Metrics and Feature Scaling in KNN:
Role of Distance Metrics in KNN:
KNN relies on distance metrics to identify the 'K' nearest neighbors of a query
point. Common metrics include:
 Euclidean Distance (default for continuous features)
 Manhattan Distance (suitable for high-dimensional data)
 Minkowski or Cosine Distance (for customized similarity)
These metrics determine how "close" other points are, directly influencing
classification or regression results.
Influence on Outcomes :
The choice of distance metric affects which neighbors are selected. Incorrect
or unbalanced metrics can lead to poor predictions by favoring irrelevant
features or distant points.
Importance of Feature Scaling:
When features have different ranges (e.g., age: 0–100 vs. income: 0–100,000),
larger-range features dominate the distance calculation.
Normalization (scales values to [0, 1])
Standardization (scales to mean = 0, std = 1)
These methods ensure all features contribute equally, making KNN
more accurate and fair.
17. Analyze how Logistic Regression models probabilities to predict binary
outcomes. Examine how adjusting the decision threshold influences model
performance, particularly in handling imbalanced datasets. (Long Answer)
Ans:Logistic Regression and Decision Thresholds (5 Marks)
Modeling Probabilities (2 Marks):
Logistic Regression predicts binary outcomes by modeling the probability that an
instance belongs to a class using the sigmoid function:
This maps any real-valued input to a value between 0 and 1, interpreted as the
probability of the positive class.
Decision Threshold :
A default threshold of 0.5 is typically used to classify outcomes:
If P≥0.5P \geq 0.5P≥0.5, predict class 1
If P<0.5P < 0.5P<0.5, predict class 0
This threshold can be adjusted depending on the problem.
Impact on Imbalanced Datasets :
In imbalanced datasets (e.g., 95% negative, 5% positive), a 0.5 threshold may miss
minority cases (false negatives).
 Lowering the threshold (e.g., to 0.3) increases sensitivity (recall) to detect more
positives.
 Adjusting the threshold helps balance precision and recall, improving
performance on real-world, skewed data.

MODULE 5
1. A company is using k-means clustering for customer segmentation but is unsure
about the optimal number of clusters. Critique whether the Elbow Method or
Silhouette Score is a better approach to determine the optimum number of clusters
and justify your reasoning.
Ans:The Elbow Method and Silhouette Score are both valid techniques for finding
the optimal number of clusters in K-means.
 Elbow Method plots the within-cluster sum of squares (WCSS) vs. number of
clusters. The "elbow" point suggests the best balance between model complexity
and compactness. However, it can be subjective and unclear if no obvious elbow
exists.
 Silhouette Score measures how well each point fits within its cluster compared to
others. It ranges from -1 to 1, with higher scores indicating well-separated, dense
clusters.

2. Define simple linear regression and mention one assumption it makes.

Ans:Simple linear regression models the relationship between one independent
variable and one dependent variable.
Assumption: There is a linear relationship between the independent and dependent
variables.
3. List two key differences between simple linear regression and multiple linear
regression.
Ans:
Feature Simple Linear Regression Multiple Linear Regression
Independent
One Two or more
Variables
Simple, straight-line More complex, considers multiple
Model Complexity
relationship factors

4. Explain the role of the R-squared (R^2) value in evaluating a regression model.
Ans:The R-squared (R²) value measures how well a regression model explains the
variance in the dependent variable.
It ranges from 0 to 1:
 R² = 1 means perfect prediction.
 R² = 0 means the model explains none of the variance.
Role:R² indicates the goodness of fit — the higher the R², the better the model
explains the data. However, it does not indicate causation or model correctness and
can be misleading if used alone, especially in overfitted models.

5. Describe how residual plots can help diagnose model fit issues in linear regression.
Ans:Residual plots show the difference between predicted and actual values
(residuals) vs. predicted values or input features.
How they help:
 A random scatter of points suggests a good model fit.
 Patterns or curves indicate non-linearity — the model may be missing
relationships.
 Increasing or decreasing spread signals heteroscedasticity (non-constant
variance).
 Outliers or clustering can reveal data issues or the need for a better model.
6. Interpret the purpose of cross-validation in regression model selection.
Ans:Cross-validation is used to assess how well a regression model generalizes to
unseen data.
Purpose:
 It splits the dataset into training and validation sets multiple times (e.g., in k-fold
cross-validation).
 Helps avoid overfitting by testing the model on different data subsets.
 Provides a more reliable estimate of model performance compared to a single
train-test split.
 Aids in model selection by comparing performance metrics (like RMSE or R²)
across different models.
7. A simple linear regression model is given by Y=3X+5. Calculate the predicted
value of Y when X=4.
Ans:
8. Illustrate how to implement multiple linear regression in Python using the scikit-
learn library with a sample dataset.
Ans:from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=2, noise=0.1)
model = LinearRegression()
model.fit(X, y)
print(model.coef_, model.intercept_)
9.Given a dataset with multicollinearity, apply an appropriate regression
technique to reduce its impact and describe your approach.
Ans:To reduce multicollinearity in a dataset, we can apply Ridge Regression or
Principal Component Regression (PCR):
Ridge Regression: Adds an L2 regularization penalty to the regression model,
shrinking correlated variable coefficients to stabilize estimates.
Principal Component Regression (PCR): Uses Principal Component Analysis
(PCA) to transform correlated predictors into uncorrelated components, reducing
redundancy.
10.A researcher builds two multiple linear regression models with different feature
sets. Analyze how cross-validation can help determine which model is more reliable.
Ans:Cross-validation helps compare the reliability and performance of two
multiple linear regression models with different feature sets.
How it helps:
 Both models are evaluated on multiple data splits (e.g., k-fold CV), reducing
bias from a single train-test split.
 Metrics like Mean Squared Error (MSE) or R² are averaged across folds.
 The model with lower average error and less variance in scores is considered
more reliable.
 Helps identify if extra features improve performance or just cause overfitting

11.A company is using Akaike Information Criterion (AIC) and Adjusted R-squared
to select the best multiple regression model. Critique which metric is more effective
for model selection and justify your reasoning.
Ans:Both Akaike Information Criterion (AIC) and Adjusted R-squared are useful
for model selection, but they serve slightly different purposes.
 AIC balances model fit and complexity. It penalizes the number of predictors to
avoid overfitting. Lower AIC indicates a better model. It is more effective when
comparing non-nested models and focuses on predictive accuracy.
 Adjusted R-squared adjusts R² for the number of predictors. It increases only if a
new feature improves the model more than by chance. However, it is less
sensitive to overfitting compared to AIC.
12. Compare and contrast the advantages and disadvantages of ensemble
methods like Bagging, Boosting, and Stacking. (Long Answer)
Ans:
Method Advantages Disadvantages
- Reduces variance and - Less improvement if base learners
Bagging (e.g.,
overfitting- Easy to parallelize- are strong- Can be computationally
Random Forest)
Improves stability expensive with many trees
- Reduces bias and variance-
Boosting (e.g., - Sensitive to noisy data and
Focuses on hard-to-predict
AdaBoost, outliers- Sequential training is
examples- Often achieves high
XGBoost) slower and harder to parallelize
accuracy
- Combines diverse models for - Complex to implement and tune-
Stacking better performance- Flexible, can Risk of overfitting if meta-model is
use any base learners not carefully chosen

13. Examine the trade-offs between bias and variance in machine learning models
and their effect on overfitting and underfitting. (Long Answer)
Ans:Bias: Bias refers to errors from overly simple models that underfit data, missing
important patterns. High bias leads to poor training and test accuracy.
 Variance refers to errors from models that are too complex and overfit training
data, capturing noise. High variance leads to good training but poor test accuracy.
Trade-off:
Reducing bias usually increases variance and vice versa. The goal is to find a balance
that minimizes total error.
Effect on Overfitting/Underfitting:
High bias → Underfitting (model too simple)
High variance → Overfitting (model too complex)
14. Evaluate the performance of different classification models (e.g., Logistic
Regression, Random Forest, and SVM) on a given dataset using precision, recall,
and F1-score. (Long Answer)
Ans:Overview of Metrics:
Precision: Measures the proportion of true positive predictions among all positive
predictions.
 Precision=TP\TP+FP
High precision means fewer false positives.
Recall (Sensitivity): Measures the proportion of true positives identified out of actual
positives.
 Recall=TP\TP+FN
High recall means fewer false negatives.
F1-score: The harmonic mean of precision and recall, balancing the two.
 F1=2×Precision×Recall \ Precision+Recall
Useful when you want a balance between precision and recall, especially in
imbalanced datasets.
15. Analyze the impact of hyperparameter tuning in Random Search in deep
learning models. (Long Answer)
Ans:· Efficiency: Random Search explores hyperparameter space by sampling
combinations randomly, often finding good configurations faster than exhaustive grid
search, especially when only a few hyperparameters matter.
· · Improved Model Performance: Proper tuning of parameters like learning rate,
batch size, and dropout via Random Search can significantly enhance model accuracy
and generalization by optimizing training dynamics.
· · Better Exploration: Random Search can discover non-intuitive hyperparameter
values by sampling broadly, increasing chances of escaping local optima in complex
deep learning models.
· · Scalability: It handles high-dimensional hyperparameter spaces efficiently,
making it suitable for deep learning models with many parameters.
· · Limitations: Despite efficiency, Random Search requires multiple training runs,
which can be computationally expensive, and results depend on chosen parameter
ranges.
16. A regression analysis between apples (y) and oranges (x) resulted in the
following least squares line: y = 100 + 2x. Predict the implication if oranges are
increased by 1 (Long Answer)

Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
FDS - Unit 1 Question Bank
No ratings yet
FDS - Unit 1 Question Bank
16 pages
Schema Masina de Spalat Indesit
100% (2)
Schema Masina de Spalat Indesit
31 pages
Data Science
No ratings yet
Data Science
10 pages
DS 3-Marks Semeseter Suggestion
No ratings yet
DS 3-Marks Semeseter Suggestion
54 pages
DS Final 3 Marks
No ratings yet
DS Final 3 Marks
10 pages
Data Science
No ratings yet
Data Science
14 pages
FDS Unit 1 QB
No ratings yet
FDS Unit 1 QB
7 pages
Data Science and Analytics Reviewer
No ratings yet
Data Science and Analytics Reviewer
5 pages
PDS Question Bank
No ratings yet
PDS Question Bank
19 pages
Data Science
No ratings yet
Data Science
10 pages
Unit 4
No ratings yet
Unit 4
6 pages
Set. No - 1 P18pecs021-Data Science QP - Ph.d.
No ratings yet
Set. No - 1 P18pecs021-Data Science QP - Ph.d.
20 pages
DS Unit 1
No ratings yet
DS Unit 1
35 pages
Data Science Comprehension Worksheets
No ratings yet
Data Science Comprehension Worksheets
32 pages
Kenny-230718-Top 70 Microsoft Data Science Interview Questions
No ratings yet
Kenny-230718-Top 70 Microsoft Data Science Interview Questions
17 pages
2 Marks With Answers
No ratings yet
2 Marks With Answers
39 pages
Introduction To Data Science - 23CSH-283
100% (1)
Introduction To Data Science - 23CSH-283
48 pages
Da Ans (GKJ)
No ratings yet
Da Ans (GKJ)
11 pages
Question Bank With Answers
No ratings yet
Question Bank With Answers
103 pages
DS Notes
No ratings yet
DS Notes
159 pages
Unit 4 & 5-Data Science and Computer Vision
No ratings yet
Unit 4 & 5-Data Science and Computer Vision
18 pages
Unit I
No ratings yet
Unit I
52 pages
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
No ratings yet
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
53 pages
DS
No ratings yet
DS
7 pages
Cs3352 - Foundation of Data Science
No ratings yet
Cs3352 - Foundation of Data Science
56 pages
Data Science - Notes - X
No ratings yet
Data Science - Notes - X
3 pages
DS PPT 1
No ratings yet
DS PPT 1
30 pages
12 2marks With Ans
No ratings yet
12 2marks With Ans
21 pages
X Ai SS CH4 Notes
No ratings yet
X Ai SS CH4 Notes
5 pages
AD3491 - Unit 1 - Introduction To Data Science Important Questions 2 Marks With Answer - 3-8
No ratings yet
AD3491 - Unit 1 - Introduction To Data Science Important Questions 2 Marks With Answer - 3-8
6 pages
Unit I 2 Marks
No ratings yet
Unit I 2 Marks
5 pages
Data Science Notes and Questions - 250605 - 112515
No ratings yet
Data Science Notes and Questions - 250605 - 112515
5 pages
Datascience
No ratings yet
Datascience
12 pages
Class 9 (Chap #4)
No ratings yet
Class 9 (Chap #4)
9 pages
Data Science Unit 01
No ratings yet
Data Science Unit 01
19 pages
Data Science Unit-1 Notes
No ratings yet
Data Science Unit-1 Notes
19 pages
Lesson 7
No ratings yet
Lesson 7
3 pages
Overview of Data Science
No ratings yet
Overview of Data Science
3 pages
Selected Topics - Datascience
No ratings yet
Selected Topics - Datascience
17 pages
Data Sciences Class 10 Notes
100% (2)
Data Sciences Class 10 Notes
3 pages
5th Sem Internship Eport
No ratings yet
5th Sem Internship Eport
83 pages
Data Science-1
No ratings yet
Data Science-1
65 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
Impact of Data Science Across Industries
No ratings yet
Impact of Data Science Across Industries
3 pages
Q1. Explain Data Science Process Along With Detailed Diagram
No ratings yet
Q1. Explain Data Science Process Along With Detailed Diagram
7 pages
Unit I - Notes
No ratings yet
Unit I - Notes
15 pages
Big Data (Imp-Questions)
No ratings yet
Big Data (Imp-Questions)
17 pages
Da CH1 Slqa
No ratings yet
Da CH1 Slqa
6 pages
DSV Notes
No ratings yet
DSV Notes
13 pages
Data Science 3
No ratings yet
Data Science 3
4 pages
File
No ratings yet
File
27 pages
Test 1 FDS
No ratings yet
Test 1 FDS
4 pages
Data Science Interview Qna
No ratings yet
Data Science Interview Qna
5 pages
Question Bank Dau
No ratings yet
Question Bank Dau
6 pages
Data Science MCQs Sample Mid2xlsx 2024 11-29-23!19!54
No ratings yet
Data Science MCQs Sample Mid2xlsx 2024 11-29-23!19!54
8 pages
Data Science
No ratings yet
Data Science
2 pages
Fds QB
No ratings yet
Fds QB
21 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
Data Science & Python Guide
No ratings yet
Data Science & Python Guide
44 pages
Phys5 Lab 2 .... 056
No ratings yet
Phys5 Lab 2 .... 056
7 pages
Job Roles Matching Your Skills UAE Bahrain
No ratings yet
Job Roles Matching Your Skills UAE Bahrain
1 page
URL of Student Forum Is Mentioned Below.: Remember To Answer All 13 Questions Otherwise You Will Not Be Able To Submit
No ratings yet
URL of Student Forum Is Mentioned Below.: Remember To Answer All 13 Questions Otherwise You Will Not Be Able To Submit
1 page
FAA Sample Lease-D52dff2e
No ratings yet
FAA Sample Lease-D52dff2e
65 pages
Untitled Document
No ratings yet
Untitled Document
8 pages
TJR TUJR WF4 Manual 01 25 15
No ratings yet
TJR TUJR WF4 Manual 01 25 15
62 pages
Weak-Measurement Elements of Reality: Lev Vaidman
No ratings yet
Weak-Measurement Elements of Reality: Lev Vaidman
11 pages
Marketing Information Systems
No ratings yet
Marketing Information Systems
7 pages
ECEN3250 Lab 7: Design of Common-Source MOS Amplifiers Prelab Assignment
No ratings yet
ECEN3250 Lab 7: Design of Common-Source MOS Amplifiers Prelab Assignment
14 pages
Kubernetes Sec
No ratings yet
Kubernetes Sec
73 pages
Engineering Mathematics
100% (1)
Engineering Mathematics
14 pages
EIM Performance Tuning Guide
No ratings yet
EIM Performance Tuning Guide
3 pages
Chapter 1 Introduction To Software Engineering and Process Models
100% (1)
Chapter 1 Introduction To Software Engineering and Process Models
53 pages
Dell Latitude E5400 and E5500 Spec Sheet
100% (1)
Dell Latitude E5400 and E5500 Spec Sheet
2 pages
LCR Measurements
No ratings yet
LCR Measurements
16 pages
Form5 Accounting HHW December 2022
No ratings yet
Form5 Accounting HHW December 2022
15 pages
Berkeley DCM - 25.03.2022
No ratings yet
Berkeley DCM - 25.03.2022
2 pages
7 Ways To Optimize Jenkins
No ratings yet
7 Ways To Optimize Jenkins
15 pages
CORVETTE 14L PV 200813 1510 Locked
No ratings yet
CORVETTE 14L PV 200813 1510 Locked
85 pages
RAK7431 RS485 To LoRaWAN Converter Specifications V1.2
No ratings yet
RAK7431 RS485 To LoRaWAN Converter Specifications V1.2
12 pages
Fibonacci Search: Observation On Unimodal Functions
No ratings yet
Fibonacci Search: Observation On Unimodal Functions
5 pages
mANT30 PDF
No ratings yet
mANT30 PDF
1 page
CSC Examination Result
No ratings yet
CSC Examination Result
2 pages
Confirmatory Composite Analysis Guide
No ratings yet
Confirmatory Composite Analysis Guide
10 pages
A VLSI Analog Computer - Math Co-Processor For A Digital Computer
No ratings yet
A VLSI Analog Computer - Math Co-Processor For A Digital Computer
3 pages
Android - Using The SDK
No ratings yet
Android - Using The SDK
32 pages
AP-Lab Manual - Updated
No ratings yet
AP-Lab Manual - Updated
110 pages
Lesson One Quantitative Techniques in Management
No ratings yet
Lesson One Quantitative Techniques in Management
5 pages
Multimedia Unit 4
No ratings yet
Multimedia Unit 4
16 pages
Lab Manual Computer Data Security & Privacy (COMP-324) : Course Coordinator: Dr. Sherif Tawfik Amin
No ratings yet
Lab Manual Computer Data Security & Privacy (COMP-324) : Course Coordinator: Dr. Sherif Tawfik Amin
51 pages
Paint Color Codes Guide
No ratings yet
Paint Color Codes Guide
10 pages
Flow Chart 0: Overall Flow For Normal Purchase Procedure
No ratings yet
Flow Chart 0: Overall Flow For Normal Purchase Procedure
1 page
Music Notation Shortcuts Guide
No ratings yet
Music Notation Shortcuts Guide
7 pages
Cognex Machine Vision PDF
No ratings yet
Cognex Machine Vision PDF
24 pages

Ixs8h l8mgc

Uploaded by

Ixs8h l8mgc

Uploaded by

MODULE 1

2. Short and Long Answer Questions

6. Explain life cycle of Data Science projects. (Short Answer)

Aspect Data Analytics Data Science

8. Compare box plot and histogram. (Short Answer)

Aspect Box Plot Histogram

9. Define Data mining. (Short Answer)

4. Describe how a probability distribution helps in statistical modeling. (Short

12. Analyze the results of a decision tree classifier implemented on a real-world

Example: Titanic Dataset

Accuracy (%) 78.2 79.1 82.5

RMSE 4.25 4.12 3.68

F1-Score 0.76 0.77 0.82

R² Score 0.71 0.73 0.79

Speed Fastest Fast Slow

15. Develop a real-world scenario that illustrates an alternative hypothesis in

Formula X−Xmin\Xmax−Xmin X−μ \ σ

Output No fixed range (Mean = 0, Std

- Neural Networks (e.g., CNNs)

Scaling pixel values (0 to 1) for Scaling features (age, income) for

Data with non-Gaussian

Model learns noise/training- Model is too simple to capture

Test Error High (poor generalization) High (poor generalization)

A decision tree memorizing Linear regression fitting a sine

- Regularization (L1/L2) - Increase model complexity

4. Describe the role of the distance metric in k-NN classification.

8. Given a dataset of customer transactions, apply the k-means clustering

2. Define simple linear regression and mention one assumption it makes.

You might also like