Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views32 pages

Data Science

The document discusses various concepts in data science and statistics, including the differences between data science and traditional statistics, the role of statistical models, and types of prediction problems. It also covers data handling techniques such as dealing with missing values, data cleaning, and the importance of exploratory data analysis (EDA). Additionally, it explains machine learning concepts like unsupervised learning, clustering, ensemble learning, and hypothesis testing.

Uploaded by

www.destroyerboy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views32 pages

Data Science

The document discusses various concepts in data science and statistics, including the differences between data science and traditional statistics, the role of statistical models, and types of prediction problems. It also covers data handling techniques such as dealing with missing values, data cleaning, and the importance of exploratory data analysis (EDA). Additionally, it explains machine learning concepts like unsupervised learning, clustering, ensemble learning, and hypothesis testing.

Uploaded by

www.destroyerboy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

​ ​ ​ ​ ​ 20 Batch

1 mark question:
a) Data science vs. traditional statistics: Traditional statistics
focuses on developing and applying methods for inference and
explanation on often smaller, structured datasets. Data science is a
broader, multidisciplinary field using statistical and computational
techniques to extract knowledge and build predictive models from
large, diverse data.

b) What is a statistical model? A statistical model is a


mathematical formulation that describes the assumed relationships
between variables in a dataset, including parameters and a
probability distribution for the error or outcome. It's used for
description, inference, and prediction.

c) Major types of prediction problems? Prediction problems


include classification, where the goal is to assign data points to
predefined categories, and regression, where the aim is to predict
a continuous numerical value. Other types include time series
forecasting for sequential data and ranking for ordering items.

d) What is structured data? Structured data is information


organized with a defined schema, typically in rows and columns
within databases or spreadsheets. This format allows for efficient
querying and analysis due to its clear and consistent organization of
data elements and their types.

e) Why data encoding is needed? Data encoding is necessary to


transform non-numerical data, such as categorical or textual
information, into a numerical format that machine learning
algorithms can process. This allows algorithms to learn patterns
and relationships from all types of data effectively.

f) What is the irreducible error? Irreducible error is the inherent


level of noise or randomness in the system being modeled that
cannot be reduced by any model. It arises from factors like
unmeasured variables or natural variability and sets a lower bound
on the achievable prediction accuracy.

g) What is an API? An Application Programming Interface (API) is


a set of protocols and tools that allows different software
1
applications to communicate and exchange data. It defines the
methods and data formats that applications can use to request and
share information or functionalities.

h) What is exploratory data analysis (EDA)? Exploratory Data


Analysis (EDA) is a preliminary approach to analyzing data using
visual and statistical techniques to summarize its main
characteristics, identify patterns, detect anomalies, and formulate
hypotheses. It helps in understanding the data's structure and
potential issues before formal modeling.

a) What is unsupervised learning? Unsupervised learning is a


type of machine learning where algorithms learn patterns and
1
structures from unlabeled data, without explicit output guidance.
The goal is to discover hidden relationships, groupings, or
dimensionality reductions within the data itself.

b) What is clustering? Clustering is an unsupervised learning


technique that groups similar data points together based on their
2
intrinsic characteristics. The algorithm aims to partition the data
into distinct clusters such that data points within a cluster are more
similar to each other than to those in other clusters.

c) How do you decide K in KNN algorithm? The value of K (the


number of neighbors) in the K-Nearest Neighbors (KNN) algorithm
is typically chosen using techniques like cross-validation. Different
K values are tested, and the one that yields the best performance
(e.g., highest accuracy or lowest error) on unseen validation data is
selected. Domain knowledge and the dataset's characteristics can
also provide initial guidance.

d) What is ensemble learning? Ensemble learning is a machine


learning technique that combines the predictions of multiple
3
individual models to make a final, often more accurate and robust
prediction. This approach leverages the strengths of different
models and reduces their individual weaknesses, leading to
improved generalization.

e) What is the P-value? In statistical hypothesis testing, the


P-value is the probability of observing data as extreme as, or more
extreme than, the data obtained, assuming the null hypothesis is
true. A small P-value (typically below a chosen significance level,
e.g., 0.05) provides evidence against the null hypothesis.

f) What is logistic regression? Logistic regression is a statistical


model used for binary classification problems, predicting the
probability of a binary outcome (e.g., yes/no, 0/1). It models the
relationship between a set of independent variables and the
log-odds of the dependent variable using a sigmoid function to
constrain the output probability between 0 and 1.

g) What is the use of data cleaning? Data cleaning is the process


of identifying and correcting errors, inconsistencies, and
4
inaccuracies in datasets. Its use is crucial for ensuring data quality,
improving the reliability of analyses and models, and preventing
misleading conclusions that could arise from flawed data.

h) What is imputation? Imputation is a technique used to handle


missing values in a dataset by replacing them with estimated or
plausible values. Common methods include using the mean,
median, or mode of the variable, or employing more sophisticated
statistical modeling techniques to predict the missing values.


#2 no question ans: 2.5 mark

b) What are the most common problems with data


and how to handle these? Common data problems and their
handling strategies include:

●​ Missing Values: Data points with absent values. Handling:


Imputation (replacing with mean, median, mode, or using
predictive models), deletion of rows or columns (if
missingness is extensive and random), or using algorithms
that can inherently handle missing values.
●​ Outliers: Data points that significantly deviate from the rest of
the data. Handling: Identification through visualization (box
plots, scatter plots) and statistical methods (e.g., Z-score).
Treatment involves removal (with caution), transformation
(e.g., log transformation), or using robust models less
sensitive to outliers.
●​ Inconsistent Formatting: Data represented differently across
sources or within the same dataset (e.g., different date
formats, inconsistent capitalization). Handling: Standardizing
formats using programming techniques and data manipulation
tools.
●​ Duplicate Records: Identical or near-identical data entries.
Handling: Identifying and removing duplicate records based
on defined criteria.
●​ Incorrect Data Types: Data stored in the wrong format (e.g.,
numerical data stored as strings). Handling: Converting data
to the appropriate data types.
●​ Biased Data: Data that does not accurately represent the
population of interest. Handling: Requires careful data
collection strategies and awareness of potential biases during
analysis and modeling. Mitigation might involve re-sampling
techniques or using models designed to handle bias.
●​ Noisy Data: Data containing random errors or irrelevant
information. Handling: Smoothing techniques, filtering, or
using models that are robust to noise.

c) What do you want your visualization to show about


your data? Name the commonly used visualization
techniques for each type. The purpose of data visualization is
to communicate insights effectively. What you want to show
depends on the data and the question you're trying to answer:

●​ Distribution of a single variable: Understanding the spread


and central tendency.
○​ Techniques: Histograms (numerical), Box plots
(numerical), Density plots (numerical), Bar charts
(categorical), Pie charts (categorical - use with caution).
●​ Relationship between two or more variables: Identifying
correlations or dependencies.
○​ Techniques: Scatter plots (numerical vs. numerical),
Line plots (numerical over time), Bar charts (categorical
vs. numerical), Stacked bar charts (categorical vs.
numerical showing proportions), Heatmaps (correlation
matrices).
●​ Comparison of values across categories or groups:
Highlighting differences.
○​ Techniques: Bar charts, Grouped bar charts, Box plots,
Violin plots.
●​ Trends over time: Observing changes and patterns over a
period.
○​ Techniques: Line plots, Time series plots, Area charts.
●​ Geographical data: Showing spatial patterns.
○​ Techniques: Choropleth maps, Scatter maps.
●​ Hierarchical data or part-to-whole relationships:
Visualizing structures and proportions.
○​ Techniques: Pie charts (part-to-whole), Treemaps
(hierarchical proportions), Sunburst charts (hierarchical).

The choice of visualization technique depends on the data type(s)


involved and the specific insight you want to convey. Effective
visualizations are clear, concise, and accurately represent the
underlying data.

d) What are the benefits and limitations of using


crowdsourced data? Crowdsourced data is information
gathered from a large, distributed group of individuals, often online.

Benefits:

●​ Large Volume and Scale: Can often provide massive


amounts of data that would be difficult or expensive to collect
through traditional methods.
●​ Diversity of Perspectives: Can capture a wide range of
opinions, experiences, and observations from a diverse
population.
●​ Cost-Effective: Can be a relatively inexpensive way to gather
large datasets compared to surveys or experiments.
●​ Real-time Data Collection: In some cases, data can be
collected rapidly and in real-time.
●​ Coverage of Niche Areas: Can provide insights into
specialized topics or reach specific demographics that might
be challenging through other means.

Limitations:

●​ Data Quality Issues: Can suffer from inconsistencies,


inaccuracies, biases, and spam due to the lack of strict control
over data collection.
●​ Representativeness and Bias: The participants may not be
representative of the broader population, leading to biased
results. Certain demographics might be over- or
under-represented.
●​ Ethical Concerns: Issues related to privacy, consent, and the
potential for misuse of the collected data need careful
consideration.
●​ Lack of Standardization: Data formats and quality can vary
significantly across different contributors.
●​ Difficulty in Validation: Verifying the accuracy and reliability
of the contributed information can be challenging.
●​ Potential for Manipulation: Can be susceptible to
manipulation or coordinated efforts to skew the results.

e) What is hypothesis testing? Define null and


alternative hypotheses. Hypothesis testing is a statistical
method used to determine whether there is enough evidence in a
sample data to infer that a certain condition is true for the entire
population. It involves formulating two competing hypotheses and
then using the data to decide which hypothesis is more likely to be
true.

●​ Null Hypothesis (H₀): This is a statement of no effect or no


difference. It represents the status quo or a commonly
accepted belief. The goal of hypothesis testing is often to try
and disprove or reject the null hypothesis.
○​ Example: "The average height of men and women is the
same."
●​ Alternative Hypothesis (H₁ or H<0xE2><0x82><0x9A>):
This is a statement that contradicts the null hypothesis. It
represents what the researcher is trying to find evidence for. It
suggests that there is a significant effect or difference. The
alternative hypothesis can be one-tailed (directional, e.g.,
"men are taller than women") or two-tailed (non-directional,
e.g., "the average height of men and women is different").
○​ Example (one-tailed): "The average height of men is
greater than the average height of women."
○​ Example (two-tailed): "The average height of men is
different from the average height of women."

The process of hypothesis testing involves calculating a test


statistic from the sample data and then determining the probability
(P-value) of observing such a statistic (or one more extreme) if the
null hypothesis were true. Based on this P-value and a chosen
significance level (alpha), a decision is made to either reject or fail
to reject the null hypothesis.

f) What are the common ways to handle missing


values in data? Handling missing values is a crucial step in
data preprocessing. Common methods include:

●​ Deletion:
○​ Listwise Deletion (Row Removal): Removing entire
rows (observations) that contain one or more missing
values. This can lead to a significant loss of data if
missingness is widespread.
○​ Column Removal: Removing entire columns (variables)
if they have a very high percentage of missing values
and are deemed less critical.
●​ Imputation (Replacement): Filling in the missing values with
estimated values.
○​ Simple Imputation:
■​ Mean Imputation: Replacing missing numerical
values with the mean of the non-missing values in
that column.
■​ Median Imputation: Replacing missing numerical
values with the median. Less sensitive to outliers
than the mean.
■​ Mode Imputation: Replacing missing categorical
values with the most frequent category.
○​ Statistical Imputation: Using statistical models to
predict the missing values based on other variables.
■​ Regression Imputation: Using regression models
to predict missing numerical values.
■​ Multiple Imputation: Creating multiple plausible
estimates for the missing values, analyzing each
completed dataset, and then pooling the results to
account for the uncertainty of the imputed values.
○​ Nearest Neighbor Imputation: Using the values from
the most similar data points to impute the missing
values.
●​ Using Algorithms that Handle Missing Values: Some
machine learning algorithms (e.g., certain tree-based
methods) can inherently handle missing values without
requiring explicit imputation.
●​ Creating a Missing Value Indicator: Adding a binary
indicator variable to denote whether a value was originally
missing. This can help the model learn if the missingness itself
is informative.

The choice of method depends on the amount and pattern of


missing data, the type of variable, and the goals of the analysis. It's
important to carefully consider the potential biases introduced by
each method.
a) Explain the role of p-values in hypothesis testing.
In hypothesis testing, the p-value quantifies the evidence against
the null hypothesis. Specifically, it is the probability of observing
data as extreme as, or more extreme than, the data actually
obtained, assuming the null hypothesis is true.

The role of the p-value is to help researchers make a decision


about whether to reject or fail to reject the null hypothesis. A small
p-value (typically less than a predetermined significance level,
denoted as α, often 0.05) suggests that the observed data is
unlikely if the null hypothesis were true. This provides evidence to
reject the null hypothesis in favor of the alternative hypothesis.

Conversely, a large p-value (greater than α) indicates that the


observed data is reasonably likely under the null hypothesis, and
therefore, there is not enough evidence to reject it. It's crucial to
understand that failing to reject the null hypothesis does not mean it
is true, only that there isn't sufficient evidence to disprove it based
on the current data. The p-value acts as a measure of the strength
of the evidence against the null hypothesis.

b) What is ensemble learning? How does it improve predictive


accuracy compared to individual models? Ensemble learning
is a machine learning technique that combines the predictions of
multiple individual "base" models (e.g., decision trees, support
vector machines, neural networks) to make a final, often more
accurate and robust prediction. The core idea is that a group of
diverse models can collectively outperform any of its individual
constituents.

Ensemble learning improves predictive accuracy compared to


individual models through several mechanisms:
●​ Variance Reduction (Bagging): Techniques like Bagging
(Bootstrap Aggregating) train multiple models on different
subsets of the training data (sampled with replacement).
Averaging (for regression) or majority voting (for classification)
the predictions of these models reduces the variance of the
final prediction, leading to more stable and less overfit models.
●​ Bias Reduction (Boosting): Techniques like Boosting (e.g.,
AdaBoost, Gradient Boosting) train models sequentially, with
each new model focusing on correcting the errors made by the
previous ones. This iterative process reduces the bias of the
ensemble, allowing it to capture more complex relationships in
the data.
●​ Improved Generalization: By combining diverse models,
ensembles are less likely to overfit to the specific nuances of
the training data and tend to generalize better to unseen data.
Different models might capture different aspects of the
underlying data distribution, and their combined prediction can
be more comprehensive.
●​ Increased Robustness: Ensembles are often more robust to
noisy data or outliers, as the errors of individual models may
cancel each other out in the aggregation process.
c) Define entropy and information gain.
e) What is bootstrapping? Bootstrapping is a resampling
technique used to estimate the sampling distribution of a statistic
(e.g., mean, median, standard deviation, regression coefficients) by
repeatedly sampling with replacement from the original dataset.

The process involves:

1.​Creating multiple (often thousands) of new datasets (bootstrap


samples) by randomly selecting data points from the original
dataset with replacement. This means that some data points
from the original dataset may appear multiple times in a
bootstrap sample, while others may not appear at all. Each
bootstrap sample has the same size as the original dataset.
2.​
3.​Calculating the statistic of interest on each of the bootstrap
samples.
4.​Estimating the sampling distribution of the statistic by looking
at the distribution of the calculated statistics from all the
bootstrap samples.

Bootstrapping is a powerful tool for quantifying the uncertainty


associated with a statistic, estimating confidence intervals, and
assessing the stability of a model without making strong
assumptions about the underlying population distribution.

f) What is overfitting? What are the causes of overfitting?


Overfitting occurs when a machine learning model learns the
training data too well, including the noise and random fluctuations
present in that specific dataset. As a result, the model performs
very well on the training data but fails to generalize effectively to
new, unseen data, leading to poor performance on test or real-world
data.

Causes of overfitting:
●​ High Model Complexity: Using a model with too many
parameters relative to the size of the training data. Complex
models have the capacity to memorize the training data,
including its noise.
●​ Insufficient Training Data: When the training dataset is too
small, the model might learn spurious correlations or patterns
that are specific to that limited sample and do not generalize.
●​ Learning Noise: The model learns not only the underlying
signal in the data but also the random errors or noise present
in the training set.
●​ Over-training: Training a model for too many epochs or
iterations, allowing it to become overly specialized to the
training data.
●​ Presence of Irrelevant Features: Including features in the
model that do not have a true relationship with the target
variable can introduce noise and lead to overfitting.
●​ Lack of Cross-Validation: Not using proper validation
techniques to assess the model's generalization ability during
training can lead to unknowingly selecting an overfit model.

d) Write the general equation of a polynomial


regression model with K features and degree M.
Okay, I'll answer question (a) about cross-validation for
5 marks.

3 a) What is cross-validation? Briefly discuss how


K-fold cross-validation is done for different
purposes.
Cross-validation is a robust technique used in machine learning to
assess how well a model will generalize to an independent, unseen
dataset. It's a crucial step in evaluating model performance and
preventing overfitting. Instead of just splitting the data into a single
training and testing set, cross-validation involves partitioning the
data into multiple subsets and iteratively training and evaluating the
1
model on different combinations of these subsets. This provides a
more reliable estimate of the model's true performance.

K-fold cross-validation is a specific type of cross-validation where


the original dataset is randomly divided into K equally sized,
non-overlapping partitions or "folds." The process is then repeated
K times, with each fold serving as the validation (or test) set for one
iteration, while the remaining K−1 folds are used as the training set.

Here's a step-by-step breakdown of how K-fold cross-validation is


typically performed:
1.​Partitioning: The entire dataset is randomly shuffled and then
split into K folds.
2.​Iteration: For each of the K iterations (from 1 to K):
○​ One fold is selected as the validation set (or test set for
that iteration).
○​ The remaining K−1 folds are combined to form the
training set.
○​ A model is trained on the training set.
○​ The trained model is evaluated on the validation set
using a chosen performance metric (e.g., accuracy,
precision, recall, F1-score for classification; mean
squared error, R-squared for regression).
3.​Performance Aggregation: After all K iterations, the
performance metrics obtained from each validation set are
averaged to provide a single, more robust estimate of the
model's generalization performance.

How K-fold cross-validation is done for different purposes:


●​ Model Evaluation and Selection: This is the most common
use of K-fold cross-validation. By obtaining an average
performance across multiple train-test splits, it provides a
more reliable estimate of how well the model is likely to
perform on unseen data compared to a single train-test split,
which can be heavily influenced by how the data is partitioned.
This allows for a fairer comparison of different models or
different hyperparameter settings for the same model. For
model selection, we would train and evaluate multiple models
(or hyperparameter configurations) using the same
cross-validation strategy and choose the one with the best
average performance.​

●​ Hyperparameter Tuning: K-fold cross-validation is often


integrated into hyperparameter tuning techniques like
GridSearchCV or RandomizedSearchCV. For each
combination of hyperparameters being evaluated:​

○​ The training data (from the initial train-test split or the


entire dataset if no final test set is held out) is subjected
to K-fold cross-validation.
○​ The model is trained and evaluated on each of the K
folds using the specific hyperparameter settings.
○​ The average performance across the folds is calculated.
○​ The hyperparameters that yield the best average
performance through cross-validation are then selected.
Finally, the chosen model with the optimal
hyperparameters is typically trained on the entire training
set and evaluated on a separate held-out test set.
●​ Assessing Model Robustness: By evaluating the model's
performance across different subsets of the data, K-fold
cross-validation can provide insights into the stability and
robustness of the model. If the performance varies
significantly across the folds, it might indicate that the model is
sensitive to the specific training data it receives, potentially
suggesting overfitting or data issues.​

●​ Small Datasets: K-fold cross-validation is particularly useful


when dealing with small datasets. A single train-test split might
result in a very small training set, leading to poor model
training, or a very small test set, resulting in an unreliable
performance estimate. By using K-fold cross-validation, we
can utilize all the data for both training and evaluation over the
K iterations, providing a more stable performance
assessment.​

In summary, K-fold cross-validation is a powerful and widely used


technique for evaluating machine learning models, tuning
hyperparameters, and assessing model robustness. Its iterative
nature of training and evaluating on different subsets of the data
provides a more reliable estimate of a model's ability to generalize
to unseen data, which is the ultimate goal in machine learning.

​ ​ ​ ​ 19 batch

1 mark :
a) What is Prior probability and Likelihood?
●​ Prior Probability: The probability of an event before any new
evidence or data is observed. It represents our initial belief
about the event.
●​ Likelihood: The probability of observing the given data
assuming a specific hypothesis or parameter value is true. It
quantifies how well a particular hypothesis explains the
observed data.
b) What is Normal Distribution?

The Normal Distribution (or Gaussian distribution) is a continuous


1
probability distribution characterized by its bell-shaped curve. It's
symmetrical around the mean, with the mean, median, and mode
being equal. It's defined by its mean (μ) and standard deviation (σ).

c) What is a Confidence Interval?

A Confidence Interval is a range of values, calculated from sample


2
data, that is likely to contain the true value of a population
3
parameter with a certain degree of confidence (e.g., 95%). It
provides a measure of the uncertainty around a point estimate.

d) What is Overfitting?

Overfitting occurs when a machine learning model learns the


training data too well, including the noise and random fluctuations.
4
As a result, it performs very well on the training data but poorly on
new, unseen data because it has memorized the specifics of the
training set.

e) What does P-value signify about the statistical data?

The P-value is the probability of observing data as extreme as, or


more extreme than, the data obtained, assuming the null hypothesis
is true. A small p-value suggests that the observed data is unlikely
under the null hypothesis, providing evidence against it.

f) What is Regularization?

Regularization is a technique used to prevent overfitting in


machine learning models by adding a penalty term to the loss
1
function. This penalty discourages the model from learning overly
complex relationships and reduces the magnitude of the model's
coefficients, leading to better generalization. Common types include
L1 and L2 regularization.

g) Define Gradient Descent.

Gradient Descent is an iterative optimization algorithm used to find


the minimum of a function (often a loss function in machine
learning). It works by repeatedly taking steps in the direction of the
negative gradient of the function at the current point, which
indicates the direction of the steepest decrease. The step size is
controlled by the learning rate.

h) What is a Null Hypothesis?

A Null Hypothesis (H₀) is a statement of no effect or no difference


in statistical hypothesis testing. It represents the status quo or a
commonly accepted belief that researchers aim to disprove or reject
based on the evidence from their data.

a) What is the purpose of using cross-validation?

The purpose of using cross-validation is to evaluate how well a


machine learning model will generalize to an independent, unseen
dataset. It provides a more reliable estimate of the model's
performance than a single train-test split by training and evaluating
the model on multiple subsets of the data.

b) What is Ensemble Learning?

Ensemble Learning is a machine learning technique that combines


1
the predictions of multiple individual models to make a final, often
more accurate and robust prediction. It leverages the strengths of
different models and reduces their individual weaknesses.

c) What is Curse of Dimensionality?

The Curse of Dimensionality refers to the challenges and issues


2
that arise when dealing with high-dimensional data (data with a
large number of features). These issues can include increased data
sparsity, higher computational costs, and the risk of overfitting.

d) What is an Outlier?

An Outlier is a data point that significantly deviates from the other


3
data points in a dataset. It is an observation that appears to be
inconsistent with the general pattern of the data and can arise due
to errors, anomalies, or genuine extreme values.

e) What is the Log-Odds ratio of a forged coin, that when


tossed 80% time lands on the head?

The probability of heads (P(Heads)) is 0.80.

The probability of tails (P(Tails)) is 1 - 0.80 = 0.20.

The odds of heads are P(Heads) / P(Tails) = 0.80 / 0.20 = 4.

The Log-Odds ratio is the natural logarithm of the odds:


ln(4)≈1.386.

f) What is a Scatter plot used for?

A Scatter plot is a type of data visualization used to display the


relationship between two numerical variables. Each point on the
plot represents a pair of values for the two variables, allowing for
the visual identification of correlations, patterns, and potential
outliers.
g) Calculate the Entropy of rolling a 6 sided fair dice.

For a fair 6-sided die, the probability of each outcome (1, 2, 3, 4, 5,


6) is 1/6. The entropy (H) is calculated as:

H=−i=1∑6​pi​log2​(pi​)=−i=1∑6​61​log2​(61​)

H=−6×(61​×(−log2​(6)))=log2​(6)≈2.585 bits

h) Define Eigen Value and Eigen Vector.


●​ Eigen Value: For a square matrix A, an eigenvalue (λ) is a
scalar such that for a non-zero vector v, the equation Av=λv
holds. It represents the factor by which the eigenvector is
scaled when transformed by the matrix.
●​ Eigen Vector: For a square matrix A and an eigenvalue λ, an
eigenvector (v) is a non-zero vector that, when multiplied by
the matrix A, results in a vector that is a scalar multiple (λ) of
itself. It represents a direction that remains unchanged (or just
scaled) by the linear transformation represented by the matrix.

2.5 mark:
b) What do you understand by the term Loss in Data Science?
What are the common ways to calculate loss?

In Data Science, loss (or error) refers to a measure of how well a


machine learning model's predictions align with the actual target
values in the training data. It quantifies the discrepancy between the
predicted output and the true output for a given data point. The goal
of training a model is to minimize this loss.

Common ways to calculate loss include:


●​ Mean Squared Error (MSE): Used for regression problems, it
calculates the average of the squared differences between the
predicted and actual values. MSE=n1​i=1∑n​(yi​−y^​i​)2
●​ Mean Absolute Error (MAE): Also for regression, it
calculates the average of the absolute differences between
predicted and actual values. MAE=n1​i=1∑n​∣yi​−y^​i​∣
●​ Binary Cross-Entropy (Log Loss): Used for binary
classification, it measures the performance of a classification
model whose output is a probability value between 0 and 1.
Binary Cross−Entropy=−n1​i=1∑n​[yi​log(y^​i​)+(1−yi​)log(1−y^​i​)]
●​ Categorical Cross-Entropy: An extension of binary
cross-entropy for multi-class classification problems.
Categorical Cross−Entropy=−i=1∑n​j=1∑C​yij​log(y^​ij​)

c) What is 95% confidence interval for linear regression?

A 95% confidence interval for linear regression provides a range


of values within which we are 95% confident that the true population
parameter (e.g., a regression coefficient or the predicted value for a
given input) lies.

For a regression coefficient, the 95% confidence interval is typically


calculated as:
β^​i​±tα/2,n−p−1​×SE(β^​i​)

where:

●​ β^​i​is the estimated coefficient.


●​ tα/2,n−p−1​is the critical t-value from the t-distribution with a
significance level of α=0.05 (for 95% confidence) and n−p−1
degrees of freedom (n is the number of observations, p is the
number of predictor variables).
●​ SE(β^​i​) is the standard error of the estimated coefficient.

For a predicted value y^​at a specific input x, the 95% prediction


interval (which is wider than the confidence interval for the mean
prediction) would be calculated differently, accounting for both the
uncertainty in the coefficient estimates and the inherent variability in
the data.

In essence, the 95% confidence interval gives us a sense of the


precision of our estimated regression parameters or predictions.

f)differences between Regression and


Classification:

Feature Regression Classification

Target Continuous Numerical Categorical Output (Finite,


Variable Output Discrete Classes)

Goal Predict a specific Assign to a predefined


numerical value category or class
Output Real number within a Class label or probability of
range belonging to a class

Example House price prediction, Email spam detection,


Problems stock forecasting, image classification,
temperature estimation disease diagnosis

Evaluation MSE, RMSE, MAE, Accuracy, Precision,


Metrics R-squared Recall, F1-score,
AUC-ROC

Common Linear Regression, Logistic Regression, SVM,


Algorithms Polynomial Regression, Decision Tree
SVR, Decision Tree Classification, Random
Regression, Random Forest Classification, Naive
Forest Regression Bayes, KNN

5>
a) What are the differences between Supervised and
Unsupervised Learning?

Feature Supervised Learning Unsupervised Learning

Data Labeled data (input Unlabeled data (input


features with features only)
corresponding outputs)
Goal Learn a mapping function Discover hidden patterns,
to predict the output for structures, or relationships
new inputs in the data

Learning Learning by example Learning without explicit


Type guidance

Common Classification, Regression Clustering, Dimensionality


Tasks Reduction, Association
Rule Mining

Evaluation Based on prediction Often subjective, based on


accuracy compared to the quality of discovered
true labels structures

Examples Image classification, Customer segmentation,


spam detection, house anomaly detection, topic
price prediction modeling

b) Differentiate between parametric and non-parametric


models with example.

Feature Parametric Models Non-Parametric


Models

Assumptions Make strong assumptions Make fewer or no


about the data distribution explicit assumptions
about the data
distribution
Parameters Have a fixed number of Number of
parameters determined by parameters grows
the model structure with the size of the
training data

Complexity Generally less flexible Can be more flexible


and capture complex
relationships

Data Can perform well with Typically require


Requirement smaller datasets (if larger datasets to
assumptions hold) learn complex
structures

Examples Linear Regression (assumes Decision Trees,


linear relationship), Naive Support Vector
Bayes (assumes feature Machines, K-Nearest
independence) Neighbors

c) Explain Bias-Variance Trade-off.

The Bias-Variance Trade-off is a fundamental concept in machine


learning that describes the relationship between a model's ability to
fit the training data (bias) and its ability to generalize to unseen data
(variance).
1
●​ Bias is the error resulting from overly simplistic assumptions
in the learning algorithm. A high bias model might underfit the
training data, failing to capture important patterns.
●​ Variance is the error resulting from the model's sensitivity to
fluctuations in the training data. A high variance model might
overfit the training data, learning the noise as well as the
underlying patterns, leading to poor performance on new data.

The goal is to find a balance between bias and variance. A model


with low bias and low variance will ideally generalize well.
Increasing model complexity typically reduces bias but increases
variance, while decreasing complexity increases bias but reduces
variance. Techniques like cross-validation and regularization help in
finding this optimal balance.

d) Compare Random Forest and Gradient Boosting.

Feature Random Forest Gradient Boosting

Base Learners Multiple decision trees Multiple decision trees


(typically shallow)

Tree Each tree is built Trees are built


Independence independently on a sequentially, with each
random subset of data tree trying to correct the
and features errors of the previous
ones

Data Sampling Bootstrap sampling No explicit resampling;


(sampling with each tree trained on the
replacement) full data with weights
based on previous errors

Feature Random subset of Typically considers all


Sampling features considered at features at each split, but
each split can have subsampling
Combining Predictions are Predictions are combined
Trees averaged (regression) through a weighted sum
or majority voted or boosting process
(classification)

Goal Reduce variance Reduce bias

Robustness to Generally more robust Can be sensitive to noisy


Noise to noise due to data if not tuned properly
averaging

Computational Can be Can be computationally


Cost computationally expensive due to
expensive to train sequential training
many deep trees in
parallel

Overfitting Less prone to More prone to overfitting if


overfitting compared to not tuned properly
individual trees
e) What is Cross Validation? Explain how to perform a 5-fold
Cross Validation on a dataset of 1000 items.

Cross-validation is a technique used to evaluate how well a


machine learning model will generalize to an independent, unseen
dataset. It involves partitioning the data into multiple subsets,
training the model on some subsets, and evaluating it on the
1
remaining subset. This process is repeated multiple times, and the
results are averaged to get a more reliable estimate of the model's
2
performance.

Performing 5-fold Cross Validation on a dataset of 1000 items:


1.​Shuffle the Data: First, the dataset of 1000 items should be
randomly shuffled to ensure that the data points are not
ordered in a way that could bias the folds.
2.​Divide into Folds: The shuffled dataset is then divided into 5
equal, non-overlapping folds. With 1000 items, each fold will
contain 1000/5=200 items. Let's label these folds as Fold 1,
Fold 2, Fold 3, Fold 4, and Fold 5.
3.​Iteration and Evaluation: The cross-validation process will
run for 5 iterations:
○​ Iteration 1: Fold 1 is used as the validation (or test) set,
and Folds 2, 3, 4, and 5 are used as the training set. The
model is trained on the 800 items in the training set and
evaluated on the 200 items in Fold 1. The performance
metric (e.g., accuracy, MSE) is recorded.
○​ Iteration 2: Fold 2 is used as the validation set, and
Folds 1, 3, 4, and 5 are used as the training set. The
model is trained on these 800 items and evaluated on
the 200 items in Fold 2. The performance metric is
recorded.
○​ Iteration 3: Fold 3 is used as the validation set, and
Folds 1, 2, 4, and 5 are used as the training set. The
model is trained on these 800 items and evaluated on
the 200 items in Fold 3. The performance metric is
recorded.
○​ Iteration 4: Fold 4 is used as the validation set, and
Folds 1, 2, 3, and 5 are used as the training set. The
model is trained on these 800 items and evaluated on
the 200 items in Fold 4. The performance metric is
recorded.
○​ Iteration 5: Fold 5 is used as the validation set, and
Folds 1, 2, 3, and 4 are used as the training set. The
model is trained on these 800 items and evaluated on
the 200 items in Fold 5. The performance metric is
recorded.
4.​Average Performance: After all 5 iterations, the recorded
performance metrics from each validation set are averaged.
This average provides a more robust estimate of the model's
generalization performance on unseen data compared to a
single train-test split.

f) How to prevent overfitting in Decision Trees?

Overfitting occurs in Decision Trees when the tree becomes too


complex and learns the noise in the training data. Here are common
ways to prevent overfitting:
●​ Pruning: This involves removing branches or nodes from a
fully grown tree.
○​ Pre-pruning: Setting limits on the growth of the tree
during the training process. This can involve setting a
maximum depth for the tree, a minimum number of
samples required to split a node, or a minimum number
of samples required in a leaf node.
○​ Post-pruning: Growing the tree fully and then pruning
back nodes that do not provide significant improvement
in generalization. This often involves using a validation
set to evaluate the impact of pruning.
●​ Limiting Tree Depth: Restricting the maximum depth of the
tree prevents it from becoming too specific to the training data.
●​ Minimum Samples per Leaf: Requiring a minimum number
of data points in each leaf node ensures that the tree doesn't
create overly specific leaves for small, potentially noisy,
subsets of the data.
●​ Minimum Samples to Split a Node: Setting a minimum
number of data points required to split an internal node
prevents splits based on very small subsets of the data.
●​ Cost Complexity Pruning (CCP): A post-pruning technique
that considers a cost function to balance the tree's complexity
and its fit to the training data.
●​ Using Ensemble Methods: Techniques like Random Forest
and Gradient Boosting, which use multiple decision trees, are
inherently less prone to overfitting than a single deep decision
tree due to the aggregation of predictions.
●​ Cross-Validation: Using cross-validation to evaluate the
tree's performance on unseen data during training and
hyperparameter tuning helps in selecting tree parameters that
generalize well and avoid overfitting.
●​ Feature Selection: Selecting only the most relevant features
can reduce the complexity of the model and prevent it from
3
fitting noise in irrelevant dimensions.

You might also like