Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views14 pages

Unit II ML

Unit II covers linear regression methods, including least squares, subset selection, and shrinkage techniques like Ridge and Lasso regression. It discusses the mathematical foundations of linear regression, error minimization, and various model selection methods, emphasizing the trade-offs between bias and variance. Additionally, it explores advanced techniques such as Principal Components Regression and Partial Least Squares for handling multicollinearity and dimensionality reduction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views14 pages

Unit II ML

Unit II covers linear regression methods, including least squares, subset selection, and shrinkage techniques like Ridge and Lasso regression. It discusses the mathematical foundations of linear regression, error minimization, and various model selection methods, emphasizing the trade-offs between bias and variance. Additionally, it explores advanced techniques such as Principal Components Regression and Partial Least Squares for handling multicollinearity and dimensionality reduction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Unit II

Linear Methods for Regression


Syllabus:
Introduction, Linear Regression Models and Least Squares, Subset Selection, Shrinkage
Methods-Ridge Regression, Lasso Regression, Least Angle Regression, Methods Using
Derived Input Directions-Principal Components Regression, Partial Least Squares, A
Comparison of the Selection and Shrinkage Methods , Multiple Outcome Shrinkage and
Selection, More on the Lasso and Related Path Algorithms, Logistic Regression-Fitting
Logistic Regression Models, Quadratic Approximations and Inference, L1 Regularized
Logistic Regression.

Introduction:
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc. Linear
regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the
linear relationship, which means it finds how the value of the dependent variable is changing
according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:

Mathematically, we can represent a linear regression as:


y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model
representation.
Linear Regression
In statistics, linear regression is a linear approach to modelling the relationship between a
dependent variable and one or more independent variables. In the case of one independent
variable it is called simple linear regression. For more than one independent variable, the
process is called mulitple linear regression. We will be dealing with simple linear regression
in this tutorial.
Let X be the independent variable and Y be the dependent variable. We will define a linear
relationship between these two variables as follows:

This is the equation for a line that you studied in high school. m is the slope of the line
and c is the y intercept. Today we will use this equation to train our model with a given
dataset and predict the value of Y for any given value of X.
Our challenege today is to determine the value of m and c, that gives the minimum error for
the given dataset. We will be doing this by using the Least Squares method.
Finding the Error
So to minimize the error we need a way to calculate the error in the first place. A loss
function in machine learning is simply a measure of how different the predicted value is
from the actual value.
Today we will be using the Quadratic Loss Function to calculate the loss or error in our
model. It can be defined as:
We are squaring it because, for the points below the regression line y — p will be negative
and we don’t want negative values in our total error.
Least Squares method
Now that we have determined the loss function, the only thing left to do is minimize it. This
is done by finding the partial derivative of L, equating it to 0 and then finding an expression
for m and c. After we do the math, we are left with these equations:

Here x̅ is the mean of all the values in the input X and ȳ is the mean of all the values in the
desired output Y. This is the Least Squares method. Now we will implement this in python
and make predictions.
There are two main problems with least square estimation. 1. Prediction accuracy : least
square minimizes biases, but this can be at the cost of high variance. Least squares exhibits an
arbitrage between bias and reduced variance. 2. Interpretations: when we have large number
of predictors, we like to be able to determine a smaller subset that exhibits strongest effect.
Least squares exhibits an arbitrage between details and big picture.

Subset selection
The first class of procedures that help us solve these problems is subset selection. They all
aim to select the set of predictors that minimize the sum of squared residuals (RSS), given a
number k≤p≤ of predictors.
Because the RSS is strictly is monotonically decreasing in k, these procedures do not inform
on how to choose the number of predictors in a model. Choice of k is governed by a trade-off
between bias and variance, and a subjective desire for parsimony.
Best subset selection
This procedure is implementable when p, the number of predictors, is less than 40.
∀k∈0,1,2,....p∀∈0,1,2,.... find the subset of size k that gives the smallest sum of square
residuals.
One of the strengths of this approach is that it is not nested, and therefore gives more
flexibility.

Forward/Backward stepwise selection


Instead of searching all of the possible subsets, this approach seeks a good path through them.
Unlike best subset selection, forward/stepwise selection produces nested models. But it can
be computationally and statistically more desirable than forward selection.
Forward stepwise selection starts with the intercept, and sequentially adds into the model the
predictor that most improves the fit.
Backward stepwise selection starts with full model and drops predictors progressively
starting with those that exhibit the smallest z-score.
Note that the standard errors spit out by the selected model are not valid, because they do not
account for the search process. This might Cal for bootstrapping for correction purposes.

Forward stagewise regression


This other subset selection approach is more heavily constrained and “slow-fitting”. It will
usually be more suitable for high dimensional cases.
The algorithm proceeds as follows:

1. Start with an intercept =y¯=¯ and cantered predictors with coefficients set to 0.
2. The algorithm identifies the variable most correlated with the current residual
3. Runs a simple linear regression of the residual on this variable, and adds it ot the
current coefficient for that variable
4. Goes back to 1, and repeats until no variable is correlated with residual

Limits
Subset selection models are discreet processes - they do not optimize continuously. For this
reason, they have high variability, and will not minimize prediction error of full model. This
limit invites alternative and more efficient (variance wise) procedures: shrinkage models.

Shrinkage models
The intuition is similar to subset selection models, in that we are still trying to minimize RSS,
with the additional tweak that RSS are here modified to include penality terms. They can also
be understood as tools that seek to alleviate consequences of multicollinearity.
The difference between shrinkage models lay in how the penalization term is defined - which
has significant consequences.
Ridge regression

Sensitivity to regularization parameter


Comparative discussion

Ridge Lasso Best subset selection

Soft thresholding Soft-Hard thresholding Hard thresholding

Differences
Where the input matrix is orthonormal, all three procedures admit equivalent explicit
solutions. Each approach basically transforms the LS estimates through a different algorithm.

Table 3.4 from Hastie et al.


But when the input is not orthogonal - that is when differences become wide.
Methods Using Derived Input Directions:
When a large number of (correlated) variables Xj , j=1,…,p are available, they may be
linearly combined in a small number of components (projections) Zm , m=1,…,M, with
M<=p.
● These components can be used as inputs in regression.
● Different methods are available for constructing linear combinations of variables
● Principal components regression
● Partial least squares
1.Principal Components Regression
Principal Components: SVD
Linear components Zm are defined by Principal Component Analysis (PCA).
Principal components (Karhunen-Loeve) directions of X are computed by SVD of X
(eigenvalue decomposition of (X^T)X, if X is standardized).
● The SVD of the N x p matrix X can be written as:

where:
● U (N x p) and V (p x p) are orthogonal matrices
● Columns of U span the column space of X
● Columns of V span the row space of X
● D is a p x p diagonal matrix with entries d1 >= d2 >= … >= dp >=0
singular values of X.
Principal Components: eigen decomposition
In fact, the covariance matrix can be decomposed as

which is the eigen decomposition of .


● The eigenvectors
(columns of V) are also called principal
components (Karhunen-Loeve) directions of X.
Principal Components: directions and variance
The first principal components direction v1 (eigenvector of X) has the property that z1 =
X*v1 has the largest sample variance amongst all normalized linear combinations of columns
of X
where d1 is the eigenvalue of with maximum absolute value and N is the total number of
observations.
● Subsequent principal components zj have maximum variance and are orthogonal to the
earlier ones.

partial least square in machine learning:


Partial Least Squares (PLS) is a statistical technique commonly used in machine learning and
multivariate analysis for regression and dimensionality reduction. It's particularly useful
when dealing with situations where there are many correlated variables and the goal is to
model relationships between predictors and response variables.
Here's an overview of how PLS works:
1. Regression Using PLS: PLS can be used for regression tasks, where the goal is to
predict a continuous target variable based on a set of predictor variables. It aims to
find a linear relationship between the predictors and the target, even in situations
where there might be multicollinearity among predictors.
2. Dimensionality Reduction Using PLS: PLS can also be used for dimensionality
reduction. It extracts a set of latent variables (also known as components) that capture
the maximum variance in both the predictors and the target variable. These latent
variables are linear combinations of the original variables.
The PLS algorithm involves iteratively constructing these latent variables in a way that
maximizes the covariance between the predictors and the target variable. The process can be
summarized as follows:
 Initialization: Start with the original predictor and target matrices.
 Step 1 - Calculate Latent Variables: Calculate the first latent variable as a linear
combination of the original predictor variables, weighted by their covariance with the
target variable.
 Step 2 - Update Residuals: Remove the variance captured by the first latent variable
from the predictor and target variables to obtain residuals.
 Repeat: Iterate the process to calculate subsequent latent variables, each capturing the
maximum covariance between the predictors and the residuals.
PLS aims to capture the maximum covariance between the predictors and the target while
also considering the covariance between the predictors themselves. It effectively combines
aspects of principal component analysis (PCA) and linear regression.
PLS is particularly useful in situations where the number of predictors is much larger than the
number of observations, where there are multicollinearity issues, or when the predictors are
highly correlated. It's used in various fields, including chemometrics, bioinformatics, finance,
and more.
However, it's worth noting that PLS might not always be the best choice for every dataset.
Other techniques like regularized regression (e.g., Ridge, Lasso) and advanced machine
learning models (e.g., tree-based models, neural networks) should also be considered based
on the specific problem and dataset characteristics.

Comparison of the Selection and Shrinkage Methods:


Selection and shrinkage methods are two important categories of techniques used in machine
learning for feature selection and regularization. Let's compare these two approaches:
1. Selection Methods: Selection methods involve choosing a subset of features from the
original set of predictors to include in the model. The aim is to eliminate irrelevant or
redundant features, which can help improve model interpretability and reduce the risk of
overfitting.
Advantages:
 Simplicity and Interpretability: Selection methods explicitly choose a subset of
features, making the model easier to understand and interpret.
 Computationally Efficient: Since only a subset of features is used, these methods are
often computationally efficient and can handle larger datasets.
 Reduced Model Complexity: By removing irrelevant features, the model's
complexity is reduced, which can lead to better generalization.
Disadvantages:
 Potential Information Loss: Removing features might result in loss of information
that could be beneficial for modeling, especially if feature interactions are important.
 Ignoring Correlations: Selection methods might ignore correlated features that
together contribute to predictive power.
 Sensitivity to Feature Set: The selected features might not be the same for different
subsets of data, leading to instability in model results.
Common Selection Methods:
 Forward Selection: Features are added to the model one by one based on their
individual performance.
 Backward Elimination: Features are removed from the model one by one based on
their individual performance.
 Recursive Feature Elimination (RFE): Iteratively removes the least important
features based on model performance until a desired number is reached.
 Feature Importance Ranking: Techniques like decision trees, random forests, and
gradient boosting can provide feature importance scores, which can guide feature
selection.
2. Shrinkage Methods (Regularization): Shrinkage methods involve adding a penalty term
to the objective function that the model tries to optimize. This penalty term encourages the
model to keep the coefficients of certain features small, effectively regularizing the model.
Advantages:
 Accounting for Correlations: Shrinkage methods can handle correlated features
more effectively by shrinking their coefficients together.
 Automatic Feature Selection: The penalty term can lead to some coefficients
becoming exactly zero, performing automatic feature selection.
 Stability: Shrinkage methods tend to yield more stable and interpretable models than
selection methods.
Disadvantages:
 Complex Model Interpretation: While the models are more stable, interpreting the
impact of individual features might be harder due to the combined effect of multiple
features.
 Hyperparameter Tuning: Shrinkage methods introduce hyperparameters that need
to be tuned to balance the trade-off between regularization and model fit.
 Potential Over-regularization: If the regularization term is too strong, the model
might underfit the data.
Common Shrinkage Methods:
 Lasso (L1 Regularization): Encourages sparsity in coefficients, leading to automatic
feature selection.
 Ridge (L2 Regularization): Penalizes large coefficients, effectively reducing their
impact on the model.
 Elastic Net: Combines L1 and L2 regularization, balancing their strengths.
 Partial Least Squares (PLS): Reduces dimensionality and models relationships
between predictors and responses by creating latent variables.
In summary, selection methods directly choose a subset of features, while shrinkage methods
introduce penalties to control the size of coefficients and encourage sparsity. The choice
between these approaches depends on the specific problem, dataset characteristics, and the
balance between interpretability and predictive performance desired in the model.

The Lasso and Related Path Algorithms:


The Lasso (Least Absolute Shrinkage and Selection Operator) is a regression technique used
in statistics and machine learning for variable selection and regularization. It's particularly
useful when dealing with high-dimensional data where the number of features (variables) is
much larger than the number of observations.
The Lasso works by adding a penalty term to the standard linear regression objective
function. The penalty term is based on the absolute values of the coefficients of the regression
variables. The goal of the penalty term is to encourage the coefficients of less important
variables to be exactly zero, effectively excluding them from the model. This leads to a sparse
solution where only the most important variables are retained.
Mathematically, the Lasso problem can be formulated as:
minimize: RSS + λ * ∑|βi|
where:
 RSS is the residual sum of squares, which measures the difference between the
predicted values and the actual target values.
 βi represents the coefficients of the regression variables.
 λ (lambda) is the regularization parameter that controls the strength of the penalty
term. Higher values of λ result in more aggressive shrinkage of coefficients.
The Lasso algorithm can be solved using optimization techniques like coordinate descent,
gradient descent, or specialized solvers. It's worth noting that when the Lasso has a non-zero
coefficient for a variable, it implies that the variable is considered important in predicting the
target.
Related to the Lasso, there are other regularization techniques and path algorithms that serve
similar purposes, including:
1. Ridge Regression: Similar to Lasso, but it uses the sum of squared coefficients as the
penalty term. This encourages smaller coefficients for less important variables without
forcing them to be exactly zero.
2. Elastic Net: A combination of Lasso and Ridge, using a combination of both absolute
and squared coefficients for the penalty term. This helps to address some of the
limitations of Lasso when dealing with correlated variables.
3. Least-Angle Regression (LARS): An algorithm that efficiently computes the Lasso
solutions along a regularization path. It adds variables to the model in a stepwise
manner, closely related to forward selection.
4. Sequential Threshold Lasso: A modification of Lasso that uses a sequential
thresholding technique to perform variable selection more efficiently, especially for
high-dimensional data.
These algorithms and techniques are widely used in fields such as statistics, machine
learning, and data science for feature selection, model regularization, and handling
multicollinearity in regression problems. They help improve model interpretability, reduce
overfitting, and enhance generalization to new data.

Logistic Regression-Fitting Logistic Regression Models:

Logistic Regression is a statistical method used for binary classification tasks, where the goal
is to predict the probability that an instance belongs to one of two classes. Despite its name,
logistic regression is used for classification rather than regression problems.
The logistic regression model is based on the logistic function (also known as the sigmoid
function), which maps any input to a value between 0 and 1. This makes it suitable for
estimating probabilities. The logistic function is defined as:
In the context of logistic regression, the input to the logistic function (z) is a linear
combination of the features (also called predictors or independent variables), each weighted
by a coefficient (β) and summed up:

Here, x0 , x1 ,…, xn are the feature values, and β0 ,β1 ,…,β n are the coefficients that the model
learns during the fitting process.
The logistic regression model then predicts the probability of belonging to the positive class
(P(y=1)) using the logistic function:

The probability of belonging to the negative class (P(y=0)) is simply 1−P(y=1).


During the model fitting process, the goal is to find the optimal values of the coefficients ( β0 ,
β1 ,…,β n) that best fit the training data. This is typically done by maximizing the likelihood of
the observed data given the model parameters, which is equivalent to minimizing the log-
likelihood:

Here y i
represents the true class labels for each observation, and N is the number of observations.
Optimization algorithms such as gradient descent, Newton's method, or specialized solvers
are used to find the coefficients that maximize the likelihood.
Logistic regression is widely used in various fields such as medical diagnosis, marketing,
finance, and natural language processing. It's a foundational algorithm in machine learning
and provides a simple yet effective approach to binary classification problems.

Quadratic Approximations and Inference in machine learning:


Quadratic approximations and inference play important roles in machine learning,
particularly in optimization and statistical inference tasks. Let's explore each of these
concepts in more detail:
1. Quadratic Approximations: Quadratic approximations are used to simplify complex
functions and make optimization more efficient. In many cases, the objective
functions or loss functions that need to be optimized are not easily solvable directly,
especially in high-dimensional spaces. Quadratic approximations involve
approximating a complex function with a simpler quadratic function around a specific
point.
The idea is to use the Taylor series expansion of the function up to the quadratic term to
create an approximation that is easier to work with. This is particularly useful when designing
optimization algorithms like gradient descent, Newton's method, and quasi-Newton methods.
Quadratic approximations allow these methods to take larger steps towards the optimal
solution and converge faster.
2. Inference: Inference in machine learning refers to the process of drawing conclusions
or making predictions based on observed data. It often involves estimating parameters
of a model or making statistical inferences about the relationships between variables.
Inference is crucial for understanding the underlying patterns and making predictions
in real-world applications.
In the context of machine learning, some common inference tasks include:
 Parameter Estimation: Given a model, estimate the values of its parameters
that best fit the observed data. This is a common task in supervised learning,
where you want to find the best-fitting model to predict outcomes.
 Hypothesis Testing: Determine whether a certain hypothesis about the data is
supported by the observed evidence. This is often used to assess the
significance of relationships between variables.
 Confidence Intervals: Estimate a range within which a parameter is likely to
fall with a certain degree of confidence. This provides a measure of
uncertainty around parameter estimates.
Statistical models often use probabilistic distributions to model uncertainty. Bayesian
inference, for example, combines prior beliefs with observed data to update the belief
distribution about the model parameters.
In summary, quadratic approximations are used to simplify complex functions for efficient
optimization, while inference involves drawing conclusions from data and making
predictions in a probabilistic and statistically sound manner. Both concepts are fundamental
in machine learning, aiding in the development of efficient algorithms and the interpretation
of model behavior.

L1 Regularized Logistic Regression in machine learning:


L1 regularized logistic regression, often referred to as "L1 regularization" or "Lasso
regularization" for logistic regression, is a variant of logistic regression that incorporates a
regularization term based on the L1 norm of the coefficients. L1 regularization is a technique
used to prevent overfitting and to encourage sparsity in the learned model.
In standard logistic regression, the goal is to find the coefficients that best fit the training
data, minimizing the logistic loss function. However, in some cases, especially when dealing
with high-dimensional data where there are many features, the model might become overly
complex and prone to overfitting.
L1 regularization addresses this by adding a penalty term to the logistic loss function, which
is proportional to the absolute values of the coefficients. The regularized logistic loss function
with L1 regularization is:

Where:
 yi is the true class label of the ith observation.
 P(y=1) is the predicted probability of the ith observation belonging to the positive
class.
 βj is the coefficient of the jth feature.
 N is the number of observations, and p is the number of features.
 λ is the regularization parameter that controls the strength of the regularization. A
higher λ leads to more aggressive shrinking of coefficients.
The key effect of the L1 regularization term is that it encourages some of the coefficients to
become exactly zero, effectively performing feature selection. This sparsity-inducing
property is particularly valuable when dealing with high-dimensional data, as it allows the
model to focus on the most important features while ignoring irrelevant ones.
The optimization problem associated with L1 regularized logistic regression can be solved
using techniques like coordinate descent, gradient descent, or specialized solvers. The choice
of the regularization parameter (λ) is important and often determined through techniques such
as cross-validation.
L1 regularized logistic regression is a powerful tool in machine learning for feature selection
and regularization. It helps prevent overfitting, increases model interpretability, and is
especially effective when there are many features and a need to identify the most relevant
ones for prediction.

You might also like