Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views27 pages

Unit-2 ML (Reference Guide For Students)

Machine Learning reference guide

Uploaded by

cedotes304
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views27 pages

Unit-2 ML (Reference Guide For Students)

Machine Learning reference guide

Uploaded by

cedotes304
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

[GTBIT, IT Dept.

Unit: 2
Machine Learning (Reference Guide)

Regression Analysis in Machine learning


Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables.
More specifically, Regression analysis helps us to understand how the value of the
dependent variable is changing corresponding to an independent variable when other
independent variables are held fixed. It predicts continuous/real values such
as temperature, age, salary, price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various advertisement


every year and get sales on that. The below list shows the advertisement made by the
company in the last 5 years and the corresponding sales:

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know
the prediction about the sales for this year. So to solve such type of prediction problems in
machine learning, we need regression analysis.

Regression is a supervised learning technique which helps in finding the correlation


between variables and enables us to predict the continuous output variable based on the
one or more predictor variables. It is mainly used for prediction, forecasting, time series
modeling, and determining the causal-effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In
simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance
between the datapoints and the regression line is minimum." The distance between
datapoints and line tells whether a model has captured a strong relationship or not.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors


o Determining Market trends
o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:


o Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable, also
called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high value
in comparison to other observed values. An outlier may hamper the result, so it should be
avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in
the dataset, because it creates problem while ranking the most affecting variable.

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

o Underfitting and Overfitting: If our algorithm works well with the training dataset but
not well with test dataset, then such problem is called Overfitting. And if our algorithm
does not perform well even with training dataset, then such problem is
called underfitting.

Why do we use Regression Analysis?


As mentioned above, Regression analysis helps in the prediction of a continuous variable.
There are various scenarios in the real world where we need some future predictions such
as weather condition, sales prediction, marketing trends, etc., for such case we need some
technology which can make predictions more accurately. So for such case we need
Regression analysis which is a statistical method and used in machine learning and data
science. Below are some other reasons for using Regression analysis:

o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor,
the least important factor, and how each factor is affecting the other factors.

Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core, all the
regression methods analyze the effect of the independent variable on dependent
variables. Here we are discussing some important types of regression which are given
below:

o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

o Lasso Regression:

Feature Engineering:
Feature engineering is the process of selecting, extracting, and transforming the most relevant
features from the available data to build more accurate and efficient machine learning models.

WHAT IS A FEATURE?
In the context of machine learning, a feature (also known as a variable or attribute) is an
individual measurable property or characteristic of a data point that is used as input for a
machine learning algorithm. Features can be numerical, categorical, or text-based, and they
represent different aspects of the data that are relevant to the problem at hand.

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

NEED FOR FEATURE ENGINEERING IN MACHINE LEARNING?


1. Improving Model Performance: One of the primary purposes of feature engineering is
to enhance the performance of machine learning models. By selecting or creating
relevant features, removing irrelevant or redundant ones, or transforming existing
features, we can provide the model with more useful information, enabling it to make
better predictions or classifications.
2. Handling Missing Data: Feature engineering techniques can help deal with missing data
in a dataset. For example, one might fill missing values with the mean, median, or mode
of the respective feature, or use more sophisticated methods like interpolation or
imputation based on other correlated features.
3. Dimensionality Reduction: Feature engineering can also involve reducing the
dimensionality of the dataset by selecting a subset of relevant features or creating new
composite features that capture the most important information. Techniques like
principal component analysis (PCA) or feature selection algorithms aid in reducing the
number of features while preserving the most relevant information.
4. Handling Categorical Variables: Many machine learning algorithms require numerical
input data, so categorical variables need to be transformed into a numerical format.
Techniques like one-hot encoding, label encoding, or binary encoding can be used to
represent categorical variables as numerical features that can be fed into the models.
5. Capturing Non-linear Relationships: Sometimes, relationships between features and
the target variable are non-linear. Feature engineering techniques like polynomial
features or interaction terms can help capture these non-linear relationships, enabling
linear models to learn more complex patterns.
6. Normalization and Scaling: Scaling features to a similar range or normalizing them can
be beneficial for certain algorithms, such as those based on distance metrics (e.g., K-
nearest neighbors or clustering algorithms). Techniques like min-max scaling or
standardization are commonly used for this purpose.
7. Handling Outliers: Outliers in the dataset can significantly affect model performance.
Feature engineering techniques such as Winsorization (capping outliers), log
transformations, or robust scaling can help mitigate the impact of outliers on model
training.
8. Domain-specific Knowledge Incorporation: Domain knowledge about the problem
being solved can guide the selection and creation of features. This can lead to the
creation of more informative features that capture relevant aspects of the data specific
to the problem domain.

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

PROCESSES INVOLVED IN FEATURE ENGINEERING


Feature engineering in Machine learning consists of mainly 5 processes: Feature Creation,
Feature Transformation, Feature Extraction, Feature Selection, and Feature Scaling. It is an
iterative process that requires experimentation and testing to find the best combination of
features for a given problem.
1. Feature Creation: Feature Creation is the process of generating new features based on
domain knowledge or by observing patterns in the data. It is a form of feature
engineering that can significantly improve the performance of a machine-learning
model.
2. Feature Transformation: Feature Transformation is the process of transforming the
features into a more suitable representation for the machine learning model. This is
done to ensure that the model can effectively learn from the data.
3. Feature Extraction: Feature Extraction is the process of creating new features from
existing ones to provide more relevant information to the machine learning model. This
is done by transforming, combining, or aggregating existing features.
4. Feature Selection: Feature Selection is the process of selecting a subset of relevant
features from the dataset to be used in a machine-learning model. It is an important
step in the feature engineering process as it can have a significant impact on the
model’s performance.
5. Feature Scaling: Feature Scaling is the process of transforming the features so that they
have a similar scale. This is important in machine learning because the scale of the
features can affect the performance of the model.

Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis)
and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

o The relationship between variables in the linear regression model can be explained using
the below image. Here we are predicting the salary of an employee on the basis of the
year of experience.

o Below is the mathematical equation for Linear regression:

1. Y= aX+b

Here, Y = dependent variables (target variables),


X= Independent variables (predictor variables),
a and b are the linear coefficients

Some popular applications of linear regression are:

o Analyzing trends and sales estimates


o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic.

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

Assumptions for a Reliable Fit:


Linear regression makes certain assumptions about the data to ensure its accuracy. Violating these
assumptions can lead to misleading results.

Here are the main assumptions:

1. Linearity: The relationship between the dependent and independent variables is assumed to be
linear. The model assumes that a change in the predictor variable is associated with a constant
change in the response variable.

2. Independence of Errors: The errors (residuals) should be independent, meaning that the value
of the error for one observation should not predict the value of the error for another
observation.

3. Homoscedasticity: The variance of the errors should be constant across all values of the
independent variable. In simpler terms, the spread of the data points around the regression line
should be consistent.

4. Normality of Residuals: The residuals (errors) should be normally distributed around the mean.
This allows for statistical tests to assess the model's validity.

5. No Multicollinearity: The independent variables shouldn't be highly correlated with each other.
If they are, it can be difficult to isolate the effect of each variable on the dependent variable.

Simple Linear Regression Model Building:


1. Understanding the Problem: Identify your dependent variable (what you're trying to predict)
and the independent variable (what you believe influences the dependent variable). They
should have a linear relationship.
2. Data Preparation: This may involve collecting data, cleaning it for inconsistencies, and ensuring
there are no missing values.
3. Model Fitting: Here, you use your data to estimate the model's parameters like slope and
intercept. Popular programming languages like Python have libraries like scikit-learn that can do
this.
4. Evaluation: Assess how well your model fits the data. This involves metrics like R-squared, which
indicates how much variance the model explains in the dependent variable.

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

This is the section where you’ll find out how to perform the regression in Python. We will use
Advertising sales channel prediction data. You can access the data here.

‘Sales’ is the target variable that needs to be predicted. Now, based on this data, our objective is to
create a predictive model, that predicts sales based on the money spent on different platforms for
marketing.

Let us straightaway get right down to some hands-on coding to get this prediction done. Please don’t
feel overlooked if you do not have experience with Python. In fact, the best way to learn is to get your
hands dirty by solving a problem – like the one we are doing.

Step 1: Importing Python Libraries

The first step is to fire up your Jupyter notebook and load all the prerequisite libraries in your Jupyter
notebook. Here are the important libraries that we will be needing for this linear regression.

 NumPy (to perform certain mathematical operations)


 pandas (to store the data in a pandas DataFrames)
 matplotlib.pyplot (you will use matplotlib to plot the data)

In order to load these, just start with these few lines of codes in your first cell:

Step 2: Loading the Dataset

Let us now import data into a DataFrame. A DataFrame is a data type in Python. The simplest way to
understand it would be that it stores all your data in tabular format.

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

Step 3: Visualization

Let us plot the scatter plot for target variable vs. predictor variables in a single plot to get the intuition.
Also, plotting a heatmap for all the variables,

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

From the scatterplot and the heatmap, we can observe that ‘Sales’ and ‘TV’ have a higher correlation as
compared to others because it shows a linear pattern in the scatterplot as well as giving 0.9 correlation.

You can go ahead and play with the visualizations and can find out interesting insights from the data.

Step 4: Split the Data

Here, as the TV and Sales have a higher correlation we will perform the simple linear regression for
these variables.

We can use sklearn or statsmodels to apply linear regression. So we will go ahead with statsmodels.

We first assign the feature variable, `TV`, during this case, to the variable `X` and the response variable,
`Sales`, to the variable `y`.

And after assigning the variables you need to split our variable into training and testing sets. You’ll
perform this by importing train_test_split from the sklearn.model_selection library. It is usually a good
practice to keep 70% of the data in your train dataset and the rest 30% in your test dataset.

In this way, you can split the data into train and test sets.

One can check the shapes of train and test sets with the following code,

Step 5: Model Training & Evaluation

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

Step 5: Visualize actual vs predicted values

Ordinary Least Squares (OLS):


Ordinary Least Squares (OLS) estimation is a method used to estimate the parameters of a linear
regression model by minimizing the sum of the squared differences between the observed and
predicted values. It provides the best linear unbiased estimates (BLUE) of the coefficients, subject to
certain assumptions.

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

Properties of Least-Square estimator in linear


regression:
1. Unbiasedness: The least-squares estimators (slope and intercept) are unbiased,
meaning that, on average, they provide accurate estimates of the true population
parameters.
2. Efficiency: Among all unbiased estimators, the least-squares estimators have the
minimum variance, making them efficient for estimating the true parameters.
3. Consistency: As the sample size increases, the least-squares estimators converge in
probability to the true population parameters. In other words, they become more
accurate as more data points are included.
4. Normality (Asymptotic): In large samples, the least-squares estimators are approximately
normally distributed. This property is a result of the Central Limit Theorem.

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

Numerical:
Estimate the intercept and coefficient of the linear regression model using Ordinary Least
Squares method.

The OLS method aims to minimize the sum of squared differences between the observed values
and the values predicted by the regression line.
The equation for the linear regression model is: 𝑦 = 𝑚𝑥 + 𝑐

 y is the dependent variable (exam score)


 x is the independent variable (study hours)
 m is the coefficient (slope)
 b is the intercept

OR

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

Note:

The reason for adding a column of ones to the design matrix X is to account for the intercept term in the
linear regression model.

When constructing the design matrix X, the column of ones is added to represent the intercept term
(β0). This allows the linear regression model to estimate both the intercept and the slope coefficients
simultaneously.

Without the column of ones, the linear regression model would assume that the data passes through
the origin (0,0), meaning the intercept is forced to be zero, which might not be appropriate for many
real-world scenarios.

Therefore, by including the column of ones, we ensure that the linear regression model is capable of
estimating the intercept term (β0) along with the coefficients for the other independent variables.

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

Interval estimation in simple linear regression:


Interval estimation in simple linear regression refers to the process of estimating a range of values
within which a population parameter, such as the regression coefficients or the predicted value of the
dependent variable, is likely to lie with a certain level of confidence. This technique provides a measure
of uncertainty around point estimates and helps in understanding the precision of the estimated
parameters.

There are two main types of intervals estimated in regression:

1. Confidence Intervals (CI): These intervals estimate the range of values within which the true
population parameter (slope or intercept) is likely to fall with a certain level of confidence
(usually denoted by 1 - α, where α is the significance level).
For example, a 95% confidence interval for the slope indicates that you are 95% confident that
the true slope of the population regression line lies within the calculated interval.

2. Prediction Intervals (PI): These intervals estimate the range of values within which a future
individual response variable (Y) is likely to fall for a given value of the independent variable (X).
In simpler terms, a prediction interval tells you with a certain confidence level where a new data
point might lie on the line, given a specific X value.

Residuals:
Residuals are the differences between the actual values of the dependent variable and the
predicted values by the regression model. It represents the individual errors for each data point
in your dataset.
𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑖 = 𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑉𝑎𝑙𝑢𝑒𝑖 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑉𝑎𝑙𝑢𝑒𝑖

or 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑖 = 𝑌𝑖 − 𝑌̂𝑖

A good model minimizes these differences. In a perfect model, the residuals would be zero for
all data points because the predicted values would exactly match the observed values.
However, in reality, residuals are almost always present due to the inherent variability in the
data and the simplifications made by the linear regression model.

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

Multiple Linear Regression:


Multiple linear regression is an extension of simple linear regression where the relationship
between a dependent variable and two or more independent variables is modeled linearly. It
assumes that the dependent variable is a linear combination of the independent variables, each
weighted by a regression coefficient.

The key assumptions of multiple linear regression are similar to those of simple linear
regression:
1. Linearity: This is the core assumption that the relationship between the dependent variable (Y)
and the independent variables (X1, X2, ..., Xn) is linear. In simpler terms, the change in Y is
expected to be proportional to the change in each X, holding all other independent variables
constant. Scatterplots are helpful tools to visually assess linearity.
2. Independence of Errors: The errors (residuals) in the model are assumed to be independent of
each other. This means the error term associated with one observation shouldn't influence the
error term of another observation. Essentially, there's no hidden pattern or correlation between
the residuals.
3. Homoscedasticity: This assumption states that the variance of the errors (residuals) is constant
across all levels of the independent variables. In other words, the spread of the residuals around
the regression line should be consistent irrespective of the values of the independent variables.
Plotting standardized residuals versus predicted values can help diagnose this assumption.
4. No Multicollinearity: Multicollinearity refers to a situation where there's a high degree of
correlation among the independent variables themselves. This can cause problems in estimating
the individual coefficients of the variables and can lead to unreliable results. Techniques like
Variance Inflation Factor (VIF) are used to check for multicollinearity.
5. Normality of Errors: This assumption states that the errors (residuals) are normally distributed
around a mean of zero. Normality helps ensure the validity of statistical tests performed on the
regression coefficients. We can assess normality through histograms and Q-Q plots of the
residuals.

Note: It's important to remember that these assumptions are ideal conditions. In practice, there might
be slight deviations from these assumptions.

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No,
True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The
function can be represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

o It uses the concept of threshold levels, values above the threshold level are rounded up to
1, and values below the threshold level are rounded up to 0.

There are three types of logistic regression:

o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)

Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear dataset using
a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value of
x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover
such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial
features of given degree and then modeled using a linear model. Which means the
datapoints are best fitted using a polynomial line.

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

o The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b0+ b1x, is transformed into
Polynomial regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic

Note: This is different from Multiple Linear regression in such a way that in Polynomial
regression, a single element has different degrees instead of multiple variables with
the same degree.

Support Vector Regression:


Support Vector Machine is a supervised learning algorithm which can be used for
regression as well as classification problems. So if we use it for regression problems, then
it is termed as Support Vector Regression.

Support Vector Regression is a regression algorithm which works for continuous variables.
Below are some keywords which are used in Support Vector Regression:

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

o Kernel: It is a function used to map a lower-dimensional data into higher dimensional


data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a
line which helps to predict the continuous variables and cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a
margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the hyperplane
and opposite class.

In SVR, we always try to determine a hyperplane with a maximum margin, so that


maximum number of datapoints are covered in that margin. The main goal of SVR is to
consider the maximum datapoints within the boundary lines and the hyperplane
(best-fit line) must contain a maximum number of datapoints. Consider the below
image:

Here, the blue line is called hyperplane, and the other two lines are known as boundary
lines.

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

Decision Tree Regression:


o Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.
o It can solve problems for both categorical and numerical data
o Decision Tree regression builds a tree-like structure in which each internal node represents
the "test" for an attribute, each branch represent the result of the test, and each leaf node
represents the final decision or result.
o A decision tree is constructed starting from the root node/parent node (dataset), which
splits into left and right child nodes (subsets of dataset). These child nodes are further
divided into their children node, and themselves become the parent node of those nodes.
Consider the below image:

Above image showing the example of Decision Tee regression, here, the model is trying
to predict the choice of a person between Sports cars or Luxury car.

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

o Random forest is one of the most powerful supervised learning algorithms which
is capable of performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines
multiple decision trees and predicts the final output based on the average of each
tree output. The combined decision trees are called as base models, and it can be
represented more formally as:

g(x)= f0(x)+ f1(x)+ f2(x)+....

o Random forest uses Bagging or Bootstrap Aggregation technique of ensemble


learning in which aggregated decision tree runs in parallel and do not interact with
each other.
o With the help of Random Forest regression, we can prevent Overfitting in the
model by creating random subsets of the dataset.

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in which a small
amount of bias is introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty. We can
compute this penalty term by multiplying with the lambda to the squared weight of each
individual features.
o The equation for ridge regression will be:

o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the complexity of
the model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.

Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the
model.
o It is similar to the Ridge Regression except that penalty term contains only the absolute
weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra


[GTBIT, IT Dept.]

References:
1. https://www.analyticsvidhya.com/blog/2021/10/everything-you-need-to-know-
about-linear-regression/
2. https://www.javatpoint.com/regression-analysis-in-machine-learning
3. https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#
sklearn.linear_model.LinearRegression
4. https://www.geeksforgeeks.org/what-is-feature-engineering/

Prepared by: Mrs. Meenakshi Sihag & Mr. Tushar Malhotra

You might also like