Unit-2 ML (Reference Guide For Students)
Unit-2 ML (Reference Guide For Students)
Unit: 2
Machine Learning (Reference Guide)
We can understand the concept of regression analysis using the below example:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know
the prediction about the sales for this year. So to solve such type of prediction problems in
machine learning, we need regression analysis.
In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In
simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance
between the datapoints and the regression line is minimum." The distance between
datapoints and line tells whether a model has captured a strong relationship or not.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but
not well with test dataset, then such problem is called Overfitting. And if our algorithm
does not perform well even with training dataset, then such problem is
called underfitting.
o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor,
the least important factor, and how each factor is affecting the other factors.
Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core, all the
regression methods analyze the effect of the independent variable on dependent
variables. Here we are discussing some important types of regression which are given
below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Feature Engineering:
Feature engineering is the process of selecting, extracting, and transforming the most relevant
features from the available data to build more accurate and efficient machine learning models.
WHAT IS A FEATURE?
In the context of machine learning, a feature (also known as a variable or attribute) is an
individual measurable property or characteristic of a data point that is used as input for a
machine learning algorithm. Features can be numerical, categorical, or text-based, and they
represent different aspects of the data that are relevant to the problem at hand.
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis)
and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression model can be explained using
the below image. Here we are predicting the salary of an employee on the basis of the
year of experience.
1. Y= aX+b
1. Linearity: The relationship between the dependent and independent variables is assumed to be
linear. The model assumes that a change in the predictor variable is associated with a constant
change in the response variable.
2. Independence of Errors: The errors (residuals) should be independent, meaning that the value
of the error for one observation should not predict the value of the error for another
observation.
3. Homoscedasticity: The variance of the errors should be constant across all values of the
independent variable. In simpler terms, the spread of the data points around the regression line
should be consistent.
4. Normality of Residuals: The residuals (errors) should be normally distributed around the mean.
This allows for statistical tests to assess the model's validity.
5. No Multicollinearity: The independent variables shouldn't be highly correlated with each other.
If they are, it can be difficult to isolate the effect of each variable on the dependent variable.
This is the section where you’ll find out how to perform the regression in Python. We will use
Advertising sales channel prediction data. You can access the data here.
‘Sales’ is the target variable that needs to be predicted. Now, based on this data, our objective is to
create a predictive model, that predicts sales based on the money spent on different platforms for
marketing.
Let us straightaway get right down to some hands-on coding to get this prediction done. Please don’t
feel overlooked if you do not have experience with Python. In fact, the best way to learn is to get your
hands dirty by solving a problem – like the one we are doing.
The first step is to fire up your Jupyter notebook and load all the prerequisite libraries in your Jupyter
notebook. Here are the important libraries that we will be needing for this linear regression.
In order to load these, just start with these few lines of codes in your first cell:
Let us now import data into a DataFrame. A DataFrame is a data type in Python. The simplest way to
understand it would be that it stores all your data in tabular format.
Step 3: Visualization
Let us plot the scatter plot for target variable vs. predictor variables in a single plot to get the intuition.
Also, plotting a heatmap for all the variables,
From the scatterplot and the heatmap, we can observe that ‘Sales’ and ‘TV’ have a higher correlation as
compared to others because it shows a linear pattern in the scatterplot as well as giving 0.9 correlation.
You can go ahead and play with the visualizations and can find out interesting insights from the data.
Here, as the TV and Sales have a higher correlation we will perform the simple linear regression for
these variables.
We can use sklearn or statsmodels to apply linear regression. So we will go ahead with statsmodels.
We first assign the feature variable, `TV`, during this case, to the variable `X` and the response variable,
`Sales`, to the variable `y`.
And after assigning the variables you need to split our variable into training and testing sets. You’ll
perform this by importing train_test_split from the sklearn.model_selection library. It is usually a good
practice to keep 70% of the data in your train dataset and the rest 30% in your test dataset.
In this way, you can split the data into train and test sets.
One can check the shapes of train and test sets with the following code,
Numerical:
Estimate the intercept and coefficient of the linear regression model using Ordinary Least
Squares method.
The OLS method aims to minimize the sum of squared differences between the observed values
and the values predicted by the regression line.
The equation for the linear regression model is: 𝑦 = 𝑚𝑥 + 𝑐
OR
Note:
The reason for adding a column of ones to the design matrix X is to account for the intercept term in the
linear regression model.
When constructing the design matrix X, the column of ones is added to represent the intercept term
(β0). This allows the linear regression model to estimate both the intercept and the slope coefficients
simultaneously.
Without the column of ones, the linear regression model would assume that the data passes through
the origin (0,0), meaning the intercept is forced to be zero, which might not be appropriate for many
real-world scenarios.
Therefore, by including the column of ones, we ensure that the linear regression model is capable of
estimating the intercept term (β0) along with the coefficients for the other independent variables.
1. Confidence Intervals (CI): These intervals estimate the range of values within which the true
population parameter (slope or intercept) is likely to fall with a certain level of confidence
(usually denoted by 1 - α, where α is the significance level).
For example, a 95% confidence interval for the slope indicates that you are 95% confident that
the true slope of the population regression line lies within the calculated interval.
2. Prediction Intervals (PI): These intervals estimate the range of values within which a future
individual response variable (Y) is likely to fall for a given value of the independent variable (X).
In simpler terms, a prediction interval tells you with a certain confidence level where a new data
point might lie on the line, given a specific X value.
Residuals:
Residuals are the differences between the actual values of the dependent variable and the
predicted values by the regression model. It represents the individual errors for each data point
in your dataset.
𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑖 = 𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑉𝑎𝑙𝑢𝑒𝑖 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑉𝑎𝑙𝑢𝑒𝑖
or 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑖 = 𝑌𝑖 − 𝑌̂𝑖
A good model minimizes these differences. In a perfect model, the residuals would be zero for
all data points because the predicted values would exactly match the observed values.
However, in reality, residuals are almost always present due to the inherent variability in the
data and the simplifications made by the linear regression model.
The key assumptions of multiple linear regression are similar to those of simple linear
regression:
1. Linearity: This is the core assumption that the relationship between the dependent variable (Y)
and the independent variables (X1, X2, ..., Xn) is linear. In simpler terms, the change in Y is
expected to be proportional to the change in each X, holding all other independent variables
constant. Scatterplots are helpful tools to visually assess linearity.
2. Independence of Errors: The errors (residuals) in the model are assumed to be independent of
each other. This means the error term associated with one observation shouldn't influence the
error term of another observation. Essentially, there's no hidden pattern or correlation between
the residuals.
3. Homoscedasticity: This assumption states that the variance of the errors (residuals) is constant
across all levels of the independent variables. In other words, the spread of the residuals around
the regression line should be consistent irrespective of the values of the independent variables.
Plotting standardized residuals versus predicted values can help diagnose this assumption.
4. No Multicollinearity: Multicollinearity refers to a situation where there's a high degree of
correlation among the independent variables themselves. This can cause problems in estimating
the individual coefficients of the variables and can lead to unreliable results. Techniques like
Variance Inflation Factor (VIF) are used to check for multicollinearity.
5. Normality of Errors: This assumption states that the errors (residuals) are normally distributed
around a mean of zero. Normality helps ensure the validity of statistical tests performed on the
regression coefficients. We can assess normality through histograms and Q-Q plots of the
residuals.
Note: It's important to remember that these assumptions are ideal conditions. In practice, there might
be slight deviations from these assumptions.
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No,
True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The
function can be represented as:
When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are rounded up to
1, and values below the threshold level are rounded up to 0.
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear dataset using
a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value of
x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover
such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial
features of given degree and then modeled using a linear model. Which means the
datapoints are best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b0+ b1x, is transformed into
Polynomial regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic
Note: This is different from Multiple Linear regression in such a way that in Polynomial
regression, a single element has different degrees instead of multiple variables with
the same degree.
Support Vector Regression is a regression algorithm which works for continuous variables.
Below are some keywords which are used in Support Vector Regression:
Here, the blue line is called hyperplane, and the other two lines are known as boundary
lines.
Above image showing the example of Decision Tee regression, here, the model is trying
to predict the choice of a person between Sports cars or Luxury car.
o Random forest is one of the most powerful supervised learning algorithms which
is capable of performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines
multiple decision trees and predicts the final output based on the average of each
tree output. The combined decision trees are called as base models, and it can be
represented more formally as:
Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in which a small
amount of bias is introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty. We can
compute this penalty term by multiplying with the lambda to the squared weight of each
individual features.
o The equation for ridge regression will be:
o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the complexity of
the model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the
model.
o It is similar to the Ridge Regression except that penalty term contains only the absolute
weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:
References:
1. https://www.analyticsvidhya.com/blog/2021/10/everything-you-need-to-know-
about-linear-regression/
2. https://www.javatpoint.com/regression-analysis-in-machine-learning
3. https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#
sklearn.linear_model.LinearRegression
4. https://www.geeksforgeeks.org/what-is-feature-engineering/