Regression and Classification Problems
The expression multivariate analysis is used to describe analyses of data that are
multivariate in the sense that numerous observations or variables are obtained for each
individual or unit studied.
Regression Problems:
• Supervised learning problems where the output is a continuous value are called as
regression problems.
• The Regression technique is used for predicting a continuous value.
• For example, predicting things like the price of a house based on its characteristics,
or to estimate the Co2 emission from a car’s engine, etc.
Regression Analysis In statistical modeling, regression analysis is a set of statistical
processes for estimating the relationships among variables.
Regression analysis is a predictive modelling technique. It estimates the relationship
between the input variables (x) and the output variable (y). Regression is a problem of
predicting the value 𝑌� (or response) given the values of the input variables x 1,x2, ...,x𝑝�
(or predictors).
• In linear regression, we assume that the function 𝑓�(𝑋�) corresponding to the
relationship 𝑌� = 𝑓�(x1,x2, ...,x𝑝�) is linear.
• The task is to find coefficients for the linear model (parameter estimation).
There are two types of Linear Regression models:
Simple Linear Regression:
• When there is a single input variable (x), the method is referred to as simple
linear regression.
• Predict Co2emission using EngineSize of all cars
• Independent variable (x): EngineSize
• Dependent variable (y): Co2emission
Multiple Linear Regression:
• When there are multiple input variables, literature from statistics often refers to
the method as multiple linear regression.
• Predict Co2emission using EngineSize and Cylinders of all cars
• Independent variables (x): EngineSize, Cylinders
• Dependent variable (y): Co2emission
Simple Linear Regression
• The simplest mathematical relationship between two variables x and y is a linear
relationship:
y = β0 + β1x
• x: the input, or independent, or predictor, or explanatory variable (usually
known).
• y: the output, or dependent, or response, or study variable.
• Objective: to find out the parameters.
• The points (x1, y1), …, (xn, yn) resulting from n independent observations will then
be scattered about the true regression line:
• The simple linear regression model is:
y = β0 + β1x c +
where:
b0 and b1 are called parameters of the model,
e is a random variable called the error term.
Evaluation Metrics in Regression Models:
Evaluation metrics are used to explain the performance of a model. As mentioned,
basically, we can compare the actual values and predicted values, to calculate the
accuracy of our regression model.
A residual is a measure of how far away a point is from the regression line. Simply, it is
the error between a predicted value and the observed actual value.
Mean Squared Error (MSE) is the mean of the squared error. It's more popular than
mean absolute error because the focus is geared more towards large errors. This is due
to the squared term exponentially increasing larger errors in comparison to smaller ones.
Root Mean Squared Error (RMSE) is the square root of the mean squared error. This
is one of the most popular of the evaluation metrics because root mean squared error is
interpretable in the same units as the response vector or y units, making it easy to relate
its information.
Estimation of Parameters in Simple Linear Regression using Ordinary Least
Squares:
Ordinary Least Squares (OLS) works by minimizing the sum of the squares of the
differences between the observed dependent variable in the given dataset and those
predicted by the linear function.
This method allows finding such estimators 𝛽 ̂0 and 𝛽
̂1 for parameters β0 and β1 that
minimize the sum of squared errors 𝜀�(β0, β1) in the observed 𝑛� experiments.
In other words, we minimize the function
and find the arguments minimizing the function.
To solve the minimization problem, we can use the following theorem.
Theorem: The minimum of the function
is unique and attained when
Where 𝑋̅ is the mean of x values, and 𝑌̅ is the mean of y values.
Example – Dataset of patient's age and their blood pressure
Our aim is to find the regression line:
𝑋̅ = 491/10= 49.1, and 𝑌̅ = 1410/10= 141
The slope (β1) can be calculated as: β1= 2335/2048.9 =1.14
The intercept (β0) is calculated as: β0= 141-1.14*49.1 = 85.026
• Now substitute the regression coefficients into the regression equation
• Estimated blood pressure:
(Ŷ) = 85.026 + 1.14 * 𝑎�𝑔�𝑒�
Classification Problems:
• The problems where the output is a discrete value are called as classification
problems.
• Classification is the process of predicting a discrete class label, or categories.
• For example, if a cell is benign or malignant, if an email is spam or not.
• The classification problem not necessarily has only two outcomes, which means
it isn’t limited to two classes. For example, the problem of handwritten digit
recognition (that is a classification problem) has ten outcome
Logistic Regression
Logistic regression is a classification algorithm designed to predict categorical target
labels based on historical feature data. It allows us to predict the probability of a
dependent variable given an input, and a model. Logistic regression can be used for both
binary classification and multi-class classification.
Sigmoid Function
Logistic Regression uses the sigmoid function also known as the logistic function to
perform classification. The sigmoid function takes in any value and map it into a
value between 0 and 1. The key thing to notice here is that it doesn’t matter what value
of y you put into the logistics or the sigmoid function you’ll always get a value between
0 and 1. This means we can take our linear regression solution and place it into
the sigmoid function and it looks something like this:
• We can formulate the algorithm for predicting the class of the new object x with
the predictors (x1, x2, ..., x𝑝�) once the coefficients β0, β1, ..., β𝑝� are found.
1. Calculate the value 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥 2 + ⋯ + 𝛽𝑝 𝑥 𝑝
2. Calculate the probability P:
3. If P ≥ 0.5, the object x will fall into the class 1 or 0 otherwise.
(In practice, the choice of a probability cut-off is up to a researcher)
Let’s apply the logistic regression algorithm to specific data.
• Our data is football statistics. It has three predictors, including shots on target
(𝑋�1), possession (𝑋� 2), and shots (𝑋�3).
• The response 𝑌� takes only two values. The value 1 corresponds to a win (class
+1), and the value 0 is a loss or draw (class 0).
• The training data provides the following values of the model parameters:
β0= −0.046, β1=0.541, β2= −0.014, β3= −0.132.
• We classify the new object 𝑧�:
𝑧� = (1, 40, 3).
• It’s a team that had 1 shot on target, 40 percent of possession, and 3 shots.
According to the described algorithm, the probability that the team wins equals:
1
P+= 1+𝑒 −(𝛽0 +𝛽1 𝑥1+𝛽2 𝑥2 +𝛽3 𝑥3 )
1
=1+𝑒 −(−0.046+0.541∗1−0.014∗40−0.132∗3)
=0.38
• It means that it will likely lose.