Types of regression analysis
1. Linear Regression
2. Multiple Linear Regression
3. Polynomial Regression
4. Ridge Regression
5. Lasso Regression
6. Elastic Net Regression
7. Logistic Regression
8. Poisson Regression
9. Support Vector Regression (SVR)
10. Decision Tree and Random Forest Regression
1. Linear Regression:
Linear regression is the simplest form of regression that models the relationship between one
independent variable (X) and one dependent variable (Y) using a straight line.
Equation: Y=β0+β1X+ϵY
Where:
Y = Dependent variable (target)
X = Independent variable (predictor)
β0 = Intercept
β1 = Slope
ϵ = Error term
Example: Suppose you want to predict a student's score based on the number of hours
studied.
2. Multiple Linear Regression
This regression model uses two or more independent variables to predict the dependent variable.
Equation: Y=β0+β1X1+β2X2+...+βnXn+ϵ
Example: Predicting house prices based on square footage and location rating.
3. Polynomial Regression:
Polynomial regression fits a non-linear relationship by adding powers of the independent
variable.
Equation: Y=β0+β1X+β2X2+...+βnXn+ϵ
Example- Predicting the growth of bacteria over time.
4. Ridge Regression:
Ridge regression is a regularization technique used to prevent over fitting by adding a penalty
term.
Example: Predicting house prices with multiple features (square footage, number of bedrooms,
etc.) while avoiding over fitting.
5. Lasso Regression:
Lasso regression uses L1 regularization, which can shrink coefficients to zero, making it
effective for feature selection.
Example: Used in predicting stock prices by selecting only the most relevant features and
ignoring others.
6. Elastic Net Regression:
Elastic Net combines Ridge and Lasso penalties, making it useful for high-dimensional data.
Example: Used in genomics for predicting disease risk by selecting relevant genes.
7. Logistic Regression:
Used for binary classification problems. Although called regression, it is a classification
algorithm.
Example: Predicting whether a student will pass or fail based on study hours.
8. Poisson Regression:
Used for count data and models the rate of occurrence of an event.
Example: Modeling number of road accidents based on vehicle density.
9. Support Vector Regression (SVR):
Used in machine learning for regression problems by finding a hyper-plane that fits the data.
Example: Predicting stock prices with high-dimensional data.
10. Decision Tree & Random Forest Regression:
Decision Tree Regression: Uses a tree-like structure to model decisions.
Random Forest Regression: Uses multiple decision trees and averages the results for
better accuracy.
Example: Predicting house prices based on multiple variables (size, location, amenities).
Normal distribution
A continuous probability distribution for a variable is called as normal probability distribution or
simply normal distribution. It is also known as Gaussian/ Gauss or LAPlace – Gauss distribution.
The normal distribution is determined by two parameters, mean and variance. The normal
distributions are used to represent the real valued random variables whose distributions are
unknown. They are used very frequently in the areas of natural sciences and social sciences.
When the normal distribution is represented in form of a graph, it is known as normal probability
distribution curve or simply normal curve. A normal curve is a bell shaped curve, bilaterally
symmetrical and is continuous frequency distribution curve. Such a curve is formed as a result of
plotting frequencies of scores of a continuous variable in a large sample. The curve is known as
normal probability distribution curve because its y ordinates provides relative frequencies or the
probabilities instead of the observed frequencies. A continuous random variable can be said to be
normally distributed if the histogram of its relative frequency has shape of a normal curve
It is important to understand the characteristics of frequency distribution
of Normal Probabilty Curve (NPC).
.
Importance of Normal Distribution
As discussed earlier, the normal distribution plays a very significant role in the fields of natural
science and other social sciences. Some of the relevance of the normal distribution is described
below:
• The normal distribution is a continuous distribution and plays significant role in statistical
theory and inference.
• The normal distribution has various mathematical properties which makes it convenient to
express the frequency distribution in simplest form.
• It is a useful method of sampling distribution.
• Many of the variables in behavioral sciences like, weight, height, achievement, intelligence
have distributions approximately like the normal curve. • Normal distribution is a necessary
component for many of the inferential statistics like z-test, t-test and F-test.
Properties of Normal Distribution
As discussed earlier, the representation of normal distribution of random variable in graphic form
is known as Normal Probability Curve (NPC). The following are the properties of the normal
curve:
It is a bell shaped curve which is bilaterally symmetrical and has continuous frequency
distribution curve.
It is a continuous probability distribution for a random variable.
It has two halves (right and left) and the value of mean, median and mode are equal
(mean = median = mode), that is, they coincide at same point at the middle of the curve.
The normal curve is asymptotic, that is, it approaches but never touches the x-axis, as it
moves farther from mean.
The mean lies in the middle of the curve and divides the curve in to two equal halves.
The total area of the normal curve is within z ± 3 σ below and above the mean.
The area of unit under the normal curve is said to be equal to one (N=1), standard
deviation is one (σ =1), variance is one (σ²=1) and mean is zero (μ=0).
At the points where the curve changes from curving upward to curving downward are
called inflection points.
The z-scores or the standard scores in NPC towards the right from the mean are positive
and towards the left from the mean are negative.
About 68% of the curve area falls within the limit of plus or minus one standard
deviation (±1 σ) unit from the mean; about 95% of the curve area falls within the limit of
plus or minus two standard deviations (±2 σ) unit from the mean and about 99.7% of the
curve area falls within the limit of plus or minus three standard deviations (±3 σ) unit
from the mean.
The normal distribution is free from skewness, that is, it’s coefficient of skewness
amounts to zero.
The fractional areas in between any two given z-scores is identical in both halves of the
normal curve, for example, the fractional area between the z-scores of +1 is identical to
the z-scores of –1. Further, the height of the ordinates at a particular z-score in both the
halves of the normal curve is same, for example, the height of an ordinate at +1z is equal
to the height of an ordinate at –1z.
Standard Normal Distribution
Standard normal distribution, also known as the z-distribution, is a special type of normal
distribution. In this distribution, the mean (average) is 0 and the standard deviation (a measure of
spread) is 1. This creates a bell-shaped curve that is symmetrical around the mean.
STANDARD SCORES or Z-SCORES
Standard score or z-score is a transformed score which shows the number of standard deviation
units by which the value of observation (the raw score) is above or below the mean. The standard
score helps in determining the probability of a score in the normal distribution. It also helps in
comparing scores from different normal distributions.
The standard score is a score that informs about the value and also where the value lies in the
distribution. Typically, for example, if the value is 5 standard deviations above the mean then it
refers to five times the average distance above the mean. It is a transformed score of a raw score.
A raw score or sample value is the unchanged score or the direct result of measurement. A raw
score (X) or sample value cannot give any information of its position within a distribution.
Therefore, these raw scores are transformed in to z-scores to know the location of the original
scores in the distribution.
The z-scores are also used to standardise an entire distribution. These scores (z) help compare the
results of a test with the “normal” population. Results from tests or surveys have thousands of
possible results and units. These results might not be meaningful without getting transformed.
For example, if a result shows that height of a particular person is 6.5 feet; such findings can
only be meaningful if it is compared to the average height. In such a case, the z-score can
provide an idea about where the height of that person is in comparison to the average height of
the population.
Properties of z-score
Following are some of the properties of the Standard (z) Score:
The mean of the z-scores is always 0.
It is also important to note that the standard deviation of the z-scores is always 1.
Further, the graph of the z-score distribution always has the same shape as the original
distribution of sample values.
The z-scores above the value of 0 represent sample values above the mean, while z-
scores below the value of 0 represent sample values below the mean.
The shape of the distribution of the z-score will be similar or identical to the original
distribution of the raw scores. Thus, if the original distribution is normal, then the
distribution of the z-score will also be normal. Therefore, converting any data to z-score
does not normalize the distribution of that data.
Uses of z-score
z-scores are useful in the following ways:
It helps in identifying the position of observation(s) in a population distribution: As
mentioned earlier, the z-scores helps in determining the position/distance of a value or an
observation from the mean in the units of standard deviations. Further, if the distribution
of the scores is like the normal distribution, then we are able to estimate the proportion of
the population falling above or below a particular value. z-score has important
implication in the studies related to diet and nutrition of children. It helps in estimating
the values of height, weight and age of children with reference to nutrition.
It is used for standardising the raw data: It helps in standardising or converting the data
to enable standard measurements. For example, if you wish to compare your scores on
one test with the scores achieved in another test, comparison on the basis of raw score is
not possible. In such a situation, comparisons across tests can only be done when you
standardise both sets of test scores.
It helps in comparing scores that are from different normal distributions: As mentioned
in the previous example, z-scores help in comparing scores from different normal
distribution. Thus, z-scores can help in comparing the IQ scores received from two
different tests.
Computation of z-score
As mentioned earlier, z-score refers to the distance of the sample value from the mean in the
standard deviations. z-score can be computed for each value of the sample. The following
formula is used to compute z-score of a sample value.
Z= X- M/ SD or Z= X- M/σ
where,
X = a particular raw score
M = Sample mean
SD or σ = Standard Deviation
To illustrate, suppose the following are the marks obtained by students in mathematics. The
marks obtained are expressed here in terms of raw scores. The mean, SD and z-scores can be
then calculated accordingly:
Students Raw Scores (X) X- M z
A 50 -15 -1.24
B 60 -5 -0.41
C 66 1 0.08
D 70 5 0.41
E 80 15 1.24
N=5
Sum 326
Mean (M) 65
SD 12.04
The above illustration shows the z-scores of the marks obtained by each student (A,B,C,D and
E). In the above example, student A is 1.24 standard deviations, or 1.24 standard deviation units
below the mean. Similarly the student E is 1.24 units above the mean. The standard deviation is
used as unit of measurement in standard scores. The standard score helps in normalising or
collapsing the data to a common standard based on how many standard deviations values lie
from the mean.
The variation of z-scores range from -3 standard deviations (which would fall to the far left of
the normal distribution curve) up to +3 standard deviations (which would fall to the far right of
the normal distribution curve). Further, we need to know the values of the μ (mean) and also the
σ (standard deviation) of the population.
Thus, if we want to compute z-score for X = 70, M= 65 and SD= 12.04, we will use the formula
Z= X- M/ SD
= 70- 65/ 12.04
= 5/ 12.04
= 0.42
Thus, the z-score is obtained as 0.42