Regression Analysis
Regression Analysis
A statistical hypothesis test is a method of statistical inference used to decide whether the data
sufficiently supports a particular hypothesis. A statistical hypothesis test typically involves a
calculation of a test statistic. Then a decision is made, either by comparing the test statistic to a critical
value or equivalently by evaluating a p-value computed from the test statistic. Roughly 100 specialized
statistical tests have been defined.
Null and alternative hypotheses are used in statistical hypothesis testing. The null hypothesis of a test
always predicts no effect or no relationship between variables, while the alternative hypothesis states
your research prediction of an effect or relationship.
What are the methods of fitting a straight line?
Three methods of fitting straight lines to data are des- cribed and their purposes are discussed
and contrasted in terms of their applicability in various water resources contexts. The three methods
are ordinary least squares (OLS), least normal squares (LNS), and the line of organic correlation (OC).
DEFINITION
Analysis of variance (ANOVA) is a statistical test used to evaluate the difference between
the means of more than two groups.
Analysis of variance (ANOVA) is a statistical test used to assess the difference between
the means of more than two groups. At its core, ANOVA allows you to simultaneously compare
arithmetic means across groups. You can determine whether the differences observed are due to
random chance or if they reflect genuine, meaningful differences.
A one-way ANOVA uses one independent variable. A two-way ANOVA uses two independent
variables. Analysts use the ANOVA test to determine the influence of independent variables on the
dependent variable in a regression study. While this can sound arcane to those new to statistics, the
applications of ANOVA are as diverse as they are profound. From medical researchers investigating
the efficacy of new treatments to marketers analyzing consumer preferences, ANOVA has become an
indispensable tool for understanding complex systems and making data-driven decisions.
KEY TAKEAWAYS
ANOVA is a statistical method that simultaneously compares means across several groups to
determine if observed differences are due to chance or reflect genuine distinctions.
A one-way ANOVA uses one independent variable. A two-way ANOVA uses two
independent variables.
By partitioning total variance into components, ANOVA unravels relationships between
variables and identifies true sources of variation.
ANOVA can handle multiple factors and their interactions, providing a robust way to better
understand intricate relationships.
UNIT-III
Multiple regression analysis is a statistical technique that uses a linear regression model to
predict the value of a dependent variable based on multiple independent variables.
Parameter estimation is a mathematical process that uses a model and experimental data to determine
the values of a model's parameters. It's used to calibrate models and understand complex physical
processes.
A partial regression coefficient is a statistical parameter that measures the strength of the linear
relationship between two variables in a multiple regression model:
A partial regression coefficient is a statistical parameter that measures the strength of the linear
relationship between two variables while holding other variables constant. It's used in multiple linear
regression (MLR) analysis and is also known as a regression weight, partial regression weight, slope
coefficient, or partial slope coefficient.
Ordinary least squares (OLS) and maximum likelihood estimation (MLE) are both methods for
calculating the coefficients of a linear regression model. While they may seem different, they can
produce the same results under certain assumptions:
OLS
A deterministic method that minimizes the squared residuals. It's often used when the independent
variable is normally distributed, and the relationship between the variables is linear.
MLE
A method that maximizes the probability of observing a dataset given a model and its
parameters. The format of MLE can vary depending on the underlying distribution, such as Poisson,
Bernoulli, or negative binomial.
When the errors are normally distributed, OLS and MLE produce the same estimator for the model
parameters. This is because both methods give rise to the same normal equations.
OLS is a fundamental concept in machine learning and is widely used for predictive modeling. Linear
regression can be used in many fields, such as meteorology, biology, and economics.
Coefficient of multiple 𝑅2
In regression analysis, the coefficient of multiple determination, also known as 𝑅2, is a statistical
measure of how well a regression model explains the variation in a dependent variable based on
multiple independent variables. It's a percentage that ranges from 0 to 100%:
𝑅2=0: The model explains none of the variability in the response data
𝑅2=100: The model explains all of the variability in the response data
0<𝑅2<1: The dependent variable can be predicted to some extent from the independent variables here
are some things to keep in mind when interpreting
The lowercase r and uppercase 𝑅are used to distinguish between the correlation coefficient and
the multiple coefficient of determination.
Adjusted R2 polynomial
Adjusted (𝑅2𝑎𝑑𝑗) is a statistical measure that corrects the goodness of fit for a linear model by
accounting for the number of predictors in the model. It's a modified version of R2, which measures
the proportion of variance in the dependent variable that can be explained by the independent variable.
Adjusted R2is useful because R2 can overestimate the fit of a linear regression model, especially
when the number of effects in the model increases. Adjusted R-squared penalizes models for adding
unnecessary predictors, and only increases if the new predictor improves the model's predictive
power.
Polynomial regression is a kind of linear regression in which the relationship shared between the
dependent and independent variables Y and X is modeled as the nth degree of the polynomial. This is
done to look for the best way of drawing a line using data points. Keep reading to know more about
polynomial regression.
The Partial Regression Coefficient
The interpretation of the individual regression coefficients gives rise to an important difference
between simple and multiple regression. In a multiple regression model the regression parameters, βi,
called partial regression coefficients, are not the same, either computationally or conceptually, as the
so-called total regression coefficients obtained by individually regressing y on each x.
Definition
The partial regression coefficients obtained in a multiple regression measure the change in the
average value of y associated with a unit increase in the corresponding x, holding constant all other
variables.
This means that normally the individual coefficients of an m-variable multiple regression model will
not have the same values nor the same interpretations as the coefficients for the m separate simple
linear regressions involving the same variables. Many difficulties in using and interpreting the results
of multiple regression arise from the fact that the definition of “holding constant,” related to the
concept of a partial derivative in calculus, is somewhat difficult to understand.
For example, in the application on estimating sick days of school children, the coefficient associated
with the height variable measures the increase in sick days associated with a unit increase in height for
a population of children all having identical waist circumference, weight, and age. In this application,
the total and partial coefficients for height would differ because the total coefficient for height would
measure not only the effect of height, but also indirectly measure the effect of the other related
variables.
The application on estimating fuel consumption provides a similar scenario: The total coefficient for
temperature would indirectly measure the effect of wind and cloud cover. Again this coefficient will
differ from the partial regression coefficient because cloud cover and wind are often associated with
lower temperatures.
We will see later that the inferential procedures for the partial coefficients are constructed to reflect
this characteristic. We will also see that these inferences and associated interpretations are often made
difficult by the existence of strong relationships among the several independent variables, a condition
known as multicollinearity.
Because the use of multiple regression models entails many different aspects, this chapter is quite
long. Section 8.2 presents the procedures for estimating the coefficients, and Section 8.3 presents the
procedure for obtaining the error variance and the inferences about model parameter and other
estimates. Section 8.4 contains brief descriptions of correlations that describe the strength of linear
relationships involving several variables. Section 8.5 provides some ideas on statistical software usage
and presents computer outputs for examples used in previous sections. The last four sections deal with
special models and problems that arise in a regression analysis. Multiple
Multiple Regression
Multiple Regressions is a special kind of regression model that is used to estimate the relationship
between two or more independent variables and one dependent variable. It is also called Multiple
Linear Regression(MLR).
It is a statistical technique that uses several variables to predict the outcome of a response variable.
The goal of multiple linear regression is to model the linear relationship between the independent
variables and dependent variables. It is used extensively in econometrics and financial inference.
o How strong the relationship is between two or more independent variables and one dependent
variable.
o The estimate of the dependent variable at a certain value of the independent variables.
For example,
A public health researcher is interested in social factors that influence heart disease. In a survey of 500
towns’ data is gathered on the percentage of people in each town who smoke, on the percentage of
people in each town who bike to work, and on the percentage of people in each town who have heart
disease.
As we have two independent variables and one dependent variable, and all the variables are
quantitative, we can use multiple regression to analyze the relationship between them.
Regression Formula
y^=β0+β1X1+…+βnXn+e
Where, y^= predicted value of the dependent variable,
β0= the y intercept,
β1X1= regression coefficient of the first independent variable,
βnXn= regression coefficient of the last independent variable,
e = variation in the estimate.
o βrepresents unit change o βi represents the unit change in Y per unit change
in Y for per unit change in Xi
in X.
The main advantages and disadvantages of Multiple Regression are tabulated below.
Advantages Disadvantages
Similar to linear regression, Multiple Regression also makes few assumptions as mentioned below.
Homoscedasticity: The size of the error in our prediction should not change significantly across the
values of the independent variable.
Independence of observations: the observations in the dataset are collected using statistically valid
methods, and there should be no hidden relationships among variables.
Linearity: The line of best fit through the data points should be a straight line rather than a curve or
some sort of grouping factor.
UNIT –IV
Multiple regression analysis
For an individual regression coefficient, we want to test if there is a relationship between the
dependent variable y and the independent variable xi.
No Relationship. There is no relationship between the dependent variable y and the independent
variable xi. In this case, the regression coefficient βi is zero. This is the claim for the null hypothesis
in an individual regression coefficient test: H0: βi=0.
Relationship. There is a relationship between the dependent variable y and the independent
variable xi. In this case, the regression coefficients βi is not zero. This is the claim for the alternative
hypothesis in an individual regression coefficient test: Ha:βi≠0. We are not interested if the
regression coefficient βi is positive or negative, only that it is not zero. We only need to find out if the
regression coefficient is not zero to demonstrate that there is a relationship between the dependent
variable and the independent variable. This makes the test on a regression coefficient a two-tailed test.
In order to conduct a hypothesis test on an individual regression coefficient βi, we need to use the
distribution of the sample regression coefficient bi
The mean of the distribution of the sample regression coefficient is the population regression
coefficient βi.
The standard deviation of the distribution of the sample regression coefficient is σbi. Because we do
not know the population standard deviation we must estimate σbi with the sample standard
deviation sbi.
The distribution of the sample regression coefficient follows a normal distribution.
Because we are using a sample standard deviation to estimate a population standard deviation in a
normal distribution, we need to use a tt-distribution with n−k−1 degrees of freedom to find the p-value
for the test on an individual regression coefficient. The t-score for the test is t=bi−βi/sbi.
How do you test the significance of individual regression coefficients?
Since the coefficients in the regression represent key parameters such as elasticities and cross-
elasticities of demand, it is important that we test their statistical significance. The t-test is used for this
purpose. For each coefficient, it tests that the coefficient differs significantly from zero value.
Testing the overall significance of the sample regression
We want to test if there is a relationship between the dependent variable and the set of independent
variables. In other words, we want to determine if the regression model is valid or invalid.
Invalid Model. There is no relationship between the dependent variable and the set of independent
variables. In this case, all of the regression coefficients βi in the population model are zero. This is
the claim for the null hypothesis in the overall model test: H0:β1=β2=⋯=βk=0H0:β1=β2=⋯=βk=0.
Valid Model. There is a relationship between the dependent variable and the set of dependent
variables. In this case, at least one of the regression coefficients βi in the population model is not
zero. This is the claim for the alternative hypothesis in the overall model test: Ha:at least
one βi≠0Ha:at least one βi≠0. The overall model test procedure compares the means of explained and
unexplained variation in the model in order to determine if the explained variation (caused by the
relationship between the dependent variable and the set of independent variables) in the model is
larger than the unexplained variation (represented by the error variable ϵϵ). If the explained variation
is larger than the unexplained variation, then there is a relationship between the dependent variable
and the set of independent variables, and the model is valid. Otherwise, there is no relationship
between the dependent variable and the set of independent variables, and the model is invalid.
The logic behind the overall model test is based on two independent estimates of the variance of the
errors:
One estimate of the variance of the errors, MSE, is based on the mean amount of unexplained
variation unexplained variation in the model in order to determine if the explained variation (caused by
the relationship between the dependent variable and the set of independent variables) in the model is
larger than the unexplained variation (represented by the error variable ϵϵ). If the explained variation
is larger than the unexplained variation, then there is a relationship between the dependent variable
and the set of independent variables, and the model is valid. Otherwise, there is no relationship
between the dependent variable and the set of independent variables, and the model is invalid.
The overall model test compares these two estimates of the variance of the errors to determine
if there is a relationship between the dependent variable and the set of independent variables. Because
the overall model test involves the comparison of two estimates of variance, an FF-distribution is used
to conduct the overall model test, where the test statistic is the ratio of the two estimates of the
variance of the errors.
The mean square due to regression, MSR, is one of the estimates of the variance of the
errors. The MSR is the estimate of the variance of the errors determined by the variance of the
predicted ^y values from the regression model and the mean of the y-values in the sample, ¯¯¯y . If
there is no relationship between the dependent variable and the set of independent variables, then
the MSR provides an unbiased estimate of the variance of the errors. If there is a relationship between
the dependent variable and the set of independent variables, then the MSR provides an overestimate of
the variance of the errors.
Multiple regression analysis is a statistical technique that uses the values of multiple
independent variables to predict the value of a dependent variable. It's an extension of simple linear
regression.
Variables
The variable being predicted is called the dependent variable, and the variables used to predict it are
called the independent variables.
Accuracy
Multiple regression models aren't always perfectly accurate because each data point may differ
slightly from the predicted outcome. To account for these variations, the model includes a residual
value, which is the difference between the actual and predicted outcomes.
Examples
Multiple regression analysis can be used to predict a variety of things, including:
Exam performance based on revision time, test anxiety, lecture attendance, and gender
Daily cigarette consumption based on smoking duration, age when smoking, smoker type, income, and
gender started
Rice yield per acre based on seed quality, soil fertility, fertilizer used, temperature, and rainfall
Consumer preferences for a soft drink based on age, consumption habits, and lifestyle.
It is pretty easy to test whether a regression coefficient is significantly different from any
constant. E.g. for the multiple linear equation y = b2x + b1z + b0 to test whether b2 is significantly
different from -1, you need to rewrite the regression equation as y+x = (b2+1)x + b1z + b0. This
equation can be represented as Y = B2x + b1z + b0, which is a multiple regression model where Y =
y+x the coefficient B2 = b2+1.
Note that b2 = -1 when B2 = b2 – 1 = 0, and since we know how to test whether the B2 coefficient is
significantly different from zero, we have a test for whether b2 is significantly different from -1.
To test whether the intercept is equal to a specific constant is even easier. E.g. to test whether the
constant for Example 1 is equal to 40, transform the regression equation to the equation (y-40)
= b2x + b1z + (b0-40), which takes the form Y = b2x + b1z + B0 and test for B0 = 0.
You can use a slightly more complicated trick to test whether two regression coefficients are equal. To
test whether b2 is significantly different from b1 in y = b2x + b1z + b0, you need to rewrite the regression
equation as y = B2(x+z) + B1(x-z) + b0. Expanding the equation results in y = (B1+B2)x + (B2–B1)z + b0,
and so we see that b2 = B1+B2 and b1 = B2–B1. Thus, B1 = (b2–b1)/2, which means that B1 = 0 is
equivalent to b1 = b2.
Analysis of variance (ANOVA) is a statistical test used to assess the difference between the means of
more than two groups. At its core, ANOVA allows you to simultaneously compare arithmetic means
across groups. You can determine whether the differences observed are due to random chance or if
they reflect genuine, meaningful differences.
A one-way ANOVA uses one independent variable. A two-way ANOVA uses two independent
variables. Analysts use the ANOVA test to determine the influence of independent variables on the
dependent variable in a regression study. While this can sound arcane to those new to statistics, the
applications of ANOVA are as diverse as they are profound. From medical researchers investigating
the efficacy of new treatments to marketers analyzing consumer preferences, ANOVA has become an
indispensable tool for understanding complex systems and making data-driven decisions.
KEY TAKEAWAYS
ANOVA is a statistical method that simultaneously compares means across several groups to
determine if observed differences are due to chance or reflect genuine distinctions.
A one-way ANOVA uses one independent variable. A two-way ANOVA uses two
independent variables.
By partitioning total variance into components, ANOVA unravels relationships between
variables and identifies true sources of variation.
ANOVA can handle multiple factors and their interactions, providing a robust way to better understand
intricate relationships
UNIT – V
Dummy variable
In regression analysis, a dummy variable (also known as indicator variable or just dummy) is one
that takes a binary value (0 or 1) to indicate the absence or presence of some categorical effect that
may be expected to shift the outcome.[1] For example, if we were studying the relationship
between biological sex and income, we could use a dummy variable to represent the sex of each
individual in the study. The variable could take on a value of 1 for males and 0 for females (or vice
versa). In machine learning this is known as one-hot encoding.
Dummy variables are commonly used in regression analysis to represent categorical variables that
have more than two levels, such as education level or occupation. In this case, multiple dummy
variables would be created to represent each level of the variable, and only one dummy variable would
take on a value of 1 for each observation. Dummy variables are useful because they allow us to
include categorical variables in our analysis, which would otherwise be difficult to include due to their
non-numeric nature. They can also help us to control for confounding factors and improve the validity
of our results.
As with any addition of variables to a model, the addition of dummy variables will increase the within-
sample model fit (coefficient of determination), but at a cost of fewer degrees of freedom and loss of
generality of the model (out of sample model fit). Too many dummy variables result in a model that
does not provide any general conclusions.
Dummy variables are useful in various cases. For example, in econometric time series analysis,
dummy variables may be used to indicate the occurrence of wars, or major strikes. It could thus be
thought of as a Boolean, i.e., a truth value represented as the numerical value 0 or 1 (as is sometimes
done in computer programming).
Dummy variables may be extended to more complex cases. For example, seasonal effects may be
captured by creating dummy variables for each of the seasons: D1=1 if the observation is for summer,
and equals zero otherwise; D2=1 if and only if autumn, otherwise equals zero; D3=1 if and only if
winter, otherwise equals zero; and D4=1 if and only if spring, otherwise equals zero. In the panel
data fixed effects estimator dummies are created for each of the units in cross-sectional data (e.g. firms
or countries) or periods in a pooled time-series. However in such regressions either the constant
term has to be removed, or one of the dummies removed making this the base category against which
the others are assessed, for the following reason:
If dummy variables for all categories were included, their sum would equal 1 for all observations,
which is identical to and hence perfectly correlated with the vector-of-ones variable whose coefficient
is the constant term; if the vector-of-ones variable were also present, this would result in perfect multi
collinearity,[2] so that the matrix inversion in the estimation algorithm would be impossible. This is
referred to as the dummy variable trap.
ANOVA
Analysis of variance (ANOVA) is a statistical method that can be used in regression analysis to
determine the influence of independent variables on a dependent variable
ANOVA, or Analysis of Variance, is a test used to determine differences between research results
from three or more unrelated samples or groups.
Process: Starts with an empty model and adds variables that contribute the most to explaining the
variance in the dependent variable.
Goal: To track the most efficient features for better prediction accuracy.
Benefits: Helps to improve model performance and avoid over fitting.
Forward selection is one of three basic variations of stepwise regression, along with backward
elimination and stepwise. Backward elimination starts with all possible predictors and removes non-
significant ones until reaching a stopping criterion. Stepwise regression combines forward selection
and backward elimination, adding and removing predictors as it builds the model.
Backward regression
Backward regression is a method used in multiple linear regressions to build models by removing the
least statistically significant variables. It's also known as backward stepwise regression or backward
elimination.
Here's how backward regression works:
Backward regression can be challenging if there are many candidate variables, and it's impossible if
there are more candidate variables than observations.
regression analysis
In statistical modeling, regression analysis is a set of statistical processes for estimating the
relationships between a dependent variable (often called the outcome or response variable, or
a label in machine learning parlance) and one or more error-free independent variables (often
called regresses, predictors, covariates, explanatory variables or features).
The most common form of regression analysis is linear regression, in which one finds the line (or a
more complex linear combination) that most closely fits the data according to a specific mathematical
criterion. For example, the method of ordinary least squares computes the unique line (or hyperplane)
that minimizes the sum of squared differences between the true data and that line (or hyperplane). For
specific mathematical reasons (see linear regression), this allows the researcher to estimate
the conditional expectation (or population average value) of the dependent variable when the
independent variables take on a given set of values. Less common forms of regression use slightly
different procedures to estimate alternative location parameters (e.g., quantile regression or Necessary
Condition Analysis) or estimate the conditional expectation across a broader collection of non-linear
models (e.g., nonparametric regression).
Regression analysis is primarily used for two conceptually distinct purposes. First, regression analysis
is widely used for prediction and forecasting, where its use has substantial overlap with the field
of machine learning. Second, in some situations regression analysis can be used to infer causal
relationships between the independent and dependent variables. Importantly, regressions by
themselves only reveal relationships between a dependent variable and a collection of independent
variables in a fixed dataset. To use regressions for prediction or to infer causal relationships,
respectively, a researcher must carefully justify why existing relationships have predictive power for a
new context or why a relationship between two variables has a causal interpretation. The latter is
especially important when researchers hope to estimate causal relationships using observational data.
Analysis of covariance
Analysis of covariance (ANCOVA) is a general linear model that
blends ANOVA and regression. ANCOVA evaluates whether the means of a dependent variable (DV)
are equal across levels of one or more categorical independent variables (IV) and across one or more
continuous variables. For example, the categorical variable(s) might describe treatment and the
continuous variable(s) might be covariates (CV)'s, typically nuisance variables; or vice versa.
Mathematically, ANCOVA decomposes the variance in the DV into variance explained by the CV(s),
variance explained by the categorical IV, and residual variance. Intuitively, ANCOVA can be thought
of as 'adjusting' the DV by the group means of the CV(s).
In this equation, the DV, Yij is the jth observation under the ith categorical group; the CV, xij is the jth
observation of the covariate under the ith group. Variables in the model that are derived from the
observed data are μ (the grand mean x ) and x (the global mean for covariate x). The variables to be
fitted are τ i (the effect of the ith level of the categorical IV), B (the slope of the line) and ∈ij (the
associated unobserved error term for the jth observation in the ith group).
In regression analysis, the "backward optimum method" refers to a variable selection technique
called "backward elimination" where you start with a model containing all potential independent
variables and then iteratively remove the least significant variable one at a time until you reach a
model with only the most important predictors left, based on a chosen statistical criterion like p-
values.