Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
24 views27 pages

Regression Basics

The document provides an overview of predictive analytics, focusing on linear regression and its assumptions, methods, and diagnostics. It explains the significance of R-squared, F-statistic, AIC, and BIC in evaluating model performance, as well as issues like autocorrelation and multicollinearity. Recommendations for improving model reliability include addressing non-normal residuals and refining the model based on diagnostic results.

Uploaded by

bharat.goel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views27 pages

Regression Basics

The document provides an overview of predictive analytics, focusing on linear regression and its assumptions, methods, and diagnostics. It explains the significance of R-squared, F-statistic, AIC, and BIC in evaluating model performance, as well as issues like autocorrelation and multicollinearity. Recommendations for improving model reliability include addressing non-normal residuals and refining the model based on diagnostic results.

Uploaded by

bharat.goel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Predictive Analytics Basics: Linear

Regression & Logistic Regression

Minati Rath
Overview of Predictive Analytics
Predictive analytics uses statistical techniques to forecast future
outcomes based on historical data. It is crucial for informed
decision-making in business. Types of predictive models include
regression, classification, and time-series forecasting.
Introduction to Linear Regression
Linear Regression models the relationship between a dependent
variable and one or more independent variables.
The equation:
Y = β0 + β1X + ε.

β0 (intercept) represents the baseline level; β1 (slope) represents


the change in Y for each unit change in X.

Assumptions: Linearity, Independence, Homoscedasticity, and


Normality. These must be met for the model to produce reliable
results.
Assumptions of Linear Regression
Check the linearity between the predictor variables and the response variable

1. Scatter Plots: Between Predictors and Response: Create scatter plots of


each predictor variable against the response variable. You’re looking for a
linear relationship in these plots. If the points roughly form a straight line (or a
cloud of points that is centered around a line), it suggests a linear relationship.
Pairwise Scatter Plots: For multiple predictors, plot each predictor against
every other predictor. This helps identify multicollinearity and whether the
relationship between predictors is linear.

2. Residual Plots:
After fitting a linear regression model, plot the residuals (the differences
between observed and predicted values) against the predicted values or against
each predictor variable.
In a well-fitting linear model, residuals should be randomly scattered without
any clear pattern. If you notice patterns (like curves or trends), it might
indicate that the relationship is not purely linear.
Component + Residual Plots (CERes plots):
CERes plots are used to visualize the relationship between a
predictor and the response variable while accounting for the
effects of other predictors in the model. They can help identify
non-linearity that might not be obvious from simple scatter plots.

Polynomial Regression or Transformation:


Fit a polynomial regression model (e.g., quadratic or cubic) and
compare it to the linear model. If the polynomial model provides
a significantly better fit, this may indicate that the relationship
between the predictor and response variable is non-linear.
Apply transformations to the predictors or the response variable
(like logarithmic or square root transformations) and check if the
linearity improves.
OLS Result Analysis for one X and Y
Model Overview
Dep. Variable: This is the dependent variable Y, the outcome
variable you're trying to predict or explain.
Model: The type of regression model used is OLS (Ordinary Least
Squares). (OLS) is a type of linear regression model that is widely
used in statistical analysis. In OLS, the goal is to find the linear
relationship between a dependent variable Y and one or more
independent variables X₁, X₂, …., Xₙ
The model assumes that this relationship can be expressed as:
𝑌 = 𝛽₀ + 𝛽₁𝑋₁ + 𝛽₂𝑋₂ + ⋯ + 𝛽ₙ𝑋ₙ + 𝜖
Where: Y is the dependent variable,
X₁, X₂, …., Xₙ are the independent variables.
β0 is the intercept (constant term).
β1,β2,…,βn ​ are the coefficients for the independent variables.
ϵ is the error term (residual), representing the difference between the
observed and predicted values of Y.
Method: Least Squares
Least Squares is the method used to estimate the parameters (i.e., the
coefficients β0,β1,…,βn in the OLS regression model.
The "least squares" method aims to find the line (or hyperplane in
higher dimensions) that minimizes the sum of the squared differences
between the observed values of Y and the values predicted by the
model.
Why Minimize the Sum of Squared Errors?
The method works by minimizing the sum of squared errors (SSE),
which is calculated as:
SSE = Σ(Yᵢ − Ŷᵢ)²
Where:
Yᵢ is the observed value of the dependent variable for the i-th
observation.
Ŷᵢ is the predicted value of Y for the i-th observation, calculated as:
Ŷᵢ = β₀ + β₁Xᵢ₁ + β₂Xᵢ₂ + ⋯ + βₙXᵢₙ.
R-squared and Adjusted R-squared

R-squared: R-squared is a statistical measure that indicates the


proportion of the variance in the dependent variable that is
predictable from the independent variable(s).
Interpretation: An R-squared value close to 1 (like 0.934 in this
case) indicates that a large proportion of the variance in the
dependent variable (Y) is explained by the independent variable(s)
(X). This suggests that the model fits the data well.

Adjusted R-squared: Adjusted R-squared adjusts the R-squared


value for the number of predictors in the model. It accounts for the
degrees of freedom associated with adding additional predictors.
R-squared and Adjusted R-squared
Interpretation: The Adjusted R-squared (0.933 in this case) is
slightly lower than the R-squared, which is common. This
adjustment is useful when comparing models with a different
number of predictors, as it penalizes the inclusion of unnecessary
predictors that do not improve the model significantly.
Summary:
R-squared provides a general measure of fit, while Adjusted R-
squared gives a more accurate measure when comparing models
with different numbers of predictors. Both values being high
indicates that the model explains a significant portion of the
variance in the outcome variable.
F-statistic and Prob(F-statistic)
F-statistic: The F-statistic is a measure used in regression analysis to
test the overall significance of the model. It compares the model with
no predictors (the intercept-only model) to the model with the
specified predictors.
Calculation: The F-statistic is calculated by dividing the mean square
of the model by the mean square of the residuals. Mathematically
Interpretation: A higher F-statistic indicates that the model explains
a significant amount of the variation in the dependent variable
compared to the noise (residuals). In the example, an F-statistic of
1386 is very high, suggesting that the model is highly significant.
Prob(F-statistic) or p-value of F-statistic: The p-value associated
with the F-statistic (also known as the Prob(F-statistic)) tells us the
probability of observing an F-statistic as extreme as the one
calculated, assuming that the null hypothesis (that all the regression
coefficients are equal to zero) is true.
F-statistic and Prob(F-statistic)
Interpretation: A very small p-value (like 1.24e-59 in the example)
indicates that the null hypothesis can be rejected, meaning that the
model is statistically significant. In other words, there is an extremely
low probability that the relationship between the dependent and
independent variables is due to random chance.

Summary: The F-statistic measures how well the model fits the data
compared to a model with no predictors. The Prob(F-statistic) or p-
value assesses the statistical significance of this fit. A very low p-
value indicates that the model significantly improves the prediction of
the dependent variable compared to having no predictors
Log-Likelihood, AIC, and BIC
Log-Likelihood: -807.50
The log-likelihood measures how well the model fits the data. In
general, higher (less negative) values of the log-likelihood indicate a
better fit of the model to the data. However, the log-likelihood value
alone does not provide a clear picture of model performance without
comparing it to other models.
Interpretation: The log-likelihood value of -807.50 indicates the
likelihood of observing the data given the model. Without a
benchmark or comparison, it is challenging to interpret this value
alone.
AIC (Akaike Information Criterion): 1619
The AIC is used for model comparison and penalizes for the number
of parameters in the model. It helps to balance goodness-of-fit with
model complexity. Lower AIC values indicate a better model fit
relative to other models.
Log-Likelihood, AIC, and BIC
BIC (Bayesian Information Criterion): 1624
The BIC also balances goodness-of-fit with model complexity, but it
penalizes complexity more strongly than the AIC. Lower BIC values
suggest a better model fit, with a stronger penalty for additional
parameters compared to the AIC.
Interpretation: AIC: 1619 and BIC: 1624 are used to compare the
fit of different models. Both AIC and BIC penalize for model
complexity, with BIC imposing a stricter penalty.
1. Comparison: These criteria are particularly useful when
comparing multiple models. The model with the lowest AIC
and BIC values is generally preferred. If you have other models
or alternative specifications, you can compare their AIC and
BIC values to assess which model offers a better trade-off
between fit and complexity.
Coefficients: const and X
The coefficient for 'const' (-1751.8412) represents the intercept of the
regression line. It is the value of Y when all X variables are zero.
-1751.8412 indicates that when X is zero, the expected value of the
dependent variable Y is -1751.8412. This might be a theoretical
baseline value, but its practical significance depends on the context of
the data.
t-Statistic: -11.069 and p-Value: 0.000 suggest that the intercept is
significantly different from zero. The intercept is statistically
significant at any conventional significance level (e.g., 0.05, 0.01).
Confidence Interval: The 95% confidence interval for the intercept is
[-2065.906, -1437.776], indicating that we are 95% confident that the
true intercept value falls within this range.
The coefficient for 'X' (101.2787) represents the change in Y for a
one-unit change in X. This suggests a strong positive relationship
between 𝑋 and Y.
Coefficients: const and X
t-Statistic: 37.224 and p-Value: 0.000 indicate that the coefficient of X is
highly statistically significant. The relationship between X and Y is highly
significant at any conventional significance level.
Confidence Interval: The 95% confidence interval for the coefficient of 𝑋 is
[95.879, 106.678], indicating that we are 95% confident that the true
coefficient value falls within this range.
Both coefficients have very small p-values, indicating they are statistically
significant.
Overall Interpretation: Significance: Both the intercept and the coefficient
for X are highly significant, with very low p-values, indicating that the
relationship between X and Y is statistically significant. Effect Size: The
coefficient for 𝑋X is quite large (101.2787), suggesting a substantial effect
of X on Y. Intercept: The negative intercept might be a point of theoretical
interest, especially if X is expected to be zero or close to zero in the context
of the model. In summary, the model suggests a strong and significant
positive relationship between X and Y, and the intercept is also statistically
significant, although its practical significance may need further context.
Omnibus
The Omnibus test is a statistical test that assesses whether the
residuals from the regression model are normally distributed. It
combines tests for skewness and kurtosis into a single test statistic.
Omnibus test(10.614) (Prob: 0.005) tests the skewness and kurtosis
of the residuals. A significant result suggests non-normality.
Result: The p-value of 0.005 is less than the common significance
level of 0.05. This indicates that there is evidence against the null
hypothesis that the residuals are normally distributed. In other
words, the residuals deviate significantly from normality.
Residuals Statistics: Skew: 0.628 , Kurtosis: 2.274
1.Skew: 0.628 , Interpretation: Positive skewness indicates that
the residuals are skewed to the right. This suggests that there may be
more frequent smaller residuals and fewer larger residuals, leading
to an asymmetry in the distribution of residuals
Other Statistics
1.Kurtosis: Value: 2.274 , Interpretation: Kurtosis measures the
"tailedness" of the distribution. A kurtosis value of 2.274 is below the
value of 3, which is the kurtosis of a normal distribution. This
suggests that the residuals have lighter tails than a normal
distribution, indicating fewer extreme outliers.
Actionable Steps:
•Residual Analysis: Further residual diagnostics should be
performed to understand the nature of the deviations from normality.
This includes checking for patterns in residuals versus fitted values,
and considering transformations of the dependent variable if needed.
•Model Refinement: Consider if any model refinements, such as
adding or removing predictors, applying transformations, or using
different modeling techniques, could address the issues with residual
normality.
Durbin-Watson Statistic:
Durbin-Watson: 0.116. The Durbin-Watson statistic tests for the presence
of autocorrelation in the residuals from a regression analysis. The value
ranges from 0 to 4, where: 2 indicates no autocorrelation. Less than 2
indicates positive autocorrelation. Greater than 2 indicates negative
autocorrelation.
Result: A Durbin-Watson statistic of 0.116 is much lower than 2,
suggesting strong positive autocorrelation in the residuals. This means that
the residuals are highly correlated with each other, which violates the
assumption of independence of residuals.
Jarque-Bera Test: Jarque-Bera (JB): 8.763, Prob(JB): 0.0125
The Jarque-Bera test assesses whether the residuals follow a normal
distribution by evaluating both skewness and kurtosis. The null hypothesis
is that the residuals are normally distributed.
Result: The p-value of 0.0125 is less than the common significance level of
0.05, indicating that the residuals significantly deviate from normality. This
confirms the earlier Omnibus test result, suggesting that the residuals are
not normally distributed.
Condition Number
Cond. No.: 117. The condition number assesses multicollinearity in
the model. A higher condition number indicates potential issues with
multicollinearity, which can destabilize the regression estimates.
Result: A condition number of 117 is relatively high. While it is not
excessively high, it suggests that there might be some
multicollinearity issues. Typically, condition numbers above 30 are
considered problematic, so a value of 117 indicates that
multicollinearity could be affecting the regression results.
Summary and Recommendations:
1.Autocorrelation:
1. Issue: The Durbin-Watson statistic indicates strong positive
autocorrelation in the residuals.
2. Action: Investigate potential sources of autocorrelation, such
as omitted variables or incorrect model specification. Consider
using time series techniques or adding lagged variables if
applicable.
Condition Number
2.Non-Normal Residuals:
1. Issue: Both the Omnibus and Jarque-Bera tests indicate
significant deviations from normality.
2. Action: Consider transforming the dependent variable or
applying robust standard errors. Re-examine model
assumptions and residual patterns.
1.Multicollinearity:
1. Issue: The high condition number suggests potential
multicollinearity.
2. Action: Evaluate the correlation between predictors. Consider
techniques such as Principal Component Analysis (PCA) or
Ridge Regression to address multicollinearity.
Overall, the model diagnostics suggest that there are several issues to
address, including autocorrelation, non-normality of residuals, and
potential multicollinearity. Addressing these issues will improve the
reliability and validity of your regression analysis
Summary of OLS Regression Results
The OLS regression results provide a detailed overview of the
relationship between the dependent variable (Y) and the
independent variable (X). The model is statistically significant,
and the high R-squared value indicates a good fit. However, care
must be taken to check for assumptions such as normality and
autocorrelation.

Least Squares is the method used to estimate the coefficients in


the OLS regression model. The method minimizes the sum of
squared differences between observed values of Y and predicted
values.

This method ensures the best linear unbiased estimators of the


coefficients under the assumptions of the OLS model.
OLS Result Analysis for one 10 X and Y
Analysis of OLS Regression Results
Summary of Key Statistics:
Dependent Variable: YYY
Model: Ordinary Least Squares (OLS)
Number of Observations: 100
Degrees of Freedom (Residuals): 89
Degrees of Freedom (Model): 10
R-squared: 0.808
Adjusted R-squared: 0.787
F-statistic: 37.53
Prob (F-statistic): 1.06×10^{-27}1.06×10−27
Log-Likelihood: -123.49
AIC: 269.0
BIC: 297.6
Interpretation:
1.Model Fit:
1. R-squared: 0.808 indicates that approximately 80.8% of the
variance in the dependent variable YYY is explained by the
independent variables in the model. This is a relatively high R-
squared value, suggesting that the model has a good fit.
2. Adjusted R-squared: 0.787 adjusts for the number of predictors
in the model. This value is slightly lower than R-squared, but still
high, indicating that the predictors are collectively effective at
explaining the variance in YYY.
2.F-statistic:
1. F-statistic: 37.53 is quite high, suggesting that the overall
regression model is statistically significant.
2. Prob (F-statistic): 1.06×10−271.06 \times 10^{-27}1.06×10−27
indicates an extremely low p-value, much smaller than common
significance levels (0.05 or 0.01). This means the null hypothesis
(that all coefficients are zero) is rejected, confirming that the
model as a whole is significant.
Coefficients and Significance:
const (Intercept): Coefficient is 1.1174 with a p-value of 0.011, indicating that the
intercept is significantly different from zero.
X1: Coefficient is 2.8052 with a p-value of 0.000, indicating a strong positive
relationship with Y.
X2: Coefficient is 5.4174 with a p-value of 0.000, also indicating a strong positive
relationship with Y.
X3: Coefficient is 0.5118 with a p-value of 0.105, which is not significant at the 0.05
level.
X4: Coefficient is 0.2176 with a p-value of 0.501, indicating no significant effect on Y.
X5: Coefficient is 0.1181 with a p-value of 0.697, indicating no significant effect on Y.
X6: Coefficient is -0.0087 with a p-value of 0.979, indicating no significant effect on
Y.
X7: Coefficient is -0.1771 with a p-value of 0.625, indicating no significant effect on
Y.
X8: Coefficient is 0.0283 with a p-value of 0.929, indicating no significant effect on Y.
X9: Coefficient is 0.0252 with a p-value of 0.940, indicating no significant effect on Y.
X10: Coefficient is 0.7142 with a p-value of 0.023, indicating a significant positive
relationship with Y.
Model Diagnostics:
Durbin-Watson: 1.689, which is close to 2, suggesting that there is no strong
autocorrelation in the residuals.
Omnibus Test: 0.424 with a p-value of 0.809 indicates that the residuals are
normally distributed.
Jarque-Bera Test: 0.336 with a p-value of 0.845, further supports the normality of
residuals.
Condition Number: 11.0, which is not extremely high, suggesting that
multicollinearity is not a major concern.

Conclusion:
The model has a good fit with an R-squared of 0.808 and a highly significant F-
statistic.
Among the predictors, X1, X2, and X10 have significant coefficients, while the
others (X3, X4, X5, X6, X7, X8, X9) do not show significant effects on YYY at the
0.05 significance level.
Diagnostics indicate that the residuals are approximately normally distributed and
there is no strong evidence of autocorrelation or multicollinearity issues.

You might also like