Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views23 pages

Fds Unit III Notes

Uploaded by

shin24727
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views23 pages

Fds Unit III Notes

Uploaded by

shin24727
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Foundation of Data Science

UNIT III : Describing Relationships

Correlation - Scatter plots - correlation coefficient for quantitative data -


computational formula for correlation coefficient - Regression - regression line -
least squares regression line - Standard error of estimate - interpretation of R2 -
multiple regression equations - regression towards the mean.

Correlation

• When one measurement is made on each observation, uni-variate analysis is


applied. If more than one measurement is made on each observation, multivariate
analysis is applied. Here we focus on bivariate analysis, where exactly two
measurements are made on each observation.

• The two measurements will be called X and Y. Since X and Y are obtained for
each observation, the data for one observation is the pair (X, Y).

• Some examples :

1. Height (X) and weight (Y) are measured for each individual in a sample.

2. Stock market valuation (X) and quarterly corporate earnings (Y) are recorded
for each company in a sample.

3. A cell culture is treated with varying concentrations of a drug and the growth
rate (X) and drug concentrations (Y) are recorded for each trial.

4. Temperature (X) and precipitation (Y) are measured on a given day at a set of
weather stations.

•There is difference in bivariate data and two sample data. In two sample data,
the X and Y values are not paired and there are not necessarily the same number
of X and Y values.
• Correlation refers to a relationship between two or more objects. In statistics,
the word correlation refers to the relationship between two variables. Correlation
exists between two variables when one of them is related to the other in some
way.

• Examples: One variable might be the number of hunters in a region and the
other variable could be the deer population. Perhaps as the number of hunters
increases, the deer population decreases. This is an example of a negative
correlation: As one variable increases, the other decreases.

A positive correlation is where the two variables react in the same way,
increasing or decreasing together. Temperature in Celsius and Fahrenheit has a
positive correlation.

• The term "correlation" refers to a measure of the strength of association between


two variables.

• Covariance is the extent to which a change in one variable corresponds


systematically to a change in another. Correlation can be thought of as a
standardized covariance.

• The correlation coefficient r is a function of the data, so it really should be called


the sample correlation coefficient. The (sample) correlation coefficient r
estimates the population correlation coefficient p.

• If either the X, or the Y; values are constant (i.e. all have the same value), then
one of the sample standard deviations is zero and therefore the correlation
coefficient is not defined.

Types of Correlation

1. Positive and negative

2. Simple and multiple

3. Partial and total

4. Linear and non-linear.


1. Positive and negative

• Positive correlation : Association between variables such that high scores on


one variable tends to have high scores on the other variable. A direct relation
between the variables.

• Negative correlation : Association between variables such that high scores on


one variable tends to have low scores on the other variable. An inverse relation
between the variables.

2. Simple and multiple

• Simple: It is about the study of only two variables, the relationship is described
as simple correlation.

• Example: Quantity of money and price level, demand and price.

• Multiple: It is about the study of more than two variables simultaneously, the
relationship is described as multiple correlations.

• Example: The relationship of price, demand and supply of a commodity.

3. Partial and total correlation

• Partial correlation : Analysis recognizes more than two variables but considers
only two variables keeping the other constant. Example: Price and demand,
eliminating the supply side.

• Total correlation is based on all the relevant variables, which is normally not
feasible. In total correlation, all the facts are taken into account.

4. Linear and non-linear correlation

• Linear correlation : Correlation is said to be linear when the amount of change


in one variable tends to bear a constant ratio to the amount of change in the other.
The graph of the variables having a linear relationship will form a straight line.

• Non linear correlation : The correlation would be non linear if the amount of
change in one variable does not bear a constant ratio to the amount of change in
the other variable.
Classification of correlation

•Two methods are used for finding relationship between variables.

1. Graphic methods

2. Mathematical methods.

• Graphic methods contain two sub methods: Scatter diagram and simple graph.

• Types of mathematical methods are,

a. Karl 'Pearson's coefficient of correlation

b. Spearman's rank coefficient correlation

c. Coefficient of concurrent deviation

d. Method of least squares.

Coefficient of Correlation

Correlation : The degree of relationship between the variables under


consideration is measure through the correlation analysis.

• The measure of correlation called the correlation coefficient. The degree of


relationship is expressed by coefficient which range from correlation (- 1 ≤ r≥ +
1). The direction of change is indicated by a sign.

• The correlation analysis enables us to have an idea about the degree and
direction of the relationship between the two variables under study.

• Correlation is a statistical tool that helps to measure and analyze the degree of
relationship between two variables. Correlation analysis deals with the
association between two or more variables.

• Correlation denotes the interdependency among the variables for correlating two
phenomenon, it is essential that the two phenomenon should have cause-effect
relationship and if such relationship does not exist then the two phenomenon can
not be correlated.

• If two variables vary in such a way that movement in one are accompanied by
movement in other, these variables are called cause and effect relationship.

Properties of Correlation

1. Correlation requires that both variables be quantitative.

2. Positive r indicates positive association between the variables and negative r


indicates negative association.

3. The correlation coefficient (r) is always a number between - 1 and + 1.

4. The correlation coefficient (r) is a pure number without units.

5. The correlation coefficient measures clustering about a line, but only relative
to the SD's.

6. The correlation can be misleading in the presence of outliers or nonlinear


association.

7. Correlation measures association. But association does not necessarily show


causation.

Example 3.1.1: A sample of 6 children was selected, data about their age in
years and weight in kilograms was recorded as shown in the following table.
It is required to find the correlation between age and weight.
Solution :

X = Variable age is the independent variable

Y = Variable weight is the dependent

• Other formula for calculating correlation coefficient is as follows:

Interpreting the correlation coefficient Cr = Σ (Zx Zy)/N

•Because the relationship between two sets of data is seldom perfect, the majority
of correlation coefficients are fractions (0.92, -0.80 and the like).

• When interpreting correlation coefficients it is sometimes difficult to determine


what is high, low and average.
• The value of correlation coefficient 'r' ranges from - 1 to +1.

• If r = + 1, then the correlation between the two variables is said to be perfect


and positive.

•If r = -1, then the correlation between the two variables is said to be perfect and
negative.

• If r = 0, then there exists no correlation between the variables.

Scatter Plots

• When two variables x and y have an association (or relationship), we say there
exists a correlation between them. Alternatively, we could say x and y are
correlated. To find such an association, we usually look at a scatterplot and try to
find a pattern.

• Scatterplot (or scatter diagram) is a graph in which the paired (x, y) sample data
are plotted with a horizontal x axis and a vertical y axis. Each individual (x, y)
pair is plotted as a single point.

• One variable is called independent (X) and the second is called dependent (Y).
Example:

• Fig. 3.2.1 shows the scatter diagram.


• The pattern of data is indicative of the type of relationship between your two
variables :

1. Positive relationship

2. Negative relationship

3. No relationship.

• The scattergram can indicate a positive relationship, a negative relationship or


a zero relationship.

Advantages of Scatter Diagram

1. It is a simple to implement and attractive method to find out the nature of


correlation.

2. It is easy to understand.

3. User will get rough idea about correlation (positive or negative correlation).

4. Not influenced by the size of extreme item

5. First step in investing the relationship between two variables.

Disadvantage of scatter diagram

• Can not adopt an exact degree of correlation.

Correlation Coefficient for Quantitative Data

• The product moment correlation, r, summarizes the strength of association


between two metric (interval or ratio scaled) variables, say X and Y. It is an index
used to determine whether a linear or straight-line relationship exists between X
and Y.

• As it was originally proposed by Karl Pearson, it is also known as the Pearson


correlation coefficient. It is also referred to as simple correlation, bivariate
correlation or merely the correlation coefficient.
• The correlation coefficient between two variables will be the same regardless of
their underlying units of measurement.

• It measures the nature and strength between two variables of the quantitative
type.

• The sign of r denotes the nature of association. While the value of r denotes the
strength of association.

• If the sign is positive this means the relation is direct (an increase in one variable
is associated with an increase in the other variable and a decrease in one variable
is associated with a decrease in the other variable).

• While if the sign is negative this means an inverse or indirect relationship (which
means an increase in one variable is associated with a decrease in the other).

• The value of r ranges between (-1) and (+ 1). The value of r denotes the strength
of the association as illustrated by the following diagram,

1. If r = Zero this means no association or correlation between the two variables.

2. If 0 < r <0.25 = Weak correlation.

3. If 0.25 ≤ r < 0.75 = Intermediate correlation.

4. If 0.75 ≤ r< 1 = Strong correlation.

5. If r=1= Perfect correlation

• Pearson's 'r' is the most common correlation coefficient. Karl Pearson's


Coefficient of Correlation denoted by - 'r' The coefficient of correlation 'r'
measure the degree of linear relationship between two variables say x and y.
• Formula for calculating correlation coefficient (r) :

1. When deviation taken from actual mean :

2. When deviation taken from an assumed mean :

Example 3.3.1: Compute Pearson's coefficient of correlation between


maintains cost and sales as per the data given below.

Solution: Given data:

n= 10

X= Maintains cost

y=Sales cost

Calculate coefficient of correlation.


Correlation coefficient is positively correlated.
Regression

• For an input x, if the output is continuous, this is called a regression problem.


For example, based on historical information of demand for tooth paste in your
supermarket, you are asked to predict the demand for the next month.

• Regression is concerned with the prediction of continuous quantities. Linear


regression is the oldest and most widely used predictive model in the field of
machine learning. The goal is to minimize the sum of the squared errors to fit a
straight line to a set of data points.

• It is one of the supervised learning algorithms. A regression model requires the


knowledge of both the dependent and the independent variables in the training
data set.

• Simple Linear Regression (SLR) is a statistical model in which there is only one
independent variable and the functional relationship between the dependent
variable and the regression coefficient is linear.

• Regression line is the line which gives the best estimate of one variable from
the value of any other given variable.

• The regression line gives the average relationship between the two variables in
mathematical form. For two variables X and Y, there are always two lines of
regression.

• Regression line of Y on X: Gives the best estimate for the value of Y for any
specific given values of X:

where

Y = a + bx

a = Y - intercept

b = Slope of the line

Y = Dependent variable

X = Independent variable
• By using the least squares method, we are able to construct a best fitting straight
line to the scatter diagram points and then formulate a regression equation in the
form of:

ŷ = a + bx

ŷ = ȳ + b(x- x̄)

• Regression analysis is the art and science of fitting straight lines to patterns of
data. In a linear regression model, the variable of interest ("dependent" variable)
is predicted from k other variables ("independent" variables) using a linear
equation.

• If Y denotes the dependent variable and X1, ..., Xk are the independent variables,
then the assumption is that the value of Y at time t in the data sample is determined
by the linear equation:

Y1 = β0 + β1 X1t + B2 X2t +… + βk Xkt + εt

where the betas are constants and the epsilons are independent and identically
distributed normal random variables with mean zero.

Regression Line

• A way of making a somewhat precise prediction based upon the relationships


between two variables. The regression line is placed so that it minimizes the
predictive error.

• The regression line does not go through every point; instead it balances the
difference between all data points and the straight-line model. The difference
between the observed data value and the predicted value (the value on the straight
line) is the error or residual. The criterion to determine the line that best describes
the relation between two variables is based on the residuals.

Residual = Observed - Predicted

• A negative residual indicates that the model is over-predicting. A positive


residual indicates that the model is under-predicting.
Linear Regression

• The simplest form of regression to visualize is linear regression with a single


predictor. A linear regression technique can be used if the relationship between
X and Y can be approximated with a straight line.

• Linear regression with a single predictor can be expressed with the equation:

y = Ɵ2x + Ɵ1 + e

• The regression parameters in simple linear regression are the slope of the line
(Ɵ2), the angle between a data point and the regression line and the y intercept
(Ɵ1) the point where x crosses the y axis (X = 0).

• Model 'Y', is a linear function of 'X'. The value of 'Y' increases or decreases in
linear manner according to which the value of 'X' also changes.

Nonlinear Regression:

• Often the relationship between x and y cannot be approximated with a straight


line. In this case, a nonlinear regression technique may be used.
• Alternatively, the data could be preprocessed to make the relationship linear. In
Fig. 3.4.2 shows nonlinear regression. (Refer Fig. 3.4.2 on previous page)

• The X and Y have a nonlinear relationship.sp

• If data does not show a linear dependence we can get a more accurate model
using a nonlinear regression model.

• For example: y = W0 + W1X + W2 X2 + W3 X3

• Generalized linear model is foundation on which linear regression can be


applied to modeling categorical response variables.

Advantages:

a. Training a linear regression model is usually much faster than methods such as
neural networks.

b. Linear regression models are simple and require minimum memory to


implement.

c. By examining the magnitude and sign of the regression coefficients you can
infer how predictor variables affect the target outcome.

• There are two important shortcomings of linear regression:

1. Predictive ability: The linear regression fit often has low bias but high
variance. Recall that expected test error is a combination of these two quantities.
Prediction accuracy can sometimes be improved by sacrificing some small
amount of bias in order to decrease the variance.

2. Interpretative ability: Linear regression freely assigns a coefficient to each


predictor variable. When the number of variables p is large, we may sometimes
seek, for the sake of interpretation, a smaller set of important variables.
Least Squares Regression Line

Least square method

• The method of least squares is about estimating parameters by minimizing the


squared discrepancies between observed data, on the one hand and their expected
values on the other.

• The Least Squares (LS) criterion states that the sum of the squares of errors is
minimum. The least-squares solutions yield y(x) whose elements sum to 1, but
do not ensure the outputs to be in the range [0, 1].

• How to draw such a line based on data points observed? Suppose a imaginary
line of y = a + bx.

• Imagine a vertical distance between the line and a data point E = Y - E(Y). This
error is the deviation of the data point from the imaginary line, regression line.
Then what is the best values of a and b? A and b that minimizes the sum of such
errors.

• Deviation does not have good properties for computation. Then why do we use
squares of deviation? Let us get a and b that can minimize the sum of squared
deviations rather than the sum of deviations. This method is called least squares.

• Least squares method minimizes the sum of squares of errors. Such a and b are
called least squares estimators i.e. estimators of parameters a and B.
• The process of getting parameter estimators (e.g., a and b) is called estimation.
Lest squares method is the estimation method of Ordinary Least Squares (OLS).

Disadvantages of least square

1. Lack robustness to outliers.

2. Certain datasets unsuitable for least squares classification.

3. Decision boundary corresponds to ML solution.

Example 3.4.1: Fit a straight line to the points in the table. Compute m and
b by least squares.
Standard Error of Estimate

• The standard error of estimate represents a special kind of standard deviation


that reflects tells that we the magnitude of predictive error. The standard error of
estimate, denoted S approximately how large the prediction errors (residuals) are
for our data set in the same units as Y.

Definition formula for standard error of estimate = √Sum of square / √n-2

Definition formula for standard error of estimate = √Y-Y' / √(n-2)

Computation formula for standard error of estimate: Sy/x = √SSy(1-r2) /n-2

Interpretation of R2

• The following measures are used to validate the simple linear regression models:

1. Co-efficient of determination (R-square).

2. Hypothesis test for the regression coefficient b1.

3. Analysis of variance for overall model validity (relevant more for multiple
linear regression).

4. Residual analysis to validate the regression model assumptions.

5. Outlier analysis.

• The primary objective of regression is to explain the variation in Y using the


knowledge of X. The coefficient of determination (R-square) measures the
percentage of variation in Y explained by the model (ẞ0 + ẞ1 X).

Characteristics of R-square:

• Here are some basic characteristics of the measure:

1. Since R2 is a proportion, it is always a number between 0 and 1.

2. If R2 = 1, all of the data points fall perfectly on the regression line. The predictor
x accounts for all of the variation in y!.
3. If R2 = 0, the estimated regression line is perfectly horizontal. The predictor x
accounts for none of the variation in y!

• Coefficient of determination, R2 a measure that assesses the ability of a model


to predict or explain an outcome in the linear regression setting. More
specifically, R2 indicates the proportion of the variance in the dependent variable
(Y) that is predicted or explained by linear regression and the predictor variable
(X, also known as the independent variable).

• In general, a high R2 value indicates that the model is a good fit for the data,
although interpretations of fit depend on the context of analysis. An R 2 of 0.35,
for example, indicates that 35 percent of the variation in the outcome has been
explained just by predicting the outcome using the covariates included in the
model.

• That percentage might be a very high portion of variation to predict in a field


such as the social sciences; in other fields, such as the physical sciences, one
would expect R2 to be much closer to 100 percent.

• The theoretical minimum R2 is 0. However, since linear regression is based on


the best possible fit, R2 will always be greater than zero, even when the predictor
and outcome variables bear no relationship to one another.

• R2 increases when a new predictor variable is added to the model, even if the
new predictor is not associated with the outcome. To account for that effect, the
adjusted R2 incorporates the same information as the usual R2 but then also
penalizes for the number of predictor variables included in the model.

• As a result, R2 increases as new predictors are added to a multiple linear


regression model, but the adjusted R increases only if the increase in R2 is greater
than one would expect from chance alone. In such a model, the adjusted R2 is the
most realistic estimate of the proportion of the variation that is predicted by the
covariates included in the model.

Spurious Regression

• The regression is spurious when we regress one random walk onto another
independent random walk. It is spurious because the regression will most likely
indicate a non-existing relationship:
1. The coefficient estimate will not converge toward zero (the true value). Instead,
in the limit the coefficient estimate will follow a non-degenerate distribution.

2. The t value most often is significant.

3. R2 is typically very high.

• Spurious regression is linked to serially correlated errors.

• Granger and Newbold(1974) pointed out that along with the large t-values
strong evidence of serially correlated errors will appear in regression analysis,
stating that when a low value of the Durbin-Watson statistic is combined with a
high value of the t-statistic the relationship is not true.

Hypothesis Test for Regression Co-Efficient (t-Test)

• The regression co-efficient (ẞ1) captures the existence of a linear relationship


between the response variable and the explanatory variable.

If ẞi = 0, we can conclude that there is no statistically significant linear


relationship between the two variables.

• Using the Analysis of Variance (ANOVA), we can test whether the overall
model is statistically significant. However, for a simple linear regression, the null
and alternative hypotheses in ANOVA and t-test are exactly same and thus there
will be no difference in the p-value.

Residual analysis

• Residual (error) analysis is important to check whether the assumptions of


regression models have been satisfied. It is performed to check the following:

1. The residuals are normally distributed.

2. The variance of residual is constant (homoscedasticity).

3. The functional form of regression is correctly specified.

4. If there are any outliers.


Multiple Regression Equations

• Multiple linear regression is an extension of linear regression, which allows a


response variable, y to be modelled as a linear function of two or more predictor
variables.

• In a multiple regression model, two or more independent variables, i.e.


predictors are involved in the model. The simple linear regression model and the
multiple regression model assume that the dependent variable is continuous.

Difference between Simple and Multiple Regression

Regression Towards the Mean

• Regression toward the mean refers to a tendency for scores, particularly extreme
scores, to shrink toward the mean. Regression toward the mean appears among
subsets of extreme observations for a wide variety of distributions.

• The rule goes that, in any series with complex phenomena that are dependent
on many variables, where chance is involved, extreme outcomes tend to be
followed by more moderate ones.

• The effects of regression to the mean can frequently be observed in sports, where
the effect causes plenty of unjustified speculations.

• It basically states that if a variable is extreme the first time we measure it, it will
be closer to the average the next time we measure it. In technical terms, it
describes how a random variable that is outside the norm eventually tends to
return to the norm.

• For example, our odds of winning on a slot machine stay the same. We might
hit a "winning streak" which is, technically speaking, a set of random variables
outside the norm. But play the machine long enough and the random variables
will regress to the mean (i.e. "return to normal") and we shall end up losing.

• Consider a sample taken from a population. The value of the variable will be
some distance from the mean. For instance, we could take a sample of people, it
could be just one measure their heights and then determine the average height of
the sample. This value will be some distance away from the average height of the
entire population of people, though the distance might be zero.

• Regression to the mean usually happens because of sampling error. A good


sampling technique is to randomly sample from the population. If we
asymmetrically sampled, then results may be abnormally high or low for the
average and therefore would regress back to the mean. Regression to the mean
can also happen because we take a very small, unrepresentative sample.

Regression fallacy

• Regression fallacy assumes that a situation has returned to normal due to


corrective actions having been taken while the situation was abnormal. It does
not take into consideration normal fluctuations.

• An example of this could be a business program failing and causing problems


which is then cancelled. The return to "normal", which might be somewhat
different from the original situation or a situation of "new normal" could fall into
the category of regression fallacy. This is considered an informal fallacy.

You might also like