CORRELATION ANALYSIS
Correlation is concerned with the relationship between two variables. It measures the
association or strength of relationship between two variables say x and y.
To any extent, the changes in one variable affect the value of the other variable.
The coefficient of correlation denoted by ρ(the Greek letter rho) or 𝑟, measures the similarity of
the changes in the value of x and y. Its ranges is
−𝟏 ≤ 𝒓 ≤ +𝟏
If y increases when x increases, 𝑟 is positive. If y decreases when x decreases when x increases,
𝑟 is negative. If y is unaffectedby x, then 𝒓 = 𝟎.
Most Familiar Measures of Correlation
Pearson Product Moment Coefficient of Correlation
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑟=
√(𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 )(𝑛 ∑ 𝑦 2 − (∑ 𝑦)2 )
Spearman Rank-Order Coefficient of Correlation
6 ∑ 𝐷2
𝑟=
𝑛(𝑛2 − 1)
REGRESSION ANALYSIS
The primary objective of regression analysis is to estimate the value of random variable
(dependent variable) given that the value of an associated variable (independent variable) is known
Dependent variable- is also called response variable or predicted variable
Independent Variable- is also called predictor variable.
Regression Equation- is the algebraic formula by which the estimated value of the dependent or
response variable is determined.
Simple Regression Analysis- indicates that the value of a dependent variable is estimated on the basis
of one independent or predictor variable
Multiple Regression Analysis- is concerned with estimating the value of a dependent variable on basis
of two or more independent variables.
General Assumptions underlying the regression analysis model:
1. The dependent variable is a random variable
2. The independent and dependent variables are linearly associated
Assumption(1) indicates that although the values of the independent variable may be
controlled, the values of the dependent variable must be obtained through the process of
random sampling
If the interval estimation or hypothesis testing is done in this regression analysis, required
assumption are;
3. The variances of conditional distributions of the dependent variable, given different values for
the independent variable are equal,
4. The conditional distributions of dependent variable, given different values for independent
variable, are all normally distributed in the population of values, and
5. The observed values of the dependent variable are independent of each other.
The Method of Least Squares for fitting a Regression Line or Straight Line Model
The statistical procedure for finding the “best-fitting straight line” for set of points is the
Method of Least Squares. It is the line that minimizes the sum of squares of the deviations of the
observed values of y from those predicted.
The linear equation that represents the simple linear regression
𝒀𝒊 = 𝜷𝟎 + 𝜷𝟏 𝑿𝒊 + 𝜺𝒊
Where: 𝑌𝑖 = value of the dependent variable in the 𝑖th trial or observation
𝛽0 = first parameter of the regression equation, which indicates the value of Y when X =0
𝛽1 = second parameter of the regression equation, called the regression coefficients,
which indicates the slope of the regression line
𝑋𝑖 = the specified value of the independent variable in the 𝑖th trial or observation
𝜀𝑖 = random-sampling error in the 𝑖th trial or observation
The parameters 𝛽0 and 𝛽1 in the linear regression model are estimated by the values 𝑏0 and
𝑏1 that are based on sample data. Thus, the linear regression equation based on the sample data that is
used to estimate a single (conditional) value of the dependent variable, where the “hat” over the Y
indicates that it is estimated value is
𝒀 = 𝒃𝟎 + 𝒃𝟏 𝑿
Or we may adopt a simpler formula as
𝒚 = 𝒂 + 𝒃𝒙
Where: 𝑦 = the dependent or predicted variable
𝑥 = the independent or criterion variables
𝑎 = the y –intercept (𝑏0 )
𝑏 = the slope of the regression line (𝑏1 )
To solve for 𝑏
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑏=
𝑛 ∑ 𝑥 2 − (∑ 𝑥)2
To solve for 𝑎
𝑎 = 𝑦̅ − 𝑏𝑥̅
Example: The data below summarizes the results of midterm grade and final exam result. Let us
try to predict that if a certain grade result in midterm will determine a value for his final grade.
Let x = midterm grade
x 75 70 65 90 85 85 80 70 65 90
y = final grade y 80 75 65 95 90 85 90 75 70 90
From a previous computation, using the Pearson Product Moment Coefficient of Correlation,
the computed value r = 0.949 and is highly significant. This shows that there is a very strong or very
high association between the two results. Below are the solution using Stepwise Method.
Solving the above problem using Stepwise Method:
I. Problem: Is there a significant relationship between the midterm grade and final
examinations of 10 students in Mathematics
II. Hypothesis:
Null Hypothesis:
There is no significant relationship between the midterm grades and final
Examination grades of 10 students in Mathematics
Alternative Hypothesis:
There is a significant relationship between the midterm grades and final
Examination grades of 10 students in Mathematics
III. Level of Significance
α = 0.05 and 𝑑𝑓 = 𝑛 − 2 = 10 − 2 = 8
𝑟0.05 = 0.632 , this is the tabulated value
IV. Statistics Use: Pearson Product Moment of Correlation
𝑥 𝑦 𝑥2 𝑦2 𝑥𝑦
75 80 5625 6400 6000
70 75 4900 5625 5250
65 65 4225 4225 4225
90 95 8100 9025 8550
85 90 7225 8100 7650
85 85 7225 7225 7225
80 90 6400 8100 7200
70 75 4900 5625 5250
65 70 4225 4900 4550
90 90 8100 8100 8100
775 815 60925 67325 64000
The computations are summarized into the following:
∑ 𝑥 = 775 , ∑ 𝑦 = 815, ∑ 𝑥 2 = 60,925 , ∑ 𝑦 2 = 67,325, ∑ 𝑥𝑦 = 64,000
Substitute to the formula of Pearson Product Moment Coefficient of Correlation and
solving simultaneously. The value is
𝑟 = 0.949
V. Decision Rule: if the computed r value is greater that the r tabular value, disconfirm or
reject the null hypothesis 𝐻0
VI. Conclusion/Implications:
Since the computed r value which is 0.949 is greater than the tabular r value of
0.632 at 0.05 level of significance, with 8 as degrees of freedom, the null hypothesis id
disconfirmed.
This means that there is a significant relationship between the midterm grades
of students and the final examination results. It implies that the higher the midterm
grades, the higher also are the final exam result. Its show a positive correlation or direct
correlation
From the previous computation, using the Pearson Product Moment Coefficient of Correlation,
the computed value r = 0.949 and is highly significant. This shows that there is a very strong or very
high association between the two results.
Applying the method of Linear Regression Analysis of the above problem
1. We need to compute the needed values use in the Pearson r such as , ∑ 𝑥 , ∑ 𝑦 , ∑ 𝑥𝑦 , ∑ 𝑥 2
∑ 𝑥 = 775 , ∑ 𝑦 = 815, ∑ 𝑥 2 = 60,925 , ∑ 𝑥𝑦 = 64,000
2. Solve for 𝑏 by substituting the above obtained values
Using
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑏=
𝑛 ∑ 𝑥 2 − (∑ 𝑥)2
Then
10(64,000) − 775(815) 8375
𝑏= = = 0.971
10(60,925) − (775)2 8625
3. Solve for 𝑦̅ and 𝑥̅ where
∑𝑦 815 ∑𝑥 775
𝑦̅ = = = 81.5 and 𝑥̅ = = = 77.5
𝑛 10 𝑛 10
Then using
𝑎 = 𝑦̅ − 𝑏𝑥̅
We obtained
𝑎 = 81.5 − 0.971(77.5 ) = 6.25
4. With 𝑎 and 𝑏 already computed we will obtain the equation of the regression line as
𝑦 = 𝑎 + 𝑏𝑥
Now
𝒚 = 𝟔. 𝟐𝟓 + 𝟎. 𝟗𝟕𝟏𝒙
6.25 + 0.971(86)=6.25+ 83.51=89.76
6.25 + 0.971(73)= 6.25+ 70.88=77.13
Where the y-intercept is 6.25 and a slope of 0.971
Execrcises:
1. Below is a summary of Advertising-Sales Data of EG Merchandising for the year 2001
Let x = Advertising Expenses in Thousand
y = Sales Revenue
x 4 2 3 5 8 7 7 9 3 5 8 10
y 20 25 10 15 30 24 28 35 12 16 32 45
Show a scatter diagram and determine the equation of the regression line for the above data
2. Compute for the equation of the regression line that determined by the given data below
x 2 7 5 4 9 3 3 4 5 8
y 20 35 48 51 71 39 45 25 60 70