Correlation Coefficient Formula: Definition
Correlation coefficient formulas are used to find how strong a relationship is between data. The formulas
return a value between -1 and 1, where:
1 indicates a strong positive relationship.
-1 indicates a strong negative relationship.
A result of zero indicates no relationship at all.
Graphs showing a correlation of -1, 0 and +1
Meaning
A correlation coefficient of 1 means that for every positive increase in one variable, there is a positive increase
of a fixed proportion in the other. For example, shoe sizes go up in (almost) perfect correlation with foot
length.
A correlation coefficient of -1 means that for every positive increase in one variable, there is a negative
decrease of a fixed proportion in the other. For example, the amount of gas in a tank decreases in (almost)
perfect correlation with speed.
Zero means that for every increase, there isn’t a positive or negative increase. The two just aren’t related.
The absolute value of the correlation coefficient gives us the relationship strength. The larger the number,
the stronger the relationship. For example, |-.75| = .75, which has a stronger relationship than .65.
Like the explanation? Check out the Practically Cheating Statistics Handbook, which has hundreds of step-by-step,
worked out problems!
Types of correlation coefficient formulas.
There are several types of correlation coefficient formulas.
One of the most commonly used formulas is Pearson’s correlation coefficient formula. If you’re taking a
basic stats class, this is the one you’ll probably use:
Pearson correlation coefficient
Two other formulas are commonly used: the sample correlation coefficient and the population correlation
coefficient.
Sample correlation coefficient
Sx and sy are the sample standard deviations, and sxy is the sample covariance.
Population correlation coefficient
The population correlation coefficient uses σx and σy as the population standard deviations, and σxy as the
population covariance.
Check out my Youtube channel for more tips and help with statistics!
Back to Top
What is Pearson Correlation?
Correlation between sets of data is a measure of how well they are related. The most common measure
of correlation in stats is the Pearson Correlation. The full name is the Pearson Product Moment
Correlation (PPMC). It shows the linear relationship between two sets of data. In simple terms, it answers
the question, Can I draw a line graph to represent the data? Two letters are used to represent the Pearson
correlation: Greek letter rho (ρ) for a population and the letter “r” for a sample.
Potential problems with Pearson correlation.
The PPMC is not able to tell the difference between dependent variables and independent variables. For
example, if you are trying to find the correlation between a high calorie diet and diabetes, you might find a
high correlation of .8. However, you could also get the same result with the variables switched around. In
other words, you could say that diabetes causes a high calorie diet. That obviously makes no sense.
Therefore, as a researcher you have to be aware of the data you are plugging in. In addition, the PPMC
will not give you any information about the slope of the line; it only tells you whether there is a relationship.
Real Life Example
Pearson correlation is used in thousands of real life situations. For example, scientists in China wanted to
know if there was a relationship between how weedy rice populations are different genetically. The goal
was to find out the evolutionary potential of the rice. Pearson’s correlation between the two groups was
analyzed. It showed a positive Pearson Product Moment correlation of between 0.783 and 0.895 for
weedy rice populations. This figure is quite high, which suggested a fairly strong relationship.
If you’re interested in seeing more examples of PPMC, you can find several studies on the National
Institute of Health’s Openi website, which shows result on studies as varied as breast cyst imaging to the
role that carbohydrates play in weight loss.
Back to Top
How to Find Pearson’s Correlation Coefficients
By Hand
Watch the video to learn how to find PPMC by hand.
Can’t see the video? Click here.
Example question: Find the value of the correlation coefficient from the following table:
SUBJECT AGE X GLUCOSE LEVEL Y
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 1: Make a chart. Use the given data, and add three more columns: xy, x2, and y2.
SUBJECT AGE X GLUCOSE LEVEL Y XY X2 Y2
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 2: Multiply x and y together to fill the xy column. For example, row 1 would be 43 × 99 = 4,257.
SUBJECT AGE X GLUCOSE LEVEL Y XY X2 Y2
1 43 99 4257
2 21 65 1365
3 25 79 1975
4 42 75 3150
5 57 87 4959
6 59 81 4779
Step 3: Take the square of the numbers in the x column, and put the result in the x column.
2
SUBJECT AGE X GLUCOSE LEVEL Y XY X2 Y2
1 43 99 4257 1849
2 21 65 1365 441
3 25 79 1975 625
4 42 75 3150 1764
5 57 87 4959 3249
6 59 81 4779 3481
Step 4: Take the square of the numbers in the y column, and put the result in the y column.
2
SUBJECT AGE X GLUCOSE LEVEL Y XY X2 Y2
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Step 5: Add up all of the numbers in the columns and put the result at the bottom of the column. The Greek letter
sigma (Σ) is a short way of saying “sum of” or summation.
SUBJECT AGE X GLUCOSE LEVEL Y XY X2 Y2
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
Step 6: Use the following correlation coefficient formula.
The answer is: 2868 / 5413.27 = 0.529809
Click here if you want easy, step-by-step instructions for solving this formula.
From our table:
Σx = 247
Σy = 486
Σxy = 20,485
Σx2 = 11,409
Σy2 = 40,022
n is the sample size, in our case = 6
The correlation coefficient =
6(20,485) – (247 × 486) / [√[[6(11,409) – (2472)] × [6(40,022) – 4862]]]
= 0.5298
The range of the correlation coefficient is from -1 to 1. Our result is 0.5298 or 52.98%, which means the
variables have a moderate positive correlation.